Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
[Docs] Fix link, refine and add vc update example (#1385)
Browse files Browse the repository at this point in the history
* fix docs link, refine docs, add vc update

* fix cluster configuration path
  • Loading branch information
mzmssg committed Sep 17, 2018
1 parent 9dae3eb commit 84a6ef8
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 28 deletions.
91 changes: 66 additions & 25 deletions docs/yarn/yarn.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@

### Goal
Enhance YARN to support resource managerment and job scheduling of PAI.
Enhance YARN to support resource management and job scheduling of PAI.

### Architecture
YARN is the core component in Hadoop2. YARN employs master/slave architecture, called Resource Manager(RM) and Node Manager(NM) respectively. More details of YARN please refer to [official documentation](http://hadoop.apache.org/docs/current/). Some PAI function such as Virtual Cluster are based on YARN features.
YARN is the core component in Hadoop2. YARN employs master/slave architecture,
called Resource Manager(RM) and Node Manager(NM) respectively.
More details of YARN please refer to [official documentation](http://hadoop.apache.org/docs/current/).
Some PAI function such as Virtual Cluster are based on YARN features.

We also add a [patch](https://issues.apache.org/jira/browse/YARN-7481) to enhance YARN, with which, YARN can manage more resource(GPU, port) than official version. This enchancement makes AI job scheduling possible.
We also add a [patch](https://issues.apache.org/jira/browse/YARN-7481) to enhance YARN,
with which, YARN can manage more resource(GPU, port) than official version.
This enhancement makes AI job scheduling possible.

### Dependencies
Cluster-configuration(all service base), drivers(GPU support), zoopkeeper(store job status), hdfs(file system).
Expand All @@ -16,56 +20,93 @@ paictl.py image build -p /path/to/cluster/config -n hadoop-run
`

The command does major 3 steps:
1. Build a dev container include necessary environment. (refer to [hadoop-build](https://github.com/Microsoft/pai/tree/master/hadoop-ai/hadoop-build))
2. Run the container to build hadoop binary applied patch. (refer to [hadoop-ai](https://github.com/Microsoft/pai/tree/master/hadoop-ai))
3. Copy binary files to hadoop-run image and set related ENV. (refer to [dockerfile](https://github.com/Microsoft/pai/blob/master/pai-management/src/hadoop-run/dockerfile))
1. Build a dev container include necessary environment.
2. Run the container to build patched hadoop binary.
3. Copy binary files to hadoop-run image and set related ENV.

Images should also push to registry with following command for deployment.
Images should also be pushed to registry with following command for deployment.

`
paictl.py image push -p /path/to/cluster/config -n hadoop-run
`

### Configuration
Most service configuration is the same as official hadoop, except GPU and port related items, new item's comments refer to [hadoop-ai](https://github.com/Microsoft/pai/blob/master/hadoop-ai/README.md).
Most service configuration are the same as official hadoop, except GPU and port related items,
for these extra items, please refer to [hadoop-ai](../hadoop-ai/README.md).


Service configuration folders are here ([RM](https://github.com/Microsoft/pai/tree/master/pai-management/bootstrap/hadoop-resource-manager/hadoop-resource-manager-configuration), [NM](https://github.com/Microsoft/pai/tree/master/pai-management/bootstrap/hadoop-node-manager/hadoop-node-manager-configuration)). We will generate these configuration files according to hardware. For advanced users, you can modify generate script under these folder to customize cluster.
YARN configuration files([RM](../../src/hadoop-resource-manager/deploy/hadoop-resource-manager-configuration),
[NM](../../src/hadoop-node-manager/deploy/hadoop-node-manager-configuration))
will be generated according to [setting](../../deployment/quick-start) and hardware when service starts.
Advanced users can modify related scripts under these folder to customize cluster.


### Deployment

The deployment leverages k8s Daemonset. RM first then NM.
YARN deployment leverages k8s Daemonset. RM first then NM:

`
paictl.py service start -p /path/to/cluster/config -n hadoop-resource-manager/hadoop-node-manager
`

The command creates k8s Daemonset to deploy service, k8s Daemonset select node according to cluster configuration.
This command creates a k8s Daemonset,
which selects node according to cluster configuration to deploy corresponding service.

You can also stop service similarly.


`
paictl.py service stop -p /path/to/cluster/config -n hadoop-resource-manager/hadoop-node-manager
`


**Important:** NM stop will lead all jobs on node failed!

### Refreshment

Because hadoop doesn't support modify the configuration at runtime, service should restart to refresh configuration:
Currently, service should restart to refresh configuration:

* step 1: Modify configuration files as your need.
* step 2: Stop service, command refers to [Deployment](#Deployment).
* step 3: Start service, command refers to [Deployment](#Deployment), new configuration will overwrite the old.

*Notes:* Refresh command in ` paictl.py ` for hadoop service is redundant temporarily.

Take VC updating as an example:

VC feature leverages YARN scheduler queue, allows admin to split resources into multiple parts.
By default, there is only one VC called `default`,
which take up 100% resources, related fields in `services-configuration.yaml` look like follows.
```yaml
virtualClusters:
default:
description: Default VC.
capacity: 100
```
If admin wants to reserve some resources for other usage,
a new VC(i.e., vc1) can be added:
* step 1: Create a new VC called `vc1`, allocate a certain quota for it,
```yaml
virtualClusters:
default:
description: Default VC.
capacity: 70
vc1:
capacity: 30
```
* step 2: Stop RM, `paictl.py service stop -p /path/to/cluster/config -n hadoop-resource-manager`
* step 3: Restart RM, `paictl.py service start -p /path/to/cluster/config -n hadoop-resource-manager`

Then, `vc1` will be available in webportal.
Similarly, admin can change VC quotas or delete a VC,
for deletion scenario, all running jobs under deleted VC should be stopped firstly.
**During entire refreshment, admin must ensure the sum of all VC capacity equals to 100.**

Todo:
1. For items supporting configured at runtime, avoid restarting service to refresh.
2. Automatic refreshment.

### Upgrading

Either build new image by yourself or use our [pre-built images](https://hub.docker.com/r/openpai/hadoop-run/). For now, upgrading should restart related service as well.
Either build new image by yourself or use our [pre-built images](https://hub.docker.com/r/openpai/hadoop-run/).
Currently, upgrading should restart related service as well.

### Service Metrics

Expand All @@ -87,20 +128,20 @@ Todo:

### High Availability

Tha major issue of YARN is the single point failure of RM, and we will solve it by configurating multi RM node, the same as official Hadoop.
Tha major issue of YARN is the single point failure of RM,
and we will solve it by multi RM node, the same as official Hadoop.

Todo:
1. Achieve HA by multi RM node.

### Runtime Requirements
For RM service, recommand to allocate at least 8G heap size (default). You can try with the default value at first, if the memory usage is close to the limitation, consider to raise it to 12G or 16G. Correspongding ENV is called YARN_RESOURCEMANAGER_HEAPSIZE in ` yarn-env.sh `.
For RM service, we recommend to allocate at least 8G heap size (default).
You can try with the default value at first,
if the memory usage is close to the limitation, consider to raise it to 12G or 16G.
Corresponding ENV is called `YARN_RESOURCEMANAGER_HEAPSIZE` in ` yarn-env.sh `.

For NM service, the ENV is YARN_NODEMANAGER_HEAPSIZE (default 4G) and you can keep it.
For NM service, the ENV is `YARN_NODEMANAGER_HEAPSIZE` (default 4G) and you can keep it.

Todo:
1. More accurate resource requirements.
2. Automatically configurate the requirements according to cluster size.

### Other information exists in current documents, or links to existing documents.

[hadoop-ai readme](https://github.com/Microsoft/pai/blob/master/hadoop-ai/README.md)
2. Automatically configured the requirements according to cluster size.
5 changes: 2 additions & 3 deletions docs/zookeeper/zookeeper.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

### Goal

Provide a centralized service for maintaining configuration information and distributed synchronization.
Expand Down Expand Up @@ -27,12 +26,12 @@ paictl.py image push -p /path/to/cluster/config -n zookeeper

* Manual build

refer to [dockerfile](https://github.com/Microsoft/pai/blob/master/pai-management/src/zookeeper/dockerfile).
refer to [dockerfile](../../src/zookeeper/build/zookeeper.dockerfile).


### Configuration

Configuration file is [here](https://github.com/Microsoft/pai/blob/master/pai-management/src/zookeeper/zoo.cfg), you can keep it in most cases.
Configuration file is [here](../../src/zookeeper/build/zoo.cfg), you can keep it in most cases.


### Deployment
Expand Down

0 comments on commit 84a6ef8

Please sign in to comment.