Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Latest commit

 

History

History
492 lines (324 loc) · 15.7 KB

cluster-bootup.md

File metadata and controls

492 lines (324 loc) · 15.7 KB

OpenPAI deployment

This document introduces the detailed procedures to boot up PAI on a cluster. Please refer to this section, if user need the complete information on cluster deployment and maintenance.

Table of contents:

OpenPAI deploy

We assume that the whole cluster has already been configured by the system maintainer to meet the Prerequisites.

Table of contents

With the cluster being set up, the steps to bring PAI up on it are as follows:

Customized deploy

Steps:

Step 0. Prepare the dev-box

It is recommended to perform the operations below in a dev box.

Dev-box is a docker container used to boot up or/and maintain a PAI cluster. For convenience, we provide a prebuild Docker image on Docker Hub.

Please refer to this section for the customize setting up a dev-box.

Use prebuild dev-box image

(1) Run your dev-box

Notice that dev-box should run on a machine outside of PAI cluster, it shouldn't run on any PAI cluster node.

# Pull the dev-box image from Docker Hub
sudo docker pull docker.io/openpai/dev-box

# Run your dev-box
# Assume the path of custom-hadoop-binary-path in your service-configuration is /pathHadoop,
#   and the path of your cluster-configuration is /pathConfiguration.
# By now, you can leave it as it is, we only mount those two directories into docker container for later usage.
sudo docker run -itd \
        -e COLUMNS=$COLUMNS -e LINES=$LINES -e TERM=$TERM \
        -v /var/run/docker.sock:/var/run/docker.sock \
        -v /pathHadoop:/pathHadoop \
        -v /pathConfiguration:/cluster-configuration  \
        --pid=host \
        --privileged=true \
        --net=host \
        --name=dev-box \
        docker.io/openpai/dev-box
(2) Working in your dev-box
sudo docker exec -it dev-box /bin/bash
(3) Check out a latest release branch of OpenPAI
cd /pai

# fetch tags
git fetch --tags

# please go to https://github.com/Microsoft/pai/releases to checkout a latest release.
# checkout a release branch. For example: v0.x.y
git checkout v0.x.y

# check current branch
git status
  • sucessful result:
HEAD detached at v0.6.1
nothing to commit, working directory clean
(4) Go to pai working dir
cd /pai

Now you are free to configure your cluster and run PAI commands...

How to check

  • exec cmd:
sudo docker ps
  • sucessful result:
24c286d888f5        openpai/dev-box                                       "/container-setup.sh"    3 days ago          Up 3 days                                    dev-box

Step 1. Prepare the quick-start.yaml file

Prepare the file under dev-box folder: /pai/deployment/quick-start/

There is a example file under path: /pai/deployment/quick-start/quick-start-example.yaml

An example yaml file is shown below. Note that you should change the IP address of the machine and ssh information accordingly.

# quick-start.yaml

# (Required) Please fill in the IP address of the server you would like to deploy OpenPAI
machines:
  - 192.168.1.11
  - 192.168.1.12
  - 192.168.1.13

# (Required) Log-in info of all machines. System administrator should guarantee
# that the username/password pair or username/key-filename is valid and has sudo privilege.
ssh-username: pai
ssh-password: pai-password

# (Optional, default=None) the key file that ssh client uses, that has higher priority then password.
#ssh-keyfile-path: <keyfile-path>

# (Optional, default=22) Port number of ssh service on each machine.
#ssh-port: 22

# (Optional, default=DNS of the first machine) Cluster DNS.
#dns: <ip-of-dns>

# (Optional, default=10.254.0.0/16) IP range used by Kubernetes. Note that
# this IP range should NOT conflict with the current network.
#service-cluster-ip-range: <ip-range-for-k8s>

How to check

Check all configruation items of the quick-start.yaml are correct.

Step 2. Generate OpenPAI configuration files

After the quick-start.yaml is ready, use it to generate four configuration yaml files as follows.

(1) generate configuration files
cd /pai

# cmd should be executed under pai directory in the dev-box.

python paictl.py config generate -i /pai/deployment/quick-start/quick-start.yaml -o ~/pai-config -f
(2) update docker tag to release version
vi ~/pai-config/services-configuration.yaml

For example: v0.x.y branch, user should change docker-tag to v0.x.y.

docker-tag: v0.x.y

Appendix: Default values in auto-generated configuration files

How to check

The command will generate the following four yaml files.

cluster-configuration.yaml
k8s-role-definition.yaml
kubernetes-configuration.yaml
serivices-configuration.yaml

Please refer to this section for the details of the configuration files.

Step 3(Optional). Customize configure OpenPAI

This method is for advanced users.

The description of each field in these configuration files can be found in A Guide For Cluster Configuration.

If user want to customize configuration, please see the table below

How to check

Check all configruation items are correct.

Step 4. Boot up Kubernetes

After the configuration files are prepared, the Kubernetes services can be started using paictl tool:

cd /pai

# cmd should be executed under /pai directory in the dev-box.

python paictl.py cluster k8s-bootup \
  -p ~/pai-config

The paictl tool does the following things:

  • Install kubectl command in the current machine (the dev-box).

  • Generate Kubernetes-related configuration files based on cluster-configuration.yaml, kubernetes-configuration.yaml and k8s-role-definition.yaml.

  • Use kubectl to boot up Kubernetes on target machines.

How to check

After this step, the system maintainer can check the status of Kubernetes by accessing Kubernetes Dashboard:

http://<master>:9090

Where <master> denotes the IP address of the load balancer of Kubernetes master nodes. When there is only one master node and a load balancer is not used, it is usually the IP address of the master node itself.

Step 5. Start all PAI services

When Kubernetes is up and running, PAI services can then be deployed to it using paictl tool:

cd /pai

# cmd should be executed under /pai directory in the dev-box.

python paictl.py service start \
  -p ~/pai-config \
  [ -n service-name ]

If the -n parameter is specified, only the given service, e.g. rest-server, webportal, watchdog, etc., will be deployed. If not, all PAI services will be deployed. In the latter case, the above command does the following things:

  • Generate Kubernetes-related configuration files based on cluster-configuration.yaml.

  • Use kubectl to set up config maps and create pods on Kubernetes.

How to check

After this step, the system maintainer can check the status of OpenPAI services by accessing OpenPAI kubernetes web portal:

http://<master>:9090/#!/pod?namespace=default

Where <master> is the same as in the previous section.

Singlebox deploy

Steps:

Prepare the file under dev-box folder: /pai/deployment/quick-start/quick-start

There is a example file under path: /pai/deployment/quick-start/quick-start-example.yaml

An example yaml file is shown below. Note that you should change the IP address of the machine and ssh information accordingly.

# quick-start.yaml

# (Required) Please fill in the IP address of the server you would like to deploy PAI
# For single box deployment, user only need configure 1 ip address
machines:
  - 192.168.1.11

# (Required) Log-in info of all machines. System administrator should guarantee
# that the username/password pair or username/key-filename is valid and has sudo privilege.
ssh-username: pai
ssh-password: pai-password
ssh-key-filename: key-filename

# (Optional, default=22) Port number of ssh service on each machine.
#ssh-port: 22

# (Optional, default=DNS of the first machine) Cluster DNS.
#dns: <ip-of-dns>

# (Optional, default=10.254.0.0/16) IP range used by Kubernetes. Note that
# this IP range should NOT conflict with the current network.
#service-cluster-ip-range: <ip-range-for-k8s>

Troubleshooting

Table of contents:

1 Troubleshooting OpenPAI services

1.1 Diagnosing the problem

  • Monitor

From kubernetes webportal:

Dashboard:

http://<master>:9090

PAI_deploy_log

From OpenPAI watchdog:

OpenPAI watchdog

  • Log

From kubernetes webportal:

PAI_deploy_pod

From each node container / pods log file:

View containers log under folder:

ls /var/log/containers

View pods log under folder:

ls /var/log/pods
  • Debug

As OpenPAI services are deployed on kubernetes, please refer debug kubernetes pods

1.2 Fix problem

  • Update OpenPAI Configuration

Check and refine 4 yaml files:

    - cluster-configuration.yaml
    - kubernetes-configuration.yaml
    - k8s-role-definition.yaml
    - serivices-configuration.yaml
  • Customize config for specific service

If user want to customize single service, you could find service config file at src and find image dockerfile at src.

  • Update Code & Image

    • Customize image dockerfile or code

User could find service's image dockerfile at src and customize them.

  • Rebuild image

User could execute the following cmds:

Build docker image

    paictl.py image build -p /path/to/configuration/ [ -n image-x ]

Push docker image

    paictl.py image push -p /path/to/configuration/ [ -n image-x ]

If the -n parameter is specified, only the given image, e.g. rest-server, webportal, watchdog, etc., will be build / push.

1.3 Reboot service

  1. Stop single or all services.
python paictl.py service stop \
  -p /path/to/cluster-configuration/dir \
  [ -n service-name ]

If the -n parameter is specified, only the given service, e.g. rest-server, webportal, watchdog, etc., will be stopped. If not, all PAI services will be stopped.

  1. Boot up single all OpenPAI services.

Please refer to this section for details.

2 Troubleshooting Kubernetes Clusters

Please refer Kubernetes Troubleshoot Clusters

3 Getting help

  • StackOverflow: If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
  • Report an issue: If you have issue/ bug/ new feature, please submit it at Github

Maintenance