Skip to content
This repository has been archived by the owner on Nov 3, 2021. It is now read-only.

Backup process hangs after a couple of tries due to docker fifo issue #51

Closed
rashamalek opened this issue Mar 23, 2017 · 8 comments
Closed
Assignees

Comments

@rashamalek
Copy link
Contributor

When Backup process starts, It creates a file named in-progress (with the assumption of preventing other backup processes to start), but when it is not responsive anymore (stucked for some reason), the backup does not finish, the process is still in the process list, and the in-progress is still there till the next day, which /delete-instuck-backups/delete_instuck_progress.py will take care of it and delete the in-progress file (only after one day).

The issue is that it will not take care of the running (stucked) process.

On the other hand, /start_backup.sh only checks for the pid existence in process list

pidof -o $$ -x "$0" >/dev/null 2>&1 && exit 1
in this case no other backup will be executed, till someone, manually kills the old stucked process or restart the docker machine.

@lotharschulz
Copy link
Contributor

Thanks for this one.

delete_instuck_progress.py purpose if to delete in-progress files, not in stuck processes (#20). However stuck processes are not covered.
Let's see if those processes exists.

@rashamalek
Copy link
Contributor Author

Hi,
Please have the processlist as below:

root@3c2bcda1c335:/backup# ps  auxwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      5800  0.0  0.3  21164  3664 ?        Ss   09:04   0:00 bash
root      5822  0.0  0.3  37364  3248 ?        R+   09:06   0:00  \_ ps auxwf
root      4972  0.0  0.0  21180   460 ?        Ss+  Mar22   0:00 bash
root      4087  0.0  0.0  21160   464 ?        Ss+  Mar22   0:00 bash
root         1  0.0  0.0  20980   256 ?        Ss   Mar21   0:00 /bin/bash /backup/final-docker-cmd.sh
root        15  0.0  0.0  29008   908 ?        Ss   Mar21   0:00 cron
root      4109  0.0  0.0  45804   384 ?        S    Mar22   0:00  \_ CRON
root      4110  0.0  0.0   4508    76 ?        Ss   Mar22   0:00      \_ /bin/sh -c /start_backup.sh
root      4111  0.0  0.0   9568   200 ?        S    Mar22   0:00          \_ /bin/bash /start_backup.sh
root      4113  0.0  0.0   9672   296 ?        S    Mar22   0:00              \_ bash /backup/backup-utils/bin/ghe-backup -v
root      4876  0.0  0.0   9660   292 ?        S    Mar22   0:00                  \_ bash /backup/backup-utils/share/github-backup-utils/ghe-backup-pages-rsync
root      4890  0.0  0.0   9664   296 ?        S    Mar22   0:00                      \_ bash /backup/backup-utils/share/github-backup-utils/ghe-backup-userdata pages
root      4923  0.0  0.0   9660   248 ?        S    Mar22   0:00                          \_ bash /backup/backup-utils/share/github-backup-utils/ghe-rsync -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages :
root      4930  0.0  0.0   9660   292 ?        S    Mar22   0:00                              \_ bash /backup/backup-utils/share/github-backup-utils/ghe-rsync -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages GHE_URL
root      4933  0.0  4.7  95944 48300 ?        S    Mar22   0:00                              |   \_ rsync --bwlimit=30000 -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages GHE_URL:/data/user/pages/ /data/ghe-prod
root      4936  0.0  0.4  50964  4780 ?        S    Mar22   0:00                              |   |   \_ ssh -p 22 -l admin -o StrictHostKeyChecking=no -p 122 -o BatchMode=yes GHE_URL -- nice -n 19 ionice -c 3 sudo -u git rsync --server --sender -vl
root      4950  0.0  2.9 159756 29656 ?        S    Mar22   0:00                              |   |   \_ rsync --bwlimit=30000 -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages GHE_URL:/data/user/pages/ /data/ghe-
root      4934  0.0  0.0   9660   292 ?        S    Mar22   0:00                              |   \_ bash /backup/backup-utils/share/github-backup-utils/ghe-rsync -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages GHE_URL
root      4935  0.0  0.0  11284   100 ?        S    Mar22   0:00                              |       \_ grep -E -v ^(file has vanished: |rsync warning: some files vanished before they could be transferred)
root      4931  0.0  0.0   9660   284 ?        S    Mar22   0:00                              \_ bash /backup/backup-utils/share/github-backup-utils/ghe-rsync -avz -e ghe-ssh -p 122 --rsync-path=sudo -u git rsync --link-dest=../../current/pages GHE_URL
root      4932  0.0  0.0  11284   100 ?        S    Mar22   0:00                                  \_ grep -E -v ^(file has vanished: |rsync warning: some files vanished before they could be transferred)
root        16  0.0  0.0   7324    72 ?        S    Mar21   0:13 tail -F /var/log/ghe-prod-backup.log

The last log provided from the box:
screenshot 2017-03-24 10 39 44

Next step would be to find the related logs in GHE instance.

@lotharschulz lotharschulz self-assigned this Mar 24, 2017
@rashamalek rashamalek changed the title delete_instuck_progress.py progress just removes the in-progress file, and does not kill the process itself. Backup process hangs after a couple of tries due to docker fifo issue Apr 24, 2017
@rashamalek
Copy link
Contributor Author

Title has been edited from "delete_instuck_progress.py progress just removes the in-progress file, and does not kill the process itself." to "Backup process hangs after a couple of tries due to docker fifo buffer issue".

Here in Dockerfilebus,
CMD ["/backup/final-docker-cmd.sh"] as the main process process,
"/var/log/ghe-prod-backup.log" is used to keep the process running.

We use this fifo to transit log messages out of the container
/var/log/ghe-prod-backup.log is not a regular file and is a fifo "https://github.com/zalando/ghe-backup/blob/master/DockerfileBus#L54"
It seems at some point, the fifo is not being empty, buffer is full, and the process into writing to it, which leads rsync in a wait status.

there are multiple issues with the similar symptoms of using fifo:
moby/moby#22502
http://stackissue.com/docker/for-mac/docker-daemon-hangs-opening-or-writing-to-fifo-911.html
moby/moby#12023
docker/for-mac#911

@lotharschulz
Copy link
Contributor

@lotharschulz
Copy link
Contributor

issue disappeared

root@b2a62bd1bc6d:/data/ghe-production-data# ll
....
drwxr-xr-x 11 root root 4096 Apr 27 16:33 20170427T161301/
drwxr-xr-x 11 root root 4096 Apr 27 20:25 20170427T201301/
lrwxrwxrwx  1 root root   15 Apr 27 20:25 current -> 20170427T201301/
root@b2a62bd1bc6d:/data/ghe-production-data# ll 20170427T201301
total 1839036
drwxr-xr-x 11 root root       4096 Apr 27 20:25 ./
drwxr-xr-x 42 root root       4096 Apr 27 20:28 ../
drwx------  3  500  500       4096 Oct 15  2016 alambic_assets/
drwxr-xr-x  2 root root       4096 Apr 27 20:16 audit-log/
-rw-r--r--  1 root root       7967 Apr 27 20:13 authorized-keys.json
drwxr-xr-x  2 root root       4096 Apr 27 20:13 benchmarks/
drwx------  3  601  601       4096 Oct 16  2016 elasticsearch/
-rw-r--r--  1 root root      17520 Apr 27 20:13 enterprise.ghl
drwxr-xr-x  3 root root       4096 Apr 27 20:25 git-hooks/
drwx------  2  500  500       4096 Oct 15  2016 hookshot/
-rw-r--r--  1 root root          0 Apr 27 20:13 manage-password+
-rw-r--r--  1 root root 1203762941 Apr 27 20:15 mysql.sql.gz
drwx------ 10  500  500       4096 Oct 22  2015 pages/
-rw-r--r--  1 root root  679234082 Apr 27 20:16 redis.rdb
drwx------ 19  500  500       4096 Apr 27 20:16 repositories/
-rw-r--r--  1 root root      10240 Apr 27 20:13 saml-keys.tar
-rw-r--r--  1 root root      23303 Apr 27 20:13 settings.json
-rw-r--r--  1 root root      20480 Apr 27 20:13 ssh-host-keys.tar
-rw-r--r--  1 root root      20480 Apr 27 20:13 ssl-ca-certificates.tar
drwx------ 19  500  500       4096 Apr 24 16:16 storage/
-rw-r--r--  1 root root          6 Apr 27 20:13 strategy
-rw-r--r--  1 root root         37 Apr 27 20:13 uuid
-rw-r--r--  1 root root          7 Apr 27 20:13 version
root@0ecb5c11160a:/data/ghe-production-data# ll
.....
drwxr-xr-x 11 root root 4096 Apr 27 14:33 20170427T141301/
drwxr-xr-x 11 root root 4096 Apr 27 18:32 20170427T181301/
lrwxrwxrwx  1 root root   15 Apr 27 18:32 current -> 20170427T181301/
root@0ecb5c11160a:/data/ghe-production-data# ll 20170427T181301                                                                                  
total 1838180
drwxr-xr-x 11 root root       4096 Apr 27 18:32 ./
drwxr-xr-x 42 root root       4096 Apr 27 18:35 ../
drwx------  3  500  500       4096 Oct 15  2016 alambic_assets/
drwxr-xr-x  2 root root       4096 Apr 27 18:16 audit-log/
-rw-r--r--  1 root root       7967 Apr 27 18:13 authorized-keys.json
drwxr-xr-x  2 root root       4096 Apr 27 18:13 benchmarks/
drwx------  3  601  601       4096 Oct 16  2016 elasticsearch/
-rw-r--r--  1 root root      17520 Apr 27 18:13 enterprise.ghl
drwxr-xr-x  3 root root       4096 Apr 27 18:31 git-hooks/
drwx------  2  500  500       4096 Oct 15  2016 hookshot/
-rw-r--r--  1 root root          0 Apr 27 18:13 manage-password+
-rw-r--r--  1 root root 1202885869 Apr 27 18:15 mysql.sql.gz
drwx------ 10  500  500       4096 Oct 22  2015 pages/
-rw-r--r--  1 root root  679234060 Apr 27 18:16 redis.rdb
drwx------ 19  500  500       4096 Apr 27 18:17 repositories/
-rw-r--r--  1 root root      10240 Apr 27 18:13 saml-keys.tar
-rw-r--r--  1 root root      23303 Apr 27 18:13 settings.json
-rw-r--r--  1 root root      20480 Apr 27 18:13 ssh-host-keys.tar
-rw-r--r--  1 root root      20480 Apr 27 18:13 ssl-ca-certificates.tar
drwx------ 19  500  500       4096 Apr 24 16:16 storage/
-rw-r--r--  1 root root          6 Apr 27 18:13 strategy
-rw-r--r--  1 root root         37 Apr 27 18:13 uuid
-rw-r--r--  1 root root          7 Apr 27 18:13 version

@rashamalek
Copy link
Contributor Author

It would be also possible to configure cron for more frequent backups (every other hour or even every one hour) for both instances.
It could be addressed in another issue, if you agree.

@lotharschulz
Copy link
Contributor

lotharschulz commented Apr 28, 2017

yes, lets do this in another issue.

@lotharschulz
Copy link
Contributor

issue still disappeared, closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants