Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipfulldata.rb crashes #444

Closed
gedankenstuecke opened this issue Dec 6, 2017 · 24 comments
Closed

Zipfulldata.rb crashes #444

gedankenstuecke opened this issue Dec 6, 2017 · 24 comments
Assignees

Comments

@gedankenstuecke
Copy link
Member

Related to the crashes mentioned in #443:

I upgraded our sidekiq machine to the next level, which means it now has a nice 8 GB of memory to fullfill its tasks. As a user mentioned via email that the opensnp_datadump.current.zip is super outdated I manually started the Zipfulldata task and observed the memory consumption. Turns out this is so hungry that even the 8 GB weren't enough, the whole machine crashed once again. 😒

This might easily explain why also the newsletters can't be sent in a regular interval: The zipfulldata.rb is run once a day to create the latest dump. As this just kills the whole machine the newsletters run out of memory at that point. 😂

I could not fully figure out why the task takes up that much memory. One thing I did notice is that we're heavily over-logging. As every single phenotype/user combination creates a new logging-event. Another thing I noticed: The create_picture_zip produces nothing but errors:

Dec  5 19:28:55 sidekiq docker/71c7c216ff34[627]: 2017-12-05T18:28:55.153Z 2155 TID-gdihg Zipfulldata JID-5b09eef25e572a91040079c7 INFO: create_picture_zip: Errno::ENOENT: No such file or directory @ rb_file_s_lstat - /home/app/snpr/public/system/user_picture_phenotypes/phenotype_pictures/000/000/057/original/WIN_20141119_184544.JPG

I found a very simple reason for this: The whole path /home/app/snpr/public/system/ is not available in the sidekiq Docker container while it is available in the web container! Somehow the mounting there doesn't work like it's supposed to. @tsujigiri @philippbayer Help is appreciated 😱 Do you have an idea for

  1. How to make sure that the .../system folder turns up in the sidekiq machine?
  2. Should we fix that and remove the super-detailed/senseless logging events and see what happens if we run the thing again?
@gedankenstuecke
Copy link
Member Author

What I figured out so far:

  1. On the web machine we do have a folder /data2 which is mounted as the system directory in the docker container on that machine.
  2. On the sidekiq the /data2 is not mounted at all
  3. correspondingly the system is never mounted in the sidekiq docker container. 😂

@tsujigiri
Copy link
Collaborator

tsujigiri commented Dec 6, 2017

Maybe the storage box got disconnected again? In that case remounting it should help. Take a look at /etc/fstab for the mount-point. I cannot access the machine, currently (out of memory?).

@gedankenstuecke
Copy link
Member Author

Yeah, I guess its because the Zipfulldata job would have started again? :D

I restarted the machine now and here's the /etc/fstab

root@sidekiq ~ # cat /etc/fstab
proc /proc proc defaults 0 0
# /dev/sda1 during Installation (RescueSystem)
UUID=4473aef2-ec5a-4f64-ada0-14063df854d0 / ext4 defaults,discard 0 0
//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/etc/smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=9999,forceuid,gid=9999,forcegid  0  0
//u114097.your-storagebox.de/backup/backups /mnt/backups cifs credentials=/etc/smbcredentials,iocharset=utf8,sec=ntlm,defaults  0  0

Here's the same thing for the web machine:

root@web ~ # cat /etc/fstab
proc /proc proc defaults 0 0
# /dev/sda1 during Installation (RescueSystem)
UUID=4ec01ba1-7491-4149-b702-e9b630ec2c26 / ext4 defaults,discard 0 0
//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/root/.smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=web-data,forceuid,gid=web-data,forcegid,_netdev,cache=strict  0  0
#//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/root/.smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=web-data,forceuid,gid=web-data,forcegid,_netdev,file_mode=0666,cache=strict  0  0
#u114097@u114097.your-storagebox.de:/web-data /mnt/web-data fuse.sshfs defaults  0  0

So the storage box itself is available on sidekiq (makes sense, otherwise we couldn't access any data). So what's absent is what we can find in /data2 on the web machine:

root@web /data2 # ll
total 16
drwxr-xr-x  4 web-data web-data 4096 Nov 16  2015 ./
drwxr-xr-x 23 root     root     4096 Feb 11  2016 ../
-rw-r--r--  1 web-data web-data    0 Nov 16  2015 test
drwxr-xr-x  3 web-data web-data 4096 Oct 27  2015 user_picture_phenotypes/
drwxr-xr-x  3 web-data web-data 4096 Oct 27  2015 users/

But I can't figure out where it comes from? (well, then it's 7am here and I didn't have coffee yet)

@philippbayer
Copy link
Member

Thank you for having a look :)

Not sure how to fix the mounting (23:30 here now..) for now, but there are some tricks to get rubyzip to use less memory, for example, by fiddling with the buffer settings as in this piece of code: rubyzip/rubyzip#261

We could also try to see what happens if we don't use the gem and call on the system zip directly. We are still running the latest version of rubyzip, will give that a try asap

@tsujigiri
Copy link
Collaborator

tsujigiri commented Dec 6, 2017

Isn't /data2 just the mount-point of the storage box within the docker container? I'm not sure either. Just guessing.

@gedankenstuecke
Copy link
Member Author

Actually not, if you look at the web machine you'll find this in /root/deploy/docker-run:

docker run --name=#{tmp_name} \
…
             -v /data2:/home/app/snpr/public/system \
…

So the host's /data2 will be mapped to the public/system folder in the docker container.

@tsujigiri
Copy link
Collaborator

Ah, I see. So, I guess this is simply to ensure that the data in there doesn't get lost between container instances.

@gedankenstuecke
Copy link
Member Author

Right, but where do we mount /data2 on the web machine? Because that should be part of the storage box as well? Otherwise there's no way to get it into the sidekiq machine.

That's the reason we have -v /mnt/web-data/data:/home/app/snpr/public/data in both the sidekiq and the web deployment after all.

@tsujigiri
Copy link
Collaborator

I get that, but I don't know. 😄

@gedankenstuecke
Copy link
Member Author

Haha, great! Maybe @philippbayer has an idea. At least the folders have the web-data user as their owner on the web machine. So I hope we can just mount these on sidekiq too. It's funny though. Given this setup the zipfulldata task can never have saved the pictures? :D

@tsujigiri
Copy link
Collaborator

Doesn't seem like it has been missed. 😅

@gedankenstuecke
Copy link
Member Author

Seems like another feature that could be cut out then. ;)

Do we have some documentation of how the storage boxes are mounted? Couldn't find anything on that in the ops repo. :D

@tsujigiri
Copy link
Collaborator

Other than fstab? 😄

@gedankenstuecke
Copy link
Member Author

Ok, so according to /etc/fstab on both the sidekiq and web machine the mount points there are basically the same. The /data2 on the web machine is definitely not mounted from anywhere, it's rather just a directory on that machine that belongs to the web-data user. To re-enable the inclusion of the picture phenotype data in the zipfulldata.rb we would have to make sure that this directory is stored on the storage box as well and then mounted on the sidekiq containers as well.

Should we do that?

@philippbayer
Copy link
Member

OK my recent PR hopefully alleviates some of the issue, probably won't fix it

currently, the script /root/deploy/docker-run mounts /data2 into /home/app/snpr/public/system but only on the web host. It would be best (I guess?) to move /data2 into /mnt/web-data/data2, then change the web deploy script to reflect the new path and add a line to the deploy script on sidekiq. How does that sound like?
It's just a little bit scary what could happen while I move /data2/ around, any possible problems except 'user uploads a picture in exactly that moment' which is unlikely

@gedankenstuecke
Copy link
Member Author

Ack, how big is the folder? If it’s small enough we could briefly shut down the web server so that this won’t happen as no one can’t upload anything and then transfer it?

@philippbayer
Copy link
Member

That would probably be the safest. It's only 106MB.

@gedankenstuecke
Copy link
Member Author

Ok, then that sounds like a good way to me :)

@gedankenstuecke
Copy link
Member Author

Current status: @philippbayer moved the /data2 that was on the web machine into our storage box as well. it's now also getting mounted on the sidekiq machine, so the zip job should be able to run through again. That's why I opened #449.

Also: With the fix about where the zipfile is written from #447 the non-picture zips all ran through without an issue again. Let's hope that the combination of all of this helps to fix all of our problems with memory leaks!

@gedankenstuecke
Copy link
Member Author

Status update:

  • Added the picture zipping parts back into the larger job
  • Adapted the test so that it works with the new tmp file structure

Next step:

  • @philippbayer will run the new task and see whether there's still some weird memory issues going on. If that's not the case we can finally go back to that newsletter that started this whole odyssey! 😂

@philippbayer
Copy link
Member

Dec 11 10:10:31 sidekiq docker/7f942f861f3e[567]: 2017-12-11T09:10:31.494Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: Added 119.jpeg
Dec 11 10:10:33 sidekiq docker/7f942f861f3e[567]: 2017-12-11T09:10:33.633Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: created picture zip file
Dec 11 10:17:02 sidekiq docker/7f942f861f3e[567]: Dec 11 09:17:01 opensnp CRON[347]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 11 11:17:02 sidekiq docker/7f942f861f3e[567]: Dec 11 10:17:02 opensnp CRON[351]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.987Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 DEBUG: Retries exhausted for job
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.998Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: fail: 11373.317 sec
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.999Z 144 TID-ke3s0 WARN: {"context":"Job raised exception","job":{"class":"Zipfulldata","args":[],"retry":0,"queue":"zipfulldata","unique":true,"dead":false,"jid":"304f4a9b9470641f53280a93","created_at":1512979561.6552744,"enqueued_at":1512979561.6555452,"error_message":"Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712110806.zip","error_class":"Errno::EACCES","failed_at":1512990934.9876456,"retry_count":0},"jobstr":"{\"class\":\"Zipfulldata\",\"args\":[],\"retry\":0,\"queue\":\"zipfulldata\",\"unique\":true,\"dead\":false,\"jid\":\"304f4a9b9470641f53280a93\",\"created_at\":1512979561.6552744,\"enqueued_at\":1512979561.6555452}"}

It still fails, but as I wrote on gitter, I think that's because FileUtils.mv runs cp while preserving file attributes, i.e., it runs chmod automatically, so should we just delete the chmod command since it's a samba storage anyway?

@gedankenstuecke
Copy link
Member Author

That should be done with the next PR (#452). I'll just merge once travis greenlights and will restart the zip job.

@gedankenstuecke
Copy link
Member Author

Ha, look what happened now:

Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.237Z 142 TID-fdyo4 WARN: {"context":"Job raised exception","job":{"class":"Zipfulldata","args":[],"retry":0,"queue":"zipfulldata","unique":true,"dead":false,"jid":"ffa54f0b12bc317252d6f7ab","created_at":1513012665.6447785,"enqueued_at":1513012665.6450741,"error_message":"Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712111717.zip","error_class":"Errno::EACCES","failed_at":1513024144.2150855,"retry_count":0},"jobstr":"{\"class\":\"Zipfulldata\",\"args\":[],\"retry\":0,\"queue\":\"zipfulldata\",\"unique\":true,\"dead\":false,\"jid\":\"ffa54f0b12bc317252d6f7ab\",\"created_at\":1513012665.6447785,\"enqueued_at\":1513012665.6450741}"}
Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.238Z 142 TID-fdyo4 WARN: Errno::EACCES: Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712111717.zip
Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.238Z 142 TID-fdyo4 WARN: /usr/local/rvm/rubies/ruby-2.3.3/lib/ruby/2.3.0/fileutils.rb:1411:in `chown'

As @philippbayer mentioned: The FileUtils.mv runs cp and tries to preserve the file attributes. Which means: it internally tries to run chown and that fails again. Not sure how to solve this best
right now. 🤔

@gedankenstuecke
Copy link
Member Author

Ok, my cheap hack in #453 worked. So we can close this here I think! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants