Zipfulldata.rb crashes #444

gedankenstuecke · 2017-12-06T03:47:05Z

Related to the crashes mentioned in #443:

I upgraded our sidekiq machine to the next level, which means it now has a nice 8 GB of memory to fullfill its tasks. As a user mentioned via email that the opensnp_datadump.current.zip is super outdated I manually started the Zipfulldata task and observed the memory consumption. Turns out this is so hungry that even the 8 GB weren't enough, the whole machine crashed once again. 😒

This might easily explain why also the newsletters can't be sent in a regular interval: The zipfulldata.rb is run once a day to create the latest dump. As this just kills the whole machine the newsletters run out of memory at that point. 😂

I could not fully figure out why the task takes up that much memory. One thing I did notice is that we're heavily over-logging. As every single phenotype/user combination creates a new logging-event. Another thing I noticed: The create_picture_zip produces nothing but errors:

Dec  5 19:28:55 sidekiq docker/71c7c216ff34[627]: 2017-12-05T18:28:55.153Z 2155 TID-gdihg Zipfulldata JID-5b09eef25e572a91040079c7 INFO: create_picture_zip: Errno::ENOENT: No such file or directory @ rb_file_s_lstat - /home/app/snpr/public/system/user_picture_phenotypes/phenotype_pictures/000/000/057/original/WIN_20141119_184544.JPG

I found a very simple reason for this: The whole path /home/app/snpr/public/system/ is not available in the sidekiq Docker container while it is available in the web container! Somehow the mounting there doesn't work like it's supposed to. @tsujigiri @philippbayer Help is appreciated 😱 Do you have an idea for

How to make sure that the .../system folder turns up in the sidekiq machine?
Should we fix that and remove the super-detailed/senseless logging events and see what happens if we run the thing again?

The text was updated successfully, but these errors were encountered:

gedankenstuecke · 2017-12-06T04:04:12Z

What I figured out so far:

On the web machine we do have a folder /data2 which is mounted as the system directory in the docker container on that machine.
On the sidekiq the /data2 is not mounted at all
correspondingly the system is never mounted in the sidekiq docker container. 😂

tsujigiri · 2017-12-06T12:08:38Z

Maybe the storage box got disconnected again? In that case remounting it should help. Take a look at /etc/fstab for the mount-point. I cannot access the machine, currently (out of memory?).

gedankenstuecke · 2017-12-06T15:22:51Z

Yeah, I guess its because the Zipfulldata job would have started again? :D

I restarted the machine now and here's the /etc/fstab

root@sidekiq ~ # cat /etc/fstab
proc /proc proc defaults 0 0
# /dev/sda1 during Installation (RescueSystem)
UUID=4473aef2-ec5a-4f64-ada0-14063df854d0 / ext4 defaults,discard 0 0
//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/etc/smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=9999,forceuid,gid=9999,forcegid  0  0
//u114097.your-storagebox.de/backup/backups /mnt/backups cifs credentials=/etc/smbcredentials,iocharset=utf8,sec=ntlm,defaults  0  0

Here's the same thing for the web machine:

root@web ~ # cat /etc/fstab
proc /proc proc defaults 0 0
# /dev/sda1 during Installation (RescueSystem)
UUID=4ec01ba1-7491-4149-b702-e9b630ec2c26 / ext4 defaults,discard 0 0
//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/root/.smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=web-data,forceuid,gid=web-data,forcegid,_netdev,cache=strict  0  0
#//u114097.your-storagebox.de/backup/web-data /mnt/web-data cifs credentials=/root/.smbcredentials,iocharset=utf8,sec=ntlm,defaults,noperm,uid=web-data,forceuid,gid=web-data,forcegid,_netdev,file_mode=0666,cache=strict  0  0
#u114097@u114097.your-storagebox.de:/web-data /mnt/web-data fuse.sshfs defaults  0  0

So the storage box itself is available on sidekiq (makes sense, otherwise we couldn't access any data). So what's absent is what we can find in /data2 on the web machine:

root@web /data2 # ll
total 16
drwxr-xr-x  4 web-data web-data 4096 Nov 16  2015 ./
drwxr-xr-x 23 root     root     4096 Feb 11  2016 ../
-rw-r--r--  1 web-data web-data    0 Nov 16  2015 test
drwxr-xr-x  3 web-data web-data 4096 Oct 27  2015 user_picture_phenotypes/
drwxr-xr-x  3 web-data web-data 4096 Oct 27  2015 users/

But I can't figure out where it comes from? (well, then it's 7am here and I didn't have coffee yet)

philippbayer · 2017-12-06T15:31:08Z

Thank you for having a look :)

Not sure how to fix the mounting (23:30 here now..) for now, but there are some tricks to get rubyzip to use less memory, for example, by fiddling with the buffer settings as in this piece of code: rubyzip/rubyzip#261

We could also try to see what happens if we don't use the gem and call on the system zip directly. We are still running the latest version of rubyzip, will give that a try asap

tsujigiri · 2017-12-06T16:21:03Z

Isn't /data2 just the mount-point of the storage box within the docker container? I'm not sure either. Just guessing.

gedankenstuecke · 2017-12-06T17:23:53Z

Actually not, if you look at the web machine you'll find this in /root/deploy/docker-run:

docker run --name=#{tmp_name} \
…
             -v /data2:/home/app/snpr/public/system \
…

So the host's /data2 will be mapped to the public/system folder in the docker container.

tsujigiri · 2017-12-06T17:29:14Z

Ah, I see. So, I guess this is simply to ensure that the data in there doesn't get lost between container instances.

gedankenstuecke · 2017-12-06T17:33:34Z

Right, but where do we mount /data2 on the web machine? Because that should be part of the storage box as well? Otherwise there's no way to get it into the sidekiq machine.

That's the reason we have -v /mnt/web-data/data:/home/app/snpr/public/data in both the sidekiq and the web deployment after all.

tsujigiri · 2017-12-06T17:34:25Z

I get that, but I don't know. 😄

gedankenstuecke · 2017-12-06T17:35:51Z

Haha, great! Maybe @philippbayer has an idea. At least the folders have the web-data user as their owner on the web machine. So I hope we can just mount these on sidekiq too. It's funny though. Given this setup the zipfulldata task can never have saved the pictures? :D

tsujigiri · 2017-12-06T17:53:51Z

Doesn't seem like it has been missed. 😅

gedankenstuecke · 2017-12-06T17:55:55Z

Seems like another feature that could be cut out then. ;)

Do we have some documentation of how the storage boxes are mounted? Couldn't find anything on that in the ops repo. :D

tsujigiri · 2017-12-06T19:04:56Z

Other than fstab? 😄

gedankenstuecke · 2017-12-09T17:11:57Z

Ok, so according to /etc/fstab on both the sidekiq and web machine the mount points there are basically the same. The /data2 on the web machine is definitely not mounted from anywhere, it's rather just a directory on that machine that belongs to the web-data user. To re-enable the inclusion of the picture phenotype data in the zipfulldata.rb we would have to make sure that this directory is stored on the storage box as well and then mounted on the sidekiq containers as well.

Should we do that?

philippbayer · 2017-12-10T04:23:36Z

OK my recent PR hopefully alleviates some of the issue, probably won't fix it

currently, the script /root/deploy/docker-run mounts /data2 into /home/app/snpr/public/system but only on the web host. It would be best (I guess?) to move /data2 into /mnt/web-data/data2, then change the web deploy script to reflect the new path and add a line to the deploy script on sidekiq. How does that sound like?
It's just a little bit scary what could happen while I move /data2/ around, any possible problems except 'user uploads a picture in exactly that moment' which is unlikely

gedankenstuecke · 2017-12-10T04:32:09Z

Ack, how big is the folder? If it’s small enough we could briefly shut down the web server so that this won’t happen as no one can’t upload anything and then transfer it?

philippbayer · 2017-12-10T04:33:48Z

That would probably be the safest. It's only 106MB.

gedankenstuecke · 2017-12-10T04:35:47Z

Ok, then that sounds like a good way to me :)

gedankenstuecke · 2017-12-11T05:35:04Z

Current status: @philippbayer moved the /data2 that was on the web machine into our storage box as well. it's now also getting mounted on the sidekiq machine, so the zip job should be able to run through again. That's why I opened #449.

Also: With the fix about where the zipfile is written from #447 the non-picture zips all ran through without an issue again. Let's hope that the combination of all of this helps to fix all of our problems with memory leaks!

gedankenstuecke · 2017-12-11T07:34:47Z

Status update:

Added the picture zipping parts back into the larger job
Adapted the test so that it works with the new tmp file structure

Next step:

@philippbayer will run the new task and see whether there's still some weird memory issues going on. If that's not the case we can finally go back to that newsletter that started this whole odyssey! 😂

philippbayer · 2017-12-11T12:16:29Z

Dec 11 10:10:31 sidekiq docker/7f942f861f3e[567]: 2017-12-11T09:10:31.494Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: Added 119.jpeg
Dec 11 10:10:33 sidekiq docker/7f942f861f3e[567]: 2017-12-11T09:10:33.633Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: created picture zip file
Dec 11 10:17:02 sidekiq docker/7f942f861f3e[567]: Dec 11 09:17:01 opensnp CRON[347]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 11 11:17:02 sidekiq docker/7f942f861f3e[567]: Dec 11 10:17:02 opensnp CRON[351]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.987Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 DEBUG: Retries exhausted for job
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.998Z 144 TID-ke3s0 Zipfulldata JID-304f4a9b9470641f53280a93 INFO: fail: 11373.317 sec
Dec 11 12:15:34 sidekiq docker/7f942f861f3e[567]: 2017-12-11T11:15:34.999Z 144 TID-ke3s0 WARN: {"context":"Job raised exception","job":{"class":"Zipfulldata","args":[],"retry":0,"queue":"zipfulldata","unique":true,"dead":false,"jid":"304f4a9b9470641f53280a93","created_at":1512979561.6552744,"enqueued_at":1512979561.6555452,"error_message":"Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712110806.zip","error_class":"Errno::EACCES","failed_at":1512990934.9876456,"retry_count":0},"jobstr":"{\"class\":\"Zipfulldata\",\"args\":[],\"retry\":0,\"queue\":\"zipfulldata\",\"unique\":true,\"dead\":false,\"jid\":\"304f4a9b9470641f53280a93\",\"created_at\":1512979561.6552744,\"enqueued_at\":1512979561.6555452}"}

It still fails, but as I wrote on gitter, I think that's because FileUtils.mv runs cp while preserving file attributes, i.e., it runs chmod automatically, so should we just delete the chmod command since it's a samba storage anyway?

gedankenstuecke · 2017-12-11T16:11:21Z

That should be done with the next PR (#452). I'll just merge once travis greenlights and will restart the zip job.

gedankenstuecke · 2017-12-11T20:35:24Z

Ha, look what happened now:

Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.237Z 142 TID-fdyo4 WARN: {"context":"Job raised exception","job":{"class":"Zipfulldata","args":[],"retry":0,"queue":"zipfulldata","unique":true,"dead":false,"jid":"ffa54f0b12bc317252d6f7ab","created_at":1513012665.6447785,"enqueued_at":1513012665.6450741,"error_message":"Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712111717.zip","error_class":"Errno::EACCES","failed_at":1513024144.2150855,"retry_count":0},"jobstr":"{\"class\":\"Zipfulldata\",\"args\":[],\"retry\":0,\"queue\":\"zipfulldata\",\"unique\":true,\"dead\":false,\"jid\":\"ffa54f0b12bc317252d6f7ab\",\"created_at\":1513012665.6447785,\"enqueued_at\":1513012665.6450741}"}
Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.238Z 142 TID-fdyo4 WARN: Errno::EACCES: Permission denied @ chown_internal - /home/app/snpr/public/data/zip/opensnp_datadump.201712111717.zip
Dec 11 21:29:04 sidekiq docker/2ae4c8320706[567]: 2017-12-11T20:29:04.238Z 142 TID-fdyo4 WARN: /usr/local/rvm/rubies/ruby-2.3.3/lib/ruby/2.3.0/fileutils.rb:1411:in `chown'

As @philippbayer mentioned: The FileUtils.mv runs cp and tries to preserve the file attributes. Which means: it internally tries to run chown and that fails again. Not sure how to solve this best
right now. 🤔

gedankenstuecke · 2017-12-12T00:55:09Z

Ok, my cheap hack in #453 worked. So we can close this here I think! 👍

gedankenstuecke assigned gedankenstuecke, tsujigiri and philippbayer Dec 6, 2017

gedankenstuecke mentioned this issue Dec 9, 2017

Trying to fix the zip #445

Merged

gedankenstuecke closed this as completed Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipfulldata.rb crashes #444

Zipfulldata.rb crashes #444

gedankenstuecke commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017 •

edited

gedankenstuecke commented Dec 6, 2017

philippbayer commented Dec 6, 2017

tsujigiri commented Dec 6, 2017 •

edited

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 9, 2017

philippbayer commented Dec 10, 2017

gedankenstuecke commented Dec 10, 2017

philippbayer commented Dec 10, 2017

gedankenstuecke commented Dec 10, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

philippbayer commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 12, 2017

Zipfulldata.rb crashes #444

Zipfulldata.rb crashes #444

Comments

gedankenstuecke commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017 • edited

gedankenstuecke commented Dec 6, 2017

philippbayer commented Dec 6, 2017

tsujigiri commented Dec 6, 2017 • edited

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 6, 2017

tsujigiri commented Dec 6, 2017

gedankenstuecke commented Dec 9, 2017

philippbayer commented Dec 10, 2017

gedankenstuecke commented Dec 10, 2017

philippbayer commented Dec 10, 2017

gedankenstuecke commented Dec 10, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

philippbayer commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 11, 2017

gedankenstuecke commented Dec 12, 2017

tsujigiri commented Dec 6, 2017 •

edited

tsujigiri commented Dec 6, 2017 •

edited