memory leaks in image pipeline #2447

pawelmhm · 2016-12-13T12:14:14Z

It seems to me that image pipeline is leaking memory in a very significant ways. I have spider that downloads lists of images. There were always problems with memory when downloading images, but now my list of images to download got larger and I thought about opening issue here.

Basically after opening some images memory usage goes up and stays up (it's not reset to previous value). It might be some issue with PIL or it might be something we're doing in pipeline that is causing this. In any case this looks worrying and I think we should reflect on steps to take to limit this problem.

Following code reproduces the problem (I know it's long but this is really shortest I could get), it relies on presence of images.txt file that contains list of urls to images.

import resource
import shutil
import sys
import tempfile

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.test import get_crawler
from twisted.internet import reactor
from twisted.python import log


def log_memory(result):
    mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    log.msg("{} bytes".format(mem))


class SomeSpider(scrapy.Spider):
    name = 'foo'


def download_image(url, pipe, spider):
    # Download image with pipeline.
    item = {
        'image_urls': [url]
    }
    dfd = pipe.process_item(item, spider)
    dfd.addBoth(log_memory)
    return dfd


log.startLogging(sys.stdout)
# Directory must be removed, otherwise pipeline will not attempt to download.
some_dir = tempfile.mkdtemp()
crawler = get_crawler(settings_dict={'IMAGES_STORE': some_dir,
                                     "IMAGE_EXPIRES": 0})

spider = SomeSpider()
spider.crawler = crawler
crawler.crawl(spider)
pipeline = ImagesPipeline.from_crawler(crawler)
pipeline.open_spider(spider)


def clean_up():
    print("removing {}".format(some_dir))
    log_memory(None)
    shutil.rmtree(some_dir)


with open('images.txt') as image_list:
    image_urls = image_list.read().split()

for url in image_urls[:20]:
    dfd = download_image(url, pipeline, spider)

reactor.addSystemEventTrigger('before', 'shutdown', clean_up)
reactor.run()

Sample output on attached image file: images.txt

2016-12-13 13:13:00+0100 [-] Log opened.
2016-12-13 13:13:00+0100 [-] TelnetConsole starting on 6023
2016-12-13 13:13:00+0100 [-] 44516 bytes
2016-12-13 13:13:00+0100 [-] 44752 bytes
2016-12-13 13:13:00+0100 [-] 49492 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 52312 bytes
2016-12-13 13:13:01+0100 [-] 52312 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52532 bytes
2016-12-13 13:13:01+0100 [-] 52532 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:02+0100 [-] 52700 bytes
2016-12-13 13:13:03+0100 [-] (TCP Port 6023 Closed)
2016-12-13 13:13:03+0100 [-] 52700 bytes
^C2016-12-13 13:13:13+0100 [-] Received SIGINT, shutting down.
2016-12-13 13:13:13+0100 [-] removing /tmp/tmpD1dRmc
2016-12-13 13:13:13+0100 [-] 52700 bytes
2016-12-13 13:13:13+0100 [-] Main loop terminated.

Notice how memory goes up and stays up. from 44516 to 52700. Notice delay between final request and SIGINT ( 10 seconds). After this delay memory usage still stays at 52700.

The text was updated successfully, but these errors were encountered:

pawelmhm · 2016-12-13T14:04:14Z

there is issue in PIL for memory leaks in Python 3, but I'm seeing this in Python 2.7: python-pillow/Pillow#2019

rmax · 2017-02-08T06:12:00Z

Additionally, there are 2-3x memory requirements for each image due to the format conversion and thumbnailing. Also, if I'm not wrong, the images pipeline bypass the concurrent requests limit causing to have a log of in-flight image requests.

I haven't seen memory issues with the images downloader when setting CONCURRENT_ITEMS = 1.

kmike · 2017-02-08T06:17:49Z

See also: #482.

Pipeline doesn't bypass website concurrency limits, but requests are sent directly to Downloader, without putting them to Scheduler - this indeed means they are all in-memory.

dev-iwf · 2017-12-06T14:39:23Z

Hi, I have a similar problem.
I maybe narrowed it down slightly, using the telnet console and prefs().

Attribute Type	Count	Oldest (seconds ago)
HtmlResponse	611	1222
MemleakscrapyItem	6	332
MemleakspiderSpider	1	1230
Request	804	1230
Response	105	1153
Selector	12	332
TextResponse	84	1227
XmlResponse	1	103

Attribute Type	Count	Oldest (seconds ago)
HtmlResponse	630	1260
MemleakscrapyItem	9	370
MemleakspiderSpider	1	1268
Request	4658	1268
Response	108	1191
Selector	11	370
TextResponse	85	1264
XmlResponse	1	141

Attribute Type	Count	Oldest (seconds ago)
HtmlResponse	2516	4304
MemleakscrapyItem	3	3
MemleakspiderSpider	1	4311
Request	3767	4311
Response	527	4234
Selector	45	238
TextResponse	743	4308
XmlResponse	11	3185

Attribute Type	Count	Oldest (seconds ago)
HtmlResponse	2802	4913
MemleakscrapyItem	3	274
MemleakspiderSpider	1	4921
Request	4336	4920
Response	566	4843
Selector	22	274
TextResponse	883	4917
XmlResponse	11	3794s

Past that, I used iter_all() and it looks like the majority of the requests are image file requests. Some are robot.txt requests.
It is not all the requests that don't get cleaned though, as then I would be looking at much much larger numbers.

I've done this testing on a completely fresh project - using scrapy startproject with the image pipelines enabled.
The spider just looks for images in //img/@src on a page and adds them to image_urls then yields the item.
The problem I have is that when I try this with scrapy-cluster where the spiders don't close, it becomes a problem.

dev-iwf · 2017-12-07T09:27:55Z

Okay. The problem for me was related to request caching in the media pipeline (self.spiderinfo.downloaded).
#939 See this issue

raphapassini · 2017-12-11T19:32:34Z

Seems that this PR I'll solve this problem when merged - #2823

dev-iwf · 2017-12-12T08:22:31Z

Looks like it. Thanks.

icanka · 2018-03-22T23:09:25Z

I had significant memory leak issue too when downloading images with IMAGES_MIN_HEIGHT and IMAGES_MIN_WEIGHT set. All of these image responses that does not meet min_height and min_weight condition raises ImageException and memory is filling up with these image responses. I deduced info.downloaded[fp] = result part of the _cache_result_and_execute_waiters function was the problem. result turns out to be twisted Failure and dont know why but causes these image responses to fill up memory. i tried emptying info.downloaded cache as mentioned in #939 but then obviously scraping takes significantly longer.I realized only key if fp in info.downloaded: is checked to decide if request has already downloaded.I solved it by assigning None to info.downloaded[fp] if its an instance of Failure.Total request count and scraping time doesnt seem to be changed.I dont know how much of a good idea this is but it solved my issue.

Gallaecio · 2024-02-28T10:54:43Z

Closing as a duplicate of #939.

Gallaecio added the performance label Jul 9, 2019

Gallaecio closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leaks in image pipeline #2447

memory leaks in image pipeline #2447

pawelmhm commented Dec 13, 2016

pawelmhm commented Dec 13, 2016

rmax commented Feb 8, 2017

kmike commented Feb 8, 2017

dev-iwf commented Dec 6, 2017

dev-iwf commented Dec 7, 2017

raphapassini commented Dec 11, 2017

dev-iwf commented Dec 12, 2017

icanka commented Mar 22, 2018

Gallaecio commented Feb 28, 2024

memory leaks in image pipeline #2447

memory leaks in image pipeline #2447

Comments

pawelmhm commented Dec 13, 2016

pawelmhm commented Dec 13, 2016

rmax commented Feb 8, 2017

kmike commented Feb 8, 2017

dev-iwf commented Dec 6, 2017

dev-iwf commented Dec 7, 2017

raphapassini commented Dec 11, 2017

dev-iwf commented Dec 12, 2017

icanka commented Mar 22, 2018

Gallaecio commented Feb 28, 2024