We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
继续爬虫的研究,上次我们的是一条追本溯源的调查方向,这次我们换一个方向,变为“自然发生”
从scrapy的官网给出了建议方法
这里推荐使用scrapy-splash来预览javascript脚本
Splash是一个网页javascript执行呈现的服务 表达上还是引用原文比较好
Splash is a javascript rendering service.It’s a lightweight web browser with an HTTP API
process multiple webpages in parallel;
get HTML results and/or take screenshots;
turn OFF images or use Adblock Plus rules to make rendering faster;
execute custom JavaScript in page context;
write Lua browsing scripts;
develop Splash Lua scripts in Splash-Jupyter Notebooks.
get detailed rendering info in HAR format.
那么scrapy-splash就是Scrapy的中间件了,连接了Splash。
使用步骤
这里都是首推Docker
$ docker run -p 8050:8050 scrapinghub/splash
2.测试一下服务状态
只要8050端口正常访问就是OK了
配置如下代码
SPLASH_URL = 'http://127.0.0.1:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
这些都是README标准流程,不过第5步漏了,暂时执行也正常
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
5.爬虫主代码
这里使用了中间件后必须重写start_requests方法
这里SPLASH会执行javascript,然后帮我们填充到dom里面,直接搜索对象,把相关的值挖出来就行了,可以省事很多
import scrapy from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): name = "temp" start_urls = [ "http://sh.weather.com.cn/" ] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): sktemp = response.css('p.sk-temp') temp = sktemp.css('span::text').get() unit = sktemp.css('em::text').get() self.log( f'temp is {temp} {unit}' ) yield { "current_temp": f'{temp} {unit}' }
当时网页情况
运行结果
The text was updated successfully, but these errors were encountered:
No branches or pull requests
前言
继续爬虫的研究,上次我们的是一条追本溯源的调查方向,这次我们换一个方向,变为“自然发生”
从scrapy的官网给出了建议方法
这里推荐使用scrapy-splash来预览javascript脚本
Splash是什么
Splash是一个网页javascript执行呈现的服务
表达上还是引用原文比较好
process multiple webpages in parallel;
get HTML results and/or take screenshots;
turn OFF images or use Adblock Plus rules to make rendering faster;
execute custom JavaScript in page context;
write Lua browsing scripts;
develop Splash Lua scripts in Splash-Jupyter Notebooks.
get detailed rendering info in HAR format.
scrapy-splash的安装
那么scrapy-splash就是Scrapy的中间件了,连接了Splash。
使用步骤
这里都是首推Docker
$ docker run -p 8050:8050 scrapinghub/splash
2.测试一下服务状态
只要8050端口正常访问就是OK了
配置如下代码
这些都是README标准流程,不过第5步漏了,暂时执行也正常
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
5.爬虫主代码
这里使用了中间件后必须重写start_requests方法
这里SPLASH会执行javascript,然后帮我们填充到dom里面,直接搜索对象,把相关的值挖出来就行了,可以省事很多
最终运行效果
当时网页情况
运行结果
The text was updated successfully, but these errors were encountered: