Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

网络爬虫入门(二)—— 预览javascript的使用 #241

Open
soapgu opened this issue Apr 4, 2024 · 0 comments
Open

网络爬虫入门(二)—— 预览javascript的使用 #241

soapgu opened this issue Apr 4, 2024 · 0 comments
Labels

Comments

@soapgu
Copy link
Owner

soapgu commented Apr 4, 2024

  • 前言

继续爬虫的研究,上次我们的是一条追本溯源的调查方向,这次我们换一个方向,变为“自然发生”

从scrapy的官网给出了建议方法

这里推荐使用scrapy-splash来预览javascript脚本

  • Splash是什么

Splash是一个网页javascript执行呈现的服务
表达上还是引用原文比较好

Splash is a javascript rendering service.It’s a lightweight web browser with an HTTP API

  • process multiple webpages in parallel;

  • get HTML results and/or take screenshots;

  • turn OFF images or use Adblock Plus rules to make rendering faster;

  • execute custom JavaScript in page context;

  • write Lua browsing scripts;

  • develop Splash Lua scripts in Splash-Jupyter Notebooks.

  • get detailed rendering info in HAR format.

  • scrapy-splash的安装

那么scrapy-splash就是Scrapy的中间件了,连接了Splash。

使用步骤

  1. 运行Splash服务

这里都是首推Docker

$ docker run -p 8050:8050 scrapinghub/splash

2.测试一下服务状态

只要8050端口正常访问就是OK了
图片

  1. 开始创建scrapy项目
  2. 设置settings.py

配置如下代码

SPLASH_URL = 'http://127.0.0.1:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

这些都是README标准流程,不过第5步漏了,暂时执行也正常

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

5.爬虫主代码

  • 初始化SplashRequest对象

这里使用了中间件后必须重写start_requests方法

  • 解析html

这里SPLASH会执行javascript,然后帮我们填充到dom里面,直接搜索对象,把相关的值挖出来就行了,可以省事很多

import scrapy
from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):
    name = "temp"
    start_urls = [
        "http://sh.weather.com.cn/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        sktemp = response.css('p.sk-temp')
        temp = sktemp.css('span::text').get()
        unit = sktemp.css('em::text').get()
        self.log( f'temp is {temp} {unit}' )
        
        yield { "current_temp": f'{temp} {unit}' }
  • 最终运行效果

当时网页情况
图片

运行结果
图片

@soapgu soapgu changed the title 网络爬虫入门(二)—— scrapy-splash的使用 网络爬虫入门(二)—— 预览javascript的使用 Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant