Scrapy Shell和Scrapy Splash

Z时代
2024-01-10
分类：问答

我们一直在使用scrapy-splash中间件来将抓取的HTML源通过Splash运行在Docker容器中的javascript引擎传递。

如果要在Spider中使用Splash，则需要配置一些必需的项目设置并产生一个Request指定的特定meta参数：

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,
            # 'url' is prefilled from request url
        },
        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

这按记录工作。但是，我们如何scrapy-splash在Scrapy Shell中使用？

回答：

只需将要封装的URL包装在启动http api中。

因此，需要以下内容：

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

这里localhost:port是你的飞溅服务运行的位置

url是URL要抓取和不要忘记urlquote吧！

render.html是可能的http api端点之一，在这种情况下，

timeout以秒为单位返回经过重编的html页面，以秒为单位的超时

wait时间返回以等待javascript执行，然后再读取/保存html。

以上是 Scrapy Shell和Scrapy Splash 的全部内容，来源链接： utcz.com/qa/421305.html

Scrapy Shell和Scrapy Splash

回答：

其他人也看了：