在收到一定数量的请求后,如何阻止scrapy spider?

我正在开发一个简单的scraper,可获取9个插口及其图像,但由于某些技术困难,我无法停止scraper,并且我一直不希望刮板继续scraper。我想增加计数器值并在100个柱后停止。但是9gag页面在每个响应中都以一种时尚的方式设计,它只给出10个帖子,并且在每次迭代之后,我的计数器值重置为10,在这种情况下,我的循环运行了很长时间,并且永不停止。

# -*- coding: utf-8 -*-

import scrapy

from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):

name = "first"

allowed_domains = ["9gag.com"]

start_urls = (

'http://www.9gag.com/',

)

last_gag_id = None

def parse(self, response):

count = 0

for article in response.xpath('//article'):

gag_id = article.xpath('@data-entry-id').extract()

count +=1

if gag_id:

if (count != 100):

last_gag_id = gag_id[0]

ninegag_item = GagItem()

ninegag_item['entry_id'] = gag_id[0]

ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]

ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]

ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]

ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()

ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

yield ninegag_item

else:

break

next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id

yield scrapy.Request(url=next_url, callback=self.parse)

print count

这里是items.py的代码

from scrapy.item import Item, Field

class GagItem(Item):

entry_id = Field()

url = Field()

votes = Field()

comments = Field()

title = Field()

img_url = Field()

所以我想增加一个全局计数值,并通过传递3个参数来解析函数,尝试给出错误

TypeError: parse() takes exactly 3 arguments (2 given)

因此,有一种方法可以传递全局计数值,并在每次迭代后返回它,并在100个帖子(假设)后停止。

整个项目都在这里 Github 即使我设置POST_LIMIT = 100也会发生无限循环,请参见此处我执行的命令

scrapy crawl first -s POST_LIMIT=10 --output=output.json

回答:

首先:self.count在之外使用和初始化parse。然后不要阻止项目的解析,而是生成new requests。请参见以下代码:

# -*- coding: utf-8 -*-

import scrapy

from scrapy import Item, Field

class GagItem(Item):

entry_id = Field()

url = Field()

votes = Field()

comments = Field()

title = Field()

img_url = Field()

class FirstSpider(scrapy.Spider):

name = "first"

allowed_domains = ["9gag.com"]

start_urls = ('http://www.9gag.com/', )

last_gag_id = None

COUNT_MAX = 30

count = 0

def parse(self, response):

for article in response.xpath('//article'):

gag_id = article.xpath('@data-entry-id').extract()

ninegag_item = GagItem()

ninegag_item['entry_id'] = gag_id[0]

ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]

ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]

ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]

ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()

ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

self.last_gag_id = gag_id[0]

self.count = self.count + 1

yield ninegag_item

if (self.count < self.COUNT_MAX):

next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id

yield scrapy.Request(url=next_url, callback=self.parse)

以上是 在收到一定数量的请求后,如何阻止scrapy spider? 的全部内容, 来源链接: utcz.com/qa/420177.html

回到顶部