CrawlerProcess与CrawlerRunner

Scrapy 1.x文档说明,有两种方法可以通过脚本运行Scrapy Spider:

  • 使用 CrawlerProcess
  • 使用 CrawlerRunner

两者有什么区别?什么时候应该使用“进程”,什么时候应该使用“运行器”?

回答:

Scrapy的文档在给出两者的实际应用示例方面做得非常糟糕。

CrawlerProcess假设scrapy是唯一使用twisted反应堆的东西。如果你在python中使用线程来运行其他代码,则并非总是如此。让我们以此为例。

from scrapy.crawler import CrawlerProcess

import scrapy

def notThreadSafe(x):

"""do something that isn't thread-safe"""

# ...

class MySpider1(scrapy.Spider):

# Your first spider definition

...

class MySpider2(scrapy.Spider):

# Your second spider definition

...

process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

notThreadSafe(3) # it will get executed when the crawlers stop

现在,你可以看到,该函数仅在搜寻器停止时才执行,如果我希望在搜寻器在同一反应堆中爬行时执行该函数,该怎么办?

from twisted.internet import reactor

from scrapy.crawler import CrawlerRunner

import scrapy

def notThreadSafe(x):

"""do something that isn't thread-safe"""

# ...

class MySpider1(scrapy.Spider):

# Your first spider definition

...

class MySpider2(scrapy.Spider):

# Your second spider definition

...

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())

reactor.callFromThread(notThreadSafe, 3)

reactor.run() #it will run both crawlers and code inside the function

Runner类不限于此功能

以上是 CrawlerProcess与CrawlerRunner 的全部内容, 来源链接: utcz.com/qa/432875.html

回到顶部