从脚本抓取总是在抓取后阻止脚本执行

这是我脚本的一部分:

    crawler = Crawler(Settings(settings))

crawler.configure()

spider = crawler.spiders.create(spider_name)

crawler.crawl(spider)

crawler.start()

log.start()

reactor.run()

print "It can't be printed out!"

它的工作原理应该是:访问页面,抓取所需信息,并将输出json存储在我告诉它的位置(通过FEED_URI)。但是当spider完成他的工作(我可以在输出json中通过数字看到它)时,我的脚本将无法恢复执行。也许这不是一个棘手的问题。答案应该在扭曲反应堆的某个地方。如何释放线程执行?

回答:

spider完成后,你将需要停止反应器。你可以通过侦听spider_closed信号来完成此操作:

from twisted.internet import reactor

from scrapy import log, signals

from scrapy.crawler import Crawler

from scrapy.settings import Settings

from scrapy.xlib.pydispatch import dispatcher

from testspiders.spiders.followall import FollowAllSpider

def stop_reactor():

reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)

spider = FollowAllSpider(domain='scrapinghub.com')

crawler = Crawler(Settings())

crawler.configure()

crawler.crawl(spider)

crawler.start()

log.start()

log.msg('Running reactor...')

reactor.run() # the script will block here until the spider is closed

log.msg('Reactor stopped.')

命令行日志输出可能类似于:

stav@maia:/srv/scrapy/testspiders$ ./api

2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...

2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)

2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 23934,...}

2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)

2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.

stav@maia:/srv/scrapy/testspiders$

以上是 从脚本抓取总是在抓取后阻止脚本执行 的全部内容, 来源链接: utcz.com/qa/413633.html

回到顶部