从脚本运行scrapy,不包括管道

我正在通过脚本运行scrapy,但它所做的只是激活spider。它不会通过我的商品渠道。我已经阅读了http://scrapy.readthedocs.org/en/latest/topics/practices.html,但是它并没有说明包括管道。

我的设置:

Scraper/

scrapy.cfg

ScrapyScript.py

Scraper/

__init__.py

items.py

pipelines.py

settings.py

spiders/

__init__.py

my_spider.py

My script:

from twisted.internet import reactor

from scrapy.crawler import Crawler

from scrapy.settings import Settings

from scrapy import log, signals

from Scraper.spiders.my_spider import MySpiderSpider

spider = MySpiderSpider(domain='myDomain.com')

settings = get_project_settings

crawler = Crawler(Settings())

crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

crawler.configure()

crawler.crawl(spider)

crawler.start()

log.start()

log.msg('Reactor activated...')

reactor.run()

log.msg('Reactor stopped.')

My pipeline:

from scrapy.exceptions import DropItem

from scrapy import log

import sqlite3

class ImageCheckPipeline(object):

def process_item(self, item, spider):

if item['image']:

log.msg("Item added successfully.")

return item

else:

del item

raise DropItem("Non-image thumbnail found: ")

class StoreImage(object):

def __init__(self):

self.db = sqlite3.connect('images')

self.cursor = self.db.cursor()

try:

self.cursor.execute('''

CREATE TABLE IMAGES(IMAGE BLOB, TITLE TEXT, URL TEXT)

''')

self.db.commit()

except sqlite3.OperationalError:

self.cursor.execute('''

DELETE FROM IMAGES

''')

self.db.commit()

def process_item(self, item, spider):

title = item['title'][0]

image = item['image'][0]

url = item['url'][0]

self.cursor.execute('''

INSERT INTO IMAGES VALUES (?, ?, ?)

''', (image, title, url))

self.db.commit()

脚本的输出:

[name@localhost Scraper]$ python ScrapyScript.py

2014-08-06 17:55:22-0400 [scrapy] INFO: Reactor activated...

2014-08-06 17:55:22-0400 [my_spider] INFO: Closing spider (finished)

2014-08-06 17:55:22-0400 [my_spider] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 213,

'downloader/request_count': 1,

'downloader/request_method_count/GET': 1,

'downloader/response_bytes': 18852,

'downloader/response_count': 1,

'downloader/response_status_count/200': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2014, 8, 6, 21, 55, 22, 518492),

'item_scraped_count': 51,

'response_received_count': 1,

'scheduler/dequeued': 1,

'scheduler/dequeued/memory': 1,

'scheduler/enqueued': 1,

'scheduler/enqueued/memory': 1,

'start_time': datetime.datetime(2014, 8, 6, 21, 55, 22, 363898)}

2014-08-06 17:55:22-0400 [my_spider] INFO: Spider closed (finished)

2014-08-06 17:55:22-0400 [scrapy] INFO: Reactor stopped.

[name@localhost Scraper]$

回答:

实际上需要调用get_project_settings,在已发布的代码中传递给爬虫的Settings对象将为你提供默认值,而不是你的特定项目设置。你需要编写如下内容:

from scrapy.utils.project import get_project_settings

settings = get_project_settings()

crawler = Crawler(settings)

以上是 从脚本运行scrapy,不包括管道 的全部内容, 来源链接: utcz.com/qa/423842.html

回到顶部