Scrapy和响应状态代码:如何进行检查?

我正在使用scrapy爬行我的站点地图,以检查404、302和200页。但是我似乎无法获得响应代码。到目前为止,这是我的代码:

from scrapy.contrib.spiders import SitemapSpider

class TothegoSitemapHomesSpider(SitemapSpider):

name ='tothego_homes_spider'

## robe che ci servono per tothego ##

sitemap_urls = []

ok_log_file = '/opt/Workspace/myapp/crawler/valid_output/ok_homes'

bad_log_file = '/opt/Workspace/myapp/crawler/bad_homes'

fourohfour = '/opt/Workspace/myapp/crawler/404/404_homes'

def __init__(self, **kwargs):

SitemapSpider.__init__(self)

if len(kwargs) > 1:

if 'domain' in kwargs:

self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

if 'country' in kwargs:

self.ok_log_file += "_%s.txt" % kwargs['country']

self.bad_log_file += "_%s.txt" % kwargs['country']

self.fourohfour += "_%s.txt" % kwargs['country']

else:

print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"

exit(1)

def parse(self, response):

try:

if response.status == 404:

## 404 tracciate anche separatamente

self.append(self.bad_log_file, response.url)

self.append(self.fourohfour, response.url)

elif response.status == 200:

## printa su ok_log_file

self.append(self.ok_log_file, response.url)

else:

self.append(self.bad_log_file, response.url)

except Exception, e:

self.log('[eccezione] : %s' % e)

pass

def append(self, file, string):

file = open(file, 'a')

file.write(string+"\n")

file.close()

他们从scrapy的文档中说,response.status参数是一个整数,对应于响应的状态码。到目前为止,它仅记录200个状态URL,而302未记录在输出文件中(但我可以在crawl.log中看到重定向)。那么,我该怎么做才能“捕获” 302请求并保存这些URL?

回答:

http://doc.codingdict.com/scrapy/index.html

假设启用了默认的spider中间件,则HttpErrorMiddleware会过滤掉200-300范围之外的响应代码。你可以通过在Spider上设置handle_httpstatus_list属性来告诉中间件你要处理404。

class TothegoSitemapHomesSpider(SitemapSpider):

handle_httpstatus_list = [404]

以上是 Scrapy和响应状态代码:如何进行检查? 的全部内容, 来源链接: utcz.com/qa/429740.html

回到顶部