Python scrapy 自定义函数无法调用。
爬取一个网页时,遇到一个非常奇怪的问题,如果使用自定义函数,那么yield item 没有调用。爬取的链接:http://www.duilian360.com/chu...
代码如下:
import scrapyfrom shufa.items import DuilianItem
class DuilianSpiderSpider(scrapy.Spider):
name = 'duilian_spider'
start_urls = [
{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
]
base_url = 'http://www.duilian360.com'
def start_requests(self):
for topic in self.start_urls:
url = topic['url']
yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))
def parse_page(self, response):
div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
self.parse_paragraph(div_list)
def parse_paragraph(self, div_list):
for div in div_list:
duilian_text_list = div.xpath('./text()').extract()
for duilian_text in duilian_text_list:
duilian_item = DuilianItem()
duilian_item['category_id'] = 1
duilian = duilian_text
duilian_item['name'] = duilian
duilian_item['desc'] = ''
print('I reach here...') # 这句始终没有调用
yield duilian_item
在上面的代码中,print语句一直没有调用,打断点也无法进入parse_paragraph
函数。但是如果我把parse_paragraph
函数的代码直接贴到调用处,print语句就可以输出,像下面这样:
import scrapyfrom shufa.items import DuilianItem
class DuilianSpiderSpider(scrapy.Spider):
name = 'duilian_spider'
start_urls = [
{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
]
base_url = 'http://www.duilian360.com'
def start_requests(self):
for topic in self.start_urls:
url = topic['url']
yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))
def parse_page(self, response):
div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
for div in div_list:
duilian_text_list = div.xpath('./text()').extract()
for duilian_text in duilian_text_list:
duilian_item = DuilianItem()
duilian_item['category_id'] = 1
duilian = duilian_text
duilian_item['name'] = duilian
duilian_item['desc'] = ''
print('I reach here...')
yield duilian_item
# def parse_paragraph(self, div_list):
# for div in div_list:
# duilian_text_list = div.xpath('./text()').extract()
# for duilian_text in duilian_text_list:
# duilian_item = DuilianItem()
# duilian_item['category_id'] = 1
# duilian = duilian_text
# duilian_item['name'] = duilian
# duilian_item['desc'] = ''
# print('I reach here...')
# yield duilian_item
请问这是为什么呢?我的代码里有很多自定义函数,并且有很多for循环,直接把代码贴到调用处会很难看,也不利于统一维护,因为可能很多地方调用同一个自定义函数。
回答:
终于找打答案了,其实是调用的方式不对,在自定义函数调用前加上yield from
就可以了。
def parse_page(self, response): div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
yield from self.parse_paragraph(div_list)
回答:
你只差了一个 return 。若是没有,yield 产生的生成器不会执行,所以你看不到 print 结果。
改成这样
def parse_page(self, response): div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
- self.parse_paragraph(div_list)
+ return self.parse_paragraph(div_list)
回答:
(补充)
第一次见到出了错还能如此肯定别人的回答是错的。
首先,scrapy的start_requests正常情况只会执行一次,就是启动的时候,默认情况它会把strat_urls列表的url包装成scrapy.Request并依次yield到调度器,调度器下载后回调parse函数。如果需要循环爬取,在parse函数内再次yeild scrapy.Request就行了,如果采集到了数据,创建Item子类并yield ,scrapy会自动识别并传给pipline处理,楼下说的对,你的yield from不就成了return吗?
这个问题不是出在这。你没有搞清楚scrapy的运作方式,start_request函数只会执行一次,把列表里的url处理并传到调度器就结束了。
重要的是你需要重写parse函数(默认的)或者自定义的callback,在其yield处返回一个Request。这样才能循环下去
python">def parse_page(self, response): div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
self.parse_paragraph(div_list)
yield scrapy.Request(something)
def parse_paragraph(self, div_list):
for div in div_list:
duilian_text_list = div.xpath('./text()').extract()
for duilian_text in duilian_text_list:
duilian_item = DuilianItem()
duilian_item['category_id'] = 1
duilian = duilian_text
duilian_item['name'] = duilian
duilian_item['desc'] = ''
print('I reach here...') # 这句始终没有调用
yield duilian_item
回答:
您的答案不对,因为自定义的函数里面有yield, 这个是一个生成器,调用生成器不是直接调函数名。
以上是 Python scrapy 自定义函数无法调用。 的全部内容, 来源链接: utcz.com/a/157806.html