Python scrapy 自定义函数无法调用。

Python scrapy 自定义函数无法调用。

爬取一个网页时,遇到一个非常奇怪的问题,如果使用自定义函数,那么yield item 没有调用。爬取的链接:http://www.duilian360.com/chu...
代码如下:

import scrapy

from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):

name = 'duilian_spider'

start_urls = [

{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},

]

base_url = 'http://www.duilian360.com'

def start_requests(self):

for topic in self.start_urls:

url = topic['url']

yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

def parse_page(self, response):

div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")

self.parse_paragraph(div_list)

def parse_paragraph(self, div_list):

for div in div_list:

duilian_text_list = div.xpath('./text()').extract()

for duilian_text in duilian_text_list:

duilian_item = DuilianItem()

duilian_item['category_id'] = 1

duilian = duilian_text

duilian_item['name'] = duilian

duilian_item['desc'] = ''

print('I reach here...') # 这句始终没有调用

yield duilian_item

在上面的代码中,print语句一直没有调用,打断点也无法进入parse_paragraph函数。但是如果我把parse_paragraph函数的代码直接贴到调用处,print语句就可以输出,像下面这样:

import scrapy

from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):

name = 'duilian_spider'

start_urls = [

{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},

]

base_url = 'http://www.duilian360.com'

def start_requests(self):

for topic in self.start_urls:

url = topic['url']

yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

def parse_page(self, response):

div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")

for div in div_list:

duilian_text_list = div.xpath('./text()').extract()

for duilian_text in duilian_text_list:

duilian_item = DuilianItem()

duilian_item['category_id'] = 1

duilian = duilian_text

duilian_item['name'] = duilian

duilian_item['desc'] = ''

print('I reach here...')

yield duilian_item

# def parse_paragraph(self, div_list):

# for div in div_list:

# duilian_text_list = div.xpath('./text()').extract()

# for duilian_text in duilian_text_list:

# duilian_item = DuilianItem()

# duilian_item['category_id'] = 1

# duilian = duilian_text

# duilian_item['name'] = duilian

# duilian_item['desc'] = ''

# print('I reach here...')

# yield duilian_item

请问这是为什么呢?我的代码里有很多自定义函数,并且有很多for循环,直接把代码贴到调用处会很难看,也不利于统一维护,因为可能很多地方调用同一个自定义函数。


回答:

终于找打答案了,其实是调用的方式不对,在自定义函数调用前加上yield from就可以了。

def parse_page(self, response):

div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")

yield from self.parse_paragraph(div_list)


回答:

你只差了一个 return 。若是没有,yield 产生的生成器不会执行,所以你看不到 print 结果。

改成这样

    def parse_page(self, response):

div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")

- self.parse_paragraph(div_list)

+ return self.parse_paragraph(div_list)


回答:

(补充)
第一次见到出了错还能如此肯定别人的回答是错的。
首先,scrapy的start_requests正常情况只会执行一次,就是启动的时候,默认情况它会把strat_urls列表的url包装成scrapy.Request并依次yield到调度器,调度器下载后回调parse函数。如果需要循环爬取,在parse函数内再次yeild scrapy.Request就行了,如果采集到了数据,创建Item子类并yield ,scrapy会自动识别并传给pipline处理,楼下说的对,你的yield from不就成了return吗?


这个问题不是出在这。你没有搞清楚scrapy的运作方式,start_request函数只会执行一次,把列表里的url处理并传到调度器就结束了。
重要的是你需要重写parse函数(默认的)或者自定义的callback,在其yield处返回一个Request。这样才能循环下去

python">def parse_page(self, response):

div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")

self.parse_paragraph(div_list)

yield scrapy.Request(something)

def parse_paragraph(self, div_list):

for div in div_list:

duilian_text_list = div.xpath('./text()').extract()

for duilian_text in duilian_text_list:

duilian_item = DuilianItem()

duilian_item['category_id'] = 1

duilian = duilian_text

duilian_item['name'] = duilian

duilian_item['desc'] = ''

print('I reach here...') # 这句始终没有调用

yield duilian_item


回答:

您的答案不对,因为自定义的函数里面有yield, 这个是一个生成器,调用生成器不是直接调函数名。

以上是 Python scrapy 自定义函数无法调用。 的全部内容, 来源链接: utcz.com/a/157806.html

回到顶部