如何正确使用Rule,restrict_xpaths来抓取和解析URL?

我正在尝试对爬网spider进行编程,以对网站的RSS提要进行爬网,然后解析文章的元标记。

第一个RSS页面是显示RSS类别的页面。我设法提取链接,因为标签在标签中。看起来像这样:

        <tr>

<td class="xmlLink">

<a href="http://feeds.example.com/subject1">subject1</a>

</td>

</tr>

<tr>

<td class="xmlLink">

<a href="http://feeds.example.com/subject2">subject2</a>

</td>

</tr>

单击该链接后,它将为你带来该RSS类别的文章,如下所示:

   <li class="regularitem">

<h4 class="itemtitle">

<a href="http://example.com/article1">article1</a>

</h4>

</li>

<li class="regularitem">

<h4 class="itemtitle">

<a href="http://example.com/article2">article2</a>

</h4>

</li>

如你所见,如果使用标签,我可以再次获取xpath的链接,我希望我的爬虫转到该标签内的链接并为我解析元标签。

这是我的搜寻器代码:

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from tutorial.items import exampleItem

class MetaCrawl(CrawlSpider):

name = 'metaspider'

start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling

rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),

Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

def parse_articles(self, response):

hxs = HtmlXPathSelector(response)

meta = hxs.select('//meta')

items = []

for m in meta:

item = exampleItem()

item['link'] = response.url

item['meta_name'] =m.select('@name').extract()

item['meta_value'] = m.select('@content').extract()

items.append(item)

return items

但是,这是我运行搜寻器时的输出:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错了什么?我已经一遍又一遍地阅读文档,但是我觉得自己一直在忽略一些东西。任何帮助,将不胜感激。

编辑:添加:items.append(item)。在原始帖子中忘记了它。 编辑::我也尝试过此操作,并导致相同的输出:

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from reuters.items import exampleItem

from scrapy.http import Request

class MetaCrawl(CrawlSpider):

name = 'metaspider'

start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling

rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),

Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]

def parse(self, response):

hxs = HtmlXPathSelector(response)

meta = hxs.select('//td[@class="xmlLink"]/a/@href')

for m in meta:

yield Request(m.extract(), callback = self.parse_link)

def parse_link(self, response):

hxs = HtmlXPathSelector(response)

meta = hxs.select('//h4[@class="itemtitle"]/a/@href')

for m in meta:

yield Request(m.extract(), callback = self.parse_again)

def parse_again(self, response):

hxs = HtmlXPathSelector(response)

meta = hxs.select('//meta')

items = []

for m in meta:

item = exampleItem()

item['link'] = response.url

item['meta_name'] = m.select('@name').extract()

item['meta_value'] = m.select('@content').extract()

items.append(item)

return items

回答:

你返回了一个空值items,需要附加itemitems

你也可以yield item在循环中。

以上是 如何正确使用Rule,restrict_xpaths来抓取和解析URL? 的全部内容, 来源链接: utcz.com/qa/434405.html

回到顶部