使用Scrapy从网站查找和下载pdf文件

Z时代
2024-01-10
分类：问答

我的任务是使用Scrapy从网站提取pdf文件。我对Python并不陌生，但是Scrapy对我来说是一个新手。我一直在试验控制台和一些基本的spider。我找到并修改了以下代码：

import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
    name = "pwc_tax"
    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]
    def parse(self, response):
        base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)
    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        with open(path, 'wb') as f:
            f.write(response.body)

我在命令行运行以下代码

scrapy crawl mySpider

我什么也没回来我没有创建可抓取的项目，因为我想抓取并下载文件，没有元数据。我将不胜感激。

回答：

我已经更新了你的代码，这实际上是可行的：

import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
    name = "pwc_tax"
    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]
    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )
    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )
    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

以上是使用Scrapy从网站查找和下载pdf文件的全部内容，来源链接： utcz.com/qa/434403.html

使用Scrapy从网站查找和下载pdf文件

回答：

其他人也看了：