创建管道来保存scrapy文件吗？

Z时代
2024-01-10
分类：问答

我需要保存一个文件（.pdf），但不确定如何执行。我需要保存.pdfs并以某种方式将它们存储在一个目录中，就像它们存储在我要删除的站点上一样。

从我可以收集的信息来看，我需要建立一个管道，但是据我所知，管道保存的“ Items”和“ items”只是诸如字符串/数字之类的基本数据。保存文件是对管道的正确使用，还是应该将文件保存在Spider中？

回答：

如果你获取一个pdf文件，它将被存储在内存中，但是如果pdf文件的大小不足以填满你的可用内存，则可以。

你可以将PDF保存在Spider回调中：

def parse_listing(self, response):
    # ... extract pdf urls
    for url in pdf_urls:
        yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f:
        f.write(response.body)

如果选择在管道中执行此操作：

# in the spider
def parse_pdf(self, response):
    i = MyItem()
    i['body'] = response.body
    i['url'] = response.url
    # you can add more metadata to the item
    return i
# in your pipeline
def process_item(self, item, spider):
    path = self.get_path(item['url'])
    with open(path, "wb") as f:
        f.write(item['body'])
    # remove body and add path as reference
    del item['body']
    item['path'] = path
    # let item be processed by other pipelines. ie. db store
    return item