scrapy启动后未执行直接结束?

scrapy启动后未执行直接结束?

问题描述

最近在学scrapy,再写一个爬取图片的项目的时候,发现启动爬虫后,未执行函数,直接结束了,找了好久没发现到底是那出问题,网上也没有相关答案

同时调试也是直接结束 断点设置再parse_item函数的第一行,根本不给面子┭┮﹏┭┮
调试提示↓

pydev debugger: process 3331 is connecting

Connected to pydev debugger (build 193.6494.30)

问题出现的环境背景及自己尝试过哪些方法

  • 一开始爬的彼岸图网,后来换了zol壁纸
  • 修改headers
  • 尝试过重建项目
  • 怀疑是不是IP被禁,但是再scrapy shell下可以访问
  • 换了一个类似的目标网站同样不行

相关代码

爬虫代码meizi.py

# -*- coding: utf-8 -*-

import scrapy

from ..items import CrawlBiAnItem

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class MeiziSpider(CrawlSpider):

name = 'meizi'

allowed_domains = ['sj.zol.com.cn/']

start_urls = ['http://sj.zol.com.cn/bizhi/meinv/2.html']

base_url = 'sj.zol.com.cn/bizhi/meinv'

img_urls = []

rules = (

Rule(LinkExtractor(allow=r'http://sj.zol.com.cn/bizhi/meinv/\d{3,5}.html'),

callback="parse_item", follow=True),

)

def parse_item(self, response):

print("++")

index_urls = response.xpath("//li[@class='photo-list-padding']//a/@href").getall()

for i in index_urls:

index_url = self.base_url + i

yield scrapy.Request(index_url, callback=self.crawl_img_url)

item = CrawlBiAnItem(image_urls=self.img_urls)

yield item

def crawl_img_url(self, response):

img_url = response.xpath("//img[@id='bigImg']/@src").get()

self.img_urls.append(img_url)

pass

setting

import os

from scrapy.pipelines.images import ImagesPipeline

IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), "images")

BOT_NAME = 'crawl_bi_an'

LOG_LEVEL = 'WARNING'

SPIDER_MODULES = ['crawl_bi_an.spiders']

NEWSPIDER_MODULE = 'crawl_bi_an.spiders'

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'zh-CN',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',

}

ITEM_PIPELINES = {

'scrapy.pipelines.images.ImagesPipeline': 1

}

pipelines.py

# 无

items.py

import scrapy

class CrawlBiAnItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

image_urls = scrapy.Field()

images = scrapy.Field()

pass


回答:

改下:

  allowed_domains = ['sj.zol.com.cn']

如果还不行,贴下打印日志


回答:

你应该在在pipline内处理图片

以上是 scrapy启动后未执行直接结束? 的全部内容, 来源链接: utcz.com/a/164154.html

回到顶部