在Scrapy中输出空文件json

我说我已经阅读了关于同样问题的一些答案,但是我无法解决我的问题。 我是Python新手,我试图从Aptoide中提取关于应用程序和商店的数据,并且我想要一个输出结果作为.json文件(或csv),但是我得到的文件是空的,我不知道原因。在Scrapy中输出空文件json

这是我的代码:

import scrapy

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.selector import HtmlXPathSelector

class ApptoideItem(scrapy.Item):

app_name = scrapy.Field()

rating = scrapy.Field()

security_status = scrapy.Field()

good_flag = scrapy.Field()

licence_flag = scrapy.Field()

fake_flag = scrapy.Field()

freeze_flag = scrapy.Field()

virus_flag = scrapy.Field()

five_stars = scrapy.Field()

four_stars = scrapy.Field()

three_stars = scrapy.Field()

two_stars = scrapy.Field()

one_stars = scrapy.Field()

info = scrapy.Field()

download = scrapy.Field()

version = scrapy.Field()

size = scrapy.Field()

link = scrapy.Field()

store = scrapy.Field()

class AppSpider(CrawlSpider):

name = "second"

allowed_domains = ["aptoide.com"]

start_urls = [ "http://www.aptoide.com/page/morestores/type:top" ]

rules = (

Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com$'))),

Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com/app/market')), callback='parse_item')

)

def parse_item(self, response):

item = ApptoideItem()

item['app_name']= str(response.css(".app_name::text").extract()[0])

item['rating']= str(response.css(".app_rating_number::text").extract()[0])

item['security_status']= str(response.css("#show_app_malware_data::text").extract()[0])

item['good_flag']= int(response.css(".good > div:nth-child(3)::text").extract()[0])

item['licence_flag']= int(response.css(".license > div:nth-child(3)::text").extract()[0])

item['fake_flag']= int(response.css(".fake > div:nth-child(3)::text").extract()[0])

item['freeze_flag']= int(response.css(".freeze > div:nth-child(3)::text").extract()[0])

item['virus_flag']= int(response.css(".virus > div:nth-child(3)::text").extract()[0])

item['five_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(1) > div:nth-child(3)::text").extract()[0])

item['four_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(2) > div:nth-child(3)::text").extract()[0])

item['three_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(3) > div:nth-child(3)::text").extract()[0])

item['two_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(4) > div:nth-child(3)::text").extract()[0])

item['link']= response.url

item['one_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(5) > div:nth-child(3)::text").extract()[0])

item['download']= int(response.css("p.app_meta::text").re('(\d[\w\.]*)')[0].replace('.', ''))

item['version']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[1])

item['size']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[2])

item['store_name']= str(response.css(".sec_header_txt::text").extract()[0])

item['info_store']= str(response.css(".ter_header2::text").extract()[0])

yield item

我敢肯定的是,problema是永远不会调用该方法parse_item,我不知道原因。第一条规则在商店之后,而第二条则在商店之后。我认为正则表达式的语法是正确的。

设置有:

CLOSESPIDER_PAGECOUNT = 1000 

CLOSESPIDER_ITEMCOUNT = 500

CONCURRENT_REQUESTS = 1

CONCURRENT_ITEMS = 1

BOT_NAME = 'nuovo'

SPIDER_MODULES = ['nuovo.spiders']

NEWSPIDER_MODULE = 'nuovo.spiders'

任何人都可以发现问题,并提出了我的解决方案?

回答:

你的代码是完全错误的,当您运行的蜘蛛,你可以保存日志,并通过它使用grep:

  • ApptoideItem是:我发现
    scrapy crawl spidername 2>&1 | tee crawl.log 

    几个错误缺少像store_name等几个字段。

  • 您的整个int()转换是不安全的,这意味着如果您的response.css返回None,如果它找不到任何内容,它会执行一个错误。

要解决的第二个问题,我建议寻找到scrapy ItemLoaders这将允许你指定默认的行为对一些领域,如转项领域_flag为布尔等
另外,作为@Jan在评论中提到的,你应该使用extract_first()方法而不是extract()[0],extract_first允许你指定默认属性,当没有找到任何东西时,即.extract_first(default=0)

以上是 在Scrapy中输出空文件json 的全部内容, 来源链接: utcz.com/qa/261776.html

回到顶部