无法顺利通过表格

我是使用scrapy的新手,我正尝试从房地产网站获取一些信息。该网站有一个带有搜索表单(方法GET)的主页。我正在尝试进入start_requests(recherche.php)中的结果页面,并设置在formdata参数的地址栏中看到的所有get参数。我还设置了我的Cookie,但他也没有工作。

这是我的spider:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.http import FormRequest, Request

from robots_immo.items import AnnonceItem

class ElyseAvenueSpider(BaseSpider):

name = "elyse_avenue"

allowed_domains = ["http://www.elyseavenue.com/"]

def start_requests(self):

return [FormRequest(url="http://www.elyseavenue.com/recherche.php",

formdata={'recherche':'recherche',

'compteurLigne':'2',

'numLigneCourante':'0',

'inseeVille_0':'',

'num_rubrique':'',

'rechercheOK':'recherche',

'recherche_budget_max':'',

'recherche_budget_min':'',

'recherche_surface_max':'',

'recherche_surface_min':'',

'recherche_distance_km_0':'20',

'recherche_reference_bien':'',

'recherche_type_logement':'9',

'recherche_ville_0':''

},

cookies={'PHPSESSID':'4e1d729f68d3163bb110ad3e4cb8ffc3',

'__utma':'150766562.159027263.1340725224.1340725224.1340727680.2',

'__utmc':'150766562',

'__utmz':'150766562.1340725224.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',

'__utmb':'150766562.14.10.1340727680'

},

callback=self.parseAnnonces

)]

def parseAnnonces(self, response):

hxs = HtmlXPathSelector(response)

annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]')

items = []

for annonce in annonces:

item = AnnonceItem()

item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract()

item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract()

item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract()

items.append(item)

return items

SPIDER = ElyseAvenueSpider()

当我运行Spider时,没有问题,但是加载的页面不是一个好页面(它说“请指定您的搜索”,但我没有得到任何结果。)

2019-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider opened

2019-06-26 20:04:54+0200 [elyse_avenue] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-06-26 20:04:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023

2019-06-26 20:04:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

2019-06-26 20:04:54+0200 [elyse_avenue] DEBUG: Crawled (200) <POST http://www.elyseavenue.com/recherche.php> (referer: None)

2019-06-26 20:04:54+0200 [elyse_avenue] INFO: Closing spider (finished)

2019-06-26 20:04:54+0200 [elyse_avenue] INFO: Dumping spider stats:

{'downloader/request_bytes': 808,

'downloader/request_count': 1,

'downloader/request_method_count/POST': 1,

'downloader/response_bytes': 7590,

'downloader/response_count': 1,

'downloader/response_status_count/200': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2019, 6, 26, 18, 4, 54, 924624),

'scheduler/memory_enqueued': 1,

'start_time': datetime.datetime(2019, 6, 26, 18, 4, 54, 559230)}

2019-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider closed (finished)

2019-06-26 20:04:54+0200 [scrapy] INFO: Dumping global stats:

{'memusage/max': 27410432, 'memusage/startup': 27410432}

谢谢

回答:

我会使用FormRequest.from_response()哪个为您完成所有工作,因为您仍然可能会错过一些字段:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.http import FormRequest, Request

from robots_immo.items import AnnonceItem

class ElyseAvenueSpider(BaseSpider):

name = "elyse_avenue"

allowed_domains = ["elyseavenue.com"] # i fixed this

start_urls = ["http://www.elyseavenue.com/"] # i added this

def parse(self, response):

yield FormRequest.from_response(response, formname='moteurRecherche', formdata={'recherche_distance_km_0':'20', 'recherche_type_logement':'9'}, callback=self.parseAnnonces)

def parseAnnonces(self, response):

hxs = HtmlXPathSelector(response)

annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]')

items = []

for annonce in annonces:

item = AnnonceItem()

item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract()

item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract()

item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract()

items.append(item)

return items

以上是 无法顺利通过表格 的全部内容, 来源链接: utcz.com/qa/413721.html

回到顶部