selenium scrape:

我有一个spider,它抓取一个通过页面上的javascript重新加载内容的网站。为了移至下一页进行抓取,我一直在使用Selenium单击网站顶部的month链接。

问题是,即使我的代码按预期方式在每个链接中移动,爬虫也只会将第一个月(9月)的数据抓取几个月,然后返回此重复数据。

我该如何解决?

from selenium import webdriver

class GigsInScotlandMain(InitSpider):

name = 'gigsinscotlandmain'

allowed_domains = ["gigsinscotland.com"]

start_urls = ["http://www.gigsinscotland.com"]

def __init__(self):

InitSpider.__init__(self)

self.br = webdriver.Firefox()

def parse(self, response):

hxs = HtmlXPathSelector(response)

self.br.get(response.url)

time.sleep(2.5)

# Get the string for each month on the page.

months = hxs.select("//ul[@id='gigsMonths']/li/a/text()").extract()

for month in months:

link = self.br.find_element_by_link_text(month)

link.click()

time.sleep(5)

# Get all the divs containing info to be scraped.

listitems = hxs.select("//div[@class='listItem']")

for listitem in listitems:

item = GigsInScotlandMainItem()

item['artist'] = listitem.select("div[contains(@class, 'artistBlock')]/div[@class='artistdiv']/span[@class='artistname']/a/text()").extract()

#

# Get other data ...

#

yield item

回答:

问题是您正在重用HtmlXPathSelector为初始响应定义的内容。从selenium浏览器重新定义它source_code

...

for month in months:

link = self.br.find_element_by_link_text(month)

link.click()

time.sleep(5)

hxs = HtmlXPathSelector(self.br.page_source)

# Get all the divs containing info to be scraped.

listitems = hxs.select("//div[@class='listItem']")

...

以上是 selenium scrape: 的全部内容, 来源链接: utcz.com/qa/401239.html

回到顶部