如何使用不变的网址抓取多个页面-Python

我正在尝试抓取此网站:http :

//data.eastmoney.com/xg/xg/

到目前为止,我已经使用selenium执行javascript并抓取了表格。但是,现在我的代码仅使我获得第一页。我想知道是否有一种方法可以访问其他17个页面,因为当我单击下一页时,URL不会更改,因此我不能每次都遍历另一个URL

下面是我到目前为止的代码:

from selenium import webdriver

import lxml

from bs4 import BeautifulSoup

import time

def scrape():

url = 'http://data.eastmoney.com/xg/xg/'

d={}

f = open('east.txt','a')

driver = webdriver.PhantomJS()

driver.get(url)

lst = [x for x in range(0,25)]

htmlsource = driver.page_source

bs = BeautifulSoup(htmlsource)

heading = bs.find_all('thead')[0]

hlist = []

for header in heading.find_all('tr'):

head = header.find_all('th')

for i in lst:

if i!=2:

hlist.append(head[i].get_text().strip())

h = '|'.join(hlist)

print h

table = bs.find_all('tbody')[0]

for row in table.find_all('tr'):

cells = row.find_all('td')

d[cells[0].get_text()]=[y.get_text() for y in cells]

for key in d:

ret=[]

for i in lst:

if i != 2:

ret.append(d.get(key)[i])

s = '|'.join(ret)

print s

if __name__ == "__main__":

scrape()

还是我每次单击后都可以使用webdriver.Chrome()而不是PhantomJS来通过浏览器单击下一步,然后在新页面上运行Python?

回答:

这不是要与之交互的琐碎页面,需要使用“ 显式等待”来等待“加载”指示器的隐形。

这是可以用作起点的完整且可行的实现:

# -*- coding: utf-8 -*-

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium import webdriver

import time

url = "http://data.eastmoney.com/xg/xg/"

driver = webdriver.PhantomJS()

driver.get(url)

def get_table_results(driver):

for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):

print [cell.text for cell in row.find_elements_by_tag_name("td")]

# initial wait for results

WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))

while True:

# print current page number

page_number = driver.find_element_by_id("gopage").get_attribute("value")

print "Page #" + page_number

get_table_results(driver)

next_link = driver.find_element_by_link_text("下一页")

if "nolink" in next_link.get_attribute("class"):

break

next_link.click()

time.sleep(2) # TODO: fix?

# wait for results to load

WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]")))

print "------"

想法是要有一个无限循环,只有当“下一页”链接被禁用(没有更多可用页面)时,我们才会退出。在每次迭代中,获取表结果(为示例起见,在控制台上打印),单击下一个链接,然后等待出现在网格顶部的“正在加载”旋转圆的隐形性。

以上是 如何使用不变的网址抓取多个页面-Python 的全部内容, 来源链接: utcz.com/qa/403005.html

回到顶部