使用Selenium缓慢向下滚动页面

我正在尝试从航班搜索页面中抓取一些数据。

此页面的工作方式如下:

您填写表格,然后单击按钮搜索-可以。当您单击按钮时,您将被重定向到包含结果的页面,这就是问题所在。例如,此页面会持续添加一分钟的结果,这没什么大不了的-

问题是要获得所有这些结果。在实际的浏览器中,您必须向下滚动页面,然后才会显示这些结果。因此,我尝试使用Selenium向下滚动。它可能在页面底部如此之快地向下滚动,或者是跳转而不是滚动,以致该页面不加载任何新结果。

当您缓慢向下滚动时,它将重新加载结果,但是如果您做得很快,它将停止加载。

我不确定我的代码是否有助于理解这一点,所以我附上了它。

SEARCH_STRING = """URL"""

class spider():

def __init__(self):

self.driver = webdriver.Firefox()

@staticmethod

def prepare_get(dep_airport,arr_airport,dep_date,arr_date):

string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)

return string

def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):

if isinstance(dep_airport, list):

airports_string = str(r'%20').join(dep_airport)

dep_airport = airports_string

wait = WebDriverWait(self.driver, 60) # wait for results

self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))

wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))

wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))

self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)

return self.driver.page_source

@staticmethod

def get_info_from_borderbox(div):

arrival = div.find('div',class_='departure').text

price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))

departure = div.find_all('div',class_='departure')[1].contents

date_departure = departure[1].text

airport_departure = departure[5].text

arrival = div.find_all('div', class_= 'arrival')[0].contents

date_arrival = arrival[1].text

airport_arrival = arrival[3].text[1:]

print 'DEPARTURE: '

print date_departure,airport_departure

print 'ARRIVAL: '

print date_arrival,airport_arrival

@staticmethod

def get_flights_from_result_page(html):

def match_tag(tag, classes):

return (tag.name == 'div'

and 'class' in tag.attrs

and all([c in tag['class'] for c in classes]))

soup = mLib.getSoup_html(html)

divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))

for div in divs:

spider.get_info_from_borderbox(div)

print len(divs)

spider_inst = spider()

print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))

因此,主要问题是我认为滚动速度太快而无法触发结果的新加载。

你知道如何使它起作用吗?

回答:

这是一种对我有用的不同方法,其中涉及 滚动到 最后一个搜索结果的 视图 ,并等待其他元素加载后再次滚动:

# -*- coding: utf-8 -*-

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.common.exceptions import StaleElementReferenceException

from selenium.webdriver.support import expected_conditions as EC

class wait_for_more_than_n_elements(object):

def __init__(self, locator, count):

self.locator = locator

self.count = count

def __call__(self, driver):

try:

count = len(EC._find_elements(driver, self.locator))

return count >= self.count

except StaleElementReferenceException:

return False

driver = webdriver.Firefox()

dep_airport = ['BTS', 'BRU', 'PAR']

arr_airport = 'MAD'

dep_date = '2015-07-15'

arr_date = '2015-08-15'

airports_string = str(r'%20').join(dep_airport)

dep_airport = airports_string

url = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date)

driver.maximize_window()

driver.get(url)

wait = WebDriverWait(driver, 60)

wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))

wait.until(EC.invisibility_of_element_located((By.XPATH,

u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))

while True: # TODO: make the endless loop end

results = driver.find_elements_by_css_selector("div.flightbox")

print "Results count: %d" % len(results)

# scroll to the last element

driver.execute_script("arguments[0].scrollIntoView();", results[-1])

# wait for more results to load

wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results)))

笔记:

  • 您将需要弄清楚何时停止循环-例如,以某个特定len(results)
  • wait_for_more_than_n_elements是自定义的“ 预期条件”,有助于确定何时加载下一部分,我们可以再次滚动

以上是 使用Selenium缓慢向下滚动页面 的全部内容, 来源链接: utcz.com/qa/429591.html

回到顶部