将InitSpider与Splash结合使用：仅解析登录页面？

Z时代
2024-01-10
分类：问答

我正在尝试抓取必须先登录才能访问的网页。但是在身份验证之后，我需要的网页需要运行一些Javascript才能查看内容。我已经按照此处的说明安装了启动程序，以尝试呈现Javascript。然而…

在我切换启动之前，使用Scrapy进行身份验证InitSpider是可以的。我正在浏览登录页面，然后抓取目标页面正常（显然，除非Javascript无法正常工作）。但是，一旦添加代码以通过启动请求传递请求，就好像我没有解析目标页面。

下面的spider。初始版本（此处）和非初始版本之间的唯一区别是function def start_requests()。两者之间的其他一切都相同。

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
            ]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" 
    # authentication
    def init_request(self):
        return scrapy.http.Request(url=self.login_page, callback=self.login)
    def login(self, response):
        return scrapy.http.FormRequest.from_response(
            response,
            formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
            callback=self.check_login_response)
    def check_login_response(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
            return self.initialized()
        else:
            self.log("Login failed")
            print(response.body)
    # pipe the requests through splash so the JS renders 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 
    # what to do when a link is encountered
    rules = (
            Rule(LinkExtractor(), callback='parse_item'),
            )
    # do nothing on new link for now
    def parse_item(self, response):
        pass
    def parse(self, response):
        filename = 'test.html' 
        with open(filename, 'wb') as f:
            f.write(response.body)

现在发生的是test.html的结果parse()现在只是登录页面本身，而不是登录后应该重定向到的页面。

这是在日志中说明的-通常，我会从中看到“登录成功”行check_login_response()，但是正如你在下面看到的那样，看来我什至没有走到那一步。这是因为scrapy现在也使身份验证请求也通过启动，并且已经挂在那里了吗？如果是这样，是否有任何方法可以仅针对身份验证部分绕过启动？

2019-01-24 14:54:56 [scrapy] INFO: Spider opened
2019-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2019-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)

谁能指出我的一些文档以了解发生了什么情况？

回答：

我认为，仅Splash不能很好地处理此特殊情况。

这是工作思路：

使用selenium和PhantomJS无头的浏览器登录到网站

将浏览器cookie从传递PhantomJS到Scrapy

代码：

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)
        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")
        driver.find_element_by_name("submit").click()
        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))
        cookies = driver.get_cookies()
        driver.close()
        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)
    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)