xpath爬虫百科网页返回结果为空,请问该如何解决?

xpath爬虫百科网页返回结果为空
xpath爬虫百科网页返回结果为空,请问该如何解决?

import urllib.request

import urllib.parse

from lxml import etree

def query(content):

# 请求地址

url = 'https://baike.baidu.com/item/' + urllib.parse.quote(content)

# 请求头部

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

}

# 利用请求地址和请求头部构造请求对象

req = urllib.request.Request(url=url, headers=headers, method='GET')

# 发送请求,获得响应

response = urllib.request.urlopen(req)

# 读取响应,获得文本

text = response.read().decode('utf-8')

# 构造 _Element 对象

html = etree.HTML(text)

# 使用 xpath 匹配数据,得到匹配字符串列表

sen_list = html.xpath('//div[contains(@class,"lemma-summary") or contains(@class,"lemmaWgt-lemmaSummary")]//text()')

# 过滤数据,去掉空白

sen_list_after_filter = [item.strip('\n') for item in sen_list]

# 将字符串列表连成字符串并返回

return ''.join(sen_list_after_filter)

if __name__ == '__main__':

while (True):

content = input('查询词语:')

result = query(content)

print("查询结果:%s" % result)

请赐教,不胜感激。


回答:

curl https://baike.baidu.com/item/叶挺 -v
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost'

* Trying 157.255.77.133...

* TCP_NODELAY set

* Connected to baike.baidu.com (157.255.77.133) port 443 (#0)

* ALPN, offering h2

* ALPN, offering http/1.1

* successfully set certificate verify locations:

* CAfile: /etc/ssl/cert.pem

CApath: none

* TLSv1.2 (OUT), TLS handshake, Client hello (1):

* TLSv1.2 (IN), TLS handshake, Server hello (2):

* TLSv1.2 (IN), TLS handshake, Certificate (11):

* TLSv1.2 (IN), TLS handshake, Server key exchange (12):

* TLSv1.2 (IN), TLS handshake, Server finished (14):

* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):

* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):

* TLSv1.2 (OUT), TLS handshake, Finished (20):

* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):

* TLSv1.2 (IN), TLS handshake, Finished (20):

* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256

* ALPN, server accepted to use http/1.1

* Server certificate:

* subject: C=CN; ST=beijing; L=beijing; OU=service operation department; O=Beijing Baidu Netcom Science Technology Co., Ltd; CN=baidu.com

* start date: Jul 5 05:16:02 2022 GMT

* expire date: Aug 6 05:16:01 2023 GMT

* subjectAltName: host "baike.baidu.com" matched cert's "*.baidu.com"

* issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign RSA OV SSL CA 2018

* SSL certificate verify ok.

> GET /item/%E5%8F%B6%E6%8C%BA HTTP/1.1

> Host: baike.baidu.com

> User-Agent: curl/7.64.1

> Accept: */*

>

< HTTP/1.1 302 Found

< Connection: keep-alive

< Content-Length: 0

< Content-Type: text/html; charset=UTF-8

< Date: Mon, 16 Jan 2023 02:58:23 GMT

< Location: /item/%E5%8F%B6%E6%8C%BA/299649

< P3p: CP=" OTI DSP COR IVA OUR IND COM "

< P3p: CP=" OTI DSP COR IVA OUR IND COM "

< Server: nginx/1.8.0

< Set-Cookie: X_ST_FLOW=0; expires=Mon, 16-Jan-2023 03:08:23 GMT; Max-Age=600; path=/

< Set-Cookie: BAIDUID=5FC2411197265E4B5F181B1BB28C2293:FG=1; expires=Tue, 16-Jan-24 02:58:23 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1

< Set-Cookie: BAIDUID=5FC2411197265E4BE03F24AD35285668:FG=1; expires=Tue, 16-Jan-24 02:58:23 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1

<

* Connection #0 to host baike.baidu.com left intact

* Closing connection 0

响应 302 需要处理重定向 Location: /item/%E5%8F%B6%E6%8C%BA/299649

以上是 xpath爬虫百科网页返回结果为空,请问该如何解决? 的全部内容, 来源链接: utcz.com/p/938722.html

回到顶部