xpath爬虫百科网页返回结果为空,请问该如何解决?
xpath爬虫百科网页返回结果为空
import urllib.requestimport urllib.parse
from lxml import etree
def query(content):
# 请求地址
url = 'https://baike.baidu.com/item/' + urllib.parse.quote(content)
# 请求头部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# 利用请求地址和请求头部构造请求对象
req = urllib.request.Request(url=url, headers=headers, method='GET')
# 发送请求,获得响应
response = urllib.request.urlopen(req)
# 读取响应,获得文本
text = response.read().decode('utf-8')
# 构造 _Element 对象
html = etree.HTML(text)
# 使用 xpath 匹配数据,得到匹配字符串列表
sen_list = html.xpath('//div[contains(@class,"lemma-summary") or contains(@class,"lemmaWgt-lemmaSummary")]//text()')
# 过滤数据,去掉空白
sen_list_after_filter = [item.strip('\n') for item in sen_list]
# 将字符串列表连成字符串并返回
return ''.join(sen_list_after_filter)
if __name__ == '__main__':
while (True):
content = input('查询词语:')
result = query(content)
print("查询结果:%s" % result)
请赐教,不胜感激。
回答:
curl https://baike.baidu.com/item/叶挺 -v
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost'* Trying 157.255.77.133...
* TCP_NODELAY set
* Connected to baike.baidu.com (157.255.77.133) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=CN; ST=beijing; L=beijing; OU=service operation department; O=Beijing Baidu Netcom Science Technology Co., Ltd; CN=baidu.com
* start date: Jul 5 05:16:02 2022 GMT
* expire date: Aug 6 05:16:01 2023 GMT
* subjectAltName: host "baike.baidu.com" matched cert's "*.baidu.com"
* issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign RSA OV SSL CA 2018
* SSL certificate verify ok.
> GET /item/%E5%8F%B6%E6%8C%BA HTTP/1.1
> Host: baike.baidu.com
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 302 Found
< Connection: keep-alive
< Content-Length: 0
< Content-Type: text/html; charset=UTF-8
< Date: Mon, 16 Jan 2023 02:58:23 GMT
< Location: /item/%E5%8F%B6%E6%8C%BA/299649
< P3p: CP=" OTI DSP COR IVA OUR IND COM "
< P3p: CP=" OTI DSP COR IVA OUR IND COM "
< Server: nginx/1.8.0
< Set-Cookie: X_ST_FLOW=0; expires=Mon, 16-Jan-2023 03:08:23 GMT; Max-Age=600; path=/
< Set-Cookie: BAIDUID=5FC2411197265E4B5F181B1BB28C2293:FG=1; expires=Tue, 16-Jan-24 02:58:23 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1
< Set-Cookie: BAIDUID=5FC2411197265E4BE03F24AD35285668:FG=1; expires=Tue, 16-Jan-24 02:58:23 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1
<
* Connection #0 to host baike.baidu.com left intact
* Closing connection 0
响应 302
需要处理重定向 Location: /item/%E5%8F%B6%E6%8C%BA/299649
以上是 xpath爬虫百科网页返回结果为空,请问该如何解决? 的全部内容, 来源链接: utcz.com/p/938722.html