BeautifulSoup findAll找不到全部

Z时代
2024-01-10
分类：问答

我正在尝试解析一个网站，并通过BeautifulSoup.findAll获取一些信息，但它并没有全部找到。.我正在使用python3

代码是这个

#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
    print (manga['href'])

它只打印一半…

回答：

不同的HTML解析器对损坏的HTML的处理方式不同。该页面提供了损坏的HTML，lxml解析器对此的处理不佳：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

标准库html.parser在此特定页面上的麻烦较少：

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

使用将其转换为您的特定代码示例urllib，您将这样指定解析器：

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

以上是 BeautifulSoup findAll找不到全部的全部内容，来源链接： utcz.com/qa/431573.html

BeautifulSoup findAll找不到全部

回答：

其他人也看了：