使用beautifulSoup,Python在h3和div标签中刮取文本

我没有使用python,BeautifulSoup,Selenium等的经验,但是我很想从网站上抓取数据并将其存储为csv文件。我需要的单个数据样本编码如下(一行数据)。

<div class="box effect">

<div class="row">

<div class="col-lg-10">

<h3>HEADING</h3>

<div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>

<div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>

<div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>

<div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>

<div class="space">&nbsp;</div>

<div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> &nbsp;more info</a></div>

</div>

<div class="col-lg-2">

</div>

</div>

</div>

我需要的输出是 Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS

我发现这些数据没有ID或类,但仍以通用文本形式出现在网站中。为此,我分别尝试使用BeautifulSoup和Python

Selenium,在这两种方法中,我都陷入了无法提取的麻烦,因为我没有看到任何教程,却指导我从这些和标签中提取文本。

我的代码使用BeautifulSoup

import urllib2

from bs4 import BeautifulSoup

import requests

import csv

MAX = 2

'''with open("lg.csv", "a") as f:

w=csv.writer(f)'''

##for i in range(1,MAX+1)

url="http://www.example_site.com"

page=requests.get(url)

soup = BeautifulSoup(page.content,"html.parser")

for h in soup.find_all('h3'):

print(h.get('h3'))

我的selenium代码

import csv

from selenium import webdriver

MAX_PAGE_NUM = 2

driver = webdriver.Firefox()

for i in range(1, MAX_PAGE_NUM+1):

url = "http://www.example_site.com"

driver.get(url)

name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')

#contact = driver.find_elements_by_xpath('//span[@class="item-price"]')

# phone =

# mobile =

# address =

# print(len(buyers))

# num_page_items = len(buyers)

# with open('res.csv','a') as f:

# for i in range(num_page_items):

# f.write(buyers[i].text + "," + prices[i].text + "\n")

print (name)

driver.close()

回答:

您可以使用CSS选择器来查找所需的数据。在您的情况下,div > h3 ~

div将找到div直接在div元素内部并由h3元素开头的所有元素。

import bs4

page= """

<div class="box effect">

<div class="row">

<div class="col-lg-10">

<h3>HEADING</h3>

<div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>

<div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>

<div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>

<div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>

</div>

</div>

</div>

"""

soup = bs4.BeautifulSoup(page, 'lxml')

# find all div elements that are inside a div element

# and are proceeded by an h3 element

selector = 'div > h3 ~ div'

# find elements that contain the data we want

found = soup.select(selector)

# Extract data from the found elements

data = [x.text.split(';')[-1].strip() for x in found]

for x in data:

print(x)

编辑:刮标题中的文本。

heading = soup.find('h3') 

heading_data = heading.text

print(heading_data)

编辑:或者,您可以通过使用如下选择器来一次获取标题和其他div元素:div.col-lg-10 > *。这将查找div属于col-

lg-10类的元素内的所有元素。

soup = bs4.BeautifulSoup(page, 'lxml')

# find all elements inside a div element of class col-lg-10

selector = 'div.col-lg-10 > *'

# find elements that contain the data we want

found = soup.select(selector)

# Extract data from the found elements

data = [x.text.split(';')[-1].strip() for x in found]

for x in data:

print(x)

以上是 使用beautifulSoup,Python在h3和div标签中刮取文本 的全部内容, 来源链接: utcz.com/qa/413907.html

回到顶部