Python 和 BeautifulSoup 怎么把 html table 处理成 csv？

Z时代
2024-01-10
分类：技术分享

譬如我想处理维基百科里边的第一个表格：
https://zh.wikipedia.org/wiki/文件编辑器比较

常规信息比较

代码如下：

import urllib
import urllib2
import cookielib
import re
import csv
import codecs
from bs4 import BeautifulSoup
wiki = 'https://zh.wikipedia.org/wiki/文件编辑器比较'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
name = ""       #名字
creater = ""    #归属
first = ""      #首次公开发布的时间
latest = ""     #最新稳定版本
cost = ""       #售价
licence = ""    #授权条款
table = soup.find("table", {"class" : "sortable wikitable"})
f = open('table.csv', 'w')
for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 4:
        name = cells[0].find(text=True)
        creater = cells[1].find(text=True)
        first = cells[2].find(text=True)
        latest = cells[3].find(text=True)
        cost = cells[4].find(text=True)
        licence = cells[5].find(text=True)

(1) 因为是仿造 https://adesquared.wordpress.com/2013/06/16/using-python-beautifulsoup-to-scrape-a-wikipedia-table/ 写的，所以这里的if len(cells) == 4是有什么作用呢？

(2) 请问接下来要怎么写入csv呢？

谢谢，麻烦大家了。

回答：

人家的是因为每一行tr，有四个td，你怎么可以写if len(cells) == 4呢，你的是四列吗？
给你改了下代码，你看下哈。

#coding:utf8
import urllib
import urllib2
import cookielib
import re
import csv
import codecs
from bs4 import BeautifulSoup
wiki = 'https://zh.wikipedia.org/wiki/文件编辑器比较'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
name = ""       #名字
creater = ""    #归属
first = ""      #首次公开发布的时间
latest = ""     #最新稳定版本
cost = ""       #售价
licence = ""    #授权条款
table = soup.find("table", {"class" : "sortable wikitable"})
f = open('table.csv', 'w')
csv_writer = csv.writer(f)
td_th = re.compile('t[dh]')
for row in table.findAll("tr"):
    cells = row.findAll(td_th)
    if len(cells) == 6:
        name = cells[0].find(text=True)
        if not name:
            continue
        creater = cells[1].find(text=True)
        first = cells[2].find(text=True)
        latest = cells[3].find(text=True)
        cost = cells[4].find(text=True)
        licence = cells[5].find(text=True)
        csv_writer.writerow([ x.encode('utf-8') for x in [name, creater, first, latest, cost, licence]])
f.close()

部分结果:

Acme,Rob Pike,1993年,隨第4版,免費,LPL
AkelPad,Aleksander Shengalts、Alexey Kuznetsov和其他贡献者,2003年,4.5.4,免費,BSD许可证
Alphatk,原屬Pete Keleher，現歸Alpha-development cabal,1990年,8.0,$ 40，共享軟件,內核不開源，含有
Alphatk,Vince Darley,1999年,8.3.3,$ 40,專有
AptEdit,Brother Technology,2003年,4.8.1,$ 44.95,專有

以上是 Python 和 BeautifulSoup 怎么把 html table 处理成 csv？的全部内容，来源链接： utcz.com/a/163811.html

Python 和 BeautifulSoup 怎么把 html table 处理成 csv？

回答：

其他人也看了：