从HTML表中提取数据

Z时代
2024-01-10
分类：问答

我正在寻找一种在Linux Shell环境中从HTML获取某些信息的方法。

这是我感兴趣的一点：

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

我想将它们存储在shell变量中或在从html上面提取的键值对中回显这些变量。范例：

Tests : 103 Failures : 24 Success Rate : 76.70 % and so on..

目前，我可以做的是创建一个Java程序，该程序将使用sax解析器或html解析器（例如jsoup）来提取此信息。

但是在这里使用Java似乎很麻烦，因为要在您要执行的“包装器”脚本中包含可运行的jar。

我确定必须有可以执行相同操作的“ shell”语言，例如perl，python，bash等。

我的问题是我对这些没有零经验，有人可以帮助我解决这个“非常容易”的问题

我忘了提一下，.html文档中有更多的表和更多的行，对此感到抱歉（清晨）。

尝试这样安装Bsoup，因为我没有root访问权限：

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz $ tar -zxvf beautifulsoup4-4.1.0.tar.gz $ cp -r beautifulsoup4-4.1.0/bs4 . $ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted $ run file (python htmlParse.py)

错误：

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

运行Tichodromas的答案会出现以下错误：

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

有任何想法吗？

回答：

使用BeautifulSoup4的Python解决方案（

使用适当的跳过。使用class="details"选择table）：

from bs4 import BeautifulSoup
html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)
print datasets

结果看起来像这样：

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

要产生所需的输出，请使用类似以下的内容：

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果：

Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms

以上是从HTML表中提取数据的全部内容，来源链接： utcz.com/qa/428636.html

从HTML表中提取数据

回答：

其他人也看了：