从 HTML 表格中提取数据 [英] Extracting data from HTML table

查看:91
本文介绍了从 HTML 表格中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种在 linux shell 环境中从 HTML 获取某些信息的方法.

I am looking for a way to get certain info from HTML in linux shell environment.

这是我感兴趣的一点:

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

我想存储在 shell 变量中,或者在从上面的 html 中提取的键值对中回显这些变量.示例:

And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

我现在可以做的是创建一个 java 程序,该程序将使用 sax 解析器或 html 解析器(例如 jsoup)来提取此信息.

What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.

但是在这里使用 java 似乎是在要执行的包装器"脚本中包含可运行的 jar 的开销.

But using java here seems to be overhead with including the runnable jar inside the "wrapper" script you want to execute.

我确信一定有shell"语言可以做同样的事情,即 perl、python、bash 等.

I'm sure that there must be "shell" languages out there that can do the same i.e. perl, python, bash etc.

我的问题是我对这些的经验为零,有人可以帮我解决这个相当容易"的问题

My problem is that I have zero experience with these, can somebody help me resolve this "fairly easy" issue

快速更新:

我忘了提到我在 .html 文档中有更多的表格和更多的行,抱歉(清晨).

I forgot to mention that I've got more tables and more rows in the .html document sorry about that (early morning).

更新 #2:

尝试像这样安装 Bsoup,因为我没有 root 访问权限:

Tried to install Bsoup like this since I don't have root access :

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

错误:

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新 #3:

运行 Tichodromas 的回答得到这个错误:

Running Tichodromas' answer get this error :

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

有什么想法吗?

推荐答案

使用 BeautifulSoup4(适当跳过.使用class="details"选择table):

A Python solution using BeautifulSoup4 ( with proper skipping. Using class="details" to select the table):

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

结果如下:

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2:要产生所需的输出,请使用以下内容:

To produce the desired output, use something like this:

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果:

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

这篇关于从 HTML 表格中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆