从HTML表格中提取数据 [英] Extracting data from HTML table

查看:553
本文介绍了从HTML表格中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要寻找一种方式来获得从HTML在Linux shell环境的某些信息。

I am looking for a way to get certain info from HTML in linux shell environment.

这是有点,我很感兴趣的:

This is bit that I'm interested in :

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

和我想在shell变量存储或从上面的html中提取键值对这些呼应。例如:

And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

我可以在此刻要做的就是创建使用SAX解析器或HTML解析器如jsoup提取此信息的Java程序。

What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.

但在这里用java似乎与包括要执行的包装脚本里面运行的JAR是开销。

But using java here seems to be overhead with including the runnable jar inside the "wrapper" script you want to execute.

我敢肯定,必须有壳语言,有可以做同样的,即Perl,Python和庆典等。

I'm sure that there must be "shell" languages out there that can do the same i.e. perl, python, bash etc.

我的问题是,我与这些零经验,有人可以帮我解决这个非常简单的问题。

My problem is that I have zero experience with these, can somebody help me resolve this "fairly easy" issue

快速更新:

我忘了提,我已经将.html文件有关(清晨)在遗憾得到了更多的表和多行。

I forgot to mention that I've got more tables and more rows in the .html document sorry about that (early morning).

更新#2:

尝试安装Bsoup这样的,因为我没有root访问权限:

Tried to install Bsoup like this since I don't have root access :

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

错误:

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新3:

运行Tichodromas的回答得到这个错误:

Running Tichodromas' answer get this error :

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

什么想法?

推荐答案

一个Python的解决方案使用 BeautifulSoup4 编辑:适当跳跃的 EDIT3:类=详细信息来选择

A Python solution using BeautifulSoup4 ( with proper skipping. Using class="details" to select the table):

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

结果是这样的:

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

EDIT2::要产生所需的输出,使用这样的:

To produce the desired output, use something like this:

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果:

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

这篇关于从HTML表格中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆