从HTML表格中提取数据 [英] Extracting data from HTML table
问题描述
我要寻找一种方式来获得从HTML在Linux shell环境的某些信息。
I am looking for a way to get certain info from HTML in linux shell environment.
这是有点,我很感兴趣的:
This is bit that I'm interested in :
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
和我想在shell变量存储或从上面的html中提取键值对这些呼应。例如:
And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :
Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..
我可以在此刻要做的就是创建使用SAX解析器或HTML解析器如jsoup提取此信息的Java程序。
What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.
但在这里用java似乎与包括要执行的包装脚本里面运行的JAR是开销。
But using java here seems to be overhead with including the runnable jar inside the "wrapper" script you want to execute.
我敢肯定,必须有壳语言,有可以做同样的,即Perl,Python和庆典等。
I'm sure that there must be "shell" languages out there that can do the same i.e. perl, python, bash etc.
我的问题是,我与这些零经验,有人可以帮我解决这个非常简单的问题。
My problem is that I have zero experience with these, can somebody help me resolve this "fairly easy" issue
快速更新:
我忘了提,我已经将.html文件有关(清晨)在遗憾得到了更多的表和多行。
I forgot to mention that I've got more tables and more rows in the .html document sorry about that (early morning).
更新#2:
尝试安装Bsoup这样的,因为我没有root访问权限:
Tried to install Bsoup like this since I don't have root access :
$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)
的错误:的
$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax
更新3:
运行Tichodromas的回答得到这个错误:
Running Tichodromas' answer get this error :
Traceback (most recent call last):
File "test.py", line 27, in ?
headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable
什么想法?
推荐答案
一个Python的解决方案使用 BeautifulSoup4 (编辑:适当跳跃的 EDIT3:是类=详细信息
来选择表
)
A Python solution using BeautifulSoup4 ( with proper skipping. Using class="details"
to select the table
):
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
结果是这样的:
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
EDIT2::要产生所需的输出,使用这样的:
To produce the desired output, use something like this:
for dataset in datasets:
for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])
结果:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
这篇关于从HTML表格中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!