使用BeautifulSoup在Python 3中提取表 [英] Using BeautifulSoup to extract a table in Python 3
问题描述
我想使用BeautifulSoup从网站中提取表格并将其存储为结构化数据. 我需要的最终输出是可以导出到带有标题行和多个数据行的.csv文件.
I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a header row and multiple data rows.
我遵循了这个问题的答案,但它似乎是对Python的更新(或BeautifulSoup)自8年前发布以来,需要进行调整.我认为我已经解决了大多数问题(请参阅下文),但是除此之外,原始答案似乎还没有真正构建数据,而是输出了标头数据对列表.
I followed the answer to this question, but it appears updates to Python (or BeautifulSoup) require adjustments since it was posted 8 years ago. I think I have that mostly solved (see below), but in addition, the original answer seems to stop just short of actually structuring the data, instead outputting a list of header-data pairs.
我想使用类似的解决方案,因为它看起来确实很接近我的需求.我的数据已经使用BeautifulSoup进行了解析,所以我特别要求使用该软件包而不是Pandas解决方案.
I'd like to use a similar solution because it seems really close to what I need. My data is already parsed using BeautifulSoup so I'm specifically asking for a solution using that package rather than Pandas.
由于原始问题,添加了第二行,因为我的数据有很多行.
Altered from original question by adding a second row, as my data has many rows.
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="Failure">
<td>109</td>
<td>35</td>
<td>82.01%</td>
<td>12 ms</td>
<td>2 ms</td>
<td>923 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print(datasets)
结果应该如下所示(尽管有多行,但我不确定确切的结构).
The result is supposed to look like the following (though with multiple rows, I'm not sure precisely the structure).
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
但是看起来像:
[<zip object at 0x7fb06b5efdc0>, <zip object at 0x7fb06b5ef980>]
尝试的解决方案
我尝试在现有的for循环中使用datasets.append(tuple(dataset))
,结果是:
[(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')),
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))]
这更接近于原始答案的预期输出,但显然是复制了对,而不是创建带有标题和值的数据表.因此,我不确定从现在开始如何处理数据.
This is closer to the original answer's expected output, but obviously duplicates the pairs rather than creating a data table with headers and values. So I'm not sure what to do with the data from this point.
推荐答案
所以您已经拥有了:
datasets = [
(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')),
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))
]
这是您如何对其进行转换的方法.假设所有行都相同,则可以从第一行中提取标题:
Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:
headers_row = [hdr for hdr, data in datasets[0]]
现在,像每行中的('Tests', '103')
一样提取每个元组的第二个字段:
Now, extract the second field of each tuple like ('Tests', '103')
in each row:
processed_rows = [
[data for hdr, data in row]
for row in datasets
]
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]
现在您具有标题行和processed_rows
的列表.您可以使用标准csv
模块.
Now you have the header row and a list of processed_rows
. You can write them to a CSV file with the standard csv
module.
更好的解决方案可能是保留原始格式并使用 csv.DictWriter
.
A better solution may be to keep your original format and use csv.DictWriter
.
-
将标题提取到
headers_row
中,如上所示.
写数据:
import csv
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= headers_row)
writer.writeheader()
for row in datasets: # your original data
writer.writerow(dict(row))
例如dict(datasets[0])
是:
{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}
这篇关于使用BeautifulSoup在Python 3中提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!