使用BeautifulSoup在Python 3中提取表 [英] Using BeautifulSoup to extract a table in Python 3

查看:175
本文介绍了使用BeautifulSoup在Python 3中提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用BeautifulSoup从网站中提取表格并将其存储为结构化数据. 我需要的最终输出是可以导出到带有标题行和多个数据行的.csv文件.

I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a header row and multiple data rows.

我遵循了这个问题的答案,但它似乎是对Python的更新(或BeautifulSoup)自8年前发布以来,需要进行调整.我认为我已经解决了大多数问题(请参阅下文),但是除此之外,原始答案似乎还没有真正构建数据,而是输出了标头数据对列表.

I followed the answer to this question, but it appears updates to Python (or BeautifulSoup) require adjustments since it was posted 8 years ago. I think I have that mostly solved (see below), but in addition, the original answer seems to stop just short of actually structuring the data, instead outputting a list of header-data pairs.

我想使用类似的解决方案,因为它看起来确实很接近我的需求.我的数据已经使用BeautifulSoup进行了解析,所以我特别要求使用该软件包而不是Pandas解决方案.

I'd like to use a similar solution because it seems really close to what I need. My data is already parsed using BeautifulSoup so I'm specifically asking for a solution using that package rather than Pandas.

由于原始问题,添加了第二行,因为我的数据有很多行.

Altered from original question by adding a second row, as my data has many rows.

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
  <tr valign="top" class="Failure">
     <td>109</td>
     <td>35</td>
     <td>82.01%</td>
     <td>12 ms</td>
     <td>2 ms</td>
     <td>923 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print(datasets)

结果应该如下所示(尽管有多行,但我不确定确切的结构).

The result is supposed to look like the following (though with multiple rows, I'm not sure precisely the structure).

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

但是看起来像:

[<zip object at 0x7fb06b5efdc0>, <zip object at 0x7fb06b5ef980>]

尝试的解决方案

我尝试在现有的for循环中使用datasets.append(tuple(dataset)),结果是:

[(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))]

这更接近于原始答案的预期输出,但显然是复制了对,而不是创建带有标题和值的数据表.因此,我不确定从现在开始如何处理数据.

This is closer to the original answer's expected output, but obviously duplicates the pairs rather than creating a data table with headers and values. So I'm not sure what to do with the data from this point.

推荐答案

所以您已经拥有了:

datasets = [
  (('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
  (('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))
]

这是您如何对其进行转换的方法.假设所有行都相同,则可以从第一行中提取标题:

Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:

headers_row = [hdr for hdr, data in datasets[0]]

现在,像每行中的('Tests', '103')一样提取每个元组的第二个字段:

Now, extract the second field of each tuple like ('Tests', '103') in each row:

processed_rows = [
  [data for hdr, data in row]
  for row in datasets
]
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]

现在您具有标题行和processed_rows的列表.您可以使用标准csv模块.

Now you have the header row and a list of processed_rows. You can write them to a CSV file with the standard csv module.

更好的解决方案可能是保留原始格式并使用 csv.DictWriter .

A better solution may be to keep your original format and use csv.DictWriter.

  1. 将标题提取到headers_row中,如上所示.

写数据:

import csv

with open('data.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames= headers_row)

    writer.writeheader()

    for row in datasets: # your original data
        writer.writerow(dict(row))

例如dict(datasets[0])是:

{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}

这篇关于使用BeautifulSoup在Python 3中提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆