在Python中将HTML表转换为CSV [英] Converting a HTML table to a CSV in Python
问题描述
我正在尝试将HTML中的表转换为Python中的csv。我要提取的表就是这个表:
I am trying to convert a table in HTML to a csv in Python. The table I am trying to extract is this one:
<table class="tblperiode">
<caption>Dades de període</caption>
<tr>
<th class="sortable"><span class="tooltip" title="Període (Temps Universal)">Període</span><br/>TU</th>
<th><span class="tooltip" title="Temperatura mitjana (°C)">TM</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura màxima (°C)">TX</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura mínima (°C)">TN</span><br/>°C</th>
<th><span class="tooltip" title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
<th><span class="tooltip" title="Precipitació (mm)">PPT</span><br/>mm</th>
<th><span class="tooltip" title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Direcció mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
<th><span class="tooltip" title="Ratxa màxima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Irradiància solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
</tr>
<tr>
<th>
00:00 - 00:30
</th>
<td>16.2</td>
<td>16.5</td>
<td>15.4</td>
<td>93</td>
<td>0.0</td>
<td>6.5</td>
<td>293</td>
<td>10.4</td>
<td>0</td>
</tr>
<tr>
<th>
00:30 - 01:00
</th>
<td>16.4</td>
<td>16.5</td>
<td>16.1</td>
<td>90</td>
<td>0.0</td>
<td>5.8</td>
<td>288</td>
<td>8.6</td>
<td>0</td>
</tr>
我希望它看起来像这样:
And I want it to look something like this:
要实现这一点,我尝试解析html并成功构建数据正确执行以下操作的数据框:
To achieve so, what I have tried is to parse the html and I have managed to build a dataframe with the data correctly doing the following:
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)
但是,我想将列命名为d指示时间间隔的列,在上面的html示例中,其中只有两个出现在00:00-00:30和00:30 1:00。因此,我的表应具有两行,其中一行对应于00:00-00:30的观察值,另一行对应于00:30和1:00的观察值。
However, I would like to have the columns name and a column indicating the interval of time, in the example of html above just two of them appear 00:00-00:30 and 00:30 1:00. Therefore my table should have two rows, one corresponding with the observations of 00:00-00:30 and another one with the observations of 00:30 and 1:00.
如何从HTML中获取此信息?
How could I get this information from my HTML?
推荐答案
这里是一种方法,它可能不是最好的方法,但它可以工作!您可以通读注释以了解代码在做什么!
Here's a way of doing it, it's probably not the nicest way but it works! You can read through the comments to figure out what the code is doing!
from bs4 import BeautifulSoup
import csv
#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')
# get the table from html
table = soup.select_one("table.tblperiode")
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
# output csv
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
这篇关于在Python中将HTML表转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!