将带有标头的HTML表转换为Json-Python [英] Convert HTML table with a header to Json - Python
问题描述
假设我有以下HTML表:
Suppose I have the following HTML table:
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
我想将此表转换为JSON,可能采用以下格式:
I'd like to convert this table to JSON, potentially in the following format:
data= [
{
Name: 'John',
Age: 28,
License: 'Y',
Amount: 12.30
},
{
Name: 'Kevin',
Age: 25,
License: 'Y',
Amount: 22.30
},
{
Name: 'Smith',
Age: 38,
License: 'Y',
Amount: 52.20
},
{
Name: 'Stewart',
Age: 21,
License: 'N',
Amount: 3.80
}
];
我已经看到了另一个执行上述操作的示例,我在此处找到了. 但是,鉴于该答案,有些事情我无法解决.这些是:
I've seen another example that sort of does the above, which I found here. However, there are a couple of things that I can't get working given that answer. Those are:
- 它仅限于表中的两行.如果添加另一行,则会出现错误:
print(json.dumps(OrderedDict(table_data)))ValueError:值太多 打开包装(预期2)
print(json.dumps(OrderedDict(table_data))) ValueError: too many values to unpack (expected 2)
- 不考虑表的标题行.
到目前为止,这是我的代码:
This is my code so far:
html_data = """
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
"""
from bs4 import BeautifulSoup
from collections import OrderedDict
import json
table_data = [[cell.text for cell in row("td")]
for row in BeautifulSoup(html_data, features="lxml")("tr")]
print(json.dumps(OrderedDict(table_data)))
但是我遇到了以下错误:
But I'm getting the following error:
print(json.dumps(OrderedDict(table_data)))ValueError:需要更多 0个要解压的值
print(json.dumps(OrderedDict(table_data))) ValueError: need more than 0 values to unpack
编辑 如果HTML中只有一个表,则下面的答案非常适用.如果有两个表怎么办?例如:
EDIT The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>Rich</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
</body>
</html>
如果将其插入下面的代码中,则仅第一个表显示为JSON输出.
If I plug this in the below code, only the first table is shown as the JSON output.
推荐答案
此代码正是您想要的
from bs4 import BeautifulSoup
import json
xml_data = """
[[your xml data]]"""
if __name__ == '__main__':
model = BeautifulSoup(xml_data, features='lxml')
fields = []
table_data = []
for tr in model.table.find_all('tr', recursive=False):
for th in tr.find_all('th', recursive=False):
fields.append(th.text)
for tr in model.table.find_all('tr', recursive=False):
datum = {}
for i, td in enumerate(tr.find_all('td', recursive=False)):
datum[fields[i]] = td.text
if datum:
table_data.append(datum)
print(json.dumps(table_data, indent=4))
这篇关于将带有标头的HTML表转换为Json-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!