将带有标头的HTML表转换为Json-Python [英] Convert HTML table with a header to Json - Python

查看:63
本文介绍了将带有标头的HTML表转换为Json-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下HTML表:

Suppose I have the following HTML table:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>

我想将此表转换为JSON,可能采用以下格式:

I'd like to convert this table to JSON, potentially in the following format:

data= [
  { 
    Name: 'John',         
    Age: 28,
    License: 'Y',
    Amount: 12.30
  },
  { 
    Name: 'Kevin',         
    Age: 25,
    License: 'Y',
    Amount: 22.30
  },
  { 
    Name: 'Smith',         
    Age: 38,
    License: 'Y',
    Amount: 52.20
  },
  { 
    Name: 'Stewart',         
    Age: 21,
    License: 'N',
    Amount: 3.80
  }
];

我已经看到了另一个执行上述操作的示例,我在此处找到了. 但是,鉴于该答案,有些事情我无法解决.这些是:

I've seen another example that sort of does the above, which I found here. However, there are a couple of things that I can't get working given that answer. Those are:

  • 它仅限于表中的两行.如果添加另一行,则会出现错误:

print(json.dumps(OrderedDict(table_data)))ValueError:值太多 打开包装(预期2)

print(json.dumps(OrderedDict(table_data))) ValueError: too many values to unpack (expected 2)

  • 不考虑表的标题行.
  • 到目前为止,这是我的代码:

    This is my code so far:

    html_data = """
    <table>
      <tr>
        <th>Name</th>
        <th>Age</th>
        <th>License</th>
        <th>Amount</th>
      </tr>
      <tr>
        <td>John</td>
        <td>28</td>
        <td>Y</td>
        <td>12.30</td>
      </tr>
      <tr>
        <td>Kevin</td>
        <td>25</td>
        <td>Y</td>
        <td>22.30</td>
      </tr>
      <tr>
        <td>Smith</td>
        <td>38</td>
        <td>Y</td>
        <td>52.20</td>
      </tr>
      <tr>
        <td>Stewart</td>
        <td>21</td>
        <td>N</td>
        <td>3.80</td>
      </tr>
    </table>
    """
    
    from bs4 import BeautifulSoup
    from collections import OrderedDict
    import json
    
    table_data = [[cell.text for cell in row("td")]
                             for row in BeautifulSoup(html_data, features="lxml")("tr")]
    
    print(json.dumps(OrderedDict(table_data)))
    

    但是我遇到了以下错误:

    But I'm getting the following error:

    print(json.dumps(OrderedDict(table_data)))ValueError:需要更多 0个要解压的值

    print(json.dumps(OrderedDict(table_data))) ValueError: need more than 0 values to unpack

    编辑 如果HTML中只有一个表,则下面的答案非常适用.如果有两个表怎么办?例如:

    EDIT The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:

    <html>
        <body>
            <h1>My Heading</h1>
            <p>Hello world</p>
            <table>
                <tr>
                    <th>Name</th>
                    <th>Age</th>
                    <th>License</th>
                    <th>Amount</th>
                </tr>
                <tr>
                    <td>John</td>
                    <td>28</td>
                    <td>Y</td>
                    <td>12.30</td>
                </tr>
                <tr>
                    <td>Kevin</td>
                    <td>25</td>
                    <td>Y</td>
                    <td>22.30</td>
                </tr>
                <tr>
                    <td>Smith</td>
                    <td>38</td>
                    <td>Y</td>
                    <td>52.20</td>
                </tr>
                <tr>
                    <td>Stewart</td>
                    <td>21</td>
                    <td>N</td>
                    <td>3.80</td>
                </tr>
            </table>
            <table>
                <tr>
                    <th>Name</th>
                    <th>Age</th>
                    <th>License</th>
                    <th>Amount</th>
                </tr>
                <tr>
                    <td>Rich</td>
                    <td>28</td>
                    <td>Y</td>
                    <td>12.30</td>
                </tr>
                <tr>
                    <td>Kevin</td>
                    <td>25</td>
                    <td>Y</td>
                    <td>22.30</td>
                </tr>
                <tr>
                    <td>Smith</td>
                    <td>38</td>
                    <td>Y</td>
                    <td>52.20</td>
                </tr>
                <tr>
                    <td>Stewart</td>
                    <td>21</td>
                    <td>N</td>
                    <td>3.80</td>
                </tr>
            </table>
        </body>
    </html>
    

    如果将其插入下面的代码中,则仅第一个表显示为JSON输出.

    If I plug this in the below code, only the first table is shown as the JSON output.

    推荐答案

    此代码正是您想要的

    from bs4 import BeautifulSoup
    import json
    
    xml_data = """
    [[your xml data]]"""
    
    
    if __name__ == '__main__':
        model = BeautifulSoup(xml_data, features='lxml')
        fields = []
        table_data = []
        for tr in model.table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in model.table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)
    
        print(json.dumps(table_data, indent=4))
    

    这篇关于将带有标头的HTML表转换为Json-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆