HTML表格到Pandas表格:html标记内的信息 [英] HTML table to pandas table: Info inside html tags

查看:79
本文介绍了HTML表格到Pandas表格:html标记内的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上有一张大桌子,可以通过请求访问并用BeautifulSoup解析.它的一部分看起来像这样:

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

当我使用pd.read_html(tbl)将其转换为熊猫时,输出如下:

When I convert this to pandas using pd.read_html(tbl) the output is like this:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

我需要将信息保留在<A HREF ... >标记中,因为唯一标识符存储在链接中.也就是说,该表应如下所示:

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

我对其他各种输出都很好(例如jones03 Jones会更有帮助),但是唯一ID是至关重要的.

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

其他单元格中也有html标签,通常我不希望保存这些标签,但是如果这是获取uid的唯一方法,我可以保留这些标签并在以后清理它们,如果我必须.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

是否有一种简单的方法来访问此信息?

Is there a simple way of accessing this information?

推荐答案

由于此解析作业需要提取文本和属性 值,它不能完全通过开箱即用"的功能来完成,例如 pd.read_html.其中一些必须手动完成.

Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

使用 lxml ,您可以使用XPath提取属性值:

Using lxml, you could extract the attribute values with XPath:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

收益

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01


上面的内容可能有用,因为它只需要几个 额外的代码行来添加refname列.


The above may be useful since it requires only a few extra lines of code to add the refname column.

但是LH.fromstringpd.read_html都解析HTML. 因此,通过删除pd.read_html和 用LH.fromstring解析表一次:

But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

收益

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

这篇关于HTML表格到Pandas表格:html标记内的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆