使用pd.read_html解析html表,其中单元格自身包含完整表 [英] Parsing an html table with pd.read_html where cells contain full-tables themselves

查看:31
本文介绍了使用pd.read_html解析html表,其中单元格自身包含完整表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从html解析一个表,该表具有嵌套在较大表中的其他表.如以下使用 pd.read_html 所调用的那样,将解析每个嵌套表,然后将其插入"/连接"为行.

I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html, each of these nested tables are parsed and then "inserted"/"concatenated" as rows.

我希望将这些嵌套表分别解析为它们自己的 pd.DataFrames ,并将其作为对象作为相应列的值插入.

I'd like these nested tables to each be parsed into their own pd.DataFrames and the inserted as objects as the value of the corresponding column.

如果这是不可能的,那么将嵌套表的原始html作为字符串放在相应位置就可以了.

If this is not possible, having raw html for the nested table as a string in the corresponding position would be fine.

经过测试的代码

import pandas as pd
df_up = pd.read_html("up_pf00344.test.html", attrs = {'id': 'results'})

输出的屏幕截图:

以html格式呈现的表格的屏幕截图:

Screenshot of table as rendered in html:

链接到文件: https://gist.github.com/smsaladi/6adb30efbe70f9fed0306b226e8ad0d8#file-up_pf00344-test-html-L62

推荐答案

您不能使用 dcode.df.iloc [0] .map(type)的结果:

Result of df.iloc[0].map(type):

                                                            <class 'str'>
Entry                                                       <class 'str'>
Organism                                                    <class 'str'>
Protein names                                               <class 'str'>
Gene names                                                  <class 'str'>
Length                                                      <class 'str'>
Cross-reference (Pfam)                                      <class 'str'>
Cross-reference (InterPro)                                  <class 'str'>
Taxonomic lineage IDs                                       <class 'str'>
Subcellular location [CC]                                   <class 'str'>
Signal peptide                                              <class 'str'>
Transit peptide                                             <class 'str'>
Topological domain                  <class 'pandas.core.frame.DataFrame'>
Transmembrane                       <class 'pandas.core.frame.DataFrame'>
Intramembrane                       <class 'pandas.core.frame.DataFrame'>
Sequence caution                                            <class 'str'>
Caution                                                     <class 'str'>
Taxonomic lineage (SUPERKINGDOM)                            <class 'str'>
Taxonomic lineage (KINGDOM)                                 <class 'str'>
Taxonomic lineage (PHYLUM)                                  <class 'str'>
Cross-reference (RefSeq)                                    <class 'str'>
Cross-reference (EMBL)                                      <class 'str'>
e                                                           <class 'str'>

奖金:由于表行具有 id ,因此您可以将其用作数据框的索引 df.loc [row.get('id')] = df_row 而不是 df.loc [len(df)] = df_row .

Bonus: As your table rows have an id, you could use it as index of your dataframe df.loc[row.get('id')] = df_row instead of df.loc[len(df)] = df_row.

这篇关于使用pd.read_html解析html表,其中单元格自身包含完整表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆