使用pd.read_html解析html表,其中单元格自身包含完整表 [英] Parsing an html table with pd.read_html where cells contain full-tables themselves
问题描述
我需要从html解析一个表,该表具有嵌套在较大表中的其他表.如以下使用 pd.read_html
所调用的那样,将解析每个嵌套表,然后将其插入"/连接"为行.
I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html
, each of these nested tables are parsed and then "inserted"/"concatenated" as rows.
我希望将这些嵌套表分别解析为它们自己的 pd.DataFrames
,并将其作为对象作为相应列的值插入.
I'd like these nested tables to each be parsed into their own pd.DataFrames
and the inserted as objects as the value of the corresponding column.
如果这是不可能的,那么将嵌套表的原始html作为字符串放在相应位置就可以了.
If this is not possible, having raw html for the nested table as a string in the corresponding position would be fine.
经过测试的代码
import pandas as pd
df_up = pd.read_html("up_pf00344.test.html", attrs = {'id': 'results'})
输出的屏幕截图:
以html格式呈现的表格的屏幕截图:
Screenshot of table as rendered in html:
链接到文件: https://gist.github.com/smsaladi/6adb30efbe70f9fed0306b226e8ad0d8#file-up_pf00344-test-html-L62
推荐答案
您不能使用 dcode.df.iloc [0] .map(type)的结果:
Result of df.iloc[0].map(type)
:
<class 'str'>
Entry <class 'str'>
Organism <class 'str'>
Protein names <class 'str'>
Gene names <class 'str'>
Length <class 'str'>
Cross-reference (Pfam) <class 'str'>
Cross-reference (InterPro) <class 'str'>
Taxonomic lineage IDs <class 'str'>
Subcellular location [CC] <class 'str'>
Signal peptide <class 'str'>
Transit peptide <class 'str'>
Topological domain <class 'pandas.core.frame.DataFrame'>
Transmembrane <class 'pandas.core.frame.DataFrame'>
Intramembrane <class 'pandas.core.frame.DataFrame'>
Sequence caution <class 'str'>
Caution <class 'str'>
Taxonomic lineage (SUPERKINGDOM) <class 'str'>
Taxonomic lineage (KINGDOM) <class 'str'>
Taxonomic lineage (PHYLUM) <class 'str'>
Cross-reference (RefSeq) <class 'str'>
Cross-reference (EMBL) <class 'str'>
e <class 'str'>
奖金:由于表行具有 id
,因此您可以将其用作数据框的索引 df.loc [row.get('id')] = df_row
而不是 df.loc [len(df)] = df_row
.
Bonus: As your table rows have an id
, you could use it as index of your dataframe df.loc[row.get('id')] = df_row
instead of df.loc[len(df)] = df_row
.
这篇关于使用pd.read_html解析html表,其中单元格自身包含完整表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!