处理BeautifulSoup中的o:p标签 [英] Handle o:p tag in BeautifulSoup
本文介绍了处理BeautifulSoup中的o:p标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在从以下网站提取一些疾病信息: http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
I was extracting some disease information from : http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
但是数据包含在一个我不知道如何处理的标签内.
but the data was contained inside a tag which I don't know how to handle.
我发现的一种方法是使用find_all函数,但是有什么方法可以做为tr.td.span.[o:p或其他] ??
One way I found was using find_all function but is there any way to do it as tr.td.span.[o:p or something] ??
<td width="584" nowrap="" valign="top" style="width:438.0pt;padding:0in 5.4pt 0in 5.4pt;
height:12.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">UMLS:C0008031_pain
chest
<o:p>&nsp</o:p>
</span>
</p>
</td>
推荐答案
import pandas as pd
df = pd.read_html(
"http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")[0]
df.to_csv("out.csv", index=False, header=False)
输出:查看在线
以防万一,您需要整张桌子.
that's in case if you want full table.
但符合您的要求.
使用:
import pandas as pd
df = pd.read_html(
"http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")[0]
print(df[2][1:].values.tolist())
对于 bs4
使用
import requests
from bs4 import BeautifulSoup
r = requests.get(
"http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("p", {'class': 'MsoNormal'}):
item = item.get_text(strip=True)
if item.startswith("UMLS"):
print(item)
这篇关于处理BeautifulSoup中的o:p标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文