从html标签中提取信息到pandas [英] Extract information from html tags into pandas
问题描述
我有一个装满 html 文件的文件夹.我正在尝试选择正确的 html 标签,以便正确打印引文,而我需要的输出只是出版物编号和标题.到目前为止,我是在 SO 中的各种帖子的帮助下完成的
I have a folder full of html files. I am trying to pick the right html tags so I print the citations correctly and the output I require is just the publication number and title. So far I did this with help from various posts in SO
with open(filename, 'r',encoding='utf-8') as f:# start loop to read HTML files in folder
patent = f.read()
#print(filename)
soup = BeautifulSoup(patent, 'html.parser')
x=soup.select('tr[itemprop="backwardReferencesOrig"]')
backorigdf= pd.read_html(str(x))
print(backorigdf.loc[: , ['Publication number', 'Title']
但我收到一条错误消息 ValueError: No tables found.我希望以 Pandas 数据帧格式输出多个 HTML 文件引用,以便我更轻松地分析数据.有人能告诉我我做错了什么吗?这是 HTML 文件的链接 https://patents.google.com/patent/US4458945?oq=US4458945A.这个文件在我的电脑上保存了一个 HTML 文件,我不想从 URL 中读取.我想编码以从 HTML 文档中提取.
But I get an error message ValueError: No tables found. I want the output of the multiple HTML files citations in a pandas dataframe format so it is easier for me to analyse data. Can someone tell me what am I doing wrong? this is the link to the HTML file https://patents.google.com/patent/US4458945?oq=US4458945A. This file is saved a HTML file on my computer and I don't want to read from the URL. I want to code to pick up from the HTML documents.
推荐答案
了解预期结果的总数会有所帮助.在下面,我通过使用 :contains 来定位引用 h2 元素,然后移动到相邻的表中来检索 25 个独特的结果
It would help to know the total number of results expected. In the following I retrieve 25 unique results by using :contains to target the citations h2 elements and then moving to the adjacent table
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://patents.google.com/patent/US4458945?oq=US4458945A')
soup = bs(r.content, 'lxml')
df = pd.concat([pd.read_html(str(t.find_next('table')))[0]
for t in soup.select('h2:contains("Citations", "Family Cites")')])
df.drop_duplicates(inplace=True)
df.sort_values(by=['Priority date'], inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
这篇关于从html标签中提取信息到pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!