从html标签中提取信息到pandas [英] Extract information from html tags into pandas

查看:90
本文介绍了从html标签中提取信息到pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个装满 html 文件的文件夹.我正在尝试选择正确的 html 标签,以便正确打印引文,而我需要的输出只是出版物编号和标题.到目前为止,我是在 SO 中的各种帖子的帮助下完成的

I have a folder full of html files. I am trying to pick the right html tags so I print the citations correctly and the output I require is just the publication number and title. So far I did this with help from various posts in SO

with open(filename, 'r',encoding='utf-8') as f:# start loop to read HTML files in folder
    patent = f.read()
    #print(filename)
    soup = BeautifulSoup(patent, 'html.parser') 
    x=soup.select('tr[itemprop="backwardReferencesOrig"]')
    backorigdf= pd.read_html(str(x))
    print(backorigdf.loc[: , ['Publication number', 'Title']

但我收到一条错误消息 ValueError: No tables found.我希望以 Pandas 数据帧格式输出多个 HTML 文件引用,以便我更轻松地分析数据.有人能告诉我我做错了什么吗?这是 HTML 文件的链接 https://patents.google.com/patent/US4458945?oq=US4458945A.这个文件在我的电脑上保存了一个 HTML 文件,我不想从 URL 中读取.我想编码以从 HTML 文档中提取.

But I get an error message ValueError: No tables found. I want the output of the multiple HTML files citations in a pandas dataframe format so it is easier for me to analyse data. Can someone tell me what am I doing wrong? this is the link to the HTML file https://patents.google.com/patent/US4458945?oq=US4458945A. This file is saved a HTML file on my computer and I don't want to read from the URL. I want to code to pick up from the HTML documents.

推荐答案

了解预期结果的总数会有所帮助.在下面,我通过使用 :contains 来定位引用 h2 元素,然后移动到相邻的表中来检索 25 个独特的结果

It would help to know the total number of results expected. In the following I retrieve 25 unique results by using :contains to target the citations h2 elements and then moving to the adjacent table

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
 
r = requests.get('https://patents.google.com/patent/US4458945?oq=US4458945A')
soup = bs(r.content, 'lxml')
df = pd.concat([pd.read_html(str(t.find_next('table')))[0]
                for t in soup.select('h2:contains("Citations", "Family Cites")')])

df.drop_duplicates(inplace=True)
df.sort_values(by=['Priority date'], inplace=True)
df.reset_index(drop=True, inplace=True) 
print(df)

这篇关于从html标签中提取信息到pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆