从html标签中提取信息到pandas [英] Extract information from html tags into pandas

查看：90 发布时间：2021/6/13 20:26:15 python html pandas string beautifulsoup

本文介绍了从html标签中提取信息到pandas的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个装满 html 文件的文件夹.我正在尝试选择正确的 html 标签，以便正确打印引文，而我需要的输出只是出版物编号和标题.到目前为止，我是在 SO 中的各种帖子的帮助下完成的

I have a folder full of html files. I am trying to pick the right html tags so I print the citations correctly and the output I require is just the publication number and title. So far I did this with help from various posts in SO

with open(filename, 'r',encoding='utf-8') as f:# start loop to read HTML files in folder
    patent = f.read()
    #print(filename)
    soup = BeautifulSoup(patent, 'html.parser') 
    x=soup.select('tr[itemprop="backwardReferencesOrig"]')
    backorigdf= pd.read_html(str(x))
    print(backorigdf.loc[: , ['Publication number', 'Title']

但我收到一条错误消息 ValueError: No tables found.我希望以 Pandas 数据帧格式输出多个 HTML 文件引用，以便我更轻松地分析数据.有人能告诉我我做错了什么吗?这是 HTML 文件的链接 https://patents.google.com/patent/US4458945?oq=US4458945A.这个文件在我的电脑上保存了一个 HTML 文件，我不想从 URL 中读取.我想编码以从 HTML 文档中提取.

But I get an error message ValueError: No tables found. I want the output of the multiple HTML files citations in a pandas dataframe format so it is easier for me to analyse data. Can someone tell me what am I doing wrong? this is the link to the HTML file https://patents.google.com/patent/US4458945?oq=US4458945A. This file is saved a HTML file on my computer and I don't want to read from the URL. I want to code to pick up from the HTML documents.

推荐答案

了解预期结果的总数会有所帮助.在下面，我通过使用 :contains 来定位引用 h2 元素，然后移动到相邻的表中来检索 25 个独特的结果

It would help to know the total number of results expected. In the following I retrieve 25 unique results by using :contains to target the citations h2 elements and then moving to the adjacent table

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
 
r = requests.get('https://patents.google.com/patent/US4458945?oq=US4458945A')
soup = bs(r.content, 'lxml')
df = pd.concat([pd.read_html(str(t.find_next('table')))[0]
                for t in soup.select('h2:contains("Citations", "Family Cites")')])

df.drop_duplicates(inplace=True)
df.sort_values(by=['Priority date'], inplace=True)
df.reset_index(drop=True, inplace=True) 
print(df)

这篇关于从html标签中提取信息到pandas的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从html标签中提取信息到pandas [英] Extract information from html tags into pandas

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从html标签中提取信息到pandas [英] Extract information from html tags into pandas

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭