使用漂亮的汤从HTML提取特定的标头 [英] Extract a specific header from HTML using beautiful soup

查看:70
本文介绍了使用漂亮的汤从HTML提取特定的标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在使用的专利示例 https://patents .google.com/patent/EP1208209A1/en?oq = medicalinal + chemistry .下面是我使用的代码.我希望代码仅显示被(3)引用的计数,因此我知道该专利被引用了多少次,如何获得输出以仅将被引用的计数显示为3?请帮助!

This is the patent example I am using https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry . Below is the code I used. I want the code to display only the cited by (3) count so I know how many times this patent was cited.How can I get the output to display the cited by count as 3 only? Kindly help!

 
soup = BeautifulSoup(patent, 'html.parser')
cited_section =soup.findAll({"h2":"Cited By"})

print(cited_section)
Output I get is [<h2>Info</h2>, <h2>Links</h2>, <h2>Images</h2>, <h2>Classifications</h2>, <h2>Abstract</h2>, <h2>Description</h2>, <h2>Claims (<span itemprop="count">57</span>)</h2>, <h2>Priority Applications (5)</h2>, <h2>Applications Claiming Priority (1)</h2>, <h2>Related Parent Applications (1)</h2>, <h2>Publications (2)</h2>, <h2>ID=38925605</h2>, <h2>Family Applications (1)</h2>, <h2>Country Status (1)</h2>, <h2>Cited By (3)</h2>, <h2>Families Citing this family (12)</h2>, <h2>Citations (306)</h2>, <h2>Patent Citations (348)</h2>, <h2>Non-Patent Citations (23)</h2>, <h2>Cited By (4)</h2>, <h2>Also Published As</h2>, <h2>Similar Documents</h2>, <h2>Legal Events</h2>]````

推荐答案

引文数量是通过JavaScript动态创建的.但是您可以使用itemprop="forwardReferencesFamily"计数元素的数量以获取计数.例如:

The number of citations is created dynamically via JavaScript. But you can count number of elements with itemprop="forwardReferencesFamily" to get the count. For example:

import requests
from bs4 import BeautifulSoup


url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print(len(soup.select('tr[itemprop="forwardReferencesFamily"]')))

打印:

4

这篇关于使用漂亮的汤从HTML提取特定的标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆