有没有办法在beautifulsoup python中找到最出现/最常见的span样式? [英] Is there a way to find the most appeared/common span style in beautifulsoup python?
问题描述
因为我需要处理不同风格的许多pdf文件,所以我假设主要内容将采用最常见的span / common span风格。
一种在beautifulsoup python中找到最出现的span样式的方法?
这是我用来查找特定跨度样式的命令:
font-family:CBCDEE + ArialMT;
font-size:12px':
spans = soup.find_all('span',
attrs = {'style':'font-family:CBCDEE + ArialMT; font-size:12px '})`
有什么方法可以找到最常出现/常见的一种?或基本上,有没有办法让跨样式列表和计数不同风格的外观?
非常感谢。
您可以使用Python Counter()
对所有不同的样式进行计数,然后显示 most_common()
元素为如下:
from bs4从集合导入BeautifulSoup
导入计数器
html =
< span style =font-family:CBCDEE + ArialMT; font-size:12px> 1< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:14px> 2< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:14px> 3< / span>
< span style =font-family:CBCDEE + Arial; font-size:12px> 4< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:12px> 5< / span>
soup = BeautifulSoup(html,html.parser)
style_counts = Counter()
for soup.find_all('span',style = True):
style_counts [span ['style']] + = 1
print style_counts.most_common(1) [b] b
在这个例子中,它会显示:
font-family:CBCDEE + ArialMT; font-size:12px
As I need to proceed many pdfs with different styles, I have an assumptions that the main content will be under the most appeared/common span style.
Is there a way to find the most appeared span style in beautifulsoup python?
This is a command I used to find a specific span style:
font-family: CBCDEE+ArialMT;
font-size:12px':
spans = soup.find_all('span',
attrs={'style': 'font-family: CBCDEE+ArialMT; font-size:12px'})`
Any ways to find the most appeared/common one? or basically, is there a way to have the span style list and count the appearance of different styles?
Many thanks.
解决方案 You could use a Python Counter()
to count all of the different styles and then display the most_common()
element as follows:
from bs4 import BeautifulSoup
from collections import Counter
html = """
<span style="font-family: CBCDEE+ArialMT; font-size:12px">1</span>
<span style="font-family: CBCDEE+ArialMT; font-size:14px">2</span>
<span style="font-family: CBCDEE+ArialMT; font-size:14px">3</span>
<span style="font-family: CBCDEE+Arial; font-size:12px">4</span>
<span style="font-family: CBCDEE+ArialMT; font-size:12px">5</span>"""
soup = BeautifulSoup(html, "html.parser")
style_counts = Counter()
for span in soup.find_all('span', style=True):
style_counts[span['style']] += 1
print style_counts.most_common(1)[0][0]
For this example it would display:
font-family: CBCDEE+ArialMT; font-size:12px
这篇关于有没有办法在beautifulsoup python中找到最出现/最常见的span样式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!