有没有办法在beautifulsoup python中找到最出现/最常见的span样式? [英] Is there a way to find the most appeared/common span style in beautifulsoup python?

查看:693
本文介绍了有没有办法在beautifulsoup python中找到最出现/最常见的span样式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因为我需要处理不同风格的许多pdf文件,所以我假设主要内容将采用最常见的span / common span风格。

一种在beautifulsoup python中找到最出现的span样式的方法?



这是我用来查找特定跨度样式的命令:

  font-family:CBCDEE + ArialMT; 
font-size:12px':
spans = soup.find_all('span',
attrs = {'style':'font-family:CBCDEE + ArialMT; font-size:12px '})`

有什么方法可以找到最常出现/常见的一种?或基本上,有没有办法让跨样式列表和计数不同风格的外观?



非常感谢。

解决方案

您可以使用Python Counter() 对所有不同的样式进行计数,然后显示 most_common() 元素为如下:

  from bs4从集合导入BeautifulSoup 
导入计数器

html =
< span style =font-family:CBCDEE + ArialMT; font-size:12px> 1< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:14px> 2< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:14px> 3< / span>
< span style =font-family:CBCDEE + Arial; font-size:12px> 4< / span>
< span style =font-family:CBCDEE + ArialMT; font-size:12px> 5< / span>

soup = BeautifulSoup(html,html.parser)
style_counts = Counter()

for soup.find_all('span',style = True):
style_counts [span ['style']] + = 1

print style_counts.most_common(1) [b] b




在这个例子中,它会显示:

  font-family:CBCDEE + ArialMT; font-size:12px 


As I need to proceed many pdfs with different styles, I have an assumptions that the main content will be under the most appeared/common span style.

Is there a way to find the most appeared span style in beautifulsoup python?

This is a command I used to find a specific span style:

 font-family: CBCDEE+ArialMT; 
 font-size:12px':
 spans = soup.find_all('span',
                       attrs={'style': 'font-family: CBCDEE+ArialMT; font-size:12px'})`

Any ways to find the most appeared/common one? or basically, is there a way to have the span style list and count the appearance of different styles?

Many thanks.

解决方案

You could use a Python Counter() to count all of the different styles and then display the most_common() element as follows:

from bs4 import BeautifulSoup
from collections import Counter

html = """
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">1</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">2</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">3</span>
    <span style="font-family: CBCDEE+Arial; font-size:12px">4</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">5</span>"""

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1

print style_counts.most_common(1)[0][0]

For this example it would display:

font-family: CBCDEE+ArialMT; font-size:12px

这篇关于有没有办法在beautifulsoup python中找到最出现/最常见的span样式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆