需要使用beautifulsoup提取所有字体大小和文本 [英] Need to extract all the font sizes and the text using beautifulsoup

查看:123
本文介绍了需要使用beautifulsoup提取所有字体大小和文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地系统上存储了以下html文件:

I have the following html file stored on my local system:

<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2 
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4 
<br>• six txt5
<br></span>

我需要提取此html文件中出现的所有字体大小.我正在使用beautifulsoup,但我只知道如何提取文本.

I need to extract all the font-sizes that occur in this html file. I am using beautifulsoup, but I know only how to extract the text.

我可以使用以下代码提取文本:

I can extract the text using the following code:

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

texts = soup.findAll(text=True)

我需要提取每个文本的字体大小并将字体-文本对存储到列表或数组中.具体来说,我想有一个像[('One','30'),('Two','15')]这样的数据结构,以此类推,其中30来自font-size:30px而15来自font-size:15px

I need to extract the font size of each piece of text and store the font-text pair into a list or array. To be specific, I want to have a data structure like [('One','30'),('Two','15')] and so on where 30 is from the font-size:30px and 15 from font-size:15px

唯一的问题是我不知道一种获取字体大小值的方法.有任何想法吗?

The only problem is that I can't figure out a way to get the font-size value. Any ideas?

推荐答案

希望有帮助:我建议您阅读BeautifulSoup

Hope this helps : I suggest you to read more documents on BeautifulSoup

from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)

font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
output = []
for i in font_spans:
    tup = ()
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
    tup = (str(i.text).strip(), fonts_size.strip())
    output.append(tup)

print(output)
[('One', '30'),('Two', '15'), ....]

如果要消除包含txt的文本值,可以添加if not 'txt' in i.text:

If you want to eliminate text values which contains txt you may add if not 'txt' in i.text:

说明:

首先,您需要识别包含font-size

First you need to identify tags which contains font-size,

font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]

然后您需要迭代font_spans并提取字体大小和文本值,

Then you need to iterate font_spans and extract font-size and text value,

textvalue = i.text # One,Two..
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..

最后,您需要创建一个列表,其中包含所有与元组相同的输出.

and Finally you need to create a list which contains all your output as in tuples.

这篇关于需要使用beautifulsoup提取所有字体大小和文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆