需要使用beautifulsoup提取所有字体大小和文本 [英] Need to extract all the font sizes and the text using beautifulsoup
问题描述
我在本地系统上存储了以下html文件:
I have the following html file stored on my local system:
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:104px; width:175px; height:40px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Two</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: two txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Three</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Three txt
<br></span><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:15px">Four</span><span style="font-family: CAAAAA+DejaVuSans; font-size:16px">: Four txt
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:274px; top:144px; width:56px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">FIVE
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:171px; width:515px; height:44px;"><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">five txt
<br>five txt2
<br>five txt3
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:220px; top:223px; width:164px; height:19px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:19px">SIX
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:44px; top:247px; width:489px; height:159px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:17px">six txt
<br></span><span style="font-family: CAAAAA+DejaVuSans; font-size:18px">six txt2
<br>- six txt2
<br>• six txt3
<br>• six txt4
<br>• six txt5
<br></span>
我需要提取此html文件中出现的所有字体大小.我正在使用beautifulsoup,但我只知道如何提取文本.
I need to extract all the font-sizes that occur in this html file. I am using beautifulsoup, but I know only how to extract the text.
我可以使用以下代码提取文本:
I can extract the text using the following code:
from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)
texts = soup.findAll(text=True)
我需要提取每个文本的字体大小并将字体-文本对存储到列表或数组中.具体来说,我想有一个像[('One','30'),('Two','15')]
这样的数据结构,以此类推,其中30来自font-size:30px
而15来自font-size:15px
I need to extract the font size of each piece of text and store the font-text pair into a list or array. To be specific, I want to have a data structure like [('One','30'),('Two','15')]
and so on where 30 is from the font-size:30px
and 15 from font-size:15px
唯一的问题是我不知道一种获取字体大小值的方法.有任何想法吗?
The only problem is that I can't figure out a way to get the font-size value. Any ideas?
推荐答案
希望有帮助:我建议您阅读BeautifulSoup
Hope this helps : I suggest you to read more documents on BeautifulSoup
from bs4 import BeautifulSoup
htmlData = open('/home/usr/Downloads/files/output.html', 'r')
soup = BeautifulSoup(htmlData)
font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
output = []
for i in font_spans:
tup = ()
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2)
tup = (str(i.text).strip(), fonts_size.strip())
output.append(tup)
print(output)
[('One', '30'),('Two', '15'), ....]
如果要消除包含txt
的文本值,可以添加if not 'txt' in i.text:
If you want to eliminate text values which contains txt
you may add if not 'txt' in i.text:
说明:
首先,您需要识别包含font-size
,
First you need to identify tags which contains font-size
,
font_spans = [ data for data in soup.select('span') if 'font-size' in str(data) ]
然后您需要迭代font_spans
并提取字体大小和文本值,
Then you need to iterate font_spans
and extract font-size and text value,
textvalue = i.text # One,Two..
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)',str(i.get('style'))).group(2) # 30, 15, 16..
最后,您需要创建一个列表,其中包含所有与元组相同的输出.
and Finally you need to create a list which contains all your output as in tuples.
这篇关于需要使用beautifulsoup提取所有字体大小和文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!