使用BeautifulSoup标签之间的文本提取 [英] Extracting text between tags using BeautifulSoup

查看:329
本文介绍了使用BeautifulSoup标签之间的文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从一系列都遵循使用BeautifulSoup类似的格式的网页中提取文本。因为我想提取文本的HTML如下。实际的链接是在这里: http://www.p2016.org/ads1/bushad120215.html

 < P><跨度风格=COLOR:RGB(153,153,153);>< / SPAN><字体大小= -  1 > <跨度
 风格=FONT-FAMILY:宋体;><大><跨度风格=COLOR:RGB(153,153,153);>< / SPAN>< /大>< / SPAN&GT ;< / FONT><跨度风格=COLOR:RGB(153,153,153);>< / SPAN><字体大小= - 1><跨度风格=字体 - family:宋体;><&大GT;<跨度
 风格=COLOR:RGB(153,153,153);>< / SPAN>< /大>< / SPAN>< / FONT>< FONT
 大小= - 1><跨度风格=FONT-FAMILY:宋体;><大><跨度风格=COLOR:RGB(153,153,153);>< /跨度>< /大>< / SPAN>< / FONT><字体大小= - 1><跨度风格=FONT-FAMILY:宋体;><大><跨度风格=COLOR:RGB(153,153,153);>< / SPAN>< /大>< / SPAN>< / FONT>< / p> < P><跨度风格=COLOR:RGB(153,153,153);> [音乐] LT; / SPAN><跨度
 风格=文本修饰:下划线;>< BR>
&所述; /跨度>&下; / P>
< P><小><跨度风格=文本修饰:下划线;>文字< / SPAN计算值:该
荣誉勋章是对英勇打击行动的最高奖项
敌军< /小><跨度风格=文本修饰:下划线;>< BR>
&所述; /跨度>&下; / P>
< P><跨度风格=文本修饰:下划线;>西。杰伊·巴尔加斯< / SPAN>:&安培; NBSP;
我们

全然
包围,
116海军陆战队员锁定头15000
北越和放大器; NBSP;没有睡四十小时,交手
。手<跨度风格=文本修饰:下划线;>< BR>
<跨度风格=FONT-FAMILY:宋体,无衬线;>< BR>
< / SPAN>

我想找到一种方法,通过我的文件夹中的所有HTML文件进行迭代并提取所有标记之间的文本。我在这里包括我的code的相关章节:

 文本= []在页页:
        html_doc = codecs.open(页,'R')
        汤= BeautifulSoup(html_doc,'html.parser')
        在t中soup.find_all('&所述p为H.;'):
            T = t.get_text()
            text.append(t.en code(UTF-8))
            印花T

但是,没有来了。道歉的noob问题,并在此先感谢您的帮助。


解决方案

  

为吨soup.find_all('< P>')


只要指定的标签名称的,而不是它的重新presentation:

 在soup.find_all('P')T:


下面是你可以缩小搜索范围,以对话的段落:

 在soup.find_all跨度(跨度,风格=文本修饰:下划线;):
    文= span.next_sibling    如果文本:
        打印(span.text,text.strip())

I am trying to extract text from a series of webpages that all follow a similar format using BeautifulSoup. The html for the text I wish to extract is below. The actual link is here: http://www.p2016.org/ads1/bushad120215.html.

 <p><span style="color: rgb(153, 153, 153);"></span><font size="-1">      <span
 style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><span style="color: rgb(153, 153, 153);"></span><font size="-1"><span style="font-family: Arial;"><big><span
 style="color: rgb(153, 153, 153);"></span></big></span></font><font
 size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><font size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font></p>   <p><span style="color: rgb(153, 153, 153);">[Music]</span><span
 style="text-decoration: underline;"><br>
</span></p>
<p><small><span style="text-decoration: underline;">TEXT</span>: The
Medal of Honor is the highest award for valor in action against an
enemy force</small><span style="text-decoration: underline;"><br>
</span></p>
<p><span style="text-decoration: underline;">Col. Jay Vargas</span>:&nbsp;
We
were
completely
surrounded,
116 Marines locking heads with 15,000
North Vietnamese.&nbsp; Forty hours with no sleep, fighting hand to
hand.<span style="text-decoration: underline;"><br>
<span style="font-family: helvetica,sans-serif;"><br>
</span>

I'd like to find a way to iterate through all the html files in my folder and extract the text between all the markers. I've included here the relevant sections of my code:

text=[]

for page in pages:
        html_doc = codecs.open(page, 'r')
        soup = BeautifulSoup(html_doc, 'html.parser')
        for t in soup.find_all('<p>'):
            t = t.get_text()
            text.append(t.encode('utf-8'))
            print t

However, nothing is coming up. Apologies for the noob question and thanks in advance for your help.

解决方案

for t in soup.find_all('<p>'):

Just specify the tag name, not it's representation:

for t in soup.find_all('p'):


Here is how you can narrow down the search to the dialogue paragraphs:

for span in soup.find_all("span", style="text-decoration: underline;"):
    text = span.next_sibling

    if text:
        print(span.text, text.strip())

这篇关于使用BeautifulSoup标签之间的文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆