使用BeautifulSoup标签之间的文本提取 [英] Extracting text between tags using BeautifulSoup

查看：329 发布时间：2016/8/5 19:09:42 python regex web-scraping beautifulsoup bs4

本文介绍了使用BeautifulSoup标签之间的文本提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从一系列都遵循使用BeautifulSoup类似的格式的网页中提取文本。因为我想提取文本的HTML如下。实际的链接是在这里： http://www.p2016.org/ads1/bushad120215.html。

 ＆LT; P＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT;＆LT; / SPAN＆GT;＆LT;字体大小= -  1 ＆GT; ＆LT;跨度
 风格=FONT-FAMILY：宋体;＆GT;＆LT;大＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT;＆LT; / SPAN＆GT;＆LT; /大＆GT;＆LT; / SPAN＆GT ;＆LT; / FONT＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT;＆LT; / SPAN＆GT;＆LT;字体大小= -  1＆GT;＆LT;跨度风格=字体 - family：宋体;＆GT;＆LT;＆大GT;＆LT;跨度
 风格=COLOR：RGB（153，153，153）;＆GT;＆LT; / SPAN＆GT;＆LT; /大＆GT;＆LT; / SPAN＆GT;＆LT; / FONT＆GT;＆LT; FONT
 大小= -  1＆GT;＆LT;跨度风格=FONT-FAMILY：宋体;＆GT;＆LT;大＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT;＆LT; /跨度＆GT;＆LT; /大＆GT;＆LT; / SPAN＆GT;＆LT; / FONT＆GT;＆LT;字体大小= -  1＆GT;＆LT;跨度风格=FONT-FAMILY：宋体;＆GT;＆LT;大＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT;＆LT; / SPAN＆GT;＆LT; /大＆GT;＆LT; / SPAN＆GT;＆LT; / FONT＆GT;＆LT; / p＆GT; ＆LT; P＆GT;＆LT;跨度风格=COLOR：RGB（153，153，153）;＆GT; [音乐] LT; / SPAN＆GT;＆LT;跨度
 风格=文本修饰：下划线;＆GT;＆LT; BR＆GT;
＆所述; /跨度＆GT;＆下; / P＆GT;
＆LT; P＆GT;＆LT;小＆GT;＆LT;跨度风格=文本修饰：下划线;＆GT;文字＆lt; / SPAN计算值：该
荣誉勋章是对英勇打击行动的最高奖项
敌军＆LT; /小＆GT;＆LT;跨度风格=文本修饰：下划线;＆GT;＆LT; BR＆GT;
＆所述; /跨度＆GT;＆下; / P＆GT;
＆LT; P＆GT;＆LT;跨度风格=文本修饰：下划线;＆GT;西。杰伊·巴尔加斯＆LT; / SPAN＆GT;：＆安培; NBSP;
我们
是
全然
包围，
116海军陆战队员锁定头15000
北越和放大器; NBSP;没有睡四十小时，交手
。手＆LT;跨度风格=文本修饰：下划线;＆GT;＆LT; BR＆GT;
＆LT;跨度风格=FONT-FAMILY：宋体，无衬线;＆GT;＆LT; BR＆GT;
＆LT; / SPAN＆GT;

我想找到一种方法，通过我的文件夹中的所有HTML文件进行迭代并提取所有标记之间的文本。我在这里包括我的code的相关章节：

 文本= []在页页：
        html_doc = codecs.open（页，'R'）
        汤= BeautifulSoup（html_doc，'html.parser'）
        在t中soup.find_all（'＆所述p为H.;'）：
            T = t.get_text（）
            text.append（t.en code（UTF-8））
            印花T

但是，没有来了。道歉的noob问题，并在此先感谢您的帮助。

解决方案

为吨soup.find_all（'＆LT; P＆GT;'）

只要指定的标签名称的，而不是它的重新presentation：
在soup.find_all（'P'）T：
下面是你可以缩小搜索范围，以对话的段落：
在soup.find_all跨度（跨度，风格=文本修饰：下划线;）：文= span.next_sibling 如果文本：打印（span.text，text.strip（））
I am trying to extract text from a series of webpages that all follow a similar format using BeautifulSoup. The html for the text I wish to extract is below. The actual link is here: http://www.p2016.org/ads1/bushad120215.html.
 <big></big><big></big><big></big><big></big> [Music] 

TEXT: The
Medal of Honor is the highest award for valor in action against an
enemy force 

Col. Jay Vargas:&nbsp;
We
were
completely
surrounded,
116 Marines locking heads with 15,000
North Vietnamese.&nbsp; Forty hours with no sleep, fighting hand to
hand. 
 

I'd like to find a way to iterate through all the html files in my folder and extract the text between all the markers. I've included here the relevant sections of my code:
text=[]

for page in pages:
 html_doc = codecs.open(page, 'r')
 soup = BeautifulSoup(html_doc, 'html.parser')
 for t in soup.find_all(''):
 t = t.get_text()
 text.append(t.encode('utf-8'))
 print t
However, nothing is coming up. Apologies for the noob question and thanks in advance for your help.
解决方案

for t in soup.find_all(''):

Just specify the tag name, not it's representation:
for t in soup.find_all('p'):
Here is how you can narrow down the search to the dialogue paragraphs:
for span in soup.find_all("span", style="text-decoration: underline;"):
 text = span.next_sibling

 if text:
 print(span.text, text.strip())
这篇关于使用BeautifulSoup标签之间的文本提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用BeautifulSoup标签之间的文本提取 [英] Extracting text between tags using BeautifulSoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用BeautifulSoup标签之间的文本提取 [英] Extracting text between tags using BeautifulSoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭