两个单词之间的Python文本解析 [英] Python text parsing between two words
本文介绍了两个单词之间的Python文本解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用beautifulsoup,并希望从网页上两个单词之间提取所有文本.
I'm using beautifulsoup and want to extract all text from between two words on a webpage.
例如,想象以下网站文字:
Ex, imagine the following website text:
This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.
我想提取页面上所有以text
开始并以bunch
结尾的内容.
I want to pull out everything on the page that starts with text
and ends with bunch
.
在这种情况下,我只想要:
In this case I'd want only:
text of the webpage. It is just a string of a bunch
但是,页面上可能有多个实例.
However, there's a chance there could be multiple instances of this on a page.
做到这一点的最佳方法是什么?
What is the best way to do this?
这是我当前的设置:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.
for line in visible_texts:
print line
推荐答案
由于您只是解析文本,因此只需要正则表达式:
since you're just parsing the text you just need the regex:
import re
result = re.findall("text.*?bunch", text_from_web_page)
这篇关于两个单词之间的Python文本解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文