Python Beautifulsoup get_text()没有获取所有文本 [英] Python Beautifulsoup get_text() not getting all text
问题描述
我试图使用beautifulsoup get_text()方法从html标签获取所有文本。我使用Python 2.7和Beautifulsoup 4.4.0。它适用于大多数时间。但是,这种方法有时只能从标签中获得第一段。我无法弄清楚为什么。请参阅以下示例。
from bs4 import BeautifulSoup
导入urllib2
job_url = http://www.indeed.com/viewjob?jk=0f5592c8191a21af
site = urllib2.urlopen(job_url).read()
soup = BeautifulSoup(site,html.parser)
text = soup.find(span,{class:summary})。get_text()
打印文本
我想从这个确实的工作描述中获得所有内容。基本上,我想要获取所有文本。然而,利用上面的代码,我只能得到请注意,这是一个1年的合同任务,候选人不能开始任务,直到背景检查和药物测试完成。为什么我失去了文本的其余部分?如何在不指定子标签的情况下从此标签获取所有文本?
非常感谢。
使用不同的解析器(如 lxml
解析器)而不是 html.parser
解析器:
替换:
汤= BeautifulSoup b.brser)
with:
soup = BeautifulSoup(site,lxml)
确保先安装了lxml解析器:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
I'm trying to get all text from a html tag using beautifulsoup get_text() method. I use Python 2.7 and Beautifulsoup 4.4.0. It works for most of the times. However, this method can only get first paragraph from a tag sometimes. I can't figure out why. Please see the following example.
from bs4 import BeautifulSoup
import urllib2
job_url = "http://www.indeed.com/viewjob?jk=0f5592c8191a21af"
site = urllib2.urlopen(job_url).read()
soup = BeautifulSoup(site, "html.parser")
text = soup.find("span", {"class": "summary"}).get_text()
print text
I want to get all content from this indeed job description. Basically, I want to get all text in . However, utilize the code above, I can only get "Please note that this is a 1 year contract assignment. Candidates cannot start an assignment until background check and drug test is completed". Why I'm losing the rest of text? How can I get all text from this tag without specifying sub-tags?
Thanks a lot.
Try it with a different parser like the lxml
parser instead of the html.parser
parser:
Replace:
soup = BeautifulSoup(site, "html.parser")
with:
soup = BeautifulSoup(site, "lxml")
Make sure you have the lxml parser installed first: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
这篇关于Python Beautifulsoup get_text()没有获取所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!