Python Beautifulsoup get_text()没有获取所有文本 [英] Python Beautifulsoup get_text() not getting all text

查看:826
本文介绍了Python Beautifulsoup get_text()没有获取所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用beautifulsoup get_text()方法从html标签获取所有文本。我使用Python 2.7和Beautifulsoup 4.4.0。它适用于大多数时间。但是,这种方法有时只能从标签中获得第一段。我无法弄清楚为什么。请参阅以下示例。

  from bs4 import BeautifulSoup 
导入urllib2

job_url = http://www.indeed.com/viewjob?jk=0f5592c8191a21af
site = urllib2.urlopen(job_url).read()
soup = BeautifulSoup(site,html.parser)
text = soup.find(span,{class:summary})。get_text()
打印文本

我想从这个确实的工作描述中获得所有内容。基本上,我想要获取所有文本。然而,利用上面的代码,我只能得到请注意,这是一个1年的合同任务,候选人不能开始任务,直到背景检查和药物测试完成。为什么我失去了文本的其余部分?如何在不指定子标签的情况下从此标签获取所有文本?



非常感谢。

解决方案

使用不同的解析器(如 lxml 解析器)而不是 html.parser 解析器:



替换:

 汤= BeautifulSoup b.brser)

with:

  soup = BeautifulSoup(site,lxml)

确保先安装了lxml解析器:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser


I'm trying to get all text from a html tag using beautifulsoup get_text() method. I use Python 2.7 and Beautifulsoup 4.4.0. It works for most of the times. However, this method can only get first paragraph from a tag sometimes. I can't figure out why. Please see the following example.

from bs4 import BeautifulSoup
import urllib2

job_url = "http://www.indeed.com/viewjob?jk=0f5592c8191a21af"
site = urllib2.urlopen(job_url).read()
soup = BeautifulSoup(site, "html.parser")
text = soup.find("span", {"class": "summary"}).get_text()
print text

I want to get all content from this indeed job description. Basically, I want to get all text in . However, utilize the code above, I can only get "Please note that this is a 1 year contract assignment. Candidates cannot start an assignment until background check and drug test is completed". Why I'm losing the rest of text? How can I get all text from this tag without specifying sub-tags?

Thanks a lot.

解决方案

Try it with a different parser like the lxml parser instead of the html.parser parser:

Replace:

soup = BeautifulSoup(site, "html.parser")

with:

soup = BeautifulSoup(site, "lxml")

Make sure you have the lxml parser installed first: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

这篇关于Python Beautifulsoup get_text()没有获取所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆