BeautifulSoup返回意外的额外空间 [英] BeautifulSoup return unexpected extra spaces

查看:90
本文介绍了BeautifulSoup返回意外的额外空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup从html文档中获取一些文本.在我看来,这很奇怪,它产生了一个奇怪而有趣的结果:在某一点之后,汤在文本中充满了多余的空格(空格将每个字母与下一个字母分隔开).我试图在网上搜索以找到原因,但是我只遇到了有关该相对错误的一些消息(根本没有空格).

I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).

您是否对为什么会发生以及如何解决此问题有任何建议或提示?.

Do you have some suggestion or hint on why it happens, and how to solve this problem?.

这是我创建的非常基本的代码:

This is the very basic code that i created:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup

这是从结果中提取的一行,此问题开始出现的那一行:

And this is a line taken from the results, the line where this problem start to appear:

value = \"Giuseppe labbate ogm?non vorremmo nuovi uccelli chiamati lontre \">< input onmouseover = \"Tip('<

value=\"Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre\"><input onmouseover=\"Tip('<cen t e r c l a s s = \ \ ' t i t l e _ v i d e o \ \ ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <

推荐答案

我相信这是Lxml的HTML解析器的错误. 试试:

I believe this is a bug with Lxml's HTML parser. Try:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup

哪个是解决此问题的方法. 我相信该问题已在lxml 3.0 alpha 2和lxml 2.3.6中修复,因此值得检查是否需要升级到较新版本.

Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

如果您想了解有关该错误的更多信息,该错误最初是在此处提交的:

If you want more info on the bug it was initially filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

希望这会有所帮助,

海登

这篇关于BeautifulSoup返回意外的额外空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆