BeautifulSoup 返回意想不到的额外空间 [英] BeautifulSoup return unexpected extra spaces

查看:23
本文介绍了BeautifulSoup 返回意想不到的额外空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 从 html 文档中获取一些文本.在一个对我来说非常相关的案例中,它产生了一个奇怪而有趣的结果:在某一点之后,文本中充满了额外的空格(一个空格将每个字母与下一个字母隔开).我试图在网上搜索以找到原因,但我只遇到了一些关于相反错误的消息(根本没有空格).

I am trying to grab some text from html documents with BeautifulSoup. In a very relavant case for me, it originates a strange and interesting result: after a certain point, the soup is full of extra spaces within the text (a space separates every letter from the following one). I tried to search the web in order to find a reason for that, but I met only some news about the opposite bug (no spaces at all).

你有什么建议或暗示为什么会发生,以及如何解决这个问题?.

Do you have some suggestion or hint on why it happens, and how to solve this problem?.

这是我创建的非常基本的代码:

This is the very basic code that i created:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova)
print soup

这是从结果中截取的一行,也是这个问题开始出现的那一行:

And this is a line taken from the results, the line where this problem start to appear:

value="Giuseppe labbate ogm? non vorremmo uccelli chiamati lontre"><input onmouseover="Tip('<centerclass = ' title _ video ' > < b> G iuseppelabbateogm ? nonvorremmonuoviuccel lichiamatilontre <

value="Giuseppe labbate ogm? non vorremmo nuovi uccelli chiamati lontre"><input onmouseover="Tip('<cen t e r c l a s s = ' t i t l e _ v i d e o ' > < b > G i u s e p p e l a b b a t e o g m ? n o n v o r r e m m o n u o v i u c c e l l i c h i a m a t i l o n t r e <

推荐答案

我认为这是 Lxml 的 HTML 解析器的错误.试试:

I believe this is a bug with Lxml's HTML parser. Try:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup

这是问题的解决方法.我相信该问题已在 lxml 3.0 alpha 2 和 lxml 2.3.6 中得到解决,因此值得检查是否需要升级到更新版本.

Which is a workaround for the problem. I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

如果您想了解有关最初在此处提交的错误的更多信息:

If you want more info on the bug it was initially filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

希望这会有所帮助,

海登

这篇关于BeautifulSoup 返回意想不到的额外空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆