BeautifulSoup soup.prettify()提供奇怪的输出 [英] BeautifulSoup soup.prettify() gives strange output

查看:175
本文介绍了BeautifulSoup soup.prettify()提供奇怪的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析一个网站,稍后将在我的Django项目中使用它.为此,我正在使用urllib2和BeautifulSoup4.但是,我无法获得想要的东西.BeautifulSoup对象的输出很奇怪.我尝试了不同的页面,它正常工作(输出是正常的).我以为是因为页面.然后,当我的朋友尝试做同样的事情时,他得到了正常的输出.我无法解决问题.

I'm trying to parse a web site and I'm going to use it later in my Django project. To do that, I'm using urllib2 and BeautifulSoup4. However, I couldn't get what I want. The output of BeautifulSoup object is weird. I tried different pages, it worked (output is normal). I thought it is because of the page. Then, when my friend tried to do the same thing, he got normal output. I couldn't manage to figure out problem.

这是我要解析的网站.

这是命令"soup.prettify()"之后的奇怪输出示例:

This is an example of the weird output after the command "soup.prettify()":

t   d       B   G   C   O   L   O   R   =   "   #   9   9   0   4   0   4   "       w   i   d   t   h   =   "   3   "   &gt;   i   m   g       S   R   C   =   "   1   p   .   g   i   f   "       A   L   T       B   O   R   D   E   R   =   "   0   "       h   e   i   g   h   t   =   "   1   "       w   i   d   t   h   =   "   3   "   &gt;   /   t   d   &gt;   \n           /   t   r   &gt;   \n           t   r   &gt;   \n                   t   d       c   o   l   s   p   a   n   =   "   3   "       B   G   C   O   L   O   R   =   "   #   9   9   0   4   0   4   "       w   i   d   t   h   =   "   6   0   0   "       h   e   i   g   h   t   =   "   3   "   &gt;   i   m   g       s   r   c   =   "   1   p   .   g   i   f   "       w   i   d   t   h   =   "   6   0   0   "   \n                   h   e   i   g   h   t   =   "   1   "   &gt;   /   t   d   &gt;   \n           /   t   r   &gt;   \n   /   t   a   b   l   e   &gt;   \n   /   c   e   n   t   e   r   &gt;   /   d   i   v   &gt;   \n   \n   p   &gt;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   /   p   &gt;   \n   /   b   o   d   y   &gt;   \n   /   h   t   m   l   &gt;\n  </p>\n </body>\n</html>'

推荐答案

以下是对我有用的一个最小示例,包括您遇到问题的html代码段.没有代码很难说,但是我猜你是在某处做了类似''.join(A.split())的事情.

import urllib2, bs4

url = "http://kafemud.bilkent.edu.tr/monu_tr.html"
req = urllib2.urlopen(url)
raw = req.read()
soup = bs4.BeautifulSoup(raw)

print soup.prettify().encode('utf-8')

给予:

....
<td bgcolor="#990404" width="3">
       <img alt="" border="0" src="1p.gif" width="3"/>
      </td>
      <td bgcolor="#FFFFFF" valign="TOP">
       <div align="left">
        <table align="left" border="0" cellpadding="10" cellspacing="0" valign="TOP" width="594">
         <tr>
          <td align="left" valign="top">
           <table align="left" border="0" cellpadding="0" cellspacing="0" class="icerik" width="574">
....

这篇关于BeautifulSoup soup.prettify()提供奇怪的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆