Beautifulsoup功能在特定情况下无法正常工作 [英] Beautifulsoup functionality not working properly in specific scenario

查看:101
本文介绍了Beautifulsoup功能在特定情况下无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用urllib2读取以下URL: http://frcwest.com/,然后搜索元重定向的数据.

I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.

它将读取以下数据:

   <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>

将其读入Beautifulsoup可以正常工作.但是由于某种原因,这些功能都无法满足此特定需求,而且我也不明白为什么.在所有其他情况下,Beautifulsoup对我来说都非常有效.但是,在尝试时:

Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:

    soup.findAll('meta')

没有结果.

我最终的目标是跑步:

    soup.find("meta",attrs={"http-equiv":"refresh"})

但是如果:

    soup.findAll('meta')

什至没有工作,然后我被卡住了.任何对这个谜的煽动将不胜感激,谢谢!

isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!

推荐答案

是注释和文档类型将解析器以及随后的BeautifulSoup扔到这里.

It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.

即使HTML标记似乎消失了":

Even the HTML tag seems 'gone':

>>> soup.find('html') is None
True

但是它仍然在.contents迭代中.您可以使用以下方法再次找到东西:

Yet it is there in the .contents iterable still. You can find things again with:

for elem in soup:
    if getattr(elem, 'name', None) == u'html':
        soup = elem
        break

soup.find_all('meta')

演示:

>>> for elem in soup:
...     if getattr(elem, 'name', None) == u'html':
...         soup = elem
...         break
... 
>>> soup.find_all('meta')
[<meta content="0;url= Home.html" http-equiv="refresh"/>]

这篇关于Beautifulsoup功能在特定情况下无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆