Beautifulsoup功能在特定情况下无法正常工作 [英] Beautifulsoup functionality not working properly in specific scenario
问题描述
我正在尝试使用urllib2读取以下URL: http://frcwest.com/,然后搜索元重定向的数据.
I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect.
它将读取以下数据:
<!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html>
将其读入Beautifulsoup可以正常工作.但是由于某种原因,这些功能都无法满足此特定需求,而且我也不明白为什么.在所有其他情况下,Beautifulsoup对我来说都非常有效.但是,在尝试时:
Reading it into Beautifulsoup works fine. However for some reason none of the functionality works for this specific senarious, and I don't understand why. Beautifulsoup has worked great for me in all other scenarios. However, when simply trying:
soup.findAll('meta')
没有结果.
我最终的目标是跑步:
soup.find("meta",attrs={"http-equiv":"refresh"})
但是如果:
soup.findAll('meta')
什至没有工作,然后我被卡住了.任何对这个谜的煽动将不胜感激,谢谢!
isn't even working then I'm stuck. Any incite into this mystery would be appreciated, thanks!
推荐答案
是注释和文档类型将解析器以及随后的BeautifulSoup扔到这里.
It's the comment and doctype that throws the parser here, and subsequently, BeautifulSoup.
即使HTML标记似乎消失了":
Even the HTML tag seems 'gone':
>>> soup.find('html') is None
True
但是它仍然在.contents
迭代中.您可以使用以下方法再次找到东西:
Yet it is there in the .contents
iterable still. You can find things again with:
for elem in soup:
if getattr(elem, 'name', None) == u'html':
soup = elem
break
soup.find_all('meta')
演示:
>>> for elem in soup:
... if getattr(elem, 'name', None) == u'html':
... soup = elem
... break
...
>>> soup.find_all('meta')
[<meta content="0;url= Home.html" http-equiv="refresh"/>]
这篇关于Beautifulsoup功能在特定情况下无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!