BeautifulSoup 无法解析长视图状态 [英] BeautifulSoup fails to parse long view state
问题描述
我尝试使用 BeautifulSoup4 来解析从 http://exporter.nih 检索到的 html.gov/ExPORTER_Catalog.aspx?index=0 如果我打印出结果汤,它会像这样结束:
kZXI9IjAi"/></form></body></html>
在原始 html 中搜索最后一个字符 9IjaI
,我发现它位于一个巨大的视图状态的中间.BeautifulSoup 似乎对此有问题.任何提示我可能做错了什么或如何解析这样的页面?
BeautifulSoup 使用 可插入的 HTML 解析器 来构建汤";您需要尝试不同的解析器,因为每个解析器都会以不同的方式处理损坏的页面.
但是,我使用任何解析器解析该页面都没有问题:
<预><代码>>>>从 beautifulsoup4 导入 BeautifulSoup>>>进口请求>>>r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')>>>对于 ('html.parser', 'lxml', 'html5lib') 中的解析器:... 打印 repr(str(BeautifulSoup(r.text, parser))[-60:])...'; pageTracker._trackPageview(); </script> </body> </html> ''(); pageTracker._trackPageview(); </script> </body></html>''(); pageTracker._trackPageview(); </script> </body></html>'确保你安装了最新的 BeautifulSoup4
包,我看到 4.1 系列的问题在 4.2 中解决了.
I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this:
kZXI9IjAi"/></form></body></html>
Searching for the last characters 9IjaI
in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page?
BeautifulSoup uses a pluggable HTML parser to build the 'soup'; you need to try out different parsers, as each will treat a broken page differently.
I had no problems parsing that page with any of the parsers, however:
>>> from beautifulsoup4 import BeautifulSoup
>>> import requests
>>> r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')
>>> for parser in ('html.parser', 'lxml', 'html5lib'):
... print repr(str(BeautifulSoup(r.text, parser))[-60:])
...
';
pageTracker._trackPageview();
</script>
</body>
</html>
'
'();
pageTracker._trackPageview();
</script>
</body></html>'
'();
pageTracker._trackPageview();
</script>
</body></html>'
Make sure you have the latest BeautifulSoup4
package installed, I have seen consistent problems in the 4.1 series solved in 4.2.
这篇关于BeautifulSoup 无法解析长视图状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!