为什么不能使用lxml.html解析target.html中的所有div元素? [英] Why can't parse all div elements in the target.html with lxml.html?
问题描述
请在保管箱中下载文件,并将其另存为/tmp/target.html
.
Please download the file in dropbox and save it as /tmp/target.html
.
在带有firebug的firefox中打开它以查看html结构.
Open it in firefox with firebug to watch the html struture.
很明显,target.html
中至少有10格.
现在,使用lxml.html解析target.html中的所有div元素.
It is clear that there are at least 10 div in target.html
.
Now to parse all div elements in the target.html with lxml.html.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> divs=doc.xpath("//div")
>>> len(divs)
4
获取结果4
,为什么上面的代码无法解析这么多的div?
target.html
中至少有10个div.
target.html
中的解析表也是如此.
target.html
中至少有9个表,请使用firebug进行检查.
Get the result 4
,why so many divs can't be parsed with above code?
At lease 10 divs in the target.html
.
Same thing for parsing table in target.html
too.
There are at least 9 tables in target.html
,please check it with firebug.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> tables=doc.xpath("//table")
>>> len(tables)
3
推荐答案
感谢sideshowbarker.
Thank to sideshowbarker.
sudo pip3 install html5lib
首先要使用pip安装html5lib.
To install html5lib with pip at first.
import html5lib;
doc = html5lib.parse(open('/tmp/target.html', 'rb'), treebuilder='lxml', namespaceHTMLElements=False);
divs=doc.xpath('//div');
tables=doc.xpath('//table');
print(len(divs));
print(len(tables));
这篇关于为什么不能使用lxml.html解析target.html中的所有div元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!