解析包含&的HTML页面使用Python [英] Parsing HTML page containing & using Python
问题描述
我正在尝试使用urllib2和ElementTree解析python中的HTML页面,并且在解析HTML时遇到了麻烦.网页包含&"在带引号的字符串中,但ElementTree对包含&的行抛出parseError.
I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &
脚本:
import urllib2
url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()
import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)
这会在Python 2.7中引发以下错误
This throws following error in Python 2.7
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73
错误对应于下一行
<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />
类似于读取HTML页面时,&符号未在变量r
Looks like when HTML page is read, & sign is not parsed as &
in variable r
我尝试使用R程序和&"使用htmlTreeParse进行解析正确转换为&
.
I tried to parse using htmlTreeParse using R program and "&" gets converted to &
properly.
让我知道urllib2中是否缺少任何内容
Let me know if I am missing anything in urllib2
我替换了&"到&
,但第904行包含<在javascript中签名会引发相同的错误.应该有一个更好的选择,而不是替换字符.
EDIT : I replaced "&" to &
but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.
LINE:904 for (i = 0; i < strac.length - 1; i++) {
推荐答案
首先, xml.etree.ElementTree
是一个 XML
解析器.它不能立即处理HTML实体.&
是在XML中包含的非法内容这就是失败的原因.
First of all, xml.etree.ElementTree
is an XML
parser. It does not handle HTML entities out of the box. &
is an illegal thing to have inside the XML and this is why it is failing.
使用真正的专用 HTML
解析器进行操作, BeautifulSoup
:
Get yourself going with a real specialized HTML
parser, BeautifulSoup
:
>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'
另请参阅:
这篇关于解析包含&的HTML页面使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!