解析包含&的HTML页面使用Python [英] Parsing HTML page containing & using Python

查看:62
本文介绍了解析包含&的HTML页面使用Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用urllib2和ElementTree解析python中的HTML页面,并且在解析HTML时遇到了麻烦.网页包含&"在带引号的字符串中,但ElementTree对包含&的行抛出parseError.

I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &

脚本:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)

这会在Python 2.7中引发以下错误

This throws following error in Python 2.7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73

错误对应于下一行

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />

类似于读取HTML页面时,&符号未在变量r

Looks like when HTML page is read, & sign is not parsed as &amp; in variable r

我尝试使用R程序和&"使用htmlTreeParse进行解析正确转换为& .

I tried to parse using htmlTreeParse using R program and "&" gets converted to &amp; properly.

让我知道urllib2中是否缺少任何内容

Let me know if I am missing anything in urllib2

我替换了&"到& ,但第904行包含<在javascript中签名会引发相同的错误.应该有一个更好的选择,而不是替换字符.

EDIT : I replaced "&" to &amp; but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.

LINE:904    for (i = 0; i < strac.length - 1; i++) {

推荐答案

首先, xml.etree.ElementTree 是一个 XML 解析器.它不能立即处理HTML实体.& 在XML中包含的非法内容这就是失败的原因.

First of all, xml.etree.ElementTree is an XML parser. It does not handle HTML entities out of the box. & is an illegal thing to have inside the XML and this is why it is failing.

使用真正的专用 HTML 解析器进行操作, BeautifulSoup :

Get yourself going with a real specialized HTML parser, BeautifulSoup:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'

另请参阅:

这篇关于解析包含&amp;的HTML页面使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆