解析包含&的HTML页面使用Python [英] Parsing HTML page containing & using Python

查看：62 发布时间：2021/5/3 20:56:30 python-2.7 urllib2 elementtree

本文介绍了解析包含&的HTML页面使用Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用urllib2和ElementTree解析python中的HTML页面，并且在解析HTML时遇到了麻烦.网页包含&"在带引号的字符串中，但ElementTree对包含&的行抛出parseError.

I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &

脚本:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)

这会在Python 2.7中引发以下错误

This throws following error in Python 2.7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73

错误对应于下一行

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />

类似于读取HTML页面时，&符号未在变量r

Looks like when HTML page is read, & sign is not parsed as & in variable r

我尝试使用R程序和&"使用htmlTreeParse进行解析正确转换为& .

I tried to parse using htmlTreeParse using R program and "&" gets converted to & properly.

让我知道urllib2中是否缺少任何内容

Let me know if I am missing anything in urllib2

我替换了&"到& ，但第904行包含<在javascript中签名会引发相同的错误.应该有一个更好的选择，而不是替换字符.

EDIT : I replaced "&" to & but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.

LINE:904    for (i = 0; i < strac.length - 1; i++) {

推荐答案

首先， xml.etree.ElementTree 是一个 XML 解析器.它不能立即处理HTML实体.& 是在XML中包含的非法内容这就是失败的原因.

First of all, xml.etree.ElementTree is an XML parser. It does not handle HTML entities out of the box. & is an illegal thing to have inside the XML and this is why it is failing.

使用真正的专用 HTML 解析器进行操作， BeautifulSoup :

Get yourself going with a real specialized HTML parser, BeautifulSoup:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'

另请参阅:

如何使用标准库

这篇关于解析包含&的HTML页面使用Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析包含&的HTML页面使用Python [英] Parsing HTML page containing & using Python

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

解析包含&amp;的HTML页面使用Python [英] Parsing HTML page containing &amp; using Python

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

解析包含&的HTML页面使用Python [英] Parsing HTML page containing & using Python

登录关闭