在Python中解析HTML时获取位置信息 [英] Obtaining position info when parsing HTML in Python

查看:137
本文介绍了在Python中解析HTML时获取位置信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种方法来解析(可能格式错误的)Python中的HTML,并且如果满足一组条件,则将该文档的该部分与位置(行,列)一起输出。位置信息是什么让我在这里绊倒。并且要清楚,我不需要构建对象树。我只是想在原始文档中找到某些数据和它们的位置(想想拼写检查器,例如:在第x行第y列中的单词foo拼写错误)'

作为一个例子,我想要这样的东西(使用ElementTree的 Target API a>):

pre $ 导入xml.etree.ElementTree作为ET

class EchoTarget:
def start(self,tag,attrib):
if somecondition():
printstart,tag,attrib,self.getpos()
def end(self,tag):
if somecondition():
printend,tag,self.getpos()
def data(self,data):
if somecondition():
打印data,repr(data),self.getpos()

target = EchoTarget()
parser = ET.XMLParser(target = target)
parser.feed < p>一些文字< / p>)
parser.close()

然而,就我所知, getpos()方法(或类似的东西)不存在。当然,这是使用XML解析器。我想解析可能格式错误的HTML。



有趣的是, Python标准库中的HTMLParser 类提供了获取位置信息的支持(使用 getpos()方法),但它是在处理格式错误的HTML时很糟糕,并已被淘汰为可能的解决方案。我需要分析存在于真实单词中的HTML而不会破坏解析器。



我知道两个HTML解析器可以很好地解析格式错误的HTML,即< a href =http://lxml.de/ =nofollow noreferrer> lxml and html5lib 。实际上,我宁愿使用其中的任何一个来使用Python中的任何其他选项。然而,据我所知,html5lib不提供事件API并要求将文档解析为树对象。然后我将不得不遍历树。当然,到那时,与源文档没有关联,所有的位置信息都会丢失。所以,html5lib已经出来了,这是一个耻辱,因为它好像是处理格式错误的HTML的最佳解析器。



lxml库提供了一个主要反映ElementTree的Target API,但再次,我不知道有什么方法可以访问每个事件的位置信息。一瞥源代码也没有提供任何提示。



lxml还为SAX事件提供API。有趣的是,Python的标准库提到SAX支持 Locator对象,但几乎没有提供关于如何使用它们的文档。这个 SO问题提供了一些信息(当使用SAX解析器时),但我不明白这与lxml提供的对SAX事件的有限支持有何关系。

最后,在任何人建议美丽的汤,我会指出,正如主页上所述,美丽的汤是位于流行的Python解析器之上,如lxml和html5lib。它给我的是一个对象,用于从没有连接到原始源文档的位置提取数据。与html5lib一样,所有的位置信息在我访问数据时都会丢失。我希望/需要直接对解析器进行原始访问。



为了扩展我在开头提到的拼写检查示例,我只想检查拼写文档文本(但不包括标签名称或属性),并且可能希望跳过检查特定标签(如脚本或代码标签)的内容。因此,我需要一个真正的HTML解析器。但是,我只关心原始源文档中拼写错误的单词在报告拼写错误的单词时的位置,而不需要构建树对象。要明确,这只是一个潜在用途的例子。我可能会用它来完全不同,但需求本质上是一样的。实际上,我曾经使用HTMLParser构建过非常相似的东西,但从未使用它,因为错误处理不适用于该用例。那是几年前,我似乎已经把这个文件丢失了。我想这次使用lxml或html5lib。



那么,有什么我错过了吗?我很难相信这些解析器(除了大部分无用的HTMLParser)都没有任何方法可以访问位置信息。但是,如果他们这样做,它必须是无证的,这对我来说似乎很奇怪。

解决方案

经过一些额外的研究和更仔细的审查 html5lib 的源代码,我发现 html5lib.tokenizer.HTMLTokenizer 确实保留部分位置信息。通过partial,我的意思是它知道给定标记的最后一个字符的行和列。不幸的是,它不保留令牌开始的位置(我认为它可以外推,但是感觉像反向重新实现大部分令牌化器 - 并且不,使用前一个的结束位置不会如果在令牌之间有空格)。

无论如何,我可以将 HTMLTokenizer 和创建一个 HTMLParser 克隆主要是复制API。你可以在这里找到我的工作: https://gist.github.com/waylan/7d5b7552078f1abc6fac 。然而,由于tokenizer只是html5lib实现的解析过程的一部分,我们放弃了html5lib的优秀部分。例如,在这个过程的这个阶段没有进行标准化,所以你得到了原始的(可能无效的)标记而不是标准化的文档。正如评论中所述,这并不完美,我怀疑它是否有用。实际上,我还发现Python标准库中包含的HTMLParser已被 /3.3.html#htmlrel =nofollow>更新为Python 3.3,并且不再在无效输入上崩溃。据我所知,这是更好的(对我的用例),因为它确实提供实际有用的位置信息(因为它总是)。在所有其他方面,html5lib的封装(当然,除了它可能接受更多的测试并因此更稳定),没有比这更好或更差的了。不幸的是,该更新还没有被移植到Python 2或更早版本的Python 3版本。尽管如此,我并不认为这样做会很困难。



无论如何,我决定在标准库中使用HTMLParser,并且拒绝我自己的包装html5lib。您可以在此处看到早期工作,该工作似乎可以在最少的测试中正常工作。




根据美丽的汤 docs ,HTMLParser已更新为支持Python 2.7.3和3.2.2(早于3.3)的无效输入。


I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)'

As an example I want something like this (using ElementTree's Target API):

import xml.etree.ElementTree as ET

class EchoTarget:
    def start(self, tag, attrib):
        if somecondition():
            print "start", tag, attrib, self.getpos()
    def end(self, tag):
        if somecondition():
            print "end", tag, self.getpos()
    def data(self, data):
        if somecondition():
            print "data", repr(data), self.getpos()

target = EchoTarget()
parser = ET.XMLParser(target=target)
parser.feed("<p>some text</p>")
parser.close() 

However, as far as I can tell, the getpos() method (or something like it) doesn't exist. And, of course, that is using an XML parser. I want to parse potentially malformed HTML.

Interestingly, the HTMLParser class in the Python Standard Lib does offer support for obtaining the location info (with a getpos() method), but it is horrible at handling malformed HTML and has been eliminated as a possible solution. I need to parse HTML that exists in the real word without breaking the parser.

I'm aware of two HTML parsers that would work well at parsing malformed HTML, namely lxml and html5lib. And in fact, I would prefer to use either one of them over any other options available in Python.

However, as far as I can tell, html5lib offers no event API and would require that the document be parsed to a tree object. Then I would have to iterate through the tree. Of course, by that point, there is no association with the source document and all location information is lost. So, html5lib is out, which is a shame because it seems like the best parser for handling malformed HTML.

The lxml library offers a Target API which mostly mirrors ElementTree's, but again, I'm not aware of any way to access location information for each event. A glance at the source code offered no hints either.

lxml also offers an API to SAX events. Interestingly, Python's standard lib mentions that SAX has support for Locator Objects, but offers little documentation about how to use them. This SO Question provides some info (when using a SAX Parser), but I don't see how that relates to the limited support for SAX events that lxml provides.

Finally, before anyone suggests Beautiful Soup, I will point out that, as stated on the home page, "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib". All it gives me is an object to extract data from with no connection to the original source document. Like with html5lib, all location info is lost by the time I have access to the data. I want/need raw access to the parser directly.

To expand on the spell checker example I mention in the beginning, I would want to check the spelling only of words in the document text (but not tag names or attributes) and may want to skip checking the content of specific tags (like the script or code tags). Therefore, I need a real HTML parser. However, I am only interested in the position of the misspelled words in the original source document when it comes to reporting the misspelled words and have no need to build a tree object. To be clear, this is only an example of one potential use. I may use it for something completely different but the needs would be essentially the same. In fact, I once built something very similar using HTMLParser, but never used it as the error handling wasn't going to work for that use case. That was years ago, and I seem to have lost that file somewhere along the line. I'd like to use lxml or html5lib instead this time around.

So, is there something I'm missing? I have a hard time believing that none of these parsers (aside from the mostly useless HTMLParser) have any way to access the position information. But if they do it must be undocumented, which seems strange to me.

解决方案

After some additional research and more carefully reviewing of the source code of html5lib, I discovered that html5lib.tokenizer.HTMLTokenizer does retain partial position information. By "partial," I mean that it knows the line and column of the last character of a given token. Unfortunately, it does not retain the position of the start of the token (I suppose it could be extrapolated, but that feels like re-implementing much of the tokenizer in reverse--and no, using the end position of the previous won't work if there is white space between tokens).

In any event, I was able to wrap the HTMLTokenizer and create an HTMLParser clone which mostly replicates the API. You can find my work here: https://gist.github.com/waylan/7d5b7552078f1abc6fac.

However, as the tokenizer is only part of the parsing process implemented by html5lib, we loose the good parts of html5lib. For example, no normalization has been done at that stage in the process, so you get the raw (potentially invalid) tokens rather than a normalized document. As stated in the comments there, it is not perfect and I question whether it is even useful.

In fact, I also discovered the the HTMLParser included in the Python standard library had been updated for Python 3.3 and no longer crashes hard on invalid input. As far as I can tell, it is better (for my use case) in that it does provide actually useful position info (as it always has). In all other respects, it is no better or worse that my wrapper of html5lib (except of course, that it has presumably received much more testing and is therefore more stable). Unfortunately, the update has not been back-ported to Python 2 or earlier Python 3 versions. Although, I don't imagine that would be all that difficult to do myself.

In any event, I'v decided to move forward with HTMLParser in the standard library and reject my own wrapper around html5lib. You can see an early effort here which appears to work fine with minimal testing.


According to the Beautiful Soup docs, HTMLParser was updated to support invalid input in Python 2.7.3 and 3.2.2, which is earlier than 3.3.

这篇关于在Python中解析HTML时获取位置信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆