HTMLParser误解了href中的实体.是不是错误?我应该举报吗? [英] HTMLParser misunderstands entities in href. Is it a bug or not? Should I report it?

查看:86
本文介绍了HTMLParser误解了href中的实体.是不是错误?我应该举报吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不想知道如何解决问题,因为我已经自己解决了.我只是问这是否真的是一个错误,以及是否以及应该如何报告它. 您可以在下面找到代码和输出:

I don't want to know how to solve the problem, because I have solved it on my own. I'm just asking if it is really a bug and whether and how I should report it. You can find the code and the output below:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        for at in attrs:
            if at[0] == 'href':
                print(at[1])
        return super().handle_starttag(tag, attrs)

    def handle_data(self, data):
        return super().handle_data(data)

    def handle_endtag(self, tag):
        return super().handle_endtag(tag)



s = '<a href="/home?ID=123&gt3=7">nomeLink</a>'

p = MyParser()
p.feed(s)

以下是输出:

"/home?ID = 123> 3 = 7"

"/home?ID=123>3=7"

推荐答案

不,这不是bug.您正在提供解析器无效的HTML,将&包含在HTML属性中的URL中的正确方法是将其转义为&amp;:

No, it is not a bug. You are feeding the parser invalid HTML, the correct way to include & in a URL in a HTML attribute is to escape it to &amp;:

>>> s = '<a href="/home?ID=123&amp;gt3=7">nomeLink</a>'
>>> p = MyParser()
>>> p.feed(s)
/home?ID=123&gt3=7

解析器会尽力(按照HTML标准的要求),并尽其所能为您提供修复"的数据.在这种情况下,它试图修复另一个常见的损​​坏的HTML错误:将&gt;拼写为&gt(忘记了;分号).

The parser did their best (as required by the HTML standard) and gave you 'repaired' data to the best of its ability. In this case, it tried to repair another common broken-HTML error: spelling &gt; as &gt (forgetting the ; semicolon).

我建议您使用html.parser库之上构建rel ="nofollow"> BeautifulSoup . BeautifulSoup支持多个解析器,其中一些可以比其他解析器更好地处理损坏的HTML.

Rather than build on top of the (rather low-level) html.parser library yourself, I recommend you use BeautifulSoup instead. BeautifulSoup supports multiple parsers, and some of those can handle broken HTML better than others.

例如,html5lib解析器可以比html.parser更好地处理未转义的&符号:

For example, the html5lib parser can handle unescaped ampersands in attributes better than html.parser can:

>>> from bs4 import BeautifulSoup
>>> s = '<a href="/home?ID=123&gt3=7">nomeLink</a>'
>>> BeautifulSoup(s, 'html.parser').find('a')['href']
'/home?ID=123>3=7'
>>> BeautifulSoup(s, 'html5lib').find('a')['href']
'/home?ID=123&gt3=7'

出于完整性考虑,第三个受支持的解析器lxml还将未转义的&符当作转义符来处理:

For completeness sake, the third supported parser, lxml, also handles unescaped ampersands as if they are escaped:

>>> BeautifulSoup(s, 'lxml').find('a')['href']
'/home?ID=123&gt3=7'

您可以直接使用lxmlhtml5lib,但是随后您将放弃BeautifulSoup提供的漂亮的高级API.

You could use lxml and html5lib directly, but then you'd forgo the nice high-level API that BeautifulSoup offers.

这篇关于HTMLParser误解了href中的实体.是不是错误?我应该举报吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆