在 Python 中高效地使用 HTMLParser [英] Using HTMLParser in Python efficiently

查看:33
本文介绍了在 Python 中高效地使用 HTMLParser的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了响应 Python 正则表达式,我尝试使用 HTMLParser 实现 HTML 解析器:

In response to Python regular expression I tried to implement an HTML parser using HTMLParser:

import HTMLParser

class ExtractHeadings(HTMLParser.HTMLParser):

  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.text = None
    self.headings = []

  def is_relevant(self, tagname):
    return tagname == 'h1' or tagname == 'h2'

  def handle_starttag(self, tag, attrs):
    if self.is_relevant(tag):
      self.in_heading = True
      self.text = ''

  def handle_endtag(self, tag):
    if self.is_relevant(tag):
      self.headings += [self.text]
      self.text = None

  def handle_data(self, data):
    if self.text != None:
      self.text += data

  def handle_charref(self, name):
    if self.text != None:
      if name[0] == 'x':
        self.text += chr(int(name[1:], 16))
      else:
        self.text += chr(int(name))

  def handle_entityref(self, name):
    if self.text != None:
      print 'TODO: entity %s' % name

def extract_headings(text):
  parser = ExtractHeadings()
  parser.feed(text)
  return parser.headings

print extract_headings('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
print extract_headings('before<h1>&#72;e&#x6c;&#108;o</h1>after')

这样做我想知道这个模块的API是不是不好,或者我没有注意到一些重要的事情.我的问题是:

Doing that I wondered if the API of this module is bad or if I didn't notice some important things. My questions are:

  • 为什么我对 handle_charref 的实现必须如此复杂?我原以为好的 API 会将代码点作为参数传递,而不是将 x6c72 作为字符串传递.
  • 为什么 handle_charref 的默认实现不使用适当的字符串调用 handle_data?
  • 为什么没有我可以调用的 handle_entityref 的实用程序实现?它可以命名为 handle_entityref_HTML4 并查找 HTML 4 中定义的实体,然后对它们调用 handle_data.
  • Why does my implementation of handle_charref have to be that complex? I would have expected that a good API passes the codepoint as a parameter, not either x6c or 72 as string.
  • Why doesn't the default implementation of handle_charref call handle_data with an appropriate string?
  • Why is there no utility implementation of handle_entityref that I could just call? It could be named handle_entityref_HTML4 and would lookup the entities defined in HTML 4 and then call handle_data on them.

如果提供了该 API,编写自定义 HTML 解析器会容易得多.那么我的误解在哪里?

If that API were provided, writing custom HTML parsers would be much easier. So where is my misunderstanding?

推荐答案

嗯,我倾向于同意 HTMLParser 没有包含将 HTML 实体引用转换为普通 ASCII 和/或其他字符的代码是一个可怕的疏忽.我认为这可以通过 Python3 中完全不同的工作来解决.

Well, I tend to agree that it's a horrible oversight for the HTMLParser not to include code to convert HTML entity references into normal ASCII and/or other characters. I gather that this is remedied by completely different work in Python3.

然而,似乎我们可以编写一个相当简单的实体处理程序,例如:

However, it seems we can write a fairly simple entity handler something like:

import htmlentitydefs
def entity2char(x):
    if x.startswith('&#x'):
        # convert from hexadecimal
        return chr(int(x[3:-1], 16))
    elif x.startswith('&#'):
        # convert from decimal
        return chr(int(x[2:-1]))
    elif x[1:-1] in htmlentitydefs.entitydefs:
        return htmlentitydefs.entitydefs[x[1:-1]]
    else:
        return x

...虽然我们应该包装以进一步输入验证,并将整数转换包装在异常处理代码中.

... though we should wrap to further input validation, and wrap the integer conversions in exception handling code.

但这应该处理最少的大约 10 行代码.添加异常处理可能会使其行数增加一倍.

But this should handle the very minimum in about 10 lines of code. Adding the exception handling would, perhaps, double its line count.

这篇关于在 Python 中高效地使用 HTMLParser的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆