使用Python中的HTMLParser有效 [英] Using HTMLParser in Python efficiently

查看:119
本文介绍了使用Python中的HTMLParser有效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回应 Python正前pression 我试着使用来实现一个HTML解析器的HTMLParser

In response to Python regular expression I tried to implement an HTML parser using HTMLParser:

import HTMLParser

class ExtractHeadings(HTMLParser.HTMLParser):

  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.text = None
    self.headings = []

  def is_relevant(self, tagname):
    return tagname == 'h1' or tagname == 'h2'

  def handle_starttag(self, tag, attrs):
    if self.is_relevant(tag):
      self.in_heading = True
      self.text = ''

  def handle_endtag(self, tag):
    if self.is_relevant(tag):
      self.headings += [self.text]
      self.text = None

  def handle_data(self, data):
    if self.text != None:
      self.text += data

  def handle_charref(self, name):
    if self.text != None:
      if name[0] == 'x':
        self.text += chr(int(name[1:], 16))
      else:
        self.text += chr(int(name))

  def handle_entityref(self, name):
    if self.text != None:
      print 'TODO: entity %s' % name

def extract_headings(text):
  parser = ExtractHeadings()
  parser.feed(text)
  return parser.headings

print extract_headings('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
print extract_headings('before<h1>&#72;e&#x6c;&#108;o</h1>after')

这样做,我想知道,如果这个模块的API是坏的,或者如果我没有注意到一些重要的事情。我的问题是:

Doing that I wondered if the API of this module is bad or if I didn't notice some important things. My questions are:


  • 为什么我的执行 handle_charref 的必须是复杂的?我本来期望一个好的API传递$ C $连接点作为一个参数,也不 x6c 72 作为字符串。

  • 为什么不 handle_charref 调用默认实现 handle_data 用适当的字符串?

  • 为什么没有实用工具实现 handle_entityref 那我可能只是打电话?它可以被命名为 handle_entityref_HTML4 键,将查找在HTML 4中定义的实体,然后调用 handle_data 在他们身上。

  • Why does my implementation of handle_charref have to be that complex? I would have expected that a good API passes the codepoint as a parameter, not either x6c or 72 as string.
  • Why doesn't the default implementation of handle_charref call handle_data with an appropriate string?
  • Why is there no utility implementation of handle_entityref that I could just call? It could be named handle_entityref_HTML4 and would lookup the entities defined in HTML 4 and then call handle_data on them.

如果该API提供了,编写自定义HTML解析器会容易得多。那么,是我误会?

If that API were provided, writing custom HTML parsers would be much easier. So where is my misunderstanding?

推荐答案

好吧,我倾向于认为这是对的HTMLParser不包括code到HTML实体引用转换成普通的ASCII和/或其它可怕的监督字符。据我了解,这是由Python3完全不同的工作纠正。

Well, I tend to agree that it's a horrible oversight for the HTMLParser not to include code to convert HTML entity references into normal ASCII and/or other characters. I gather that this is remedied by completely different work in Python3.

但是,似乎我们可以写一个相当简单的实体处理程序是这样的:

However, it seems we can write a fairly simple entity handler something like:

import htmlentitydefs
def entity2char(x):
    if x.startswith('&#x'):
        return chr(int(x[3:-1],16))
    elif x.startswith('&#'):
        return chr(int(x[2:-1]))
    elif x[1:-1] in htmlentitydefs.entitydefs:
        return htmlentitydefs.entitydefs[x[1:-1]]
    else:
        return x

...虽然我们应该换到进一步的输入验证和包装在异常处理code整数转换。

... though we should wrap to further input validation, and wrap the integer conversions in exception handling code.

但是,这应该处理约10 $ C $行C变化最小。添加异常处理会,也许,加倍行数。

But this should handle the vary minimum in about 10 lines of code. Adding the exception handling would, perhaps, double it line count.

这篇关于使用Python中的HTMLParser有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆