使用Python中的HTMLParser有效 [英] Using HTMLParser in Python efficiently

查看：119 发布时间：2016/5/22 22:20:48 python api html-parsing

本文介绍了使用Python中的HTMLParser有效的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在回应 Python正前pression 我试着使用来实现一个HTML解析器的HTMLParser ：

In response to Python regular expression I tried to implement an HTML parser using HTMLParser:

import HTMLParser

class ExtractHeadings(HTMLParser.HTMLParser):

  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.text = None
    self.headings = []

  def is_relevant(self, tagname):
    return tagname == 'h1' or tagname == 'h2'

  def handle_starttag(self, tag, attrs):
    if self.is_relevant(tag):
      self.in_heading = True
      self.text = ''

  def handle_endtag(self, tag):
    if self.is_relevant(tag):
      self.headings += [self.text]
      self.text = None

  def handle_data(self, data):
    if self.text != None:
      self.text += data

  def handle_charref(self, name):
    if self.text != None:
      if name[0] == 'x':
        self.text += chr(int(name[1:], 16))
      else:
        self.text += chr(int(name))

  def handle_entityref(self, name):
    if self.text != None:
      print 'TODO: entity %s' % name

def extract_headings(text):
  parser = ExtractHeadings()
  parser.feed(text)
  return parser.headings

print extract_headings('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
print extract_headings('before<h1>&#72;e&#x6c;&#108;o</h1>after')

这样做，我想知道，如果这个模块的API是坏的，或者如果我没有注意到一些重要的事情。我的问题是：

Doing that I wondered if the API of this module is bad or if I didn't notice some important things. My questions are:

为什么我的执行 handle_charref 的必须是复杂的？我本来期望一个好的API传递$ C $连接点作为一个参数，也不 x6c 或 72 作为字符串。

为什么不 handle_charref 调用默认实现 handle_data 用适当的字符串？


为什么没有实用工具实现 handle_entityref 那我可能只是打电话？它可以被命名为 handle_entityref_HTML4 键，将查找在HTML 4中定义的实体，然后调用 handle_data 在他们身上。



Why does my implementation of handle_charref have to be that complex? I would have expected that a good API passes the codepoint as a parameter, not either x6c or 72 as string.
Why doesn't the default implementation of handle_charref call handle_data with an appropriate string?
Why is there no utility implementation of handle_entityref that I could just call? It could be named handle_entityref_HTML4 and would lookup the entities defined in HTML 4 and then call handle_data on them.

如果该API提供了，编写自定义HTML解析器会容易得多。那么，是我误会？
If that API were provided, writing custom HTML parsers would be much easier. So where is my misunderstanding?
推荐答案
好吧，我倾向于认为这是对的HTMLParser不包括code到HTML实体引用转换成普通的ASCII和/或其它可怕的监督字符。据我了解，这是由Python3完全不同的工作纠正。
Well, I tend to agree that it's a horrible oversight for the HTMLParser not to include code to convert HTML entity references into normal ASCII and/or other characters.  I gather that this is remedied by completely different work in Python3.
但是，似乎我们可以写一个相当简单的实体处理程序是这样的：
However, it seems we can write a fairly simple entity handler something like:
import htmlentitydefs
def entity2char(x):
    if x.startswith('&#x'):
        return chr(int(x[3:-1],16))
    elif x.startswith('&#'):
        return chr(int(x[2:-1]))
    elif x[1:-1] in htmlentitydefs.entitydefs:
        return htmlentitydefs.entitydefs[x[1:-1]]
    else:
        return x

 ...虽然我们应该换到进一步的输入验证和包装在异常处理code整数转换。
... though we should wrap to further input validation, and wrap the integer conversions in exception handling code.
但是，这应该处理约10 $ C $行C变化最小。添加异常处理会，也许，加倍行数。
But this should handle the vary minimum in about 10 lines of code.  Adding the exception handling would, perhaps, double it line count.

                        这篇关于使用Python中的HTMLParser有效的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用Python中的HTMLParser有效 [英] Using HTMLParser in Python efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python中的HTMLParser有效 [英] Using HTMLParser in Python efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭