使用Python中的HTMLParser有效 [英] Using HTMLParser in Python efficiently
问题描述
在回应 Python正前pression 我试着使用来实现一个HTML解析器的HTMLParser
:
In response to Python regular expression I tried to implement an HTML parser using HTMLParser
:
import HTMLParser
class ExtractHeadings(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.text = None
self.headings = []
def is_relevant(self, tagname):
return tagname == 'h1' or tagname == 'h2'
def handle_starttag(self, tag, attrs):
if self.is_relevant(tag):
self.in_heading = True
self.text = ''
def handle_endtag(self, tag):
if self.is_relevant(tag):
self.headings += [self.text]
self.text = None
def handle_data(self, data):
if self.text != None:
self.text += data
def handle_charref(self, name):
if self.text != None:
if name[0] == 'x':
self.text += chr(int(name[1:], 16))
else:
self.text += chr(int(name))
def handle_entityref(self, name):
if self.text != None:
print 'TODO: entity %s' % name
def extract_headings(text):
parser = ExtractHeadings()
parser.feed(text)
return parser.headings
print extract_headings('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
print extract_headings('before<h1>Hello</h1>after')
这样做,我想知道,如果这个模块的API是坏的,或者如果我没有注意到一些重要的事情。我的问题是:
Doing that I wondered if the API of this module is bad or if I didn't notice some important things. My questions are:
- 为什么我的执行
handle_charref
的必须是复杂的?我本来期望一个好的API传递$ C $连接点作为一个参数,也不x6c
或72
作为字符串。 - 为什么不
handle_charref
调用默认实现handle_data code>用适当的字符串?
- 为什么没有实用工具实现
handle_entityref
那我可能只是打电话?它可以被命名为handle_entityref_HTML4
键,将查找在HTML 4中定义的实体,然后调用handle_data code>在他们身上。
- Why does my implementation of
handle_charref
have to be that complex? I would have expected that a good API passes the codepoint as a parameter, not eitherx6c
or72
as string. - Why doesn't the default implementation of
handle_charref
callhandle_data
with an appropriate string? - Why is there no utility implementation of
handle_entityref
that I could just call? It could be namedhandle_entityref_HTML4
and would lookup the entities defined in HTML 4 and then callhandle_data
on them.
如果该API提供了,编写自定义HTML解析器会容易得多。那么,是我误会?
If that API were provided, writing custom HTML parsers would be much easier. So where is my misunderstanding?
推荐答案
好吧,我倾向于认为这是对的HTMLParser不包括code到HTML实体引用转换成普通的ASCII和/或其它可怕的监督字符。据我了解,这是由Python3完全不同的工作纠正。
Well, I tend to agree that it's a horrible oversight for the HTMLParser not to include code to convert HTML entity references into normal ASCII and/or other characters. I gather that this is remedied by completely different work in Python3.
但是,似乎我们可以写一个相当简单的实体处理程序是这样的:
However, it seems we can write a fairly simple entity handler something like:
import htmlentitydefs
def entity2char(x):
if x.startswith('&#x'):
return chr(int(x[3:-1],16))
elif x.startswith('&#'):
return chr(int(x[2:-1]))
elif x[1:-1] in htmlentitydefs.entitydefs:
return htmlentitydefs.entitydefs[x[1:-1]]
else:
return x
...虽然我们应该换到进一步的输入验证和包装在异常处理code整数转换。
... though we should wrap to further input validation, and wrap the integer conversions in exception handling code.
但是,这应该处理约10 $ C $行C变化最小。添加异常处理会,也许,加倍行数。
But this should handle the vary minimum in about 10 lines of code. Adding the exception handling would, perhaps, double it line count.
这篇关于使用Python中的HTMLParser有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!