如何使用 python HTMLParser 库从特定的 div 标签中提取数据? [英] How can I use the python HTMLParser library to extract data from a specific div tag?
问题描述
我正在尝试使用 python HTMLParser 库从 HTML 页面中获取一个值.我想得到的值在这个 html 元素中:
I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:
...
<div id="remository">20</div>
...
到目前为止,这是我的 HTMLParser 类:
This is my HTMLParser class so far:
class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.seen = {}
def handle_starttag(self, tag, attributes):
if tag != 'div': return
for name, value in attributes:
if name == 'id' and value == 'remository':
#print value
return
def handle_data(self, data):
print data
p = LinksParser()
f = urllib.urlopen("http://domain.com/somepage.html")
html = f.read()
p.feed(html)
p.close()
有人能指出我正确的方向吗?我希望类功能获得值 20.
Can someone point me in the right direction? I want the class functionality to get the value 20.
推荐答案
class LinksParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.recording = 0
self.data = []
def handle_starttag(self, tag, attributes):
if tag != 'div':
return
if self.recording:
self.recording += 1
return
for name, value in attributes:
if name == 'id' and value == 'remository':
break
else:
return
self.recording = 1
def handle_endtag(self, tag):
if tag == 'div' and self.recording:
self.recording -= 1
def handle_data(self, data):
if self.recording:
self.data.append(data)
self.recording
从触发"标签开始计算嵌套的 div
标签的数量.当我们在以触发标签为根的子树中时,我们在 self.data
中积累数据.
self.recording
counts the number of nested div
tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data
.
解析结束时的数据留在self.data
(一个字符串列表,如果没有触发标签可能为空).您在类外部的代码可以直接从解析结束时的实例访问列表,或者您可以为此目的添加适当的访问器方法,具体取决于您的目标.
The data at the end of the parse are left in self.data
(a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.
通过使用 'div'
, 'id'
代替上面代码中看到的常量文字字符串,可以轻松地使该类更通用一些和 'remository'
,实例属性 self.tag
、self.attname
和 self.attvalue
,由__init__
来自传递给它的参数——我避免了上面代码中的廉价泛化步骤,以避免模糊核心点(跟踪嵌套标签的计数并在记录状态时将数据累积到列表中)处于活动状态).
The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div'
, 'id'
, and 'remository'
, instance attributes self.tag
, self.attname
and self.attvalue
, set by __init__
from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).
这篇关于如何使用 python HTMLParser 库从特定的 div 标签中提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!