如何使用python HTMLParser库从特定的div标签中提取数据? [英] How can I use the python HTMLParser library to extract data from a specific div tag?

查看:1830
本文介绍了如何使用python HTMLParser库从特定的div标签中提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用python HTMLParser库从HTML页面中获取值。我希望得到的值在这个html元素中:

  ... 
< div id = remository > 20℃; / DIV>
...

到目前为止,这是我的HTMLParser类:

  class LinksParser(HTMLParser.HTMLParser):
def __init __(self):
HTMLParser.HTMLParser .__ init __(self)
self.seen = {}

def handle_starttag(self,tag,attributes):
如果标记!='div':返回
用于名称,属性值:
if name =='id'and value =='remository':
#print value
return

def handle_data(self,data):
打印数据


p = LinksParser()
f = urllib.urlopen(http://domain.com/somepage.html)
html = f。 read()
p.feed(html)
p.close()

有人能指出我正确的方向吗?我希望类功能得到值20。

解决方案

  class LinksParser(HTMLParser。 HTMLParser):
def __init __(self):
HTMLParser.HTMLParser .__ init __(self)
self.recording = 0
self.data = []

def handle_starttag(self,tag,attributes):
如果tag!='div':
返回
如果self.recording:
self.recording + = 1
返回
用于名称,属性值:
如果name =='id'和value =='remository':
break
else:
return
self.recording = 1

def handle_endtag(self,tag):
if tag =='div'and self.recording:
self.recording - = 1

def handle_data(self,data):
if self.recording:
self.data.append(data)
pre>

self.recording 计算嵌套的数量 div 标签从触发一个。当我们处于根植于触发标签的子树中时,我们将数据存储在 self.data 中。



解析结束时的数据保留在 self.data 中(字符串列表,如果没有满足触发标记,则可能为空)。您的代码可以从类外部直接从解析结束时的实例访问列表,或者您可以为此目的添加适当的访问器方法,具体取决于您的目标是什么。



通过使用代替上面代码中看到的常量字符串'div',<$ code>'id'和'remository',实例属性 self.tag self.attname self.attvalue ,由 __ init __ 从传递给它的参数中 - 我避免了上面代码中的低成本泛化步骤,以避免模糊核心点(记录嵌套标记的计数并在记录状态处于活动状态时将数据累积到列表中) p>

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this html element:

...
<div id="remository">20</div>
...

This is my HTMLParser class so far:

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.seen = {}

  def handle_starttag(self, tag, attributes):
    if tag != 'div': return
    for name, value in attributes:
    if name == 'id' and value == 'remository':
      #print value
      return

  def handle_data(self, data):
    print data


p = LinksParser()
f = urllib.urlopen("http://domain.com/somepage.html")
html = f.read()
p.feed(html)
p.close()

Can someone point me in the right direction? I want the class functionality to get the value 20.

解决方案

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.recording = 0
    self.data = []

  def handle_starttag(self, tag, attributes):
    if tag != 'div':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'id' and value == 'remository':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'div' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

这篇关于如何使用python HTMLParser库从特定的div标签中提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆