使用htmlparser python获取标签下的html [英] Get the html under a tag using htmlparser python

查看:68
本文介绍了使用htmlparser python获取标签下的html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在标签下并使用HTMLParser获得整个html.我目前能够获取标签之间的数据,以下是我的代码

I want to get whole html under a tag and using HTMLParser. I am able to currently get the data between the tags and following is my code

class LinksParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.recording = 0
    self.data = ''

  def handle_starttag(self, tag, attributes):
    if tag != 'span':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'itemprop' and value == 'description':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'span' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data += data

例如,我还想要输入中的html标记

I also want the html tags inside the input for example

<span itemprop="description">
<h1>My First Heading</h1>
<p>My first <br/><br/>paragraph.</p>
</span>

当作为输入提供时,只会给我没有标签的数据.有什么方法可以让我在标记之间获得整个html?

when provided as input would only give me the data with out tags. Is there any method with which I can get whole html between the tags?

推荐答案

一个人可以使用 xml.etree.ElementTree.TreeBuilder 利用etree API查找/操纵<span>元素:

One could use xml.etree.ElementTree.TreeBuilder to exploit etree API for finding/manipulating the <span> element:

import sys
from HTMLParser import HTMLParser
from xml.etree import cElementTree as etree

class LinksParser(HTMLParser):
  def __init__(self):
      HTMLParser.__init__(self)
      self.tb = etree.TreeBuilder()

  def handle_starttag(self, tag, attributes):
      self.tb.start(tag, dict(attributes))

  def handle_endtag(self, tag):
      self.tb.end(tag)

  def handle_data(self, data):
      self.tb.data(data)

  def close(self):
      HTMLParser.close(self)
      return self.tb.close()

parser = LinksParser()
parser.feed(sys.stdin.read())
root = parser.close()
span = root.find(".//span[@itemprop='description']")
etree.ElementTree(span).write(sys.stdout)

输出

<span itemprop="description">
<h1>My First Heading</h1>
<p>My first <br /><br />paragraph.</p>
</span>

要在没有父(根)<span>标记的情况下进行打印:

To print without the parent (root) <span> tag:

sys.stdout.write(span.text)
for child in span:
    sys.stdout.write(etree.tostring(child)) # add encoding="unicode" on Python 3

这篇关于使用htmlparser python获取标签下的html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆