解析HTML以在元素中获取文本 [英] Parsing HTML to get text inside an element

查看:124
本文介绍了解析HTML以在元素中获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将两个元素中的文本转换为字符串:
$ b

source_code = < span class =UserName>< a href =#> Martin Elias< / a>< / span>

>>> ; text
'Martin Elias'

我怎么能做到这一点?

解决方案

我搜索了python parse html,这是第一个结果:
https://docs.python.org/2/library/htmlparser.html



这段代码是从python文档中获得的

  from HTMLParser import HTMLParser 

#创建一个子类并重写处理方法
class MyHTMLParser(HTMLParser):
def handle_starttag(self,tag,attrs):
print遇到一个开始标记:,标记
def handle_endtag( self,tag):
print遇到一个结束标记:,标记
def handle_data(self,data):
print遇到一些数据:,数据

#实例化解析器并为其提供一些HTML
解析器= MyHTMLParser()
parser.feed('< html> < head>< title> Test< / title>< / head>'
'< body>< h1>解析我!< / h1>< / body>< / html> ')

以下是结果:

<$遇到一个开始标签:html
遇到一个开始标签:head
遇到一个开始标签:title
遇到一些数据:Test
遇到一个结束标签:title
遇到一个结束标签:head
遇到一个开始标签:body
遇到一个开始标签:h1
遇到一些数据:解析我!
遇到一个结束标记:h1
遇到一个结束标记:body
遇到一个结束标记:html

通过查看HTMLParser中的代码,我想出了这个:

  class myhtmlparser(HTMLParser):
def __init __(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self。 HTMLDATA = []
def handle_starttag(self,tag,attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self ,数据):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []

您可以像这样使用它:

  from HTMLParser import HTMLParser 

pstring = source_code =< span class =UserName>< a href = #> Martin Elias< / a>< / span>


class myhtmlparser(HTMLParser):
def __init __(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self,tag,attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self,data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = [ ]
self.HTMLDATA = []

parser = myhtmlparser()
parser.feed(pstring)

#从解析器中提取数据
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA

#清理解析器
parser.clean()

#打印出我们的数据
打印标签
打印attrs
打印数据

现在您应该能够轻松地从这些列表中提取数据。我希望这有助于!


I need to get the text inside the two elements into a string:

source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""

>>> text
'Martin Elias'

How could I achieve this?

解决方案

I searched "python parse html" and this was the first result: https://docs.python.org/2/library/htmlparser.html

This code is taken from the python docs

from HTMLParser import HTMLParser

    # create a subclass and override the handler methods
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print "Encountered a start tag:", tag
        def handle_endtag(self, tag):
            print "Encountered an end tag :", tag
        def handle_data(self, data):
            print "Encountered some data  :", data

    # instantiate the parser and fed it some HTML
    parser = MyHTMLParser()
    parser.feed('<html><head><title>Test</title></head>'
                '<body><h1>Parse me!</h1></body></html>')

Here is the result:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Using this and by looking at the code in HTMLParser I came up with this:

class myhtmlparser(HTMLParser):
    def __init__(self):
        self.reset()
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []
    def handle_starttag(self, tag, attrs):
        self.NEWTAGS.append(tag)
        self.NEWATTRS.append(attrs)
    def handle_data(self, data):
        self.HTMLDATA.append(data)
    def clean(self):
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []

You can use it like this:

from HTMLParser import HTMLParser

pstring = source_code = """<span class="UserName"><a href="#">Martin Elias</a></span>"""


class myhtmlparser(HTMLParser):
    def __init__(self):
        self.reset()
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []
    def handle_starttag(self, tag, attrs):
        self.NEWTAGS.append(tag)
        self.NEWATTRS.append(attrs)
    def handle_data(self, data):
        self.HTMLDATA.append(data)
    def clean(self):
        self.NEWTAGS = []
        self.NEWATTRS = []
        self.HTMLDATA = []

parser = myhtmlparser()
parser.feed(pstring)

# Extract data from parser
tags  = parser.NEWTAGS
attrs = parser.NEWATTRS
data  = parser.HTMLDATA

# Clean the parser
parser.clean()

# Print out our data
print tags
print attrs
print data

Now you should be able to extract your data from those lists easily. I hope this helped!

这篇关于解析HTML以在元素中获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆