删除html标签并获得已标记文本的开始/结束索引? [英] Remove html tags AND get start/end indices of marked-down text?

查看:158
本文介绍了删除html标签并获得已标记文本的开始/结束索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆以降价格式显示的文本:

  a ** b ** c 

b c。

我已将它转换为HTML标记以更加经常:

  a< strong> b< / strong> c 

我知道有很多工具可以转换为纯文本,但我希望两者都可以做到这一点,并得到每个markdown / tag的内部文本的索引。

例如,输入

  a< strong> b< / strong> c 

无论是剥离文本:

  abc 

,给我开始(第一个字符(b)的位置)和结束(标记字符串(c)之后的第一个字符的位置),因此对于此示例(start,end)=( 1,2)。这也必须在嵌套标签上工作。我知道有很多库(我使用Python 3)来移除标签,但是我没有找到能够完成这两项任务的库。任何人都可以通过指出某件事情,或者描述可能有用的算法来帮助我?



嵌套标记示例:

一些标签可以无限嵌套在自己的标签类型中

 < sup>< sup> There< ; /坐席>< / SUP> <坐席><坐席的收率为< / SUP>< / SUP> <坐席><坐席>另一个< / SUP>< / SUP> <坐席><坐席>螺纹< / SUP>< / SUP> <坐席><坐席>像< / SUP>< / SUP> <坐席><坐席>这< / SUP>< / SUP> 

同时列出

 < UL> 
< li> https://steamcommunity.com/tradeoffer/new/partner=30515749& token = WOIxg5eB< / li>
< li> 79< / li>
< li>为什么大象会被踢出公共泳池?他们把他们的行李丢掉了! < /锂>
< / ul>

另外,删除线的字体可以嵌套在斜体内等。

 < em>< strike> a< / strike>< / em> 


解决方案

看起来像你想要的是一个HTML解析器。 HTML解析器是复杂的东西。因此,您希望使用现有的库(创建自己的库很困难,很可能会在很多边界情况下失败)。不幸的是,正如这个问题中强调的那样,大多数现有的HTML解析库不保留位置信息。好消息是,一个可靠地保留位置信息的HTML解析器位于Python标准库中(参见 HTMLParser的)。当你使用Python 3时,解析器的问题已经解决。



一个基本的例子可能是这样的:

  from html.parser import HTMLParser 

$ b $ class StripTextParser(HTMLParser):
def __init __(self,* args, ** kwargs):
self.data = []
super(StripTextParser,self).__ init __(* args,** kwargs)

def handle_data(self,data) :
if data.strip():
#仅使用包含多于空格的字符串
startpos = self.getpos()
#`self.getpos()`返回`(行,列)`的起始位置。
#使用该数据加上长度来计算结束位置。
endpos =(startpos [0],startpos [1] + len(data))
self.data.append((data,startpos,endpos))


def strip_text(html):
parser = StripTextParser()
parser.feed(html)
return parser.data

test1 =< sup> <坐席>还有< / SUP>< / SUP><坐席><坐席的收率为< / SUP>< / SUP><坐席><坐席>另一个< / SUP>< / SUP> <坐席><坐席>螺纹< / SUP>< / SUP><坐席><坐席>像< / SUP>< / SUP><坐席><坐席>这< / SUP>< ; /坐席>中

print(strip_text(test1))

#输出:[('There',(1,10),(1,15)),''was', (1,38),(1,41)),('another',(1,64),(1,71)),('thread',(1,94),(1,100)),( (1,123),(1,127)),('this',(1,150),(1,154))]


test2 =
将UL>
将立GT; HTTPS://steamcommunity.com/tradeoffer/new/partner=30515749&标记= WOIxg5eB< /立GT;
将立GT; 79< ; /李>
<李>?为什么大象被踢出公共游泳池的他们不停地落下树干<!/李>
< / UL>


print(strip_text(test2))

#输出:[('https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB' ,(3,4),(3,77)),('79',(4,4),(4,6)),('为什么大象会被踢出公共泳池?他们会把他们丢掉(5,4),(5,95))]

test3 =< em&

print(strip_text(test3))

#输出:[('a',(1,12),(1,13))]

如果没有关于输出所需格式的更多特定信息,我只是创建了一个元组列表。当然,您可以重构以适应您的特定需求。如果你想要所有的空格,然后删除 if data.strip(): line。


I have a bunch of text that in markdown format:

a**b**c

is abc.

I've got it converted to html tags to be more regular:

a<strong>b</strong>c

I know there's a lot of tools out there to convert to plain text, but I want to both do that, AND get the indices of the inner text for each markdown/tag.

For example, the input

a<strong>b</strong>c 

would return both the stripped text:

abc

and give me the start (position of first char(b)) and end (position of first char AFTER the tagged string(c)), so for this example (start,end) = (1,2). This also has to work on nested tags. I know there's a lot of libraries out there (I'm using Python 3) to remove the tags, but I haven't found one that will do both tasks. Can anyone help me by either pointing out something that does this, or describing an algorithm that might work?

Examples of nested markup:

Some tags can be nested inside their own tag type infinitely

<sup><sup>There</sup></sup> <sup><sup>was</sup></sup> <sup><sup>another</sup></sup> <sup><sup>thread</sup></sup> <sup><sup>like</sup></sup> <sup><sup>this</sup></sup>

Also lists

<ul>
<li>https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB</li>
<li>79</li>
<li>Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! </li>
</ul>

Also strikethrough font can be nested inside italic, etc.

<em><strike>a</strike></em>

解决方案

Looks like what you want is an HTML Parser. HTML Parser's are complicated things. Therefore, you want to use an existing library (creating your own is hard and likely to fail on many edge cases). Unfortunately, as highlighted in this question, most of the existing HTML parsing libraries do not retain position information. The good news is that the one HTML Parser which reliably retains position information is in the Python standard library (see HTMLParser). And as you are using Python 3, the problems with that parser have been fixed.

A basic example might look like this:

from html.parser import HTMLParser


class StripTextParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.data = []
        super(StripTextParser, self).__init__(*args, **kwargs)

    def handle_data(self, data):
        if data.strip():
            # Only use wtrings which are contain more than whitespace
            startpos = self.getpos()
            # `self.getpos()` returns `(line, column)` of start position.
            # Use that plus length of data to calculate end position.
            endpos = (startpos[0], startpos[1] + len(data))
            self.data.append((data, startpos, endpos))


def strip_text(html):
    parser = StripTextParser()
    parser.feed(html)
    return parser.data

test1 = "<sup><sup>There</sup></sup> <sup><sup>was</sup></sup> <sup><sup>another</sup></sup> <sup><sup>thread</sup></sup> <sup><sup>like</sup></sup> <sup><sup>this</sup></sup>" 

print(strip_text(test1))

# Ouputs: [('There', (1, 10), (1, 15)), ('was', (1, 38), (1, 41)), ('another', (1, 64), (1, 71)), ('thread', (1, 94), (1, 100)), ('like', (1, 123), (1, 127)), ('this', (1, 150), (1, 154))]


test2 = """
<ul>
<li>https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB</li>
<li>79</li>
<li>Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! </li>
</ul>
"""

print(strip_text(test2))

# Outputs: [('https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB', (3, 4), (3, 77)), ('79', (4, 4), (4, 6)), ('Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! ', (5, 4), (5, 95))]

test3 = "<em><strike>a</strike></em>"

print(strip_text(test3))

# Outputs: [('a', (1, 12), (1, 13))]

Without some more specific information about the format desired for the output, I just created a list of tuples. Of course, you can refactor to fit your specific needs. And if you want all of the whitespace, then remove the if data.strip(): line.

这篇关于删除html标签并获得已标记文本的开始/结束索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆