如何在nltk中的斜杠前删除POS标签? [英] How can I remove POS tags before slashes in nltk?

查看:96
本文介绍了如何在nltk中的斜杠前删除POS标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我项目的一部分,我需要像这样在短语检测之后表示输出-(a,x,b)其中a,x,b是短语.我构造了代码,并得到了这样的输出:

This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))

我要使其像以前的表示形式一样,这意味着我必须删除'CLAUSE','NP','VP','VBD','NNP'等标签.

I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.

该怎么做?

首先将其写在文本文件中,标记化并使用list.remove('word').但这根本没有帮助. 我要澄清更多.

First wrote this in a text file, tokenize and used list.remove('word'). But that is not at all helpful. I am clarifying a bit more.

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))

[杰克,爱人,彼得],[杰克,留在伦敦] 输出只是根据花括号而没有标签.

[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.

推荐答案

由于您已标记此nltk,因此,我们将使用NLTK的树解析器来处理您的树.我们将阅读每棵树,然后简单地打印出叶子.完成.

Since you tagged this nltk, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.

>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())

['Jack', 'stayed', 'in', 'London']

lambda形式将每个word/tag对分开并丢弃标签,仅保留单词.

The lambda form splits each word/tag pair and discards the tag, keeping just the word.

我知道,您会问我如何处理整个文件中的此类树,其中一些树需要一行以上.这是NLTK的BracketParseCorpusReader的工作,但是它希望终端的形式为(POS word)而不是word/POS.我不会那样做,因为诱骗Tree.fromstring()读取所有树就好像它们是一棵树的分支一样容易:

I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader, but it expects terminals to be in the form (POS word) instead of word/POS. I won't bother doing it that way, since it's even easier to trick Tree.fromstring() into reading all your trees as if they're branches of a single tree:

allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )"  # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
    print(tree.leaves())

如您所见,唯一的区别是我们在文件内容周围添加了"(ROOT "" )",并使用了for循环来生成输出.该循环为我们提供了顶层节点的子节点,即实际的树.

As you see, the only difference is we added "(ROOT " and " )" around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.

这篇关于如何在nltk中的斜杠前删除POS标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆