如何在nltk中的斜杠前删除POS标签? [英] How can I remove POS tags before slashes in nltk?
问题描述
这是我项目的一部分,我需要像这样在短语检测之后表示输出-(a,x,b)其中a,x,b是短语.我构造了代码,并得到了这样的输出:
This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
我要使其像以前的表示形式一样,这意味着我必须删除'CLAUSE','NP','VP','VBD','NNP'等标签.
I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.
该怎么做?
首先将其写在文本文件中,标记化并使用list.remove('word')
.但这根本没有帮助.
我要澄清更多.
First wrote this in a text file, tokenize and used list.remove('word')
. But that is not at all helpful.
I am clarifying a bit more.
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
[杰克,爱人,彼得],[杰克,留在伦敦] 输出只是根据花括号而没有标签.
[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.
推荐答案
由于您已标记此nltk
,因此,我们将使用NLTK的树解析器来处理您的树.我们将阅读每棵树,然后简单地打印出叶子.完成.
Since you tagged this nltk
, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.
>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())
['Jack', 'stayed', 'in', 'London']
lambda形式将每个word/tag
对分开并丢弃标签,仅保留单词.
The lambda form splits each word/tag
pair and discards the tag, keeping just the word.
我知道,您会问我如何处理整个文件中的此类树,其中一些树需要一行以上.这是NLTK的BracketParseCorpusReader
的工作,但是它希望终端的形式为(POS word)
而不是word/POS
.我不会那样做,因为诱骗Tree.fromstring()
读取所有树就好像它们是一棵树的分支一样容易:
I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader
, but it expects terminals to be in the form (POS word)
instead of word/POS
. I won't bother doing it that way, since it's even easier to trick Tree.fromstring()
into reading all your trees as if they're branches of a single tree:
allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )" # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
print(tree.leaves())
如您所见,唯一的区别是我们在文件内容周围添加了"(ROOT "
和" )"
,并使用了for循环来生成输出.该循环为我们提供了顶层节点的子节点,即实际的树.
As you see, the only difference is we added "(ROOT "
and " )"
around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.
这篇关于如何在nltk中的斜杠前删除POS标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!