如何使用NLTK在python中获取句子列表的通用标签模式 [英] How to get common tag pattern for sentences list in python with NLTK
问题描述
这里有一个句子列表.使用NLTK,我可以标记句子并获得该句子的标记模式.这样就可以得到整个列表的标签模式,但是我想要的是找出大多数句子都可以匹配的常见标签模式,例如:
Here I have a list of sentences.With NLTK I can tag the sentence and get the tag pattern of that sentences. So as like this I can get the tag patterns for the whole list.But what I wanted is to identify the common tag patterns which most sentences get matched.For example:
-
什么是封装
What is encapsulation
tag pattern : {<WP><VBZ><NN>}
你的婚礼怎么样
How was your wedding
tag pattern : {<WRB><VBD><PRP$><NN>}
您今天的计划是什么
What is your plan today
tag pattern : {<WP><VBZ><PRP$><NN><NN>}
因此上述三句话的常用标记模式(结合正则表达式标记器)是:
So the common tag pattern(Combining regexp tagger) for above threes sentences is:
{<W.+><V.+><PRP.?>?<NN>+} - One "Wh" word,one verb,zero or one pronoun,one or many nouns
所以我想将句子的标记模式概括为普通的标记模式.这就是我想做的事情.
So I want to generalize the tag patterns of sentences to common ones.This is the thing what I wanted to do..
那么有人可以告诉我该怎么做吗?
So can someone tell me how to do that?
推荐答案
听起来您正在使用正则表达式(带有量词),该正则表达式将匹配数据中的所有不同标记序列.虽然这不是一个简单的问题, 我怀疑您的目标是找到一种模式来捕获合法句子的序列,对吗?
It sounds like you are after a regexp (with quantifiers) that will match all the different tag sequences in your data. While this is not an easy problem, I suspect that your goal is to find a pattern that captures the sequences that are legal sentences, is this right?
如果是这样,则正则表达式(通常是有限状态方法)天生就是用于此工作的错误工具.为了甚至开始刻画句子集合的特征,您需要查看上下文无关的语法.看一下有关该主题的NLTK资料.
If so, regexps (and finite-state approaches in general) are inherently the wrong tool for the job. To even get a start on characterizing your sentence collection, you need to look at context-free grammars. Take a look at the NLTK's materials on the topic.
这篇关于如何使用NLTK在python中获取句子列表的通用标签模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!