如何使用weka从文本参数中提取关键短语 [英] how to use weka in keyphrase extraction from text arguments

查看:33
本文介绍了如何使用weka从文本参数中提取关键短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个项目从文本参数中提取关键短语".为此,我首先进行了输入清理,然后使用斯坦福解析器(POS 标记)确定了候选短语列表(总共大约 300 个).然后我计算了每个短语的特征值.我对数据集中的每个文档都遵循了这些步骤.现在我应该如何进行,即如何使用 WEKA 来查找关键短语.我应该如何在 weka 中存储短语和特征值(TFXIDF).如何找到最终项目的效率??

I am working on a project "key phrase extraction from text arguments" . For this I first did input cleaning and then detemined list of candidate phrases( in total around 300) using stanford parser(POS tagging). Then I computed feature value of each and every phrase. I followed these steps on each and every document in my dataset. Now how should I proceed i.e.., how to use WEKA to find keyphrases. How should I store phrases and feature values(TFXIDF) in weka . How to find efficiency of the final project??

推荐答案

WEKA 在文本分类 任务(如文本分类和聚类)方面做了出色而简单的工作,其中实例相对较长文本片段(例如从推文到文档)和类(如果可用)是不重叠的标签(例如主题类如经济/体育/...、垃圾邮件/合法电子邮件、正面/负面情绪分析等).

WEKA does an excellent and simple work with Text Classification tasks (like Text Categorization and Clustering), in which the instances are relatively long pieces of text (e.g. from tweets to documents), and classes (when available) are non-overlapping tags (e.g. thematic classes like economy/sports/..., spam/legitimate email, positive/negative in sentiment analysis, etc.).

但是,WEKA 不直接适合术语分类任务,例如 Part Of Specch 标记、词义消歧、命名实体识别,或者在您的情况下,关键短语提取.对于应用 WEKA,您不仅需要原始文本和手动提取的关键短语,还需要确定使这些文本片段成为实际关键短语的属性.你必须检查你的例子,并决定,例如,关键阶段中的词的词性和周围的词实际上很重要,以便猜测一段文本是一个关键短语.

However WEKA does not fit directly term classification tasks like Part Of Specch Tagging, Word Sense Disambiguation, Named Entity Recognition, or in your case, keyphrase extraction. For applying WEKA, yo do not only need your original texts and the manually extracted keyphrases, but to decide the atributes that make those pieces of text actual keyphrases. You have to inspect your examples, and decide, for instance, that the part of speech of the words in a keyphase and the surrounding words are actually important in order to guess that a piece of text is a keyphrase.

我强烈建议您查看 CONLL NER 共享任务中使用的数据集中使用的表示(CONLL 20022003).命名实体中的每个单词都是独立的,并标记为命名实体的开头、中间或结尾.此外,您可以使用的特征是实际单词、周围单词及其词性.

I strongly recommend you take a look at the representation used in the datasets used in the CONLL NER shared tasks (CONLL 2002 and 2003). Each word in named entity is independent and marked as starting, in the middle or at the end of the named entity. Additionally, the features you can use are the actual words, the surrounding words, and their parts of speech.

例如,在 NER 2003 数据集的示例中:

For instance, in the example of the NER 2003 dataset:

   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O 

你知道单词Ekeus"是一个NNP,它在一个名词短语(I-NP)中,它是一个类型为的命名实体"人"(I-PER).您可以处理此格式以获取一个实例文件,您可以在其中使用 POS 标签和两字窗口中的实际字词:

You have that the word "Ekeus" is an NNP, it is inside a Noun Phrase (I-NP), and it is a named entity of type "person" (I-PER). You can process this format to get an instance file in which you use the POS tag and the actual words in a two-word window:

@attribute word-2 string
@attribute word-1 string
@attribute word string
@attribute word+1 string
@attribute word+2 string
@attribute postag-2 {NNP, NN, ....} // The full list of available POS tags
@attribute postag-1 {NNP, NN, ....}
// ../..
@attribute named-entity-class {O, I-PER, I-ORG, ...} // The full list of possible NE tags

@data
"U.N.","official","Ekeus","heads","for",NNP,NN,NNP,VBZ,IN,I-PER
../..

如您所见,您必须确定每个单词所需的属性并使用这些属性构建窗口.

As you can see, you have to decide the attributes you need for each word and to build windows with the attributes.

这篇关于如何使用weka从文本参数中提取关键短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆