算法来检测从文本短语和关键字 [英] Algorithms to detect phrases and keywords from text

查看:182
本文介绍了算法来检测从文本短语和关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约100兆字节的文本,没有任何标记,分为约10,000个条目。我想自动生成一个标签列表中。的问题是,有字组(即短语),只有有意义当它们组合在一起。

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.

如果我算的话,我得到了大量的真正的常用词(是的,因为,在,在,等)。我已经算言语,换句话说是之前和之后的数量,但现在我真的无法弄清楚下一步该怎么做有关的2和3字词组的信息是present,但我怎么解压这些数据?

If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?

推荐答案

任何事情之前,尝试preserve关于边界,它来自于输入文本的信息。
(如果这些信息已经不容易丢失,你的问题意味着,也许符号化已经迅速地被完成)
在标记化(字分析,在这种情况下)的过程中,寻找那些可以定义的 EX pression边界(如标点符号,特别时期,也多次LF / CR分离,使用这些模式。还有像的,通常可以用来作为边界。这些非pression界限是典型的负翁,在一定意义上,他们分开两个令牌情况下这是一定要的没有的是包含在相同的前pression。几个阳性边界是引号,特别是双引号,这种类型的信息可能是有用的过滤出一些的n-gram(见下段)。此外字sequencces如例如或替代或需要可以作为EX pression边界,以及(但使用这种方式是利用边的​​先验,我在后面讨论)。

Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).

不使用外部数据的(而不是输入文本等),你可以有这样一个相对成功通过的文字的双字母组合运行统计和卦的2 (序列并连续3个字)。然后,[最]与实例的显著(*)号的顺序可能是前pression /短语你正在寻找的类型。
这个有点粗略的方法会产生一些假阳性,但整体上可能是可行的。经过滤后的n-gram称为跨越边界作为暗示在第一段,可显著帮助,因为在自然语言句子结尾和句子开始倾向于从消息空间的有限子集来绘制,从而产生标记的组合可能似乎是统计学以及再presented,但通常不是语义相关的。

Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.

更好的方法(可能更昂贵,加工明智的,和设计/投资WISE),将利用额外的先验相关的输入域和/或民族语言文本。

Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.

  • POS(兼词性)标记 是非常有用的,在几个方面(提供附加的,更客观的前pression边界,并且还噪音的话类,例如,所有文章,在实体的上下文中使用,即使是一般的小在标签云,使得OP要生产。
  • 字典,词典之类是非常有用了。特别是,这些用于识别的实体(又名在 WordNet的的行话实例)和它们的替代形式。实体是标签云非常重要的(虽然它们不是在其中发现的唯一的类的话),并通过识别它们,但也可以正常化他们(许多不同的前pressions可用于比如参议员肯尼迪吨),从而消除重复,同时也增加了底层实体的频率。
  • 如果胼被构造为文档集合以使用有关的TF(词频)和IDF(逆文档频率)
  • 各种花样,它可能是有用
  • POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
  • Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
  • if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)

[对不起,该走了,现在(加上想从你的具体目标更详细等)。我会尝试,并提供更多的细节和pointes的后面]

[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]

[顺便说一句,我想从这个岗位这里的乔纳森·范伯格和Dervin咚回应堵塞,为他们提供出色的三分球,在方法和工具为手头的任务种类方面。尤其是, NTLK 的Python在超大作为试验提供了一个很好的框架]

[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]

这篇关于算法来检测从文本短语和关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆