从文本中检测短语和关键字的算法 [英] Algorithms to detect phrases and keywords from text

查看:13
本文介绍了从文本中检测短语和关键字的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 100 兆字节的文本,没有任何标记,分为大约 10,000 个条目.我想自动生成一个标签"列表.问题是有些词组(即短语)只有在将它们组合在一起时才有意义.

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.

如果我只数词,我会得到大量非常常见的词(is、the、for、in、am 等).我已经统计了它前后的单词和其他单词的数量,但现在我真的不知道下一步该怎么做2和3词组的相关信息存在,但是我如何提取这些数据?

If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?

推荐答案

在任何事情之前,尝试保留输入文本中有关边界"的信息.
(如果这些信息不容易丢失,您的问题意味着可能已经完成了标记化)
在标记化(在这种情况下是词解析)过程中,寻找可以定义表达式边界的模式(例如标点符号,特别是句号,以及多个 LF/CR 分隔,使用这些.还有像这样的词the",通常可以用作边界.这样的表达式边界通常是否定的",从某种意义上说,它们将两个标记实例分开,而这些实例肯定不会被包含在同一个表达式中.一些正边界是引号,特别是双引号.这种类型的信息可能有助于过滤掉一些 n-gram(见下一段).还有单词序列,例如例如"或代替"或需要"to" 也可以用作表达式边界(但使用这些信息是在使用我稍后讨论的先验").

Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).

不使用外部数据(输入文本除外),您可以通过对文本的二元组和三元组(序列 2和 3 个连续的单词).那么[大多数]具有大量 (*) 实例的序列很可能是您正在寻找的表达/短语"类型.
这种有点粗糙的方法会产生一些误报,但总的来说可能是可行的.正如第一段所暗示的那样,过滤掉已知会跨越边界"的 n-gram 可能会有很大帮助,因为在自然语言中,句子结尾和句子开头往往会从消息空间的有限子集中抽取,因此会产生标记组合似乎在统计上得到了很好的表示,但它们通常在语义上没有关系.

Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.

更好的方法(可能更昂贵、处理方面以及设计/投资方面),将使用与输入域和/或国家语言相关的额外先验"文本.

Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.

  • POS(词性)标记 在很多方面都非常有用(提供额外的、更客观的表达边界,以及噪音"词类,例如所有文章,即使在实体上下文中使用时通常在标签云中也很少,因此 OP想要生产.
  • 字典、词典 等也非常有用.特别是,这些标识实体"(也就是 WordNet 行话中的实例)及其替代形式.实体对于标签云非常重要(尽管它们不是在其中发现的唯一一类词),并且通过识别它们,还可以对它们进行规范化(可以使用许多不同的表达方式,参议员 T.Kennedy"),从而消除重复,但也增加了底层实体的频率.
  • 如果语料库被构造为文档集合,那么使用与 TF(词频)和 IDF(逆文档频率)相关的各种技巧可能会很有用
  • POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
  • Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
  • if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)

[抱歉,我得走了,现在(另外还想了解您的具体目标等的更多细节).稍后我会尝试提供更多细节和要点]

[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]

[顺便说一句,我想从这篇博文中插入 Jonathan Feinberg 和 Dervin Thunk 的回应,因为他们在方法和工具方面提供了极好的指导,以完成手头的任务.特别是,NTLKPython-at-large 为实验提供了极好的框架]

[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]

这篇关于从文本中检测短语和关键字的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆