用于从输入文本中提取关键字的 Java 库 [英] Java library for keywords extraction from input text
问题描述
我正在寻找一个 Java 库来从文本块中提取关键字.
I'm looking for a Java library to extract keywords from a block of text.
流程应该如下:
停止词清理 -> 词干提取 -> 根据英语语言学统计信息搜索关键字 - 意思是如果一个词在文本中出现的次数比在英语语言中出现的次数多于它作为候选关键字的概率.
stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.
是否有执行此任务的库?
Is there a library that performs this task?
推荐答案
这是一个使用 Apache Lucene.我没有使用上一个版本,而是 3.6.2 one,因为这是我最了解的.除了 /lucene-core-xxxjar
,不要忘记从下载的存档中添加 /contrib/analyzers/common/lucene-analyzers-xxxjar
项目:它包含特定于语言的分析器(尤其是在您的情况下的英语分析器).
Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar
, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar
from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).
请注意,这将只根据它们各自的词干找到输入文本词的频率.之后应将这些频率与英语语言统计数据进行比较(这个答案可能会有所帮助).
Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).
一个词干一个关键词.不同的词可能具有相同的词干,因此 terms
集.每次找到新词时,关键字频率都会增加(即使已经找到了 - 一组会自动删除重复项).
One keyword for one stem. Different words may have the same stem, hence the terms
set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).
public class Keyword implements Comparable<Keyword> {
private final String stem;
private final Set<String> terms = new HashSet<String>();
private int frequency = 0;
public Keyword(String stem) {
this.stem = stem;
}
public void add(String term) {
terms.add(term);
frequency++;
}
@Override
public int compareTo(Keyword o) {
// descending order
return Integer.valueOf(o.frequency).compareTo(frequency);
}
@Override
public boolean equals(Object obj) {
if (this == obj) {
return true;
} else if (!(obj instanceof Keyword)) {
return false;
} else {
return stem.equals(((Keyword) obj).stem);
}
}
@Override
public int hashCode() {
return Arrays.hashCode(new Object[] { stem });
}
public String getStem() {
return stem;
}
public Set<String> getTerms() {
return terms;
}
public int getFrequency() {
return frequency;
}
}
<小时>
实用工具
词干:
public static String stem(String term) throws IOException {
TokenStream tokenStream = null;
try {
// tokenize
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
// stem
tokenStream = new PorterStemFilter(tokenStream);
// add each token in a set, so that duplicates are removed
Set<String> stems = new HashSet<String>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
stems.add(token.toString());
}
// if no stem or 2+ stems have been found, return null
if (stems.size() != 1) {
return null;
}
String stem = stems.iterator().next();
// if the stem has non-alphanumerical chars, return null
if (!stem.matches("[a-zA-Z0-9-]+")) {
return null;
}
return stem;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
要搜索集合(将由潜在关键字列表使用):
To search into a collection (will be used by the list of potential keywords):
public static <T> T find(Collection<T> collection, T example) {
for (T element : collection) {
if (element.equals(example)) {
return element;
}
}
collection.add(example);
return example;
}
<小时>
核心
这里是主要的输入法:
Core
Here is the main input method:
public static List<Keyword> guessFromString(String input) throws IOException {
TokenStream tokenStream = null;
try {
// hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
input = input.replaceAll("-+", "-0");
// replace any punctuation char but apostrophes and dashes by a space
input = input.replaceAll("[\p{Punct}&&[^'-]]+", " ");
// replace most common english contractions
input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", "");
// tokenize input
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
// to lowercase
tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
// remove dots from acronyms (and "'s" but already done manually above)
tokenStream = new ClassicFilter(tokenStream);
// convert any char to ASCII
tokenStream = new ASCIIFoldingFilter(tokenStream);
// remove english stop words
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());
List<Keyword> keywords = new LinkedList<Keyword>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = token.toString();
// stem each term
String stem = stem(term);
if (stem != null) {
// create the keyword or get the existing one if any
Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
// add its corresponding initial token
keyword.add(term.replaceAll("-0", "-"));
}
}
// reverse sort by frequency
Collections.sort(keywords);
return keywords;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
<小时>
示例
在Java维基百科文章介绍部分使用guessFromString
方法,这是找到的前 10 个最常见的关键字(即词干):
Example
Using the guessFromString
method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:
java x12 [java]
compil x5 [compiled, compiler, compilers]
sun x5 [sun]
develop x4 [developed, developers]
languag x3 [languages, language]
implement x3 [implementation, implementations]
applic x3 [application, applications]
run x3 [run]
origin x3 [originally, original]
gnu x3 [gnu]
通过获取terms
集(显示在括号[...]
在上面的例子中).
Iterate over the output list to know which were the original found words for each stem by getting the terms
sets (displayed between brackets [...]
in the above example).
将词干频率/频率总和比率与英语统计数据进行比较,如果您管理得当,请让我参与其中:我可能也很感兴趣:)
Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)
这篇关于用于从输入文本中提取关键字的 Java 库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!