用于从输入文本中提取关键字的Java库 [英] Java library for keywords extraction from input text
问题描述
我正在寻找一个Java库来从一个文本块中提取关键字。
I'm looking for a Java library to extract keywords from a block of text.
该过程应如下所示:
停止单词清理 - >词干 - >根据英语语言学统计信息搜索关键词 - 这意味着如果一个单词在文本中出现的次数多于在英语中出现的概率而不是关键词候选词。
stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.
是否有执行此任务的库?
Is there a library that performs this task?
推荐答案
此处是使用 Apache Lucene 的可能解决方案。我没有使用上一个版本,但 3.6.2一个因为这是我最了解的那个。除了 / lucene-core-xxxjar
之外,别忘了添加 / contrib / analyzers / common / lucene-analyzers-xxxjar
从下载的存档到您的项目:它包含特定于语言的分析器(特别是您的英文版)。
Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar
, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar
from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).
请注意,这将是仅根据各自的词干查找输入文本词的频率。将这些频率与英语语言统计数据进行比较后应进行(此答案可能会有所帮助)。
Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).
一个词干的一个关键词。不同的单词可能具有相同的词干,因此词汇
设置。每次找到新术语时关键词频率都会递增(即使已经找到它 - 一个集合会自动删除重复项)。
One keyword for one stem. Different words may have the same stem, hence the terms
set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).
public class Keyword implements Comparable<Keyword> {
private final String stem;
private final Set<String> terms = new HashSet<String>();
private int frequency = 0;
public Keyword(String stem) {
this.stem = stem;
}
public void add(String term) {
terms.add(term);
frequency++;
}
@Override
public int compareTo(Keyword o) {
// descending order
return Integer.valueOf(o.frequency).compareTo(frequency);
}
@Override
public boolean equals(Object obj) {
if (this == obj) {
return true;
} else if (!(obj instanceof Keyword)) {
return false;
} else {
return stem.equals(((Keyword) obj).stem);
}
}
@Override
public int hashCode() {
return Arrays.hashCode(new Object[] { stem });
}
public String getStem() {
return stem;
}
public Set<String> getTerms() {
return terms;
}
public int getFrequency() {
return frequency;
}
}
公用事业
要说出一句话:
Utilities
To stem a word:
public static String stem(String term) throws IOException {
TokenStream tokenStream = null;
try {
// tokenize
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
// stem
tokenStream = new PorterStemFilter(tokenStream);
// add each token in a set, so that duplicates are removed
Set<String> stems = new HashSet<String>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
stems.add(token.toString());
}
// if no stem or 2+ stems have been found, return null
if (stems.size() != 1) {
return null;
}
String stem = stems.iterator().next();
// if the stem has non-alphanumerical chars, return null
if (!stem.matches("[a-zA-Z0-9-]+")) {
return null;
}
return stem;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
搜索到一个集合(将由潜在关键字列表使用):
To search into a collection (will be used by the list of potential keywords):
public static <T> T find(Collection<T> collection, T example) {
for (T element : collection) {
if (element.equals(example)) {
return element;
}
}
collection.add(example);
return example;
}
核心
这是主要的输入法:
Core
Here is the main input method:
public static List<Keyword> guessFromString(String input) throws IOException {
TokenStream tokenStream = null;
try {
// hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
input = input.replaceAll("-+", "-0");
// replace any punctuation char but apostrophes and dashes by a space
input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
// replace most common english contractions
input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");
// tokenize input
tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
// to lowercase
tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
// remove dots from acronyms (and "'s" but already done manually above)
tokenStream = new ClassicFilter(tokenStream);
// convert any char to ASCII
tokenStream = new ASCIIFoldingFilter(tokenStream);
// remove english stop words
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());
List<Keyword> keywords = new LinkedList<Keyword>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = token.toString();
// stem each term
String stem = stem(term);
if (stem != null) {
// create the keyword or get the existing one if any
Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
// add its corresponding initial token
keyword.add(term.replaceAll("-0", "-"));
}
}
// reverse sort by frequency
Collections.sort(keywords);
return keywords;
} finally {
if (tokenStream != null) {
tokenStream.close();
}
}
}
示例
在 guessFromString 方法://en.wikipedia.org/wiki/Java_(programming_language)rel =noreferrer> Java维基百科文章介绍部分,这里是前10个最常见的关键词(即词干):
Example
Using the guessFromString
method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:
java x12 [java]
compil x5 [compiled, compiler, compilers]
sun x5 [sun]
develop x4 [developed, developers]
languag x3 [languages, language]
implement x3 [implementation, implementations]
applic x3 [application, applications]
run x3 [run]
origin x3 [originally, original]
gnu x3 [gnu]
迭代输出列表,知道每个词干的原始找到的词是哪个获取条款
集(在上面的示例中显示在括号 [...]
之间)。
Iterate over the output list to know which were the original found words for each stem by getting the terms
sets (displayed between brackets [...]
in the above example).
比较词干频率/频率总和与英语统计数据的比率,如果您管理它,让我保持在循环中:我可能也非常感兴趣:)
Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)
这篇关于用于从输入文本中提取关键字的Java库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!