用于从输入文本中提取关键字的 Java 库 [英] Java library for keywords extraction from input text

查看:25
本文介绍了用于从输入文本中提取关键字的 Java 库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个 Java 库来从文本块中提取关键字.

I'm looking for a Java library to extract keywords from a block of text.

流程应该如下:

停止词清理 -> 词干提取 -> 根据英语语言学统计信息搜索关键字 - 意思是如果一个词在文本中出现的次数比在英语语言中出现的次数多于它作为候选关键字的概率.

stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.

是否有执行此任务的库?

Is there a library that performs this task?

推荐答案

这是一个使用 Apache Lucene.我没有使用上一个版本,而是 3.6.2 one,因为这是我最了解的.除了 /lucene-core-xxxjar,不要忘记从下载的存档中添加 /contrib/analyzers/common/lucene-analyzers-xxxjar项目:它包含特定于语言的分析器(尤其是在您的情况下的英语分析器).

Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).

请注意,这将根据它们各自的词干找到输入文本词的频率.之后应将这些频率与英语语言统计数据进行比较(这个答案可能会有所帮助).

Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).

一个词干一个关键词.不同的词可能具有相同的词干,因此 terms 集.每次找到新词时,关键字频率都会增加(即使已经找到了 - 一组会自动删除重复项).

One keyword for one stem. Different words may have the same stem, hence the terms set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).

public class Keyword implements Comparable<Keyword> {

  private final String stem;
  private final Set<String> terms = new HashSet<String>();
  private int frequency = 0;

  public Keyword(String stem) {
    this.stem = stem;
  }

  public void add(String term) {
    terms.add(term);
    frequency++;
  }

  @Override
  public int compareTo(Keyword o) {
    // descending order
    return Integer.valueOf(o.frequency).compareTo(frequency);
  }

  @Override
  public boolean equals(Object obj) {
    if (this == obj) {
      return true;
    } else if (!(obj instanceof Keyword)) {
      return false;
    } else {
      return stem.equals(((Keyword) obj).stem);
    }
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(new Object[] { stem });
  }

  public String getStem() {
    return stem;
  }

  public Set<String> getTerms() {
    return terms;
  }

  public int getFrequency() {
    return frequency;
  }

}

<小时>

实用工具

词干:

public static String stem(String term) throws IOException {

  TokenStream tokenStream = null;
  try {

    // tokenize
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
    // stem
    tokenStream = new PorterStemFilter(tokenStream);

    // add each token in a set, so that duplicates are removed
    Set<String> stems = new HashSet<String>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      stems.add(token.toString());
    }

    // if no stem or 2+ stems have been found, return null
    if (stems.size() != 1) {
      return null;
    }
    String stem = stems.iterator().next();
    // if the stem has non-alphanumerical chars, return null
    if (!stem.matches("[a-zA-Z0-9-]+")) {
      return null;
    }

    return stem;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

要搜索集合(将由潜在关键字列表使用):

To search into a collection (will be used by the list of potential keywords):

public static <T> T find(Collection<T> collection, T example) {
  for (T element : collection) {
    if (element.equals(example)) {
      return element;
    }
  }
  collection.add(example);
  return example;
}

<小时>

核心

这里是主要的输入法:


Core

Here is the main input method:

public static List<Keyword> guessFromString(String input) throws IOException {

  TokenStream tokenStream = null;
  try {

    // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
    input = input.replaceAll("-+", "-0");
    // replace any punctuation char but apostrophes and dashes by a space
    input = input.replaceAll("[\p{Punct}&&[^'-]]+", " ");
    // replace most common english contractions
    input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", "");

    // tokenize input
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
    // to lowercase
    tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
    // remove dots from acronyms (and "'s" but already done manually above)
    tokenStream = new ClassicFilter(tokenStream);
    // convert any char to ASCII
    tokenStream = new ASCIIFoldingFilter(tokenStream);
    // remove english stop words
    tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());

    List<Keyword> keywords = new LinkedList<Keyword>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      String term = token.toString();
      // stem each term
      String stem = stem(term);
      if (stem != null) {
        // create the keyword or get the existing one if any
        Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
        // add its corresponding initial token
        keyword.add(term.replaceAll("-0", "-"));
      }
    }

    // reverse sort by frequency
    Collections.sort(keywords);

    return keywords;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

<小时>

示例

Java维基百科文章介绍部分使用guessFromString方法,这是找到的前 10 个最常见的关键字(即词干):


Example

Using the guessFromString method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:

java         x12    [java]
compil       x5     [compiled, compiler, compilers]
sun          x5     [sun]
develop      x4     [developed, developers]
languag      x3     [languages, language]
implement    x3     [implementation, implementations]
applic       x3     [application, applications]
run          x3     [run]
origin       x3     [originally, original]
gnu          x3     [gnu]

通过获取terms集(显示在括号[...] 在上面的例子中).

Iterate over the output list to know which were the original found words for each stem by getting the terms sets (displayed between brackets [...] in the above example).

词干频率/频率总和比率与英语统计数据进行比较,如果您管理得当,请让我参与其中:我可能也很感兴趣:)

Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)

这篇关于用于从输入文本中提取关键字的 Java 库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆