用于从输入文本中提取关键字的Java库 [英] Java library for keywords extraction from input text

查看:166
本文介绍了用于从输入文本中提取关键字的Java库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个Java库来从一个文本块中提取关键字。

I'm looking for a Java library to extract keywords from a block of text.

该过程应如下所示:

停止单词清理 - >词干 - >根据英语语言学统计信息搜索关键词 - 这意味着如果一个单词在文本中出现的次数多于在英语中出现的概率而不是关键词候选词。

stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more times in the text than in the English language in terms of probability than it's a keyword candidate.

是否有执行此任务的库?

Is there a library that performs this task?

推荐答案

此处是使用 Apache Lucene 的可能解决方案。我没有使用上一个版本,但 3.6.2一个因为这是我最了解的那个。除了 / lucene-core-xxxjar 之外,别忘了添加 / contrib / analyzers / common / lucene-analyzers-xxxjar 从下载的存档到您的项目:它包含特定于语言的分析器(特别是您的英文版)。

Here is a possible solution using Apache Lucene. I didn't use the last version but the 3.6.2 one, since this is the one I know the best. Besides the /lucene-core-x.x.x.jar, don't forget to add the /contrib/analyzers/common/lucene-analyzers-x.x.x.jar from the downloaded archive to your project: it contains the language-specific analyzers (especially the English one in your case).

请注意,这将是根据各自的词干查找输入文本词的频率。将这些频率与英语语言统计数据进行比较后应进行(此答案可能会有所帮助)。

Note that this will only find the frequencies of the input text words based on their respective stem. Comparing these frequencies with the English language statistics shall be done afterwards (this answer may help by the way).

一个词干的一个关键词。不同的单词可能具有相同的词干,因此词汇设置。每次找到新术语时关键词频率都会递增(即使已经找到它 - 一个集合会自动删除重复项)。

One keyword for one stem. Different words may have the same stem, hence the terms set. The keyword frequency is incremented every time a new term is found (even if it has been already found - a set automatically removes duplicates).

public class Keyword implements Comparable<Keyword> {

  private final String stem;
  private final Set<String> terms = new HashSet<String>();
  private int frequency = 0;

  public Keyword(String stem) {
    this.stem = stem;
  }

  public void add(String term) {
    terms.add(term);
    frequency++;
  }

  @Override
  public int compareTo(Keyword o) {
    // descending order
    return Integer.valueOf(o.frequency).compareTo(frequency);
  }

  @Override
  public boolean equals(Object obj) {
    if (this == obj) {
      return true;
    } else if (!(obj instanceof Keyword)) {
      return false;
    } else {
      return stem.equals(((Keyword) obj).stem);
    }
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(new Object[] { stem });
  }

  public String getStem() {
    return stem;
  }

  public Set<String> getTerms() {
    return terms;
  }

  public int getFrequency() {
    return frequency;
  }

}






公用事业



要说出一句话:


Utilities

To stem a word:

public static String stem(String term) throws IOException {

  TokenStream tokenStream = null;
  try {

    // tokenize
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
    // stem
    tokenStream = new PorterStemFilter(tokenStream);

    // add each token in a set, so that duplicates are removed
    Set<String> stems = new HashSet<String>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      stems.add(token.toString());
    }

    // if no stem or 2+ stems have been found, return null
    if (stems.size() != 1) {
      return null;
    }
    String stem = stems.iterator().next();
    // if the stem has non-alphanumerical chars, return null
    if (!stem.matches("[a-zA-Z0-9-]+")) {
      return null;
    }

    return stem;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

搜索到一个集合(将由潜在关键字列表使用):

To search into a collection (will be used by the list of potential keywords):

public static <T> T find(Collection<T> collection, T example) {
  for (T element : collection) {
    if (element.equals(example)) {
      return element;
    }
  }
  collection.add(example);
  return example;
}






核心



这是主要的输入法:


Core

Here is the main input method:

public static List<Keyword> guessFromString(String input) throws IOException {

  TokenStream tokenStream = null;
  try {

    // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
    input = input.replaceAll("-+", "-0");
    // replace any punctuation char but apostrophes and dashes by a space
    input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
    // replace most common english contractions
    input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

    // tokenize input
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
    // to lowercase
    tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
    // remove dots from acronyms (and "'s" but already done manually above)
    tokenStream = new ClassicFilter(tokenStream);
    // convert any char to ASCII
    tokenStream = new ASCIIFoldingFilter(tokenStream);
    // remove english stop words
    tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());

    List<Keyword> keywords = new LinkedList<Keyword>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      String term = token.toString();
      // stem each term
      String stem = stem(term);
      if (stem != null) {
        // create the keyword or get the existing one if any
        Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
        // add its corresponding initial token
        keyword.add(term.replaceAll("-0", "-"));
      }
    }

    // reverse sort by frequency
    Collections.sort(keywords);

    return keywords;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}






示例



guessFromString 方法://en.wikipedia.org/wiki/Java_(programming_language)rel =noreferrer> Java维基百科文章介绍部分,这里是前10个最常见的关键词(即词干):


Example

Using the guessFromString method on the Java wikipedia article introduction part, here are the first 10 most frequent keywords (i.e. stems) that were found:

java         x12    [java]
compil       x5     [compiled, compiler, compilers]
sun          x5     [sun]
develop      x4     [developed, developers]
languag      x3     [languages, language]
implement    x3     [implementation, implementations]
applic       x3     [application, applications]
run          x3     [run]
origin       x3     [originally, original]
gnu          x3     [gnu]

迭代输出列表,知道每个词干的原始找到的词是哪个获取条款集(在上面的示例中显示在括号 [...] 之间)。

Iterate over the output list to know which were the original found words for each stem by getting the terms sets (displayed between brackets [...] in the above example).

比较词干频率/频率总和与英语统计数据的比率,如果您管理它,让我保持在循环中:我可能也非常感兴趣:)

Compare the stem frequency / frequencies sum ratios with the English language statistics ones, and keep me in the loop if your managed it: I could be quite interested too :)

这篇关于用于从输入文本中提取关键字的Java库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆