文本分段:基于字典的单词拆分 [英] Text segmentation: dictionary-based word splitting

查看:205
本文介绍了文本分段:基于字典的单词拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将数据库列名拆分为等效的英文文本以播种数据字典。英语词典是从公司文档,维基和电子邮件的语料库中创建的。字典( lexicon.csv )是一个包含单词和概率的CSV文件。因此,有人写治疗师这个词(在电子邮件或维基页面上)的次数越多,治疗师名称分裂为治疗师名称的可能性就越高,而不是其他东西。 (词典可能甚至不包括强奸犯这个词。)

Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary (lexicon.csv) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.)

  • TextSegmenter.java @ http://pastebin.com/taXyE03L
  • SortableValueMap.java @ http://pastebin.com/v3hRXYan
  • lexicon.csv - http://pastebin.com/0crECtXY
  • columns.txt - http://pastebin.com/EtN9Qesr

遇到以下问题时:

dependentrelationship::end depend ent dependent relationship
end=0.86
ent=0.001
dependent=0.8
relationship=0.9

存在以下可能的解决方案:

These possible solutions exist:

dependentrelationship::dependent relationship
dependentrelationship::dep end ent relationship
dependentrelationship::depend ent relationship

词典包含具有相对概率的单词(基于单词频率): 依赖0.8 结束0.86 关系0.9 依赖0.3 ent 0.001

The lexicon contains words with their relative probabilities (based on word frequency): dependent 0.8, end 0.86, relationship 0.9, depend 0.3, and ent 0.001.

消除解决方案 dep end ent relationship 因为 dep 不在词典中(即75%的单词使用率),而另外两个解决方案覆盖了词典中100%的单词。在剩余的解决方案中,依赖关系的概率 0.72 依赖关系 0.00027 。因此,我们可以选择依赖关系作为正确的解决方案。

Eliminate the solution of dep end ent relationship because dep is not in the lexicon (i.e., 75% word usage), whereas the other two solutions cover 100% of words in the lexicon. Of the remaining solutions, the probability of dependent relationship is 0.72 whereas depend ent relationship is 0.00027. We can therefore select dependent relationship as the correct solution.

  • How to separate words in a "sentence" with spaces?
  • Top Coder - Text Segmentation Presentation 1/2
  • Top Coder - Text Segmentation Presentation 2/2
  • Linear Text Segmentation using Dynamic Programming Algorithm
  • Dynamic Programming: Segmentation
  • Dynamic Programming: A Computational Tool

给定:

// The concatenated phrase or database column (e.g., dependentrelationship).
String concat;

// All words (String) in the lexicon within concat, in left-to-right order; and
// the ranked probability of those words (Double). (E.g., {end, 0.97}
// {dependent, 0.86}, {relationship, 0.95}.)
Map.Entry<String, Double> word;

您如何实施基于词典覆盖率和概率生成最可能解决方案的例程?例如:

How would you implement a routine that generates the most likely solution based on lexicon coverage and probabilities? For example:

for( Map.Entry<String, Double> word : words ) {
  result.append( word.getKey() ).append( ' ' );

  // What goes here?

  System.out.printf( "%s=%f\n", word.getKey(), word.getValue() );
}

谢谢!

推荐答案

Peter Norvig在python中写了一些东西。

Peter Norvig has written some stuff in python.

http://norvig.com/ngrams/ngrams.py

包含一个名为segment的函数。它运行一系列单词的朴素贝叶斯概率。效果很好。可以成为你在java中尝试完成的一个很好的基础。

contains a function called segment. It run a Naive Bayes probability of a sequence of words. works well. Can be a good basis for what your trying to accomplish in java.

如果你把它转换成java,我会对看到它的实现感兴趣。

If you get it converted to java, i'd be interested in seeing the implementation.

谢谢。

Mike

这篇关于文本分段:基于字典的单词拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆