文本切分:基于字典的分词 [英] Text segmentation: dictionary-based word splitting

查看：21 发布时间：2022/1/2 17:47:32 java nlp data-dictionary text-segmentation

本文介绍了文本切分:基于字典的分词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

将数据库列名称拆分为等效的英文文本以作为数据字典的种子.英语词典是根据公司文档、维基和电子邮件的语料库创建的.字典 (lexicon.csv) 是一个包含单词和概率的 CSV 文件.因此，某人(在电子邮件中或在 wiki 页面上)写下治疗师"这个词的频率越高，治疗师姓名"拆分为治疗师姓名"而不是其他内容的可能性就越大.(词典可能甚至不会包括强奸犯这个词.)

Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary (lexicon.csv) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.)

TextSegmenter.java @ http://pastebin.com/taXyE03L
SortableValueMap.java @ http://pastebin.com/v3hRXYan

lexicon.csv - http://pastebin.com/0crECtXY
columns.txt - http://pastebin.com/EtN9Qesr

遇到以下问题时:

dependentrelationship::end depend ent dependent relationship
end=0.86
ent=0.001
dependent=0.8
relationship=0.9

存在以下可能的解决方案:

These possible solutions exist:

dependentrelationship::dependent relationship
dependentrelationship::dep end ent relationship
dependentrelationship::depend ent relationship

词典包含单词及其相对概率(基于词频):dependent 0.8、end 0.86、relationship 0.9、depend 0.3 和 ent 0.001.

The lexicon contains words with their relative probabilities (based on word frequency): dependent 0.8, end 0.86, relationship 0.9, depend 0.3, and ent 0.001.

消除依赖关系的解决方案，因为dep不在词典中(即75%的词使用)，而其他两个解决方案覆盖了100%的词典中的词.在其余解中，依赖关系的概率为0.72，而依赖关系的概率为0.00027.因此我们可以选择依赖关系作为正确的解决方案.

Eliminate the solution of dep end ent relationship because dep is not in the lexicon (i.e., 75% word usage), whereas the other two solutions cover 100% of words in the lexicon. Of the remaining solutions, the probability of dependent relationship is 0.72 whereas depend ent relationship is 0.00027. We can therefore select dependent relationship as the correct solution.

How to separate words in a "sentence" with spaces?
Top Coder - Text Segmentation Presentation 1/2
Top Coder - Text Segmentation Presentation 2/2
Linear Text Segmentation using Dynamic Programming Algorithm
Dynamic Programming: Segmentation
Dynamic Programming: A Computational Tool

给定:

// The concatenated phrase or database column (e.g., dependentrelationship).
String concat;

// All words (String) in the lexicon within concat, in left-to-right order; and
// the ranked probability of those words (Double). (E.g., {end, 0.97}
// {dependent, 0.86}, {relationship, 0.95}.)
Map.Entry<String, Double> word;

您将如何实施根据词典覆盖率和概率生成最可能解决方案的例程?例如:

How would you implement a routine that generates the most likely solution based on lexicon coverage and probabilities? For example:

for( Map.Entry<String, Double> word : words ) {
  result.append( word.getKey() ).append( ' ' );

  // What goes here?

  System.out.printf( "%s=%f
", word.getKey(), word.getValue() );
}

谢谢！

文本切分:基于字典的分词 [英] Text segmentation: dictionary-based word splitting

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

文本切分:基于字典的分词 [英] Text segmentation: dictionary-based word splitting

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭