英文字符串分割 [英] English string segmentation

查看:88
本文介绍了英文字符串分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近〜我在项目中遇到了有关英语字符串分割的问题.
首先,有一个子字符串数据库,以及如何通过匹配子字符串数据库来最大程度地分割输入字符串.

即:

 inputstring = " ;
substringdatebase [] = {" " 男人"  wome"" "  c ++"  shi"" " 程序员"  and"" " 代码" 项目"
                     .....
                    };


如下所示:

 outstring = {"   mem"  shi" .....};

解决方案

为了提高搜索效率,您可以使用trie数据结构来表示子字符串( ^ ]).

然后重复执行搜索与某些子字符串匹配的最长字符串.无论如何,这不是防弹解决方案,因为长子串匹配可能会取代短子串匹配并导致死胡同.

示例:在"abce"中查找{"ab","abc","ce"}将检测到"abc",然后在"e"上失败,而有一个解决方案,其中"ab"后跟"ce". /blockquote>

recently~ i met a question about the english string segmentation in my project.
at first,there is a substring database ,and how to segmentation a input string in at all most through match the substring database.

ie:

inputstring="womenshic++programmerandorcodeproject";
substringdatebase[]={"wo","men","wome",
                     "c","c++","shi",
                    "program","programmer","and",
                    "or","code","project"
                     ..........
                    };


as follow affter segmentation:

outstring={"wo","mem","shi".....};

解决方案

For efficiency of the search, you can represent your substrings using a trie data structure (http://en.wikipedia.org/wiki/Trie[^]).

Then repeatedly perform a search for the longest string that matches some substring. Anyway, this is not a bulletproof solution, as long substring matches could supersede shorter substring matches and lead to a dead end.

Example: looking for { "ab", "abc", "ce" } in "abce" would detect "abc" and then fail on "e", whereas there is a solution with "ab" followed by "ce".


这篇关于英文字符串分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆