字分离算法 [英] Word-separating algorithm

查看:146
本文介绍了字分离算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是算法 - 似乎在域名停放页面使用的 - 这需要一个spaceless一堆话(如thecarrotofcuriosity)和更多或更少的正确将其分解成组成的单词(比如好奇胡萝卜)?<​​/ P>

解决方案

开始一个基本特里数据结构重新presenting你的字典。当你通过字符串的字符循环,搜索自己的方式,通过线索一组指针,而不是一个单一的指针 - 全套的种子,特里的根源。每个字母,整个集通过由字母表示的指针前进一次,并且如果一组元件不能由字母被推进,它是从集合中移除。当你达到一个可以结束的话,新根的-线索添加到组(跟踪与该组元素相关见过的单词列表)。最后,一​​旦所有字符都已经被处理,返回字的一个任意的列表,它是在根-的线索。如果有多个,这意味着该字符串可以以多种方式(如therapistforum可以解析为【治疗师,论坛]或[了,色魔,论坛被打破了]),它是不确定的,我们会回来。

或者,在wacked了伪code(Java的的foreach,元组表示与括号,设置标有括号,的缺点使用头::尾,[]是空列表):

 名单,其中,字符串&GT;分手(字符串str,特里根){
    设置≤(名单&LT;字符串&gt;中特里)&GT;集= {([],根)};
    对于(字符C:STR){
        设置≤(名单&LT;字符串&gt;中特里)&GT; newSet = {};
        对于(名单&LT;字符串&GT; LS,特里T:集){
            特里T下一页= t.follow(C);
            如果(T下一页!= NULL){
                newSet.add((LS,T下一页));
                如果(tNext.isWord()){
                    newSet.add((t.follow(C).getWord():: LS,根));
                }
            }
        }
        设置= newSet;
     }
     对于(名单&LT;字符串&GT; LS,特里T:集){
        如果(T ==根)返回LS;
     }
     返回null;
 }
 

让我知道如果我需要澄清或我错过了什么......

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?

解决方案

Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.

Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):

List<String> breakUp(String str, Trie root) {
    Set<(List<String>, Trie)> set = {([], root)};
    for (char c : str) {
        Set<(List<String>, Trie)> newSet = {};
        for (List<String> ls, Trie t : set) {
            Trie tNext = t.follow(c);
            if (tNext != null) {
                newSet.add((ls, tNext));
                if (tNext.isWord()) {
                    newSet.add((t.follow(c).getWord() :: ls, root));
                }
            }
        }
        set = newSet;
     }
     for (List<String> ls, Trie t : set) {
        if (t == root) return ls;
     }
     return null;
 }

Let me know if I need to clarify or I missed something...

这篇关于字分离算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆