分词算法 [英] Word-separating algorithm

查看:34
本文介绍了分词算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是算法 - 似乎在域名停放页面上使用 - 需要一堆无空间的词(例如thecarrotofcuriosity")并或多或少正确地将其分解为组成词(例如好奇心的胡萝卜"") ?

解决方案

从基本的Trie 开始

a> 表示字典的数据结构.当您遍历字符串的字符时,请使用一组指针而不是单个指针在特里搜索您的方式 - 该集以特里的根为种子.对于每个字母,整个集合通过字母所指示的指针一次推进,如果一个集合元素不能被该字母推进,则将其从集合中移除.每当您到达可能的词尾时,向集合中添加一个新的树根(跟踪与该集合元素相关联的单词列表).最后,处理完所有字符后,返回位于特里根的任意单词列表.如果有多个,则意味着字符串可以以多种方式分解(例如therapistforum",可以解析为 ["therapist", "forum"] 或 ["the", "rapist", "forum"]) 并且它是未定义的,我们将返回.

或者,在一个奇怪的伪代码中(Java foreach,用括号表示的元组,用大括号表示的集合,cons 使用 head :: tail,[] 是空列表):

ListbreakUp(String str, Trie root) {Set<(List, Trie)>set = {([], root)};for (char c: str) {Set<(List, Trie)>新集 = {};for (List ls, Trie t : set) {尝试 tNext = t.follow(c);如果(tNext != null){newSet.add((ls, tNext));如果(tNext.isWord()){newSet.add((t.follow(c).getWord() :: ls, root));}}}集 = 新集;}for (List ls, Trie t : set) {if (t == root) 返回 ls;}返回空;}

如果我需要澄清或遗漏了什么,请告诉我...

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?

解决方案

Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.

Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):

List<String> breakUp(String str, Trie root) {
    Set<(List<String>, Trie)> set = {([], root)};
    for (char c : str) {
        Set<(List<String>, Trie)> newSet = {};
        for (List<String> ls, Trie t : set) {
            Trie tNext = t.follow(c);
            if (tNext != null) {
                newSet.add((ls, tNext));
                if (tNext.isWord()) {
                    newSet.add((t.follow(c).getWord() :: ls, root));
                }
            }
        }
        set = newSet;
     }
     for (List<String> ls, Trie t : set) {
        if (t == root) return ls;
     }
     return null;
 }

Let me know if I need to clarify or I missed something...

这篇关于分词算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆