寻找词典 [英] Finding dictionary words

查看:130
本文介绍了寻找词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多复合字符串,它们是两个或三个英文单词的组合。

  Spicejet是香料和jet的组合

我需要将这些单词来自这些复合字符串的单词。我的字典将包含大约10万字。



将这些复合字符串中的单个英文单词分离出来是最有效率的。

解决方案

我不知道你要做多少时间或频率(这是一次性的操作吗? ?每周?),但是你显然需要一个快速加权的字典查找。



您还将要有一个冲突解决机制,排队以手动解决具有多个可能含义的元组上的冲突。



我会研究尝试。使用一个您可以有效地找到(并加重)您的前缀,这正是您将要寻找的。



您必须从一个好的字典源构建自己的Tries,并以完整的字来加重节点,为自己提供一个很好的质量机制以供参考。 >

只是在这里进行头脑风暴,但是如果你知道你的数据集主要包含小数或三元组,那么你可能会有多个Trie查找,例如查找Spic,然后是ejet 然后发现这两个结果都具有较低的分数,放弃到Spice和Jet中,两者都将在两者之间产生良好的组合结果。



另外我也会考虑对最常见的前缀进行频率分析,直到任意或动态限制,例如过滤''或'un'或'in',并相应加权。



听起来像一个有趣的问题,祝你好运!


I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

What would be the most efficient by which I can separate individual English words from such compound strings.

解决方案

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

Sounds like a fun problem, good luck!

这篇关于寻找词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆