查找字典单词 [英] Finding dictionary words

查看:190
本文介绍了查找字典单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多的复合字符串是两个或三个英文单词的组合。

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

我需要这些独立的英语单词从这样的复合字符串分隔。我的意思是要由约100000字。

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

什么是最有效的通过,我可以分开单独的英语单词从这样的复合字符串。

What would be the most efficient by which I can separate individual English words from such compound strings.

推荐答案

我不知道你有多少时间或频率要做到这一点(每日它是一个一次性操作?周?),但你显然会希望有一个快速,加权字典查询。

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

您还需要有一个解决冲突的机制,也​​许侧面队列手动解决冲突上有多个可能的含义元组。

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

我会去了解一下尝试。使用一个你可以高效地找到(重量)的prefixes,这是$ P $你将寻找pcisely什么。

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

您将不得不从一本好字典源码构建自己的尝试次数和重量上满语中的节点提供自己一个良好的质量的机制,以供参考。

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

只是集思广益这里,但如果你知道你的数据集主要由小芯片或三胞胎,你很可能逃脱多个特里查找,比如查找'穗',然后'ejet,然后发现两个结果有一个低分数,弃入'香料和喷气,其中两个尝试次数将产生两者之间的良好的综合结果。

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

此外,我会考虑使用频率分析最常见的prefixes长达一个武断的或动态的限制,例如:过滤'的'或'在'联合国'或与相应的权重。

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

听起来像一个有趣的问题,祝你好运!

Sounds like a fun problem, good luck!

这篇关于查找字典单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆