在大型字典中找到一个单词的存在 [英] Find the existence of a word in a large dictionary

查看:83
本文介绍了在大型字典中找到一个单词的存在的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我给了一个200万字的平面文件中的大字典,我的函数需要检查字典中任何给定单词的存在,最快的方法是什么?您不能将字典存储在内存中,因为您只有1GB的内存。您可以将其存储在数据库中,但是无需任何优化,查询它仍将非常慢。您不能为完整的词索引,因为您没有足够的资源。



编辑:除了下面提到的文件优化方法,有没有任何数据库优化?我正在考虑创建部分索引,对于单词中的每2个字母,限制,我创建一个索引。这会加快数据库查询吗?

解决方案

二进制搜索



假设字典按字母顺序排列,我将尝试修改二进制搜索。通过跳转到文件中的中点位置并查看有什么字,来分割并征服文件。如果猜到高,将下半部分分开,再试一次,直到没有文件位置尝试或找到该单词。



(as 在评论中提到,跳到一个文件位置,你需要向前和向后扫描以找到你跳过的单词的边界。)



您可能可以通过猜测一个基于单词的第一个字母的蝙蝠的位置块。例如,如果单词以c开头,请在文件的第3/26个部分开始搜索。虽然,在现实中,我认为这个早期的猜测只会在整体上产生微不足道的差异。



其他优化可能包括保留索引的一小部分。例如,保持以字母表的每个字母开头的第一个单词的索引,或者保持以每个可能的两个字母组合开始的每个单词的索引。这将允许您立即缩小搜索范围。



O(log n)


Suppose I'm given a large dictionary in flat file with 200 million words and my function needs to check the existence of any given word in the dictionary, what's the fastest way to do it? You can't store the dictionary in the memory because you only have 1GB of memory. You can store it in the database, however querying it would still be very very slow without any optimization. You can't index the full words because you don't have enough resources.

Edit: in addition to the file optimization approach mentioned below, are there any database optimization? I'm thinking of creating partial indices, say for every 2 letters in the word up to a limit, I create an index. Would this speed up the db query?

解决方案

Binary search

Assuming the dictionary has the words in alphabetical order, I would attempt a modified binary search. Divide and conquer the file by jumping to a midpoint location in the file and seeing what word is there. If guessed to high, split the lower in half and try again until there's no file location to attempt or the word is found.

(As outis mentioned in a comment, after jumping to a file location, you'll need to scan backwards and forwards to find the boundaries of the word you jumped to.)

You might be able to optimize this by guessing a location chunk right off the bat based on the first letter of the word. For example, if the word begins with "c" start your search around the 3/26th section of the file. Though, in reality, I think this early guess will only make a negligible difference overall.

Other optimizations could include keeping a small subset of an index. For example, keep an index of the first word that starts with each letter of the alphabet, or keep an index of each word that starts with each possible two letter combination. This would allow you to immediately narrow your search area.

O(log n)

这篇关于在大型字典中找到一个单词的存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆