如何在下降内存消耗和查找时间的大单词列表(词汇表)中找到单词? [英] How to find a word in large word list (vocabulary) with descent memory consumption and look-up time?

查看:123
本文介绍了如何在下降内存消耗和查找时间的大单词列表(词汇表)中找到单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[以下是关于应用程序在哪些约束下应该做什么的描述]

我想要一个数据结构,搜索一个250,000字列表中是否存在字符串,同时只使用相当数量的ram并保留时间将这个数据结构加载到ram small(比如说0-8秒)。找到一个单词所需的时间也应该很快(比方说0到0.5秒),但ram的使用更为重要。它也应该可以创建多个游戏(更多关于这个游戏的标题是使用),而不需要更多的内存。

I want a data-structure that searches if a string exists in a 250,000 word-list, while using only a fair amount of ram and keeping the time it takes to load this data-structure into ram small (let's say 0-8 seconds). The time it takes to find a word should also be quick (let's say 0 to 0.5 second), but ram usage is more important. It should also be possible to create multiple games (more on what this game is about at the title "use") without needing significant more memory.

知道哪些单词以字符串开头也是非常有价值的,但不足以牺牲负载 - 时间很多秒。

It would also be highly valuable to know which words start with a string, but not enough so to sacrifice load-time by many seconds.

它是对于Android离线游戏。有限的ram可用。 应用程序可以使用的最大ram数量根据设备的不同,这篇文章介于16-32mb ram之间。我的空Android应用程序已经使用了大约17mb(在Android Studio中使用Memory Monitor)。我的Android设备将ram的使用量限制在26mb,让我的整个活动的剩余空间大约为8mb。

It is for an Android offline game. Limited ram is available. The maximum amount of ram an Application can use according to this post is between 16-32mb ram depending on the device. My empty Android Application already uses about 17mb (using Memory Monitor in Android Studio). My android device caps the ram usage off at 26mb, leaving me at about 8mb of free space for my whole Activity.

它们似乎都注定了不同的方式。

They all seem doomed in different ways.


  1. Hashmap - 将所有单词读入哈希映射对象。

  1. Hashmap - Read all words into a hash-map object.

1.1 初始化速度:在23秒内将每个单词读入哈希映射的速度很慢。

1.1 initialize speed: slow to read each word into the Hash-map with 23 seconds.

1.2 ram usage:用途大量的ram,虽然我忘记了多少。

1.2 ram usage: uses significant amount of ram, although I forgot how much exactly.

1.3 搜索速度:查找列表中是否存在单词当然是快速的。

1.3 search speed: Finding if a word existed in the list was quick of course.

1.4 缩小可能的单词(可选):慢,需要经过整个哈希映射并逐个删除它们。此外,因为它使用删除,将无法使用相同的哈希映射实例播放多个游戏。添加更多游戏时会占用太多内存,因此无法缩小可能的单词。

1.4 narrowing down on possible words (optional): slow, needs to go through the whole hash-map and delete them one by one. Also because it's using deletion, multiple games won't be able to be played using the same instance of the hash-map. Too much memory would be taken when adding more games, making narrowing down on the possible words therefor impossible.

Trie - 实施RadixTree &
您可以在此处查看我的实施。

Trie - Implement a RadixTree & You can see my implementation here.

2.1 初始化速度:在47秒内将每个单词读入RadixTree的速度很慢。

2.1 initialize speed: slow to read each word into the RadixTree with 47 seconds.

2.2 ram usage:使用了大量的ram,以至于Android几次暂停线程。

2.2 ram usage: uses significant amount of ram, so much that Android is suspending threads a couple of times.

2.3 搜索速度:查找列表中是否存在单词是否很快。

2.3 search speed: Finding if a word existed in the list was quick.

2.4 缩小可能的单词(可选):超快速,因为只有需要引用树中的节点,然后找到所有可能的单词作为其子节点。你可以通过缩小可能的单词来玩很多游戏,因为额外的游戏只需要参考树中的节点!

2.4 narrowing down on possible words (optional): Ultra fast since only a reference to a node in the tree is needed to then find all possible words as its children. You can play a lot of games with narrowing down the possible words since an extra game requires only a reference to a node in the tree!

扫描程序 - 按顺序浏览word文件

Scanner - Go through the word-file sequentially

3.1 初始化速度无。

3.2 ram使用情况:无。

3.3 搜索速度:约20秒。

3.4 缩小可能的单词(可选):无法实现。

简单代码:

String word;
String wordToFind = "example";
boolean foundWord = false;

while (wordFile.hasNextLine()) {
    word = wordFile.nextLine();
    if(word.equals(wordToFind)) {
        foundWord = true;
        break;
    }
}

test.close();






我想到的选项:




Options I thought of:


  1. Long-binary-search-tree:将word-list转换为 long 然后阅读这些并对它们进行二元搜索。

  1. Long-binary-search-tree: Converting the word-list to a list of longs then reading these in and doing a binary search on them.

1.1 初始化速度:可能相同作为哈希映射或少于约20秒。但是我希望调用Array.sort()不会占用太多时间,不知道为止。

1.1 initialize speed: probably the same as a hash-map or little less with about 20 seconds. However I hope calling Array.sort() does not take too much time, no idea as of yet.

1.2 ram usage:如果你只有12个字母的单词或更低的26个字母的字母表,你需要5位(2 ^ 5 = 32)来编码一个字符串。然后需要250,000 * 8位=大约2mb的长数组。哪个不是太多。

1.2 ram usage: if you only account 12 letter words or lower with a 26 letter alphabet you need 5 bits (2^5= 32) to encode a string. An array of longs would need then 250,000*8 bits = around 2mb. Which is not too much.

1.3 搜索速度: Arrays.binarySearch()

1.3 search speed: Arrays.binarySearch()

1.4 缩小可能的单词(可选):缩小可能的单词是可能的,但我不确定如何。 根据评论在这篇文章上

1.4 narrowing down on possible words (optional): Narrowing down on possible words could be possible but I am not sure how. According to a comment on this post.

带存储空间的Hashmap - 创建一个将单词映射到索引号的散列函数单词列表文件。然后访问此特定位置的文件,并从此处查看是否存在单词。您可以利用字母表的顺序来确定您是否仍然可以找到该单词,因为单词列表是自然顺序。

Hashmap with storage - Creating a hashfunction that maps a word to an index number of the word-list file. Then accessing the file at this specific location and look from here to find if a word exists. You can make use of the ordering of the alphabet to determine if you can still find the word since the word-list is in natural order.

2.1 初始化速度:不需要(因为我需要预先将每个单词放在正确的索引处。)

2.1 initialize speed: not needed (since I need to put every word at the right index beforehand.)

2.2 ram usage: none。

2.2 ram usage: none.

2.3 搜索速度快。

2.4 缩小范围可能的单词(可选):不可能。






我有具体问题




Specific questions I have


  1. 我在我想到的选项一节中可以考虑的选项是可行的选项还是有东西我错过了哪些会使它们无法实现?

  2. 我有没有想过哪些选项在性能上更好/相等?



结束语



我已经坚持了一个星期了。因此,任何新的想法都非常受欢迎。如果我上面的任何一个假设不正确,我也很高兴听到他们。

End remarks

I have been stuck at this for about a week now. So any new ideas are more than welcome. If any of my assumption above are incorrect I would also be pleased to hear about them.

我以这种方式发布了这篇文章,所以其他人也可以通过看到我的错误或看到答案中有效的方法向他们学习。

I made this post this way so others could learn from them as well, either by seeing my mistakes or seeing what does work in the answers.

推荐答案

这听起来像布隆过滤器。如果你愿意允许某些错误地被认为是某个词的风险,你可以将你的词汇表浓缩成一个你愿意做的小或大的记忆量。

This sounds like an ideal use for a bloom filter. If you're willing to allow the risk of something being falsely considered a word, you can condense your wordlist into an amount of memory as small or as large as you're willing to make it.

这篇关于如何在下降内存消耗和查找时间的大单词列表(词汇表)中找到单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆