Java - 自定义哈希表/表格某些点 [英] Java - Custom Hash Map/Table Some Points

查看:126
本文介绍了Java - 自定义哈希表/表格某些点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在之前的一些文章中,我提出了一些关于java中自定义哈希表/表格编码的问题。现在,因为我解决不了问题,而且可能忘了正确地提及我真正想要的东西,所以我总结了所有这些,以使其清晰明确。 strong>我要做什么:



我试图编写我们的服务器,在其中我必须通过URL找到用户访问类型。 p>

现在,我拥有1110百万个网址(约)。



所以,我们做了什么,



1)将数据库分成10个部分,每个部分包含110万个Url。
2)使用并行数组构建一个HashMap,该数组的键是URL的一部分(表示为LONG),值是URL的另一部分(表示为INT) - 键可以有多个值。 p>

3)然后在系统启动的时候,在开始的时候每天搜索HashMap中的其他URL(一天内保存的数百万个URL)。



您试过的是什么:

1)我尝试了很多NoSQL数据库,但是我们发现对我们来说不太好目的。



2)我建立了自己的自定义hashmap (使用两个并行数组)。



所以,问题在于:



当系统启动时,我们必须加载每个数据库的散列表,并执行搜索网址的数以百万计:



现在,问题是,

1)尽管HashTable的性能非常好,但代码需要更多时间加载HashTable(我们正在使用文件频道&内存映射缓冲区加载它需要20秒加载HashTable - 220百万条目 - 因为加载因子是0.5,我们发现它更快)所以,我们花时间:(哈希表负载+哈希表搜索) * DB的数量=(5 + 20)* 10 = 250秒。这对我们来说相当昂贵,而且大部分时间(250秒内有200次)会加载哈希表。

你有什么其他想法:一种方法可以是:

不用担心加载和存储,并将缓存保存为 操作系统通过使用内存映射缓冲区。但是,由于我必须搜索数以百万计的密钥,因此它的性能比上面差。



当我们发现HashTable性能很好,但加载时间很长时,我们认为以另一种方式剪掉它:

<1>创建一个大小为Integer_MAX的链接列表数组()。 列出其编号是关键编号的列表(我们将关键尺寸缩小到INT)。

3)因此,我们只需要将链接列表存储到磁盘中。



现在问题在于,创建如此大量的链接列表需要花费大量时间,并且如果数据分布不均匀,创建如此大量的链接列表就没有意义。

只需要我的要求:



1)具有多个值插入和搜索的键。寻找不错的搜索性能。


(键是64位INT和值是32位INT,一个键最多可以有2-3个值,我们也可以让我们的关键32位,但会给更多的冲突,但我们可以接受,如果我们可以做得更好)。

任何人都可以帮助我,如何解决这个或任何评论如何解决这个问题?



谢谢。



注意:< 1)根据以前的堆栈溢出的建议,预读磁盘缓存数据是不可能的,因为当系统启动时,我们的应用程序将开始工作,第二天系统启动时。

2)我们还没有发现NoSQL数据库的伸缩性很好,因为我们的要求很简单(意味着只需插入哈希表键值并加载和搜索(检索值))。
/ p>

3)由于我们的应用程序是小型项目的一部分,要在小型园区应用,我不认为有人会为我购买SSD磁盘。这是我的局限。 4)我们也使用Guava / Trove,但他们无法存储16 GB的大量数据(我们使用的是32 GB ubuntu服务器。) / p>

解决方案

我不太了解您将数据存储在磁盘上的形式。如果你正在存储的内容包含url和一些数字,你可以通过压缩数据来加速磁盘的加载速度(除非你已经这样做了)。



创建一个在加载时解压的多线程加载器可能会给你带来很大的提升。


In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.

What I am going to do:

I am trying to code for our server in which I have to find users access type by URL.

Now, I have 1110 millions of URLs (approx).

So, what we did,

1) Divided the database on 10 parts each of 110 millions of Urls. 2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.

What you have Tried:

1) I have tried many NoSQL databases, however we found not so good for our purpose.

2) I have build our custom hashmap(using two parallel arrays) for that purpose.

So, what the issue is:

When the system starts we have to load our hashtable of each database and perform search for million of url:

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)

So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.

Have you think any-other way:

One way can be:

Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.

As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:

1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).

2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).

3) So, we have to store only the linked lists to the disks.

Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.

So, What is your requirements:

Simply my requirements:

1) Key with multiple values insertion and searching. Looking for nice searching performance. 2) Fast way to load (specially) into memory.

(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).

Can anyone help me, how to solve this or any comment how to solve this issue ?

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).

3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.

4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)

解决方案

I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).

Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.

这篇关于Java - 自定义哈希表/表格某些点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆