基于java磁盘的hashmap [英] java disc based hashmap

查看:33
本文介绍了基于java磁盘的hashmap的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个网络爬虫(请不要推荐现有的,它不是一个选项).我让它按预期的方式工作.我唯一的问题是,目前我正在使用一种服务器/客户端模型,服务器在其中进行抓取和处理数据,然后将其放在中心位置.

I'm working on a web crawler (please don't suggest an existing one, its not an option). I have it working the way it is expected to. My only issue is that currently I'm using a sort of server/client model where by the server does the crawling and processes the data, it then put it in a central location.

此位置是从我编写的类创建的对象.该类在内部维护一个哈希映射,定义为 HashMap>

This location is an object create from a class i wrote. Internally the class maintains a hashmap defined as HashMap<String, HashMap<String, String>>

我将数据存储在地图中,使 url 成为键(我保持这些唯一性),而 hasmap 值存储该 url 的相应数据字段,例如标题、值等

I store data in the map making the url the key (i keep these unique) and the hasmap value stores the corresponding data fields for that url such as title,value etc

我偶尔会序列化使用的内部对象,但蜘蛛是多线程的,一旦我说 5 个线程在爬行,内存需求就会呈指数增长.

I occasionally serialize the internal objects used but the spider is multi threaded and as soon as i have say 5 threads crawling the memory requirements go up exponentially.

到目前为止,hashmap 的性能非常好,在 2.r 分钟内爬取 15K url,大约 30 秒的 CPU 时间,所以我真的不需要像大多数论坛用户那样指向现有蜘蛛的方向有建议.

To so far the performance has been excellent with the hashmap, crawling 15K urls in 2.r minutes with about 30 seconds CPU time so i really don't need to be pointed in the direction of an existing spider like most forum users have suggested.

谁能推荐一个基于磁盘的快速解决方案,它可能支持并发读取和写作?数据结构不必相同,只需要能够将相关的元标记值存储在一起等即可.

Can anyone suggest a a fast disc based solution that will probably support concurrent reading & writing? The data structure doesnt have to be the same, just needs to be able to store related meta tag values together etc.

提前致谢

推荐答案

我建议为此使用 EhCache,即使尽管您正在构建的并不是真正的缓存.EhCache 允许您配置缓存实例,使其溢出到磁盘存储,同时将最新项目保留在内存中.它也可以配置为磁盘持久性,即数据在关机时刷新到磁盘,并在启动时读回内存.最重要的是,它是基于键值的,因此它已经适合您的模型.它支持并发访问,并且由于磁盘存储作为单独的线程进行管理,因此您不必担心磁盘访问并发性.

I suggest using EhCache for this, even though what you're building isn't really a cache. EhCache allows you to configure the cache instance so that it overflows to disc storage, while keeping the most recent items in memory. It can also be configured to be disc-persistent, i.e. data is flushed to disc on shutdown, and read back into memory at startup. On top of all that, it's key-value based, so it already fits your model. It supports concurrent access, and since the disk storage is managed as a separate thread, you shouldn't need to worry about disk access concurrency.

或者,您可以考虑使用适当的嵌入式数据库,例如 Hypersonic(或许多其他类似风格的数据库),但这可能需要更多的工作.

Alternatively, you could consider a proper embedded database such as Hypersonic (or numerous others of a similar style), but that's probably going to be more work.

这篇关于基于java磁盘的hashmap的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆