基于java光盘的散列表 [英] java disc based hashmap

查看:101
本文介绍了基于java光盘的散列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个网络爬虫(请不要建议一个现有的,它不是一个选项)。
我按照预期的方式工作。我唯一的问题是,目前我正在使用一种服务器/客户端模式,服务器执行爬取并处理数据,然后将它放在中心位置。



这个位置是从我写的一个类创建的对象。在内部,类保持一个定义为 HashMap< String,HashMap< String,String>>



<地图中的数据使网址成为关键(我保持这些独特性),并且hasmap值存储该网址的相应数据字段,例如标题,值等等。

序列化使用的内部对象,但蜘蛛是多线程的,只要我说5个线程爬行内存需求呈指数增长。



到目前为止,性能已经哈希映射非常好,在2.r分钟内抓取15K的URL,CPU时间大约为30秒,所以我真的不需要像大多数论坛用户所建议的那样指向现有蜘蛛的方向。



任何人都可以提出一种基于快速光盘的解决方案,可能会支持并发读取和放大功能。写作?数据结构不一定是相同的,只需要能够存储相关的元标记值等。



预先感谢

解决方案

我建议为此使用 EhCache ,甚至尽管你正在构建的不是真正的缓存。 EhCache允许您配置缓存实例,使其溢出到光盘存储,同时将最新的项目保存在内存中。它也可以配置为光盘持久化,即在关机时将数据刷新到光盘上,并在启动时读回到内存。最重要的是,它基于键值,所以它已经适合您的模型。它支持并发访问,并且由于磁盘存储是作为一个单独的线程来管理的,所以您不必担心磁盘访问并发。



或者,您可以考虑正确的嵌入式数据库,例如 Hypersonic (或许多类似风格的其他数据库),但这可能会有更多的工作。

I'm working on a web crawler (please don't suggest an existing one, its not an option). I have it working the way it is expected to. My only issue is that currently I'm using a sort of server/client model where by the server does the crawling and processes the data, it then put it in a central location.

This location is an object create from a class i wrote. Internally the class maintains a hashmap defined as HashMap<String, HashMap<String, String>>

I store data in the map making the url the key (i keep these unique) and the hasmap value stores the corresponding data fields for that url such as title,value etc

I occasionally serialize the internal objects used but the spider is multi threaded and as soon as i have say 5 threads crawling the memory requirements go up exponentially.

To so far the performance has been excellent with the hashmap, crawling 15K urls in 2.r minutes with about 30 seconds CPU time so i really don't need to be pointed in the direction of an existing spider like most forum users have suggested.

Can anyone suggest a a fast disc based solution that will probably support concurrent reading & writing? The data structure doesnt have to be the same, just needs to be able to store related meta tag values together etc.

thanks in advance

解决方案

I suggest using EhCache for this, even though what you're building isn't really a cache. EhCache allows you to configure the cache instance so that it overflows to disc storage, while keeping the most recent items in memory. It can also be configured to be disc-persistent, i.e. data is flushed to disc on shutdown, and read back into memory at startup. On top of all that, it's key-value based, so it already fits your model. It supports concurrent access, and since the disk storage is managed as a separate thread, you shouldn't need to worry about disk access concurrency.

Alternatively, you could consider a proper embedded database such as Hypersonic (or numerous others of a similar style), but that's probably going to be more work.

这篇关于基于java光盘的散列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆