存储巨大的std :: map,大多在磁盘上 [英] Store huge std::map, mostly on disk

查看:125
本文介绍了存储巨大的std :: map,大多在磁盘上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个C ++程序,可能会产生大量的数据 - 数十亿不同大小的二进制记录,最可能小于256字节,但几个扩展到几个K.大多数记录很少在程序创建后查看程序,但有些将定期访问和修改。没有办法告诉它们在创建时是哪些。

I've got a C++ program that's likely to generate a HUGE amount of data -- billions of binary records of varying sizes, most probably less than 256 bytes but a few stretching to several K. Most of the records will seldom be looked at by the program after they're created, but some will be accessed and modified regularly. There's no way to tell which are which when they're created.

考虑到数据量,我无法将其全部存储在内存中。但是由于数据只需要通过其数字(64位整数)进行索引和访问,因此我不需要一个完整的数据库程序的开销。理想情况下,我想把它作为一个 std :: map ,其数据存储在磁盘上,直到请求。

Considering the volume of data, there's no way I can store it all in memory. But as the data only needs to be indexed and accessed by its number (a 64-bit integer), I don't want the overhead of a full-fledged database program. Ideally I'd like to treat it as an std::map with its data stored on disk until requested.

编辑:经过一番想法,我意识到, Rob Walker的回答有一个有效的观点:我将很难从一个真实的数据库中获得的自制类中获得相同类型的数据完整性。

After some thought, I realized that Rob Walker's answer had a valid point: I'd be hard-pressed to get anywhere near the same kind of data integrity out of a home-brew class that I'd get from a real database.

虽然BerkeleyDB(由RHM建议)看起来像我们正在寻找的,双重许可是我们不想处理的头痛。当我们完成代码并且可以证明它将从BerkeleyDB(它可能会)明显受益,我们将重新审查该问题。

Although BerkeleyDB (as suggested by RHM) looks like it would do exactly what we're looking for, the dual-licensing is a headache that we don't want to deal with. When we're done with the code and can prove that it would benefit noticeably from BerkeleyDB (which it probably would), we'll reexamine the issue.

在Ferruccio的建议stxxl,但我不能告诉如何处理程序被中断和重新启动(可能与更改)。有了这么多的数据,我讨厌只是舍弃它已经完成,并开始每次,如果一些数据可以保存。

I did look at Ferruccio's suggestion of stxxl, but I wasn't able to tell how it would handle the program being interrupted and restarted (maybe with changes). With that much data, I'd hate to just scrap what it had already completed and start over every time, if some of the data could be saved.

ve决定使用SQLite数据库,至少对于初始开发。感谢所有回答或投票的人。

So we've decided to use an SQLite database, at least for the initial development. Thanks to everyone who answered or voted.

推荐答案

我怀疑你会找到一个完全符合要求的图书馆,必须决定什么'功能'对你真正重要,然后决定一个现有的DB解决方案是否足够接近。

I doubt you will find a library that meets your requirements exactly, so you'll have to decide on what 'features' are really important to you and then decide if an existing DB solution comes close enough.

十亿条记录是一个大数据集。记录生成的速率是多少?他们坚持多久?

Billions of records is a large dataset by any stretch. What rate are records generated at? How long do they persist? Does the access pattern change over time?

更新是否始终具有与原始数据相同的数据量?

Are updates always with the same amount of data as the original?

我建议明确证明,数据库解决方案不会开始工作,在开始滚动自己,特别是如果数据的完整性是至关重要的(通常是...)在磁盘上可靠地保持数据量可靠是一个挑战。在更改数据时,您需要任何类型的事务语义吗?客户端是否是多线程的?

I would suggest proving definitively that a DB solution isn't going to work before starting to roll your own, particularly if integrity of the data is paramount (and it usually is...) Maintaining that volume of data on disk reliably can definitely be a challenge. Do you need any kind of transaction semantics when changing the data? Is the client multithreaded?

这篇关于存储巨大的std :: map,大多在磁盘上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆