存储和查询气象数据的大数据集的更好方法是什么? [英] What is a better approach of storing and querying a big dataset of meteorological data

查看:147
本文介绍了存储和查询气象数据的大数据集的更好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方便的方式来存储和查询大量的气象数据(几TB)。关于问题中间的数据类型的更多信息。



以前我在寻找MongoDB的方向(我曾经用它来处理许多我自己以前的项目并且很乐意处理它),但最近我发现了,我需要存储它们(并且将HDF5转换为相对简单)。然后,定期提供更小的数据部分(每周1Gb),我必须将它们添加到存储中。为了突出显示:我有足够的存储空间来将所有这些数据保存在一台机器上。

  • 查询数据。通常需要实时查询数据。最常见的问题是:告诉我特定时间内特定区域的传感器温度,在特定时间向特定传感器显示数据,向我显示特定时间范围内某个区域的风速。综合查询(过去两个月的平均温度)极不可能。在这里,我认为Mongo非常适合,但hdf5 + pytables 是一种替代方案。


  • 进行一些统计分析。目前我不知道究竟是什么,但我知道这不应该是实时的。所以我在考虑在mongo中使用hadoop可能是一个好主意,但是使用hdf5和 R 是合理的




  • 我知道不鼓励关于更好方法的问题,但我我正在寻找有经验的用户的建议。如果您有任何问题,我会很乐意回答他们,并会感谢您的帮助。

    我回顾了一些有趣的讨论,与我的相似: hdf-forum 搜索hdf5 存储气象数据


    <解决方案

    这是一个难题,我不确定我是否可以给出明确的答案,但我有HDF5 / pyTables和一些NoSQL数据库的经验。

    这里有一些想法。


    • HDF5本身没有索引的概念。这只是一种非常适合多维数字数据的分层存储格式。可以在HDF5之上扩展以实现索引(即PyTables, HDF5 FastQuery )数据。

    • HDF5(除非您使用的是MPI版本)不支持并发写入访问(可以读取访问权限)。

    • HDF5支持压缩过滤器,与普遍认为的不同 - 使数据访问实际上更快(但您必须考虑正确的块大小,这取决于您访问数据的方式)。

    • HDF5不是数据库。 MongoDB具有ACID属性,HDF5不(可能很重要)。

    • 有一个包( SciHadoop )结合了Hadoop和HDF5。

    • HDF5使核心计算相对容易(即如果数据太大而无法放入内存)。 使用 numexpr ,PyTables直接在HDF5中支持一些快速的内核计算



    我认为您的数据通常非常适合存储在HDF5中。您还可以通过 R 或通过 Numpy / Scipy 进行统计分析。也可以考虑一个混乱的问题。将原始批量数据存储在HDF5中,并使用MongoDB作为元数据或缓存经常使用的特定值。


    I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.

    Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:

    HDF5 simplifies the file structure to include only two major types of object: Datasets, which are multidimensional arrays of a homogenous type Groups, which are container structures which can hold datasets and other groups This results in a truly hierarchical, filesystem-like data format. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets.

    Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.

    Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data.

    The data:

    Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).

    This sensor collects the following information every X minutes:

    • solar luminosity
    • wind location and speed
    • humidity
    • and so on (this information is mostly the same, sometimes a sensor does not collect all the information)

    It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:

    • name
    • lat, lng
    • is it in water, and many others

    Giving this, I do not expect the size of one element to be bigger than 1Mb. Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)

    Operations with the data. There are several ways I am going to interact with a data:

    • convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.

    • query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.

    • perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.

    I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.

    P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data

    解决方案

    It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
    Here are some thoughts.

    • HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
    • HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
    • HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
    • HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
    • There is a package (SciHadoop) that combines Hadoop and HDF5.
    • HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
    • PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr

    I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
    But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.

    这篇关于存储和查询气象数据的大数据集的更好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆