容器适用于各种类型的大量数据 [英] Container for large amount of data of various types

查看:56
本文介绍了容器适用于各种类型的大量数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,


我对C ++比较陌生,现在已经解决了容器相关的问题了。我有一个〜2Gb ASCII文件的时间排序数据,有4个逗号分隔列。第一列是字符串标识符,第二列和第三列是日期和时间,最后两列是一些数字数据,例如:


frog,10/06 / 2006,23:56 :03,12.3456,322.551

frog,11/06 / 2006,22:00:06,6.12136,41.1236

fish,12/06 / 2006,09:01 :54,1.3456,0.3321

我想对数据列进行一些数值分析。假设我只有青蛙和鱼,但很多。我正在考虑定义一个脊椎动物类并创建一个青蛙和一个鱼类对象,每个对象都包含我的数字数据和时间。我的Vertebrate构造函数将读取ASCII文件,并将数据传递给对象的容器(在将日期和时间减少到long int之后,这是自2000年1月1日以来经过的秒数或其他任何时间)。 />

现在我将迭代所有对象的数据,以计算数值数据的平均值或青蛙或鱼类数据之间的相关性。容器必须将string,long int和float类型与每个迭代器值或键相关联,并且应该针对顺序访问进行优化。我考虑了结构的向量:


类脊椎动物{

public:

//构造函数,析构函数,各种方法

私人:

struct animal {

char [4]种类;

long int dayTime;

浮动数据1;

浮动数据2;

}

vector< animal>动物;

vector< animal> :: iterator itr;

};


我的问题:这是一种有效的方式存储这些不同的数据还是有更适合此目的的STL容器? (另外,我不应该期望容器适合机器的内存,所以我考虑了STXXL的矢量模拟,它将矢量存储在硬盘上的一种页面文件中,但我没有'甚至还试图让它继续下去。)


不太重要的是,考虑到读取和处理数据到容器中的时间可能会花费的时间,是否有可能以某种方式将对象作为二进制文件保存到磁盘,以便在重新启动程序时直接重新加载,从而避免重新处理整个ASCII数据?

Hello,

I am relatively new to C++ and have been milling over a container related problem for a while now. I have a ~2Gb ASCII file of time-ordered data with 4 comma-separated columns. The first column is a string identifier, the second and third are date and time, and the last two some numerical data, e.g.:

frog, 10/06/2006, 23:56:03, 12.3456, 322.551
frog, 11/06/2006, 22:00:06, 6.12136, 41.1236
fish, 12/06/2006, 09:01:54, 1.3456, 0.3321

I want to perform some numerical analysis on the data columns. Suppose I only have frogs and fish, but a lot of them. I was thinking of defining a Vertebrate class and creating a frog and a fish object each of which will contain my numerical data and times. My Vertebrate constructor will read the ASCII file, and pass the data to the object''s container (after reducing the date and time to a long int which is the number of seconds elapsed since 1/1/2000 or whatever).

Now I will iterate over the all object''s data to compute, say, the mean of the numerical data or correlations between frog or fish data. The container must associate string, long int, and float types to each iterator value or key and should be optimized for sequential access. I considered a vector of structs:

class Vertebrate {
public:
// constructor, destructor, various methods
private:
struct animal {
char[4] species;
long int dayTime;
float data1;
float data2;
}
vector<animal> animals;
vector<animal>::iterator itr;
};

My question: is this an efficient way to store this varied data or are there STL containers better suited for this purpose? (Also, I shouldn''t expect the container to fit in the machine''s memory, so I considered STXXL''s analog of vector which stores the vector in a kind of page file on the hard disk, but I haven''t even tried to get that going yet.)

Less importantly, given the amount of time that reading and processing the data into the container would presumably take, would it be possible to somehow save the object as a binary file to disk to be reloaded directly when the program is restarted, thus avoiding reprocessing the entire ASCII data?

推荐答案

我想你已经有一个良好的开端解决问题,我认为结构是要走的路。我所知道的没有STL容器可以存储多种类型(除非它们是一个父类的子类)。但是,看看你的节目,我不确定我是否会这样做。暂时考虑一下这个组织 - 脊椎动物是否具有动物?不,脊椎动物是一种动物。我认为你应该使用继承而不是组合来解决这个问题。另外,它是用于解释文本数据的Vertebrate类的工作,还是可以编写一个函数来处理文本然后使用检索到的值创建一个对象?


只是一些需要考虑的事情。
I think you''ve got a good start to solving the problem, and I think a struct is the way to go. There''s no STL container that I know of that can store multiple types (unless they were subclasses of one parent class). However, looking at your program, I''m not sure if I would do it this way. Just think about the organization for a moment - does a ''Vertebrate'' have-an ''animal''? No, a ''Vertebrate'' is-an- ''animal''. I think you should tackle this problem using inheritance, not composition. Also, is it the Vertebrate class'' job to interpret textual data, or could you write a function to process the text and then create an object using the retrieved values?

Just a few things to think about.


这里有很多东西没有说。


第一:你的容器不需要存储对象。相反,它可以存储对象的句柄。请参阅rthe C / C ++文章论坛中关于Handle Classes的文章。


第二:您不需要关联字符串。 long int和float。问题出现了添加一个新的关联将导致conbtainer重新加载。


第三:使用多态。那是一个基类。 但是,将接口(基类的publlic方法)与implmentation(基类的虚方法)分开。基类应该没有公共虚方法。相反,基类虚方法应该是私有的,派生类重写的是这些方法。查看名为Template Method的设计模式。


第四:Vertebrate构造函数不应读取光盘文件。相反,使用CreateContainer函数a)创建脊椎动物的句柄容器,b)读取光盘文件,c)创建青蛙或鱼对象,d)用文件数据加载对象,e)将对象添加到容器作为基础对象的句柄。在光盘文件的末尾,此函数返回容器的句柄。


第五:考虑使用平面文件,其中对象存储在文件中而不是容器中。容器只有从文件开头到对象的偏移量。也就是说,一个青蛙知道它是从文件开头的112345字节。你可以寻找那个位置并阅读青蛙。这将允许添加到文件的末尾。如果这样做,您永远不需要将数据重新加载到coimputer中。你可以拥有一个巨大的文件和一个小型计算机程序。


第六:你为什么不使用SQL数据库?我的意思是所有这些数据库工作都已完成。您只需要数据模型,创建表并处理SQL查询和更新。如果您使用的是Unix,请使用Oracle。如果Windows使用SQLServer。
There''s a lot of stuff not being said here.

First: Your container does not need to store the objects. Instead it can store a handle to the object. See the article on Handle Classes in rthe C/C++ Articles forum.

Second: You do not need to associate a string. long int, and float. The problem arises that adding a new association will cause the conbtainer to nee3d to be relloaded.

Third: Use polytmorphism. That is have a base class. However, separate the interface (the publlic methods of the base class) from the implmentation (the virtual methods of the base class). The base class should have no public virtual methods. Instead, the base class virtual methods should be private and it is these methods that the derived class overrides. Check out the design pattern called Template Method.

Fourth: the Vertebrate constructor should not read the disc file. Instead, use a CreateContainer function that a) creates the container of handles to vertebrates, b) reads the disc file, c) creates the frog or fish object, d) loads the object with file data, e) adds the object to the container as a handle to a base object. At the end of the disc file, this function returns a handle to the container.

Fifth: Consider using a flat file where the objects are stored in the file rather than in the container. The container just has the offset from the beginning of the file to the object. That is, a frog knows it is 112345 bytes from the beginning of the file. You can seek to that location and read the frog. This would allow adding to the end of the file. If you do this, you never need to reload the the data into the coimputer. You can have a giant file and a small computer program.

Sixth: Why are you not using a SQL database?? I mean all of this database work has already been done. You just need your data model, create your tables and process your SQL queries and updates. If you are on Unix, use Oracle. If Windows use SQLServer.


非常感谢您的评论。我想我应该详细说明一下。当谈到OOP时,我仍然在品尝水,所以我可能会建议以最不切实际的方式做事。让我以相反的顺序回复:


第六。我的脚本将上传到运行Unix和Sun Grid Engine的不同快速机器的网格。我的排队作业脚本将在机器特定的编译器上为我们生成的每个节点编译我的代码并运行它。据我所知,节点没有任何复杂的数据库程序。虽然我原则上可以从作业目录运行数据库,但我认为有一个更紧凑的解决方案,因为我正在努力做到这一点。 (而且我喜欢了解幕后发生的事情。)


第五。至于''平面文件'',我想你的意思是对文件进行散列或索引,然后使用gseek()迭代文件。我想到了这一点,但与访问容器的元素相比,文件访问速度是否非常慢?特别是如果这必须通过网格网络完成,即使它是一个千兆以太网。我将使用的例程涉及一些重复的循环数据。更快的磁盘相关方法可能涉及STXXL,它显然使用了它创建的本地页面文件的一些复杂的预取和缓存,但我认为它仅限于矢量和集合容器。


粒子物理学家使用CERN开发的ROOT数据处理框架中的TTree对象在服务器场上处理数TB的数据。您可以在TTree中存储任何内容。创建一个后,可以将其作为压缩二进制文件(ROOT文件)保存到磁盘。我想这是某种哈希表。我真的很喜欢使用那个类,但ROOT是一个神秘的非标准混合的C和C ++,它继承了几十年的遗留C和FORTRAN例程并使用CINT解释器,我没有设法弄清楚如何提取TTree类及其依赖项。


至于你的前三点,来自简短的C背景,我仍然试图让我的头围绕OOP所以我会试着读你的建议了解如何实现它。但与此同时,


第四。听起来不错。我会那样做的。


第三。将阅读有关模板方法的内容。


其次。为什么要重载载体容器?向量只看到一个结构,你是否将结构称为容器?我还在习惯终点...


首先。将阅读有关Handle Classes的内容。通过一个存储对象句柄的容器,你的意思是像C中不同类型的指针数组吗?我希望避免使用带有对象的动态内存,但有一天我将不得不学习它...
Thank you very much for your comments. I guess I should elaborate a bit. I''m still ''tasting the water'' when it comes to OOP so I might well be suggesting to do things in the most impractical way. Let me reply in reverse order:

Sixth. My script will be uploaded to a grid of non-identical fast machines running Unix and the Sun Grid Engine. My queued job scripts will compile my code on the machine-specific compiler for each node they spawn to and run it. The nodes do not have any sophisticated database program as far as I know. Whilst I could in principle run a database from the job directory, I suppose there is a more compact solution given the simplicity of what I''m trying to do. (And I like to understand what''s going on behind the scenes, anyway).

Fifth. As for the ''flat file'', I guess you mean hashing or indexing the file then using gseek() to iterate through the file. I thought of that, but isn''t file access very slow compared to accessing a container''s elements? Especially if this has to be done over the grid network, even though it''s a gigabit ethernet. The routines I will be using involve some heavy repeated cycling through the data. A faster disk-related method might involve STXXL which apparently uses some sophisticated prefetching and caching of a local pagefile it creates, but it''s limited to vector and set containers I think.

Particle physicists process terabytes of data on server farms using the TTree object from the ROOT data processing framework developed at CERN. You can store anything in a TTree. After you create one, you can save it to the disk as a compressed binary file (a "ROOT" file). I guess it''s some kind of hashed table. I''d really like to use that class but ROOT is an arcane non-standard mix of C and C++ which inherits decades of legacy C and FORTRAN routines and uses the CINT interpreter, and I didn''t manage to figure out how to extract the TTree class and its dependencies.

As for your first three points, coming from a brief C background, I''m still trying to get my head around OOP so I''ll try to read up what you suggested to understand how to implement it. But in the meantime,

Fourth. Sounds good. I''ll do that.

Third. Will read about the "Template Method".

Second. Why would the vector container be reloaded? The vector only sees a struct, do you refer to the struct as a container? I''m still getting used to the terminilogy...

First. Will read about Handle Classes. By a container storing a handle to the object, do you mean something like an array of pointers to different types in C? I was hoping to avoid using dynamic memory with objects but I''ll have to learn it someday...


这篇关于容器适用于各种类型的大量数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆