如何产生大量的数据? [英] How to produce massive amount of data?

查看:125
本文介绍了如何产生大量的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用nutch和hadoop进行一些测试,我需要大量的数据。
我想从20GB开始,达到100 GB,500 GB,最终达到1-2 TB。



问题是我没有这个数据量,所以我正在考虑如何生成它。



数据本身可以是任何形式的。
一个想法是获取一组初始数据并复制它。但它不够好,因为需要彼此不同的文件(相同的文件被忽略)。

另一个想法是编写一个程序来创建带有伪数据的文件。

其他想法?

问题的统计StackExchange网站(请参阅,例如,关于生成综合数据的最佳实践的问题)。

然而,如果您对数据属性不感兴趣,并处理数据,那么你可以忽略统计网站。特别是,如果您不关注数据的统计方面,只想要大数据,那么我们可以关注如何生成大量数据。



我可以提供几个答案:


  1. 如果您只是对随机数字数据感兴趣,喜爱梅森扭转者的实施。还有/ dev / random(请参阅此维基百科条目以获取更多信息)。我更喜欢一个已知的随机数发生器,因为其结果可以被任何其他人令人生厌。

  2. 对于结构化数据,你可以看看随机数到索引并创建一个表,将索引映射到字符串,数字等,比如在生成名称,地址等的数据库时可能遇到的表。如果您有足够大的表或足够丰富的映射目标,可以降低碰撞风险(例如同名),但也许你想碰撞,因为这些碰撞也发生在现实中。 请记住,使用任何生成方法,在开始工作之前不需要存储整个数据集。只要你记录了状态(例如RNG)的状态,你就可以在你离开的地方继续。 对于文本数据,你可以看看简单随机字符串发生器您可以为不同长度或不同特征的字符串的概率创建自己的估计值。同样可以用于句子,段落,文档等 - 只需决定你想要模仿的属性,创建一个空白对象,并用文本填充。


I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.

The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

解决方案

This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).

However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.

I can offer several answers:

  1. If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.

  2. For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.

  3. Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.

  4. For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.

这篇关于如何产生大量的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆