我如何可以快速创建大(大于1GB)文本+二进制文件"自然"内容? (C#) [英] How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

查看:384
本文介绍了我如何可以快速创建大(大于1GB)文本+二进制文件"自然"内容? (C#)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了测试COM pression,我需要能够创建大文件,最好是在文本,二进制和混合格式。

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

  • 文件的内容应当是既不是完全随机的,也不均匀。
    二进制文件全部为零也是白搭。二进制文件,完全随机的数据也并不好。为文本,以ASCII码的完全随机的序列的文件不是好 - 文本文件应该具有的图案和用于模拟自然语言,或源$ C ​​$ C(XML,C#等)的频率。伪真实文本。
  • 每个单独的文件的大小不是关键的,但对于一组文件,我总需要是〜8GB。
  • 我想保持文件的数量在一个可控的水平,比方说,O(10)。
  • The content of the files should be neither completely random nor uniform.
    A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
  • The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
  • I'd like to keep the number of files at a manageable level, let's say o(10).

有关创建二进制文件,我可以新建一个大的缓冲区,做System.Random.NextBytes随后FileStream.Write在一个循环中,像这样的:

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

通过一个足够大的缓冲区,比方说512K,这是比较快的,甚至超过2或3GB的文件。但内容是完全随机的,这不是我想要的。

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

对于文本文件,我采取的方法是使用 Lorem存有和通过使用StreamWriter到一个文本文件中多次发出它。该含量非随机和非均匀的,但它确实有许多相同的重复块,这是不自然的。同时,由于LOREM Ispum块是如此之小(小于1K),它需要许多圈和一个很长很长的时间。

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

这些都不是中规中矩我。

Neither of these is quite satisfactory for me.

我已经看到了答案<一个href="http://stackoverflow.com/questions/982659/quickly-create-large-file-on-a-windows-system">Quickly在Windows系统上创建大文件?。这些方法是非常快的,但我认为他们只需填写该文件以零,或随机的数据,这两者都不是我想要的。我有一个运行的外部进程像重叠群或FSUTIL,如果需要的话就没问题了。

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

测试在Windows上运行。
而不是创建新的文件,它更有意义只使用已经在文件系统中存在的文件?我不知道有什么是足够大的。

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.

什么始发于一个现有的文件(可能C:\ WINDOWS \ Microsoft.NET \框架\ V2.0.50727 \ CONFIG \ enterprisesec.config.cch一个文本文件),并复制其内容多少次?这将与一个文本或二进制文件。

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

目前,我有一个办法,这类作品,但它需要很长时间才能运行。

Currently I have an approach that sort of works but it takes too long to run.

有其他人解决了这个?

有没有更快的方法写一个文本文件比通过StreamWriter的?

Is there a much faster way to write a text file than via StreamWriter?

建议?

修改:我喜欢马尔可夫链的想法,产生更自然的文字。还需要面对速度的问题,但。

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

推荐答案

我想你可能会寻找像一个的马尔可夫链过程中生成这些数据。它的两个随机的(随机),但也构成,在它进行操作基于一个有限状态机

I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.

实际上,马尔可夫链已被用于产生在人类语言半逼真文本。一般情况下,他们是不是琐碎的事情正确分析,但他们表现出某些特性的事实,应该对你不够好。 (同样,参阅的性能马尔可夫部分页)希望你会看到如何设计之一,然而 - 实现,它实际上是一个相当简单的概念。你最好的选择可能是为了创造一个通用的马尔可夫过程的框架,然后分析无论是自然语言或源$ C ​​$ C(无论您希望您的随机数据来模拟),以训练你的马尔可夫过程。最终,这应该给你的你的要求方面是非常高质量的数据。非常值得去努力,如果你需要的测试数据,这些巨大的长度。

Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

这篇关于我如何可以快速创建大(大于1GB)文本+二进制文件&QUOT;自然&QUOT;内容? (C#)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆