使用C#排序巨大的二进制文件 [英] Sorting gigantic binary files with C#

查看:68
本文介绍了使用C#排序巨大的二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件,大小约为400 GB.每天由外部封闭系统生成.它是具有以下格式的二进制文件:

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:

byte[8]byte[4]byte[n]

其中n等于字节[4]的int32值.

Where n is equal to the int32 value of byte[4].

此文件没有定界符,要读取整个文件,请重复直到EOF.每个项目"都表示为byte [8] byte [4] byte [n].

This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].

文件看起来像

byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF

byte [8]是一个64位数字,表示由.NET Ticks表示的时间段.我需要对该文件进行排序,但似乎无法找出最快的方法.

byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.

当前,我将Ticks加载到一个结构中,并将byte [n]的开始和结束位置加载到文件的末尾.之后,我通过Ticks属性对内存中的List进行排序,然后打开BinaryReader并按Ticks顺序搜索到每个位置,读取byte [n]值,然后写入外部文件.

Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.

在该过程结束时,我最终得到一个排序的二进制文件,但它需要FOREVER.我正在使用C#.NET和非常强大的服务器,但是磁盘IO似乎是一个问题.

At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.

服务器规格:

  • 2个2.6 GHz Intel Xeon(带HT的六核)(24线程)
  • 32GB RAM
  • 500GB RAID 1 + 0
  • 2TB RAID 5

我在互联网上看过一遍,只能找到一个1GB的大文件的示例(使我发笑).

I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).

有人有什么建议吗?

推荐答案

加速这种文件访问的一种很好的方法是

At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.

您有很多主内存,因此这应该提供相当不错的性能(只要您使用的是64位操作系统).

You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

这篇关于使用C#排序巨大的二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆