阅读和使用C在Linux上的块写 [英] reading and writing in chunks on linux using c

查看:84
本文介绍了阅读和使用C在Linux上的块写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ASCII文件,其中每行包含可变长度的记录。例如:

 录制-1:15个字符
记录2:200个字符
记录3:500个字符
...
...
记录-N:X字符

由于文件大小约为10GB,我想读取数据块的记录。曾经看过,我要改造他们,写他们成二进制格式另一个文件。

所以,对于阅读,我的第一反应是要建立一个字符数组,如

  FILE *流;
字符缓冲区[104857600] // 100 MB字符数组
FREAD(缓冲区,缓冲区尺寸,104857600,流);


  1. 是不是正确的假设,认为Linux将发出一个系统调用,并获取整个100MB?

  2. 作为记录由新线分离,我搜索逐个字符在缓冲区的新行字符和重建的每个记录。

我的问题是,这是我应该如何在成批读或是否有更好的替代块中读取数据并重组每条记录?有没有在读一个调用的从ASCII文件可变大小行x个以另一种方式?

接下来写入过程中,我也一样。我有一个写入字符缓冲区,这是我传递给使用fwrite在一个调用写了一整套的记录。

  FWRITE(缓冲区的sizeof(缓冲区),104857600,流);

更新:如果我则setbuf(流,缓冲液),其中缓冲区是我的100MB字符缓冲区,将FGETS从缓冲区退货或导致磁盘IO


解决方案

  1. FREAD 将一次性抓取整个事情。 (假设它是一个普通的文件。)但它不会读105 MB,除非该文件本身是105 MB,如果你不检查返回值,你不知​​道有多少实际的数据读取方式,或有是一个错误。


  2. 使用与fgets (见人与fgets )而不是 FREAD 。这将搜索换行符你。

     字符linebuf [1000];
    FILE *文件= ...;
    而(与fgets(linebuf,sizeof的(linebuf),文件){
        //德code一行
    }


  3. 有与您的code的一个问题。

     字符缓冲区[104857600] // 太大

    如果您尝试分配一个大的缓冲区(105 MB肯定是大)在栈上,那么它会失败,你的程序会崩溃。如果你需要一个大的缓冲区,你将不得不为它分配与的malloc 或类似堆。我一定会保持堆栈的使用在几十KB的最多的单一功能,但你很可能逃脱大多数股票Linux系统几MB。


作为一种选择,你可以只 MMAP 整个文件到内存中。这将不会提高或降低在大多数情况下的性能,但它更易于使用。

  INT R,FDES;
struct stat中ST;
无效* PTR;
为size_t SZ;FDES =打开(文件名,O_RDONLY);
如果(FDES℃,)中止();
R = FSTAT(FDES,&安培; ST);
如果(R)中止();
如果(st.st_size>(为size_t)-1)中止(); //太大地图
SZ = st.st_size;
PTR = MMAP(NULL,SZ,PROT_READ,MAP_SHARED,FDES,0);
如果(PTR == MAP_FAILED)中止();
关闭(FDES); //文件不再需要//现在,PTR具有数据,SZ具有数据长度
//你可以使用普通的字符串函数

使用 MMAP 的好处是,你的程序将无法运行内存不足。在64位系统上,你可以把整个文件到您的地址空间,在同一时间(甚至是10 GB的文件),并为您的程序访问内存,系统将自动读取新的数据块。旧的数据块会被自动丢弃,并重新读取,如果你的程序再次需要它们。

这是通过大型文件犁一个非常好的方式。

I have a ASCII file where every line contains a record of variable length. For example

Record-1:15 characters
Record-2:200 characters
Record-3:500 characters
...
...
Record-n: X characters

As the file sizes is about 10GB, i would like to read the record in chunks. Once read, i need to transform them, write them into another file in binary format.

So, for reading, my first reaction was to create a char array such as

FILE *stream; 
char buffer[104857600]; //100 MB char array
fread(buffer, sizeof(buffer), 104857600, stream);

  1. Is it correct to assume, that linux will issue one system call and fetch the entire 100MB?
  2. As the records are separated by new line, i search for character by character for a new line character in the buffer and reconstruct each record.

My question is that is this how i should read in chunks or is there a better alternative to read data in chunks and reconstitute each record? Is there an alternative way to read x number of variable sized lines from an ASCII file in one call ?

Next during write, i do the same. I have a write char buffer, which i pass to fwrite to write a whole set of records in one call.

fwrite(buffer, sizeof(buffer), 104857600, stream);

UPDATE: If i setbuf(stream, buffer), where buffer is my 100MB char buffer, would fgets return from buffer or cause a disk IO?

解决方案

  1. Yes, fread will fetch the entire thing at once. (Assuming it's a regular file.) But it won't read 105 MB unless the file itself is 105 MB, and if you don't check the return value you have no way of knowing how much data was actually read, or if there was an error.

  2. Use fgets (see man fgets) instead of fread. This will search for the line breaks for you.

    char linebuf[1000];
    FILE *file = ...;
    while (fgets(linebuf, sizeof(linebuf), file) {
        // decode one line
    }
    

  3. There is a problem with your code.

    char buffer[104857600]; // too big
    

    If you try to allocate a large buffer (105 MB is certainly large) on the stack, then it will fail and your program will crash. If you need a buffer that big, you will have to allocate it on the heap with malloc or similar. I'd certainly keep stack usage for a single function in the tens of KB at most, although you could probably get away with a few MB on most stock Linux systems.

As an alternative, you could just mmap the entire file into memory. This will not improve or degrade performance in most cases, but it easier to work with.

int r, fdes;
struct stat st;
void *ptr;
size_t sz;

fdes = open(filename, O_RDONLY);
if (fdes < 0) abort();
r = fstat(fdes, &st);
if (r) abort();
if (st.st_size > (size_t) -1) abort(); // too big to map
sz = st.st_size;
ptr = mmap(NULL, sz, PROT_READ, MAP_SHARED, fdes, 0);
if (ptr == MAP_FAILED) abort();
close(fdes); // file no longer needed

// now, ptr has the data, sz has the data length
// you can use ordinary string functions

The advantage of using mmap is that your program won't run out of memory. On a 64-bit system, you can put the entire file into your address space at the same time (even a 10 GB file), and the system will automatically read new chunks as your program accesses the memory. The old chunks will be automatically discarded, and re-read if your program needs them again.

It's a very nice way to plow through large files.

这篇关于阅读和使用C在Linux上的块写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆