mmap比getline慢? [英] mmap slower than getline?

查看:264
本文介绍了mmap比getline慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临着逐行读取/写入文件(以千兆为单位)的挑战。

I face the challenge of reading/writing files (in Gigs) line by line.

阅读许多论坛条目和网站(包括一堆SO),建议mmap作为读/写文件的最快选项。但是,当我用readline和mmap技术实现我的代码时,mmap是两者中较慢的。这对于阅读和写作都是正确的。我已经测试了大约600 MB的文件。

Reading many forum entries and sites (including a bunch of SO's), mmap was suggested as the fastest option to read/write files. However, when I implement my code with both readline and mmap techniques, mmap is the slower of the two. This is true for both reading and writing. I have been testing with files ~600 MB large.

我的实现逐行解析,然后标记该行。

My implementations parse line by line and then tokenize the line. I will present file input only.

这里是 getline 实现:

void two(char* path) {

    std::ios::sync_with_stdio(false);
    ifstream pFile(path);
    string mystring;

    if (pFile.is_open()) {
        while (getline(pFile,mystring)) {
            // c style tokenizing
        }
    }
    else perror("error opening file");
    pFile.close();
}

这里是 mmap

void four(char* path) {

    int fd;
    char *map;
    char *FILEPATH = path;
    unsigned long FILESIZE;

    // find file size
    FILE* fp = fopen(FILEPATH, "r");
    fseek(fp, 0, SEEK_END);
    FILESIZE = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    fclose(fp);

    fd = open(FILEPATH, O_RDONLY);

    map = (char *) mmap(0, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);

    /* Read the file char-by-char from the mmap
     */
    char c;
    stringstream ss;

    for (long i = 0; i <= FILESIZE; ++i) {
        c = map[i];
        if (c != '\n') {
            ss << c;
        }
        else {
            // c style tokenizing
            ss.str("");
        }

    }

    if (munmap(map, FILESIZE) == -1) perror("Error un-mmapping the file");

    close(fd);

}

为了简洁起见,我省略了很多错误检查。

I omitted much error checking in the interest of brevity.

我的mmap实现不正确,从而影响性能吗?也许mmap对我的应用程序不是理想的?

Is my mmap implementation incorrect, and thus affecting performance? Perhaps mmap is non ideal for my application?

感谢任何评论或帮助。

推荐答案

mmap的真正威力是能够在文件中自由搜索,直接将其内容用作指针,并避免将数据从内核缓存内存复制到用户空间的开销。

The real power of mmap is being able to freely seek in a file, use its contents directly as a pointer, and avoid the overhead of copying data from kernel cache memory to userspace. However, your code sample is not taking advantage of this.

在循环中,一次扫描一个字符,附加到 stringstream stringstream 不知道字符串有多长,因此必须在过程中重新分配几次。在这一点上,你使用 mmap 可以避免任何性能提升 - 甚至标准的getline实现避免了多重重分配(通过使用一个128字节的堆栈缓冲区,在GNU C ++实现)。

In your loop, you scan the buffer one character at a time, appending to a stringstream. The stringstream doesn't know how long the string is, and so has to reallocate several times in the process. At this point you've killed off any performance increase from using mmap - even the standard getline implementation avoids multiple reallocations (by using a 128-byte on-stack buffer, in the GNU C++ implementation).

如果你想使用mmap的最大力量:

If you want to use mmap to its fullest power:


  • 不要复制字符串。在所有。 c>使用内置函数,例如 strnchr memchr 找到换行符;这些使用手动汇编器和其他优化来运行比大多数开放编码的搜索循环更快。

  • Don't copy your strings. At all. Instead, copy around pointers right into the mmap buffer.
  • Use built-in functions such as strnchr or memchr to find newlines; these make use of hand-rolled assembler and other optimizations to run faster than most open-coded search loops.

这篇关于mmap比getline慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆