将大文件加载到内存中并启用快速访问 [英] Loading Large file in memory and enable fast access

查看:145
本文介绍了将大文件加载到内存中并启用快速访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个我要加载的大文件,在文件的每一行中我都有一个KEY。

我使用了哈希映射,但使用哈希映射的大小非常大。



还有什么方法可以做到这一点?

I have a large file i want to load, where in each line of the file i have a "KEY".
I used hash maps, but the size was really big using hash maps.

What other way to do this ?

推荐答案

没有说明你的意思真的很大 ;而你的实际问题是什么,很难帮助你。尝试回答以下问题:



(1)虚拟地址空间的大小是多少? (在典型的32位系统中,它大约为2 GB;在64位系统中,它要大得多。)



(2)整个文件是否适合你的进程的虚拟内存?



(3)如果没有,至少密钥集是否适合虚拟内存? (假设您的平均密钥长度为12个字节,加上一个指向数据的8字节指针,加上3个用于树中链接的指针,每个密钥大约需要32个字节。在典型的32位系统中,您可以使用一个密钥来存储大约5000个密钥。 B树状结构)



如果密钥集和文件内容都适合您的虚拟内存,只需加载它们并构建密钥数据结构。哈希映射不是最经济的方法,但速度非常快。如果您想将尺寸保持在最小值,请尝试使用红黑树。例如,您可能想尝试QMap,因为Qt 5.0基于红黑树。



如果连你的钥匙的红黑树都没有适应虚拟内存,事情变得复杂一些。在这种情况下,您可以使用像B树这样的结构,每个内存块存储多个节点,并将这些内存块链接在一起形成一个完整的索引。您只需要在内存中有几个块,其余块可能驻留在磁盘文件中。这就是大多数数据库存储其索引的方式。 Google for B-tree。



您在索引中存储的指针可以直接指向内存位置(如果索引和文件适合虚拟内存),也可以指向到相对于数据文件开头的位置。在后一种情况下,您需要一个额外的磁盘访问来获取所需的行。
Without specifying what you mean by "really big" and what your actual problem is, it is difficult to help you. Try to answer the following questions:

(1) What is the size of your virtual address space? (In a typical 32-bit system it's about 2 GB; in 64-bit systems it is much larger.)

(2) Does the entire file fit into the virtual memory of your process?

(3) If not, does at least the key set fit into the virtual memory? (Assuming your average key length is 12 bytes, plus an 8 byte pointer to the data, plus 3 pointers for linkage in a tree, gives about 32 bytes per key. In typical 32-bit system you could store about 50 million keys using a B-tree like structure)

If key set and file contents all fit into your virtual memory, just load them and build the key data structure. A hash map is not the most economical method for that, but very fast. If you want to keep the size to a minimum try using a red-black tree. For example you might want to try QMap, which since Qt 5.0 is based on a red-black tree.

If even the red-black tree with your keys doesn't fit into virtual memory, things get a little more complicated. In that case you can resort to structures like a B-tree, which stores multiple nodes per memory block and links those memory blocks together to form a complete index. You only need to have a couple of those blocks in memory, the rest may reside on a disk file. That is how most databases store their indices. Google for B-tree.

The pointers you store in your index may either point directly to a memory location (if index and file fit into virtual memory) or may point to a location relative to the begin of the data file. In the latter case you need an addition disk access to grab the desired line.


解决方案1:使用文本数据文件存储哈希映射:



Solution 1: Using text data file to store hash maps:

#include <fstream>
#include <iostream>
#include <string>
#include <cstdio>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
	std::ifstream file_in;
	std::string filename = "inp.txt";
	char line[1024] = "\0"; std::streamsize size = 1024;
        // Open file
	file_in.open(filename.c_str(), std::ifstream::in);
        // Fetch each line from the file into the buffer
	while (file_in.good() && file_in.getline(line, size, '\n'))
	{
		char key[256] = "\0", value[256] = "\0";
                // Parse each line according to specific format
		sscanf_s(line, "KEY: %s VALUE: %s", key, 256, value, 256);
                // Output the values of key and value
		std::cout << key << " " << value << endl;
	}

	file_in.close();

	std::cin.get();

	return 0;
}





解决方案2:使用二进制文件和mmap功能:





Solution 2: Using binary files and mmap function:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

#define NUMITEMS 10

typedef struct tagHash
{
     char key[256];
     char value[256];
}HASH, *PHASH;

int main(int argc, char** argv)
{
    const char fname[256] = "inp.bin";

    int fd = -1;
    if ((fd = open(fname, O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600)) == -1) 
    {
	perror("Error opening file for writing");
	exit(EXIT_FAILURE);
    }    

    PHASH hash_array = new HASH[NUMITEMS];
    memset((void*)hash_array, 0x00, sizeof(HASH) * NUMITEMS);

    for (int i = 0; i < NUMITEMS; i++)
    {
         sprintf(hash_array[i].key,"KEY: %d",i);
         sprintf(hash_array[i].value,"VALUE: %d",i);
    }

    write(fd, (void*)hash_array, sizeof(HASH) * NUMITEMS);

    long cur_pos = 0L, file_size = 0L;
    cur_pos = lseek(fd, 0, SEEK_CUR);
    file_size = lseek(fd, 0, SEEK_END); 
    lseek(fd, cur_pos, SEEK_CUR);

    HASH* hash_ptr = NULL;
    if ((hash_ptr = reinterpret_cast<HASH*>(mmap(0, file_size, 
        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0))) == MAP_FAILED) 
    {
	close(fd);
	perror("Error mmapping the file");
	exit(EXIT_FAILURE);
    }

    while (hash_ptr != NULL)
    {
         printf("key = %s value = %s\n",hash_ptr->key, hash_ptr->value);
         hash_ptr++;
    }

    close(fd);

    return 0;
}





解决方案3:使用文件中的数据项填充QtHash





Solution 3: Filling QtHash with data item from a file

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

#include <QHash>
#include <QString>

#define NUMITEMS 10

typedef struct tagHash
{
     char key[256];
     char value[256];
}HASH, *PHASH;

int main(int argc, char** argv)
{
    const char fname[256] = "inp.bin";

    int fd = -1;
    if ((fd = open(fname, O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600)) == -1) 
    {
	perror("Error opening file for writing");
	exit(EXIT_FAILURE);
    }    

    PHASH hash_array = new HASH[NUMITEMS];
    memset((void*)hash_array, 0x00, sizeof(HASH) * NUMITEMS);

    for (int i = 0; i < NUMITEMS; i++)
    {
         sprintf(hash_array[i].key,"KEY: %d",i);
         sprintf(hash_array[i].value,"VALUE: %d",i);
    }

    write(fd, (void*)hash_array, sizeof(HASH) * NUMITEMS);

    long cur_pos = 0L, file_size = 0L;
    cur_pos = lseek(fd, 0, SEEK_CUR);
    file_size = lseek(fd, 0, SEEK_END); 
    lseek(fd, cur_pos, SEEK_CUR);

    HASH* hash_ptr = NULL;
    if ((hash_ptr = reinterpret_cast<HASH*>(mmap(0, file_size, 
        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0))) == MAP_FAILED) 
    {
	close(fd);
	perror("Error mmapping the file");
	exit(EXIT_FAILURE);
    }

    QHash<QString, QString> hashmap;
    while (hash_ptr != NULL)
    {
         hashmap.insert(QString(hash_ptr->key), QString(hash_ptr->value));
         hash_ptr++;
    }

    close(fd);

    return 0;
}





P.S.如果您仍想尝试使用mmap,请参阅 LINK [^] 解释如何使用mmap函数的在线资源。



就是这样。



P.S. If you still want to experiment with mmap, here's the LINK[^] to the online resources explaining how to use the mmap function.

That's it.


这篇关于将大文件加载到内存中并启用快速访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆