处理巨大的文本文件 [英] Processing huge text files

查看:148
本文介绍了处理巨大的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:
我有一个巨大的原始文本文件(假设为3gig),我需要遍历文件
中的每个字,并发现一个字

Problem: I've a huge raw text file (assume of 3gig), I need to go through each word in the file and find out that a word appears how many times in the file.

我的建议解决方案:
将巨大的文件拆分为多个文件,有排序的方式。例如,
a 开头的所有单词将存储在* a.dic 文件中。因此,在任何时候,我们都不会超过26个文件。

My Proposed Solution: Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example, all the words starting with "a" will be stored in a "*a.dic" file. So, at any time we will not execeed more than 26 files.

这种方法的问题是,

我可以使用流读取文件,但想使用线程来读取文件的某些部分。例如,使用单独的线程读取0-1024字节(至少有4-8个线程,基于处理器中存在的处理器数量)。这是可能还是我在做梦?

I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?

更好的方法吗?

注意: c ++或c基溶液。不允许数据库等。

Note: It should be a pure c++ or c based solution. No databases etc., are allowed.

推荐答案

您需要查看' The Practice of Programming ,特别是第3章。

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.

在C ++中,使用基于字符串和计数的映射( std :: map< string,size_t> ,IIRC)。读取文件(一次 - 太大,不能多次读取),将它拆分为单词(对于'word'的某些定义),并增加在您找到的每个单词的地图条目中的计数。

In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.

在C中,您必须自行创建地图。 (或查找David Hanson的 C接口和实现。)

In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)

或者你可以使用Perl,或Python,或Awk(所有的关联数组,相当于一个地图)。

Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

这篇关于处理巨大的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆