大文本文件中的词频 [英] Word frequency in a large text file

查看:88
本文介绍了大文本文件中的词频的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一个大文本文件,并在其中输出不同的单词以及计数.到目前为止,我已经尝试了几次尝试,这是迄今为止我提出的最快的解决方案.

I've am trying to read a large text file and output the distinct words in it along with it's count. I've tried a couple of attempts so far, and this is by far the fastest solution I have come up with.

private static readonly char[] separators = { ' ' };

public IDictionary<string, int> Parse(string path)
{
    var wordCount = new Dictionary<string, int>();

    using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var streamReader = new StreamReader(fileStream))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (wordCount.ContainsKey(word))
                {
                    wordCount[word] = wordCount[word] + 1;
                }
                else
                {
                    wordCount.Add(word, 1);
                }
            }
        }
    }

    return wordCount;
}

我如何衡量我的解决方案

我有200MB的文本,我知道该文本的总字数(通过文本编辑器).我正在使用Stopwatch class并对单词进行计数以确保准确性并计算所花费的时间.到目前为止,大约需要9秒钟.

I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.

其他尝试

  • 我试图利用多线程通过 TPL库.这涉及批量生产多条生产线,然后发出 将一批生产线处理到一个单独的任务并锁定 字典中的读/写操作.但是,这似乎没有 为我提供任何性能改进.
  • 花了大约30秒.我怀疑要读取/写入该锁 字典太昂贵了,无法获得任何性能.
  • 我也查看了ConcurrentDictionary类型,但是 AddOrUpdate方法确实需要调用代码来处理 从我的理解同步,并没有带来任何表现 好处.
  • I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.
  • It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.
  • I also had a look at the ConcurrentDictionary type, but the AddOrUpdate method does require the calling code to handle the synchronization from my understanding, and brought no performance benefit.

我敢肯定有一种更快的方法可以实现这一目标!是否有更好的数据结构可用于此问题?

I am sure there is a faster way to achieve this! Is there a better data structure to use for this problem?

欢迎对我的解决方案提出任何建议/批评-尝试在此处学习和改进!

Any suggestions/criticisms to my solution are welcome - trying to learn and improve here!

干杯.

更新:这是我正在使用的测试文件的链接.

UPDATE: Here is the link to the test file i'm using.

推荐答案

我能给出的最好的简短答案是测量,测量,测量. Stopwatch可以很好地了解花在哪里的时间,但是最终您将不得不花大量的代码,否则您将不得不为此目的找到一个更好的工具.我建议为此使用专用的探查器工具,其中有许多可用于C#和.NET的工具.

The best short answer I can give is to measure, measure, measure. Stopwatch is nice to get a feeling for where time is spent but eventually you'll end up sprinkling large swats of your code with it or you will have to find a better tool for this purpose. I would suggest getting a dedicated profiler tool for this, there are many available for C# and .NET.

我分三步设法节省了总运行时间的43%.

I've managed to shave off about 43% of the total runtime in three steps.

首先,我测量了您的代码并得到了:

First I measured your code and got this:

这似乎表明这里有两个热点可供我们对抗:

This seems to indicate that there are two hotspots here that we can try to combat:

  1. 字符串拆分(SplitInternal)
  2. 字典维护(FindEntry,Insert,get_Item)

花的最后一部分时间是读取文件,我真的怀疑我们可以通过更改代码的那部分来获得很多收益.这里的另一个答案提到使用特定的缓冲区大小,我尝试了这一点,但无法获得可测量的差异.

The last piece of the time spent is in reading the file and I really doubt we can gain much by changing that part of the code. One other answer here mentions using specific buffersizes, I tried this and could not gain measurable differences.

首先,字符串拆分有些容易,但是需要将对string.Split的非常简单的调用重写为更多的代码.我将处理以下一行的循环重写为:

The first, string splitting, is somewhat easy but involves rewriting a very simple call to string.Split into a bit more code. The loop that processes one line I rewrote to this:

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                // process word here
            }
            lastPos = index + 1;
        }
    }
}

然后我将一个单词的处理重写为此:

I then rewrote the processing of one word to this:

int currentCount;
wordCount.TryGetValue(word, out currentCount);
wordCount[word] = currentCount + 1;

这取决于以下事实:

  1. TryGetValue比检查单词是否存在然后检索其当前计数要便宜
  2. 如果TryGetValue无法获取值(键不存在),它将在此处将currentCount变量初始化为其默认值,即0.这意味着我们真的不需要检查是否这个词确实存在.
  3. 我们可以通过索引器向字典添加新单词(它将覆盖现有值或向字典添加新的键值)
  1. TryGetValue is cheaper than checking if the word exists and then retrieving its current count
  2. If TryGetValue fails to get the value (key does not exist) then it will initialize the currentCount variable here to its default value, which is 0. This means that we don't really need to check if the word actually existed.
  3. We can add new words to the dictionary through the indexer (it will either overwrite the existing value or add a new key+value to the dictionary)

最终循环如下:

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                int currentCount;
                wordCount.TryGetValue(word, out currentCount);
                wordCount[word] = currentCount + 1;
            }
            lastPos = index + 1;
        }
    }
}

新的度量显示如下:

详细信息:

  1. 我们从6876ms缩短到5013ms
  2. 我们失去了在SplitInternalFindEntryget_Item
  3. 中花费的时间
  4. 我们花了更多时间在TryGetValueSubstring
  1. We went from 6876ms to 5013ms
  2. We lost the time spent in SplitInternal, FindEntry and get_Item
  3. We gained time spent in TryGetValue and Substring

以下是详细信息:

如您所见,我们损失了更多的时间而不是获得新的时间,从而带来了净的进步.

As you can see, we lost more time than we gained new time which resulted in a net improvement.

但是,我们可以做得更好.我们在这里进行2次字典查找,其中涉及计算单词的哈希码,并将其与字典中的键进行比较.第一次查找是TryGetValue的一部分,第二次查找是wordCount[word] = ...的一部分.

However, we can do better. We're doing 2 dictionary lookups here which involves calculating the hash code of the word, and comparing it to keys in the dictionary. The first lookup is part of the TryGetValue and the second is part of wordCount[word] = ....

我们可以通过在字典内创建一个更智能的数据结构来删除第二个字典查找,但要消耗更多的堆内存.

We can remove the second dictionary lookup by creating a smarter data structure inside the dictionary at the cost of more heap memory used.

我们可以使用Xanatos的把计数存储在对象中的技巧,以便删除第二个字典查找:

We can use Xanatos' trick of storing the count inside an object so that we can remove that second dictionary lookup:

public class WordCount
{
    public int Count;
}

...

var wordCount = new Dictionary<string, WordCount>();

...

string word = line.Substring(lastPos, index - lastPos);
WordCount currentCount;
if (!wordCount.TryGetValue(word, out currentCount))
    wordCount[word] = currentCount = new WordCount();
currentCount.Count++;

这将仅从字典中检索计数,增加1次额外出现不涉及字典.该方法的结果也将更改为返回此WordCount类型作为字典的一部分,而不仅仅是int.

This will only retrieve the count from the dictionary, the addition of 1 extra occurance does not involve the dictionary. The result from the method will also change to return this WordCount type as part of the dictionary instead of just an int.

净结果:节省约43%.

Net result: ~43% savings.

最后一段代码:

public class WordCount
{
    public int Count;
}

public static IDictionary<string, WordCount> Parse(string path)
{
    var wordCount = new Dictionary<string, WordCount>();

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
    using (var streamReader = new StreamReader(fileStream, Encoding.Default, false, 65536))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            int lastPos = 0;
            for (int index = 0; index <= line.Length; index++)
            {
                if (index == line.Length || line[index] == ' ')
                {
                    if (lastPos < index)
                    {
                        string word = line.Substring(lastPos, index - lastPos);
                        WordCount currentCount;
                        if (!wordCount.TryGetValue(word, out currentCount))
                            wordCount[word] = currentCount = new WordCount();
                        currentCount.Count++;
                    }
                    lastPos = index + 1;
                }
            }
        }
    }

    return wordCount;
}

这篇关于大文本文件中的词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆