最好的最佳的方式找到的频率在一个非常非常长的字符串 [英] The best optimal way to find the frequency in a very very long string

查看:109
本文介绍了最好的最佳的方式找到的频率在一个非常非常长的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须找到一个找到一个字符的频率包含单词一个非常非常长的文件很优化的方式,用C / C(案件被忽视,应该算两个小写和大写)+ +。 我已经知道其中一个是这样的(我在这里读取用户输入的终端,但在我的情况下,我会从文件中读取数据,所以请不要去获得()函数,请关注我的主要目标是获得一个比这更优化的方式(如果有可能的话)):

I have to find a very optimal way to find the frequency of a character in a very very long file containing words,(cases are ignored, should count both Lower case and Upper case) using C/C++. I already know one which is this (here i am reading input from user at terminal but in my case i will be reading from file, so please do not go to gets() function, please focus on my main objective which is to get a more optimized way than this (if any is possible) ):

int main()
{
   char string[100];
   int c = 0, count[26] = {0};

   printf("Enter a string\n");
   gets(string);

   while (string[c] != '\0')
   {
      /** Considering characters from 'a' to 'z' only
          and ignoring others */

      if (string[c] >= 'a' && string[c] <= 'z') 
         count[string[c]-'a']++;

      c++;
   }

   for (c = 0; c < 26; c++)
   {
      /** Printing only those characters 
          whose count is at least 1 */

      if (count[c] != 0)
         printf("%c occurs %d times in the entered string.\n", c + 'a', count[c]);
   }

   return 0;
}

但我想优化它的一些比这更多,因为它的工作包含了大量的词汇一个非常非常长的文件,可能有人请给我任何建议或想法?谢谢你。

But i want to optimize it some more than this because it has to work for a very very long file containing a lot of words, Could some one please give me any suggestion or ideas ? Thanks.

推荐答案

渐进的复杂性没有得到任何好转,和一般的算法已经主要是在最低限度。

The asymptotic complexity doesn't get any better, and in general the algorithm is already mostly at the bare minimum.

您可以让一个最重要的变化是调用较少的IO功能(和你的没有的打算叫获得真正) ;使用 FREAD 和阅读在一个大的(例如,4 KB)缓冲区 - 更大的尺寸通常不利于

The single most important change you can make is to call less frequently IO functions (and you are not going to call gets for real); use fread and read in a big (say, 4 KB) buffer - larger sizes are usually not beneficial.

根据不同的CPU和缓存上,如果你已经有了在内存中的整个字符串时,它可能会获得你的东西,只是让计数 256个元素,避免了如果的字母字符(交易少了一个分支prediction现货更大的缓存职业)。但我怀疑这可能是连衡量 - 你的code现在应该是完全IO的限制,有必要进行处理是完全可以忽略不计相比,磁盘读取等待的CPU时间。

Depending on the CPU and cache, if you already had the whole string in memory it may gain you something to just make count 256 elements long and avoid the if for alphabetical characters (trading one less branch prediction spot for bigger cache occupation). But I doubt this could be even measurable - your code should now be completely IO-bound, with the CPU time needed for processing being completely negligible compared to the wait for the disk reads.

这篇关于最好的最佳的方式找到的频率在一个非常非常长的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆