是`ls -f |当使用POSIX / Unix系统（大数据）时，grep -c。是目录中最快的方法吗？ [英] Is `ls -f | grep -c .` the fastest way to count files in directory, when using POSIX / Unix system (Big Data)?

查看：223 发布时间：2017/11/6 21:52:06 unix filesystems bigdata

本文介绍了是`ls -f |当使用POSIX / Unix系统（大数据）时，grep -c。是目录中最快的方法吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我曾经做过 ls path-to-whatever | wc -l </ code>，直到我发现它实际上消耗了大量的内存。然后我转到查找路径到任意名称*| wc -l </ code>，这似乎消耗了大量的内存，不管有多少文件。

 
 
 然后我知道ls大部分由于对结果进行分类，所以速度慢，内存效率低。通过使用 ls -f | grep -c。，会得到非常快的结果;唯一的问题是文件名可能有换行符。然而，对于大多数用例来说，这是一个非常小的问题。
 
 
 这是计数文件最快的方法吗？
 
 <编辑/可能的答案：看来，当涉及到大数据，一些版本的LS，发现等已经报告与>八百万文件（需要确认虽然）挂起。为了获得非常大的文件数量（我的猜测是> 22亿），应该使用getdents64系统调用而不是getdent，这可以用大多数支持POSIX标准的编程语言来完成。一些文件系统可能会提供更快的非POSIX方法来计算文件。一种方法是使用  readdir 并对条目进行计数（在一个目录中）。下面我计算常规文件，并使用 d_type == DT_REG 这是有限的操作系统和FSs（ man readdir 并看到注释），但你可以注释掉这一行，并计算所有的dir条目： 
 
 
  #include< stdio.h> ; 
 #include< dirent.h> 
 $ b $ int main（int argc，char * argv []）{
 
 struct dirent * entry; 
 DIR * dirp; 
 
 long long c; // 64位
 
 if（argc <= 1）//需要dir 
返回1; 
 
 dirp = opendir（argv [1]）; 
 
 if（dirp == NULL）{// dir not found 
 return 2; （（entry = readdir（dirp））！= NULL）{
 if（entry-> d_type == DT_REG）
 c ++; 
 
 
 
 // printf（％s \ n，entry-> d_name）; //输出文件名
} 
 printf（％lli\\\
，c）; 
 
 closedir（dirp）; 
返回0; 
 
  
编译并运行： 
 
 
  $ gcc code.c 
 $ ./a.out〜
 254 
  
（我需要清理我的主目录：）
 $ b  编辑 
 
 
 我将一个1000000个文件转换成了一个目录，然后运行一个快速比较（最好用户+ sys的5个目录）：
 
 
  $ time ls -f | grep -c。 
 1000005 
 
 real 0m1.771s 
 user 0m0.656s 
 sys 0m1.244s 
 
 $ time ls -f | wc -l 
 1000005 
 
 real 0m1.733s 
 user 0m0.520s 
 sys 0m1.248s 
 
 $ time ../a .out。 
 1000003 
 
 real 0m0.474s 
 user 0m0.048s 
 sys 0m0.424s 
   编辑2 ： 
 
 
请注意： 
 
 
  $ time ./a.out testdir | wc -l 
 1000004 
 
 real 0m0.567s 
 user 0m0.124s 
 sys 0m0.468s 
  
 
I used to do ls path-to-whatever| wc -l, until I discovered, that it actually consumes huge amount of memory. Then I moved to find path-to-whatever -name "*" | wc -l, which seems to consume much graceful amount of memory, regardless how many files you have.


Then I learned that ls is mostly slow and less memory efficient due to sorting the results. By using ls -f | grep -c ., one will get very fast results; the only problem is filenames which might have "line breaks" in them. However, that is a very minor problem for most use cases.

Is this the fastest way to count files?

EDIT / Possible Answer: It seems that when it comes to Big Data, some versions of ls, find etc. have been reported to hang with >8 million files (need to be confirmed though). In order to succeed with very large file counts (my guess is > 2.2 billion), one should use getdents64 system call instead of getdents, which can be done with most programming languages, that support POSIX standards. Some filesystems might offer faster non-POSIX methods for counting files.
 解决方案 
One way would be to use readdir and count the entries (in one directory). Below I'm counting regular file and using d_type==DT_REG which is available for limited OSs and FSs (man readdir and see NOTES) but you could just comment out that line and count all the dir entries:
#include <stdio.h>
#include <dirent.h>

int main (int argc, char *argv[]) {

  struct dirent *entry;
  DIR *dirp;

  long long c;                            // 64 bit

  if(argc<=1)                             // require dir
    return 1;

  dirp = opendir (argv[1]);

  if (dirp == NULL) {                     // dir not found
    return 2;
  }

  while ((entry = readdir(dirp)) != NULL) {
    if(entry->d_type==DT_REG)
      c++;
      // printf ("%s\n", entry->d_name);  // for outputing filenames
  }
  printf ("%lli\n", c);

  closedir (dirp);
  return 0;
}
Complie and run:
$ gcc code.c
$ ./a.out ~
254
(I need to clean my home dir :)

Edit: 

I touched a 1000000 files into a dir and run a quick comparison (best user+sys of 5 presented):
$ time ls -f | grep -c .
1000005

real    0m1.771s
user    0m0.656s
sys     0m1.244s

$ time ls -f | wc -l
1000005

real    0m1.733s
user    0m0.520s
sys     0m1.248s

$ time ../a.out  .
1000003

real    0m0.474s
user    0m0.048s
sys     0m0.424s
Edit 2: 

As requested in comments:
$ time ./a.out testdir | wc -l
1000004

real    0m0.567s
user    0m0.124s
sys     0m0.468s


                        
这篇关于是`ls -f |当使用POSIX / Unix系统（大数据）时，grep -c。是目录中最快的方法吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

是`ls -f |当使用POSIX / Unix系统（大数据）时，grep -c。是目录中最快的方法吗？ [英] Is `ls -f | grep -c .` the fastest way to count files in directory, when using POSIX / Unix system (Big Data)?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

是`ls -f |当使用POSIX / Unix系统（大数据）时，grep -c。是目录中最快的方法吗？ [英] Is `ls -f | grep -c .` the fastest way to count files in directory, when using POSIX / Unix system (Big Data)?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭