如何快速找到添加/删除的文件? [英] How to quickly find added / removed files?

查看:155
本文介绍了如何快速找到添加/删除的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个小程序来创建我的目录中所有文件的索引。它基本上遍历磁盘上的每个文件,并将其存储到可搜索的数据库中,就像Unix的定位一样。问题是,索引生成是相当缓慢,因为我有大约一百万个文件。

一旦我产生了一个索引,是否有一个快速的方法来找出哪些文件自上次运行以来,是否已在磁盘上添加或删除?

编辑:我不想监视文件系统事件。我认为风险太高而不同步,我更喜欢有一个像快速重新扫描,快速找到文件已被添加/删除的地方。也许有目录最后修改日期或什么?

一个小基准



我刚刚做了一个基准。正在运行

  dir / b / s M:\测试\> c:\out.txt 

需要0.9秒,并提供所有我需要的信息。当我使用Java实现(很像这样)时,大约4.5秒。任何想法如何改善至少这种蛮横的方法?

相关文章:

我在我的工具MetaMake中做了这个。这里是配方:


  1. 如果索引是空的,将根目录添加到具有时间戳记的索引== dir.lastModified() -1。

  2. 查找索引中的所有目录
  3. 比较索引中目录的时间戳与文件系统中的目录的时间戳。这是一个快速的操作,因为你有完整的路径(没有扫描所涉及的树中的所有文件/目录)。

  4. 如果时间戳已经改变,你在这个目录中有一个改变。重新扫描并更新索引。
  5. 如果您在此步骤中遇到丢失的目录,请从索引中删除子树

  6. 如果遇到现有目录,将其忽略(将在步骤2中检查)
  7. 如果遇到新目录,请使用时间戳== dir.lastModified() - 1添加它。确保在步骤2中考虑它。

这将使您能够有效地注意到新的和已删除的文件。由于您只扫描步骤#2中的已知路径,这将非常有效。文件系统在枚举目录中的所有条目时很糟糕,但是当您知道确切的名称时,它们很快。



缺点:您不会注意到已更改的文件。所以,如果你编辑一个文件,这将不会反映在目录的改变。如果您也需要这些信息,则必须重复上述算法来查找索引中的文件节点。这一次,你可以忽略新的/删除的文件,因为它们已经在运行期间被更新了。


$ b

Zach提到时间戳是不够的。我的回答是:没有别的办法可以做到这一点。 大小的概念是完全不定义的目录和从实施到实施的变化。没有可以注册的API我想要通知在文件系统中对某些内容进行了任何更改。有些API可以在你的应用程序处于活动状态时工作,但是如果它停止或错过一个事件,那么你是不同步的。



如果文件系统是远程的,变得更糟,因为各种网络问题可能会导致你不同步。所以,虽然我的解决方案可能不是100%完美和水密,但它将适用于所有,但最构建的例外情况。这是目前唯一的解决方案。



现在有一种类型的应用程序需要在修改后保留目录的时间戳:病毒或蠕虫。这显然会打破我的算法,但是,这并不意味着要防止病毒感染。如果你想防范这一点,你必须采取一种完全不同的方法。

实现Zach想要的唯一方法是建立一个新的文件系统,永久记录这个信息在某个地方,把它卖给微软,并等待几年(大概10个或更多),直到每个人都使用它。


I am writing a little program that creates an index of all files on my directories. It basically iterates over each file on the disk and stores it into a searchable database, much like Unix's locate. The problem is, that index generation is quite slow since I have about a million files.

Once I have generated an index, is there a quick way to find out which files have been added or removed on the disk since the last run?

EDIT: I do not want to monitor the file system events. I think the risk is too high to get out of sync, I would much prefer to have something like a quick re-scan that quickly finds where files have been added / removed. Maybe with directory last modified date or something?

A Little Benchmark

I just made a little benchmark. Running

dir /b /s M:\tests\  >c:\out.txt

Takes 0.9 seconds and gives me all the information I need. When I use a Java implementation (much like this), it takes about 4.5 seconds. Any ideas how to improve at least this brute force approach?

Related posts: How to see if a subfile of a directory has changed

解决方案

I've done this in my tool MetaMake. Here is the recipe:

  1. If the index is empty, add the root directory to the index with a timestamp == dir.lastModified()-1.
  2. Find all directories in the index
  3. Compare the timestamp of the directory in the index with the one from the filesystem. This is a fast operation since you have the full path (no scanning of all files/dirs in the tree involved).
  4. If the timestamp has changed, you have a change in this directory. Rescan it and update the index.
  5. If you encounter missing directories in this step, delete the subtree from the index
  6. If you encounter an existing directory, ignore it (will be checked in step 2)
  7. If you encounter a new directory, add it with timestamp == dir.lastModified()-1. Make sure it gets considered in step 2.

This will allow you to notice new and deleted files in an effective manner. Since you scan only for known paths in step #2, this will be very effective. File systems are bad at enumerating all the entries in a directory but they are fast when you know the exact name.

Drawback: You will not notice changed files. So if you edit a file, this will not reflect in a change of the directory. If you need this information, too, you will have to repeat the algorithm above for the file nodes in your index. This time, you can ignore new/deleted files because they have already been updated during the run over the directories.

[EDIT] Zach mentioned that timestamps are not enough. My reply is: There simply is no other way to do this. The notion of "size" is completely undefined for directories and changes from implementation to implementation. There is no API where you can register "I want to be notified of any change being made to something in the file system". There are APIs which work while your application is alive but if it stops or misses an event, then you're out of sync.

If the file system is remote, things get worse because all kinds of network problems can cause you to get out of sync. So while my solution might not be 100% perfect and water tight, it will work for all but the most constructed exceptional case. And it's the only solution which even gets this far.

Now there is a single kind application which would want to preserve the timestamp of a directory after making a modification: A virus or worm. This will clearly break my algorithm but then, it's not meant to protect against a virus infection. If you want to protect against this, you must a completely different approach.

The only other way to achieve what Zach wants is to build a new filesystem which logs this information permanently somewhere, sell it to Microsoft and wait a few years (probably 10 or more) until everyone uses it.

这篇关于如何快速找到添加/删除的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆