检查包含数十万张图像的目录中的损坏文件会逐渐减慢速度 [英] Checking for corrupted files in directory with hundreds of thousands of images gradually slows down
问题描述
所以我有600,000张以上的图像.我估计其中约5-10%的文件已损坏.我正在生成与该图像有关的日志.
So I have 600,000+ images. My estimate is that roughly 5-10% of these are corrupted. I'm generating a log of exactly which images this pertains to.
使用Python,到目前为止,我的方法是:
Using Python, my approach thus far is this:
def img_validator(source):
files = get_paths(source) # A list of complete paths to each image
invalid_files = []
for img in files:
try:
im = Image.open(img)
im.verify()
im.close()
except (IOError, OSError, Image.DecompressionBombError):
invalid_files.append(img)
# Write invalid_files to file
最初的200-250K图像处理速度非常快,仅需1-2个小时左右.我让进程运行了一整夜(当时为230K),八小时后才达到310K,但仍在进行.
The first 200-250K images are quite fast to process, only around 1-2 hours. I left the process running overnight (at the time it was at 230K), 8 hours later it was only at 310K, but still progressing.
有人知道为什么会这样吗?起初我以为可能是因为图像存储在HDD上,但是看到它真的没有任何意义,因为前200-250k的速度非常快.
Anyone got an idea of why that is? At first I thought it might be due to the images being stored on an HDD, but that doesn't really make sense seeing as it was very fast the first 200-250k.
推荐答案
如果您有很多图像,建议您使用多重处理.我创建了100,000个文件,其中5%损坏了,并像这样检查它们:
If you have that many images, I would suggest you use multiprocessing. I created 100,000 files of which 5% were corrupt and checked them like this:
#!/usr/bin/env python3
import glob
from multiprocessing import Pool
from PIL import Image
def CheckOne(f):
try:
im = Image.open(f)
im.verify()
im.close()
# DEBUG: print(f"OK: {f}")
return
except (IOError, OSError, Image.DecompressionBombError):
# DEBUG: print(f"Fail: {f}")
return f
if __name__ == '__main__':
# Create a pool of processes to check files
p = Pool()
# Create a list of files to process
files = [f for f in glob.glob("*.jpg")]
print(f"Files to be checked: {len(files)}")
# Map the list of files to check onto the Pool
result = p.map(CheckOne, files)
# Filter out None values representing files that are ok, leaving just corrupt ones
result = list(filter(None, result))
print(f"Num corrupt files: {len(result)}")
示例输出
Files to be checked: 100002
Num corrupt files: 5001
在装有NVME磁盘的12核CPU上花费1.6秒,但对您来说仍然应该明显更快.
That takes 1.6 seconds on my 12-core CPU with NVME disk, but should still be noticeably faster for you.
这篇关于检查包含数十万张图像的目录中的损坏文件会逐渐减慢速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!