Python:慢读&为数百万个小文件写 [英] Python: slow read & write for millions of small files

查看:175
本文介绍了Python:慢读&为数百万个小文件写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


结论:
似乎HDF5是我的目的。基本上 HDF5是用于存储和管理数据的数据模型,库和文件格式。,旨在处理大量数据。它有一个名为python-tables的Python模块。 (链接在下面的答案中)

Conclusion: It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)

HDF5在保存吨数和吨数据方面完成了1000%的工作。从2亿行读取/修改数据是一件痛苦的事情,因此这是下一个需要解决的问题。






我正在构建具有大量子目录和文件的目录树。大约有1000万个文件分布在十万个目录中。每个文件都在32个子目录下。


I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.

我有一个python脚本来构建这个文件系统并读取&写那些文件。问题是当我达到一百多万个文件时,读取和写入方法变得非常慢。

I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.

这是我读取文件内容的函数(该文件包含一个整数字符串),向其添加一定数量,然后将其写回原始文件。

Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.

def addInFile(path, scoreToAdd):
    num = scoreToAdd
    try:
        shutil.copyfile(path, '/tmp/tmp.txt')
        fp = open('/tmp/tmp.txt', 'r')
        num += int(fp.readlines()[0])
        fp.close()
    except:
        pass
    fp = open('/tmp/tmp.txt', 'w')
    fp.write(str(num))
    fp.close()
    shutil.copyfile('/tmp/tmp.txt', path)




  • 关系数据库似乎太慢了为了访问这些数据,所以我选择了文件系统方法。

  • 我之前尝试过为这些操作执行linux控制台命令,但速度要慢一些。

  • 我首先将文件复制到临时文件然后访问/修改然后将其复制回来,因为我发现这比直接访问文件更快。

  • 将所有文件放入1个目录(采用reiserfs格式),在访问文件时导致速度过快。

    • Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
    • I previously tried performing linux console commands for these but it was way slower.
    • I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
    • Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.
    • 我认为原因减速是因为有大量的文件。执行此功能1000次,时间不到一秒......但现在它已达到1分钟。

      I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.

      您如何建议我解决这个问题?我是否更改了目录树结构?

      How do you suggest I fix this? Do I change my directory tree structure?

      我需要的是快速访问这个庞大的文件池中的每个文件*

      推荐答案

      我知道这不是你问题的直接答案,但它是你问题的直接解决方案。

      I know this isn't a direct answer to your question, but it is a direct solution to your problem.

      您需要使用 HDF5 等内容进行研究。它仅适用于具有数百万个单独数据点的分层数据类型。

      You need to research using something like HDF5. It is designed for just the type of hierarchical data with millions of individual data points.

      您真的很幸运,因为HDF5有很棒的Python绑定,叫做 pytables
      我以非常相似的方式使用它并取得了巨大的成功。

      You are REALLY in luck because there are awesome Python bindings for HDF5 called pytables. I have used it in a very similar way and had tremendous success.

      这篇关于Python:慢读&为数百万个小文件写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆