Python：慢读&为数百万个小文件写 [英] Python: slow read & write for millions of small files

查看：175 发布时间：2018/8/24 17:34:52 python file io

本文介绍了Python：慢读&为数百万个小文件写的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

结论：
似乎HDF5是我的目的。基本上 HDF5是用于存储和管理数据的数据模型，库和文件格式。，旨在处理大量数据。它有一个名为python-tables的Python模块。（链接在下面的答案中）

Conclusion: It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)

HDF5在保存吨数和吨数据方面完成了1000％的工作。从2亿行读取/修改数据是一件痛苦的事情，因此这是下一个需要解决的问题。

我正在构建具有大量子目录和文件的目录树。大约有1000万个文件分布在十万个目录中。每个文件都在32个子目录下。

I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.

我有一个python脚本来构建这个文件系统并读取&写那些文件。问题是当我达到一百多万个文件时，读取和写入方法变得非常慢。

I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.

这是我读取文件内容的函数（该文件包含一个整数字符串），向其添加一定数量，然后将其写回原始文件。

Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.

def addInFile(path, scoreToAdd):
    num = scoreToAdd
    try:
        shutil.copyfile(path, '/tmp/tmp.txt')
        fp = open('/tmp/tmp.txt', 'r')
        num += int(fp.readlines()[0])
        fp.close()
    except:
        pass
    fp = open('/tmp/tmp.txt', 'w')
    fp.write(str(num))
    fp.close()
    shutil.copyfile('/tmp/tmp.txt', path)

关系数据库似乎太慢了为了访问这些数据，所以我选择了文件系统方法。

我之前尝试过为这些操作执行linux控制台命令，但速度要慢一些。

我首先将文件复制到临时文件然后访问/修改然后将其复制回来，因为我发现这比直接访问文件更快。

将所有文件放入1个目录（采用reiserfs格式），在访问文件时导致速度过快。

Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
I previously tried performing linux console commands for these but it was way slower.
I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.

我认为原因减速是因为有大量的文件。执行此功能1000次，时间不到一秒......但现在它已达到1分钟。

I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.

您如何建议我解决这个问题？我是否更改了目录树结构？

How do you suggest I fix this? Do I change my directory tree structure?

我需要的是快速访问这个庞大的文件池中的每个文件*

Python：慢读&为数百万个小文件写 [英] Python: slow read & write for millions of small files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python：慢读&amp;为数百万个小文件写 [英] Python: slow read &amp; write for millions of small files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python：慢读&为数百万个小文件写 [英] Python: slow read & write for millions of small files

登录关闭