处理大量.txt文件(和整体较大大小)的提示-Python? [英] Tips for working with large quantity .txt files (and overall large size) - python?

查看:401
本文介绍了处理大量.txt文件(和整体较大大小)的提示-Python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个脚本,以分析txt文件并将其存储到可导出为CSV的熊猫数据框中.

I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.

当我使用少于100个文件时,我的脚本很容易工作-但是现在,当尝试在完整示例中运行它时,我遇到了很多问题.

My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.

我处理的是〜8000个.txt文件,平均大小为300 KB,因此总共约为2.5 GB.

Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.

我想知道我是否可以获得有关如何提高代码效率的提示.

I was wondering if I could get tips on how to make my code more efficient.

对于打开和读取文件,我使用:

for opening and reading files, I use:

filenames = os.listdir('.')
dict = {}
for file in filenames:
    with open(file) as f:
        contents = f.read()
        dict[file.replace(".txt", "")] = contents

执行print(dict)会使我的python崩溃(至少看起来像这样). 有更好的方法来解决这个问题吗?

Doing print(dict) crashes (at least it seems like it) my python. Is there a better way to handle this?

此外,我还使用以下命令将dict中的所有values转换为小写字母:

Additionally, I also convert all the values in my dict to lowercase, using:

def lower_dict(d):
   lcase_dict = dict((k, v.lower()) for k, v in d.items())
   return lcase_dict
lower = lower_dict(dict)

我还没有尝试过(无法通过开放/阅读阶段),但是我想知道这是否会引起问题?

I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?

现在,在我被标记为重复之前,我确实读过以下内容:

Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?

但是,该用户似乎正在处理1个非常大的文件,即5GB,而我正在处理多个总计2.5GB的小文件(实际上,我的ENTIRE示例大约是50GB和60,000个文件).所以我想知道我的方法是否需要有所不同. 很抱歉,如果这是一个愚蠢的问题,很遗憾,我对RAM和计算机处理方法领域并不精通.

however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different. Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.

非常感谢您的帮助.

谢谢

推荐答案

我认为最会使您的代码变慢的是您正在使用的.replace()方法.我相信这是因为内置的replace方法是迭代的,因此效率很低.尝试在for循环中使用re模块.这是最近如何使用该模块将键"T",:"和-"替换为"的示例,在这种情况下,它们已从文件中删除:

I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:

for line in lines:
    line = re.sub('[T:-]', '', line)

让我知道这是否有帮助!

Let me know if this helps!

这篇关于处理大量.txt文件(和整体较大大小)的提示-Python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆