如何加快读取多个文件并将数据放入数据框的速度? [英] How can I speed up reading multiple files and putting the data into a dataframe?

查看：81 发布时间：2020/5/24 1:34:10 python regex performance parsing pandas

本文介绍了如何加快读取多个文件并将数据放入数据框的速度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有许多文本文件(例如50个)，需要将其读入海量数据框中.目前，我正在使用以下步骤.

I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.

读取每个文件并检查标签是什么.我需要的信息通常包含在前几行中.相同的标签只会在文件的其余部分重复，每次都会针对它们列出不同类型的数据.
使用这些标签创建数据框.
再次读取文件，并使用值填充数据框.
将该数据框与主数据框连接起来.

这对于100 KB大小的文件非常有效-几分钟，但只有50 MB，这只需要几个小时，并不实用.

This works pretty well for files that are of the 100 KB size - a few minutes, but at 50 MB, it just takes hours, and is not practical.

如何优化代码?特别是-

How can I optimise my code? In particular -

如何确定哪些功能占用的时间最多，我需要优化哪些时间?它是文件的读取吗?它是在写数据帧吗?我的程序在哪里花时间?
我应该考虑使用多线程或多处理吗?
我可以改善算法吗?
- 也许一次读入一个列表中的整个文件，而不是一行一行地
- 解析数据块/整个文件，而不是逐行解析
- 一次又一次地而不是一行一行地将数据分配给数据框.

How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
Should I consider multithreading or multiprocessing?
Can I improve the algorithm?
- Perhaps read the entire file in in one go into a list, rather than line by line,
- Parse data in chunks/entire file, rather than line by line,
- Assign data to the dataframe in chunks/one go, rather than row by row.

这是示例代码.我自己的代码稍微复杂一点，因为文本文件更加复杂，因此我必须使用大约10个正则表达式和多个while循环来读取数据并将其分配到正确的数组中的正确位置.为了简化MWE，我也没有在MWE的输入文件中使用重复标签，因此我想无故读取两次文件.我希望这是有道理的！

Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven't used repeating labels in the input files for the MWE either, so it would like I'm reading the file twice for no reason. I hope that makes sense!

import re
import pandas as pd

df = pd.DataFrame()
paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"]
reg_ex = re.compile('^(.+) (.+)\n')
# read all files to determine what indices are available
for path in paths:
    file_obj = open(path, 'r')
    print file_obj.readlines()

['a 1\n', 'b 2\n', 'end']
['c 3\n', 'd 4\n', 'end']

indices = []
for path in paths:
    index = []
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                index += match.group(1)
            except AttributeError:
                pass
    indices.append(index)
# read files again and put data into a master dataframe
for path, index in zip(paths, indices):
    subset_df = pd.DataFrame(index=index, columns=["Number"])
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                subset_df.loc[[match.group(1)]] = match.group(2)
            except AttributeError:
                pass
    df = pd.concat([df, subset_df]).sort_index()
print df

  Number
a      1
b      2
c      3
d      4

我的输入文件:

test1.txt

a 1
b 2
end

test2.txt

c 3
d 4
end

如何加快读取多个文件并将数据放入数据框的速度? [英] How can I speed up reading multiple files and putting the data into a dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何加快读取多个文件并将数据放入数据框的速度? [英] How can I speed up reading multiple files and putting the data into a dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭