如何加快读取多个文件并将数据放入数据框的速度? [英] How can I speed up reading multiple files and putting the data into a dataframe?

查看:81
本文介绍了如何加快读取多个文件并将数据放入数据框的速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多文本文件(例如50个),需要将其读入海量数据框中.目前,我正在使用以下步骤.

I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.

  1. 读取每个文件并检查标签是什么.我需要的信息通常包含在前几行中.相同的标签只会在文件的其余部分重复,每次都会针对它们列出不同类型的数据.
  2. 使用这些标签创建数据框.
  3. 再次读取文件,并使用值填充数据框.
  4. 将该数据框与主数据框连接起来.

这对于100 KB大小的文件非常有效-几分钟,但只有50 MB,这只需要几个小时,并不实用.

This works pretty well for files that are of the 100 KB size - a few minutes, but at 50 MB, it just takes hours, and is not practical.

如何优化代码?特别是-

How can I optimise my code? In particular -

  1. 如何确定哪些功能占用的时间最多,我需要优化哪些时间?它是文件的读取吗?它是在写数据帧吗?我的程序在哪里花时间?
  2. 我应该考虑使用多线程或多处理吗?
  3. 我可以改善算法吗?
    • 也许一次读入一个列表中的整个文件,而不是一行一行地
    • 解析数据块/整个文件,而不是逐行解析
    • 一次又一次地而不是一行一行地将数据分配给数据框.
  1. How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
  2. Should I consider multithreading or multiprocessing?
  3. Can I improve the algorithm?
    • Perhaps read the entire file in in one go into a list, rather than line by line,
    • Parse data in chunks/entire file, rather than line by line,
    • Assign data to the dataframe in chunks/one go, rather than row by row.

这是示例代码.我自己的代码稍微复杂一点,因为文本文件更加复杂,因此我必须使用大约10个正则表达式和多个while循环来读取数据并将其分配到正确的数组中的正确位置.为了简化MWE,我也没有在MWE的输入文件中使用重复标签,因此我想无故读取两次文件.我希望这是有道理的!

Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven't used repeating labels in the input files for the MWE either, so it would like I'm reading the file twice for no reason. I hope that makes sense!

import re
import pandas as pd

df = pd.DataFrame()
paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"]
reg_ex = re.compile('^(.+) (.+)\n')
# read all files to determine what indices are available
for path in paths:
    file_obj = open(path, 'r')
    print file_obj.readlines()

['a 1\n', 'b 2\n', 'end']
['c 3\n', 'd 4\n', 'end']

indices = []
for path in paths:
    index = []
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                index += match.group(1)
            except AttributeError:
                pass
    indices.append(index)
# read files again and put data into a master dataframe
for path, index in zip(paths, indices):
    subset_df = pd.DataFrame(index=index, columns=["Number"])
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                subset_df.loc[[match.group(1)]] = match.group(2)
            except AttributeError:
                pass
    df = pd.concat([df, subset_df]).sort_index()
print df

  Number
a      1
b      2
c      3
d      4

我的输入文件:

test1.txt

test1.txt

a 1
b 2
end

test2.txt

test2.txt

c 3
d 4
end

推荐答案

事实证明,首先创建一个空白DataFrame,搜索索引以找到一行数据的正确位置,然后仅更新该行中的一行. DataFrame是一个非常耗时的过程.

It turns out that creating a blank DataFrame first, searching the index to find the right place for a row of data, and then updating just that one row of the DataFrame is a stupidly time expensive process.

一种更快的方法是将输入文件的内容读入原始数据结构(例如列表列表或字典列表),然后将其转换为DataFrame.

A much faster way of doing this is to read the contents of the input file into a primitive data structure such as a list of lists, or a list of dicts, and then converting that into a DataFrame.

当您要读取的所有数据都在同一列中时,请使用列表.否则,请使用dict明确说明每位数据应转到哪一列.

Use lists when all of the data that you're reading in are in the same columns. Otherwise, use dicts to explicitly say which column each bit of data should go to.

更新1月18日:这链接到博客文章,向初学者介绍了如何解析复杂文件.

Update Jan 18: This is linked to How to parse complex text files using Python? I also wrote a blog article explaining how to parse complex files to beginners.

这篇关于如何加快读取多个文件并将数据放入数据框的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆