对 pandas 中的多个.csv文件应用相同的操作 [英] Applying the same operations on multiple .csv file in pandas

查看:156
本文介绍了对 pandas 中的多个.csv文件应用相同的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有六个.csv文件。它们的整体大小约为4gigs。我需要清理每个对象,并对每个对象执行一些数据分析任务。这些操作对于所有帧都是相同的。
这是我阅读它们的代码。

  #df = pd.read_csv(r yellow_tripdata_2018-01.csv )
#df = pd.read_csv(r yellow_tripdata_2018-02.csv)
#df = pd.read_csv(r yellow_tripdata_2018-03.csv)
#df = pd .read_csv(r yellow_tripdata_2018-04.csv)
#df = pd.read_csv(r yellow_tripdata_2018-05.csv)
df = pd.read_csv(r yellow_tripdata_2018-06.csv )

每次运行内核时,我都会激活一个要读取的文件。
我正在寻找一种更优雅的方法。我考虑过要进行循环。列出文件名,然后一个接一个地读取它们,但是我不想将它们合并在一起,因此我认为必须存在另一种方法。我一直在搜索它,但似乎所有问题都导致将最后读取的文件连接起来。

解决方案

您可以使用列表保存所有数据框:

 文件数= 6 
dfs = []

范围内的文件数(len(number_of_files)):
dfs.append(pd .read_csv(f yellow_tripdata_2018-0 {file_num} .csv))#我使用Python 3.6,所以现在我习惯了f字符串。如果您使用的是Python< 3.6,请使用.format()

然后使用特定的数据框:

  df1 = dfs [0] 

编辑:



当您试图避免将所有这些加载到内存中时,我会d诉诸流媒体。尝试将for循环更改为以下内容:

  for range(len(number(of_files))):
以f的形式打开(f yellow_tripdata_2018-0 {file_num} .csv,'wb'):
dfs.append(csv.reader(iter(f.readline,'')))

然后只需在 dfs [n] next(dfs [n])将每一行读入内存。



PS



您可能需要多线程在同一时间遍历每个线程。



加载/编辑/保存:-使用 csv 模块



好,所以我做了很多研究,python的 csv 模块一次加载了一行,这很可能是在我们打开它的模式下。(解释此处



如果您不想使用 Pan das (坦白地说,这可能是答案,如果可以,只需将其实现到@seralouk的答案中),否则,那就可以了!在我看来,以下 是最好的方法,我们只需要更改几件事即可。

  number_of_files = 6 
filename = yellow_tripdata_2018-{}。csv

用于range(number_of_files)中的file_num:
#notice我正在打开原始文件在'r'模式下的f表示只读
#,在'a'模式下将新文件作为nf表示附加
并带有open(filename.format(str(file_num).zfill(2)),' r')as f,
open(filename.format((str(file_num)+-new)。zfill(2)),'a')as nf:
#初始化作者之前循环每一行
w = csv.writer(nf)
用于csv.reader(f)中的行:
#执行数据清理(这是每行记录)
#保存到文件
w.writerow(row)

注意:



您可能要考虑使用字典读者和/或 DictWriter ,我d比普通读者更喜欢它们,因为我发现它们更易于理解。



熊猫方法-使用块



请阅读此答案-如果您愿意避开我的csv方法并坚持使用熊猫:)从字面上看,这和您的问题相同,答案就是您要的内容。



基本上,熊猫允许您部分地将文件作为块加载,执行任何更改,然后可以将这些块写入新文件。以下主要是该答案的内容,但我确实在文档中做了一些进一步的阅读

  number_of_files = 6 
chunksize = 500#找到最适合您的块大小
filename = yellow_tripdata_2018-{}。csv

用于range(number_of_files)中的file_num:
用于pd中的块。 read_csv(filename.format(str(file_num).zfill(2))chunksize = ch)
#清理数据
chunk.to_csv(filename.format((str(file_num)+-new ).zfill(2)),mode ='a')#再次看到我们在追加模式下进行操作,因此它以大块


有关对数据进行分块的更多信息,请参见在此对于那些自己对这些内存问题感到头疼的人来说也是一本好书。


I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames. This is my code for reading them.

#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")

Each time I run the kernel, I activate one of the files to be read. I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.

解决方案

You could use a list to hold all of the dataframes:

number_of_files = 6
dfs = []

for file_num in range(len(number_of_files)):
    dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()

Then to get a certain dataframe use:

df1 = dfs[0]

Edit:

As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:

for file_num in range(len(number_of_files)):
    with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
        dfs.append(csv.reader(iter(f.readline, '')))

Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.

P.S.

You may need multi-threading to iterate through each one at the same time.

Loading/Editing/Saving: - using csv module

Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)

If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into @seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.

number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    #notice I'm opening the original file as f in mode 'r' for read only
    #and the new file as nf in mode 'a' for append
    with open(filename.format(str(file_num).zfill(2)), 'r') as f,
         open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
        #initialize the writer before looping every line
        w = csv.writer(nf)
        for row in csv.reader(f):
            #do your "data cleaning" (THIS IS PER-LINE REMEMBER)
        #save to file
        w.writerow(row)

Note:

You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.

Pandas Approach - using chunks

PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.

Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs

number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
        # Do your data cleaning
        chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks

For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.

这篇关于对 pandas 中的多个.csv文件应用相同的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆