读取文件夹中的多个镶木地板文件,然后使用python写入单个csv文件 [英] Read multiple parquet files in a folder and write to single csv file using python

查看:92
本文介绍了读取文件夹中的多个镶木地板文件,然后使用python写入单个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,我有一个场景,其中存在多个拼写文件,文件名称按顺序排列.例如:一个文件夹中的par_file1,par_file2,par_file3等,最多100个文件.

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.

我需要从file1开始读取这些拼花地板文件,然后将其写入单个csv文件.写入file1的内容后,应将file2的内容附加到没有标题的同一csv中.请注意,所有文件都具有相同的列名,并且只有数据被拆分为多个文件.

I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.

我学会了使用pyarrow通过以下代码将单个实木复合地板转换为csv文件:

I learnt to convert single parquet to csv file using pyarrow with the following code:

import pandas as pd    
df = pd.read_parquet('par_file.parquet')    
df.to_csv('csv_file.csv')

但是我无法将其扩展为循环到多个镶木文件并追加到单个csv. 熊猫有没有办法做到这一点?或其他任何方式都可以提供很大帮助.谢谢.

But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.

推荐答案

如果要将文件复制到本地计算机上并运行代码,则可以执行以下操作.下面的代码假定您在与实木复合地板文件相同的目录中运行代码.它还假定文件的命名与上面提供的一样:订单.例如:par_file1,par_file2,par_file3,依此类推,一个文件夹中最多包含100个文件."如果需要搜索文件,则需要使用glob获取文件名,并明确提供要保存csv的路径:open(r'this\is\your\path\to\csv_file.csv', 'a')希望这会有所帮助.

If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.

import pandas as pd

# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
    print('Reading par_file1.parquet')
    df = pd.read_parquet('par_file1.parquet')
    df.to_csv(csv_file, index=False)
    print('par_file1.parquet appended to csv_file.csv\n')
    csv_file.close()

# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
    files.append(f'par_file{i}.parquet')

# open files and append to csv_file.csv
for f in files:
    print(f'Reading {f}')
    df = pd.read_parquet(f)
    with open('csv_file.csv','a') as file:
        df.to_csv(file, header=False, index=False)
        print(f'{f} appended to csv_file.csv\n')

您可以根据需要删除打印语句.

You can remove the print statements if you want.

使用pandas 0.23.3

这篇关于读取文件夹中的多个镶木地板文件,然后使用python写入单个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆