在 Pandas Dataframe 中并行加载输入文件 [英] Parallel loading of Input Files in Pandas Dataframe
问题描述
我有一个需求,我有三个输入文件,需要将它们加载到 Pandas Data Frame 中,然后将其中两个文件合并为一个 Data Frame.
I have a Requirement, where I have three Input files and need to load them inside the Pandas Data Frame, before merging two of the files into one single Data Frame.
文件扩展名总是会改变,一次可能是 .txt,另一次可能是 .xlsx 或 .csv.
The File extension always changes, it could be .txt one time and .xlsx or .csv another time.
如何并行运行这个进程,以节省等待/加载时间?
How Can I run this process parallel, in order to save the waiting/ loading time ?
这是我目前的代码,
from time import time # to measure the time taken to run the code
start_time = time()
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
print(end_time - start_time)
加载我的primary_df 和secondary_df 大约需要20 分钟.因此,我正在寻找一种可能使用并行处理来节省时间的有效解决方案.我通过阅读操作计时,大部分时间大约为 18 分 45 秒.
It takes around 20 minutes for me to load my primary_df and secondary_df. So, I am looking for an efficient solution possibly using parallel processing to save time. I timed by Reading operation and it takes most of the time approximately 18 minutes 45 seconds.
硬件配置:- Intel i5 处理器、16 GB 内存和 64 位操作系统
有资格获得赏金的问题:- 因为我正在寻找工作带有详细步骤的代码 - 在 anaconda 中使用 包环境支持加载我的输入文件并行和将它们分别存储在熊猫数据框中.这应该最终节省时间.
Question Made Eligible for bounty :- As I am looking for a working code with detailed steps - using a package with in anaconda environment that supports loading my input files Parallel and storing them in a pandas data frame separately. This should eventually save time.
推荐答案
试试这个:
from time import time
import pandas as pd
from multiprocessing.pool import ThreadPool
start_time = time()
pool = ThreadPool(processes=3)
Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"
# Define a function for the thread
def import_xlsx(file_name):
df_xlsx = pd.read_excel(file_name)
# print(df_xlsx.head())
return df_xlsx
def import_csv(file_name):
df_csv = pd.read_csv(file_name)
# print(df_csv.head())
return df_csv
# Create two threads as follows
Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get()
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get()
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get()
Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
这篇关于在 Pandas Dataframe 中并行加载输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!