将行追加到DataFrame的最快,最有效的方法是什么? [英] What is the fastest and most efficient way to append rows to a DataFrame?
问题描述
我有一个很大的数据集,必须将其转换为.csv格式,我有29列和超过一百万行.我正在使用python和pandas数据框来处理此工作.我认为,随着数据框变大,将任何行追加到它会越来越耗时.我想知道是否有更快的方法,可以共享代码中的相关代码段.
I have a large dataset which I have to convert to .csv format, I have 29 columns and more than a million lines. I am using python and pandas dataframe to handle this job. I figured that as the dataframe gets larger, appending any rows to is it getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.
欢迎任何建议.
df = DataFrame()
for startID in range(0, 100000, 1000):
s1 = time.time()
tempdf = DataFrame()
url = f'https://******/products?startId={startID}&size=1000'
r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
jsonList = r.json() # datatype= list, contains= dict
normalized = json_normalize(jsonList)
# type(normal) = pandas.DataFrame
print(startID / 1000) # status indicator
for series in normalized.iterrows():
series = series[1] # iterrows returns tuple (index, series)
offers = series['offers']
series = series.drop(columns='offers')
length = len(offers)
for offer in offers:
n = json_normalize(offer).squeeze() # squeeze() casts DataFrame into Series
concatinated = concat([series, n]).to_frame().transpose()
tempdf = tempdf.append(concatinated, ignore_index=True)
del normalized
df = df.append(tempdf)
f1 = time.time()
print(f1 - s1, ' seconds')
df.to_csv('out.csv')
推荐答案
正如Mohit Motwani建议的最快方法是将数据收集到字典中,然后将所有内容加载到数据帧中.下面是一些速度测量示例:
As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:
import pandas as pd
import numpy as np
import time
import random
end_value = 10000
用于创建字典的度量,最后将其全部加载到数据帧中
Measurement for creating dictionary and at the end load all into data frame
start_time = time.time()
dictinary_list = []
for i in range(0, end_value, 1):
dictionary_data = {k: random.random() for k in range(30)}
dictinary_list.append(dictionary_data)
df_final = pd.DataFrame.from_dict(dictinary_list)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
执行时间= 0.090153秒
Execution time = 0.090153 seconds
将数据附加到列表中并将concat附加到数据框中的度量:
Measurements for appending data into list and concat into data frame:
start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
appended_data.append(data)
appended_data = pd.concat(appended_data, axis=0)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
执行时间= 4.183921秒
Execution time = 4.183921 seconds
用于附加数据帧的度量:
Measurements for appending data frames:
start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
df_final = df_final.append(df)
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
执行时间= 11.085888秒
Execution time = 11.085888 seconds
使用loc进行插入数据的测量:
Measurements for insert data by usage of loc:
start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
df.loc[i] = list(np.random.randint(0, 100, size=30))
end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))
执行时间= 21.029176秒
Execution time = 21.029176 seconds
这篇关于将行追加到DataFrame的最快,最有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!