如何优化此Pandas代码以使其运行更快 [英] How to optimize this Pandas code to run faster

查看:95
本文介绍了如何优化此Pandas代码以使其运行更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这段代码可以根据DataFrame中的数据创建一个Swarmplot:

I have this code to create a swarmplot from data from a DataFrame:

df = pd.DataFrame({"Refined__Some_ID":some_id_list,
                   "Refined_Age":age_list,
                   "Name":name_list                   
                          }
                         )
#Creating dataframe with strings from the lists
select  = df.apply(lambda row : any([isinstance(e, str) for e in row  ]),axis=1) 
#Exlcluding data from select in a new dataframe
dfAnalysis = df[~select]
dfAnalysis['Refined_Age'].replace('', np.nan, inplace=True)
dfAnalysis = dfAnalysis.dropna()
dfAnalysis['Refined_Age'] = dfAnalysis['Refined_Age'].apply(int)
# print dfAnalysis
print type(dfAnalysis['Refined_Patient_Age'][1])
g = sns.swarmplot(x = dfAnalysis['Refined_ID'],y = dfAnalysis['Refined_Age'], hue = dfAnalysis['Name'], orient="v")
g.set_xticklabels(g.get_xticklabels(),rotation=30)
# print g

要花费大量的时间(14小时,而且还在计时!)。我如何加快速度?另外,为什么代码这么慢呢?

It's taking a crazy amount of time to run (14 hours and counting!). How can I speed it up? Also, why is the code so slow in the first place?

数据框中包含的3个列表来自Couchdb数据库,包含约320k文档。

The 3 lists being included in the dataframe are from a Couchdb database with about 320k documents.

更新1

我原本只打算查看前20个类别,但排除了这样做的代码。

I had intended to view the first 20 categories only but excluded the code to do so.

该行应为:

x = dfAnalysis['Refined_ID'].iloc[:20]


推荐答案

真的意味着几十万点的小样吗?除了要永远走,这是胡说八道。尝试前1000个,看看会遇到什么样的混乱。然后使用箱形图或小提琴图代替。

Do you really mean a swarmplot with several hundred thousand points? Besides it's gonna take forever, it's nonsense. Try with the first 1000 and see what kind of mess you get. Then use a boxplot or a violinplot instead. Try to understand your tools before using them.

从文档字符串开始:


[...]它不能很好地适应大量的观测结果(无论是显示所有点的能力还是计算,都需要
来安排它们)。

[...] it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

这篇关于如何优化此Pandas代码以使其运行更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆