catplot(kind="count") 明显慢于 countplot() [英] catplot(kind="count") is significantly slower than countplot()
问题描述
我正在处理一个相当大的数据集(约4000万行).我发现,如果直接调用 sns.countplot(),那么我的可视化效果会很快:
I am working on a fairly large dataset (~40m rows). I have found that if I call sns.countplot() directly then my visualisation plots really quickly:
%%time
ax = sns.countplot(x="age_band",data=acme)
但是,如果我使用 catplot(kind ="count")进行相同的可视化,则执行速度会大大降低:
However if I do the same visualisation using catplot(kind="count") then the speed of execution slows down dramatically:
%%time
g = sns.catplot(x="age_band",data=acme,kind="count")
有这么大的性能差异的原因吗?catplot() 是否在绘制数据之前对我的数据进行某种转换?
Is there a reason for such a large performance difference? Is catplot() doing some sort of conversion on my data before it can plot it?
如果有一个已知的原因,那么它是否扩展到所有图形级函数与轴级函数,例如 sns.scatterplot() 比 sns.relplot(kind=分散") 等?
If there is a known reason for this, then does it extend to all figure level functions vs axis level functions eg is sns.scatterplot() faster that sns.relplot(kind="scatter") etc?
我更喜欢使用 catplot(),因为我喜欢它的灵活性和在FacetGrid上轻松绘制的功能,但是如果要花费更多的时间来实现相同的绘制,那么我将只使用轴级直接作用.
My preference would be to use catplot() as I like its flexibility and easy plotting on a FacetGrid but if it is going to take so much longer to achieve the same plot then I will just use the axis level functions directly.
推荐答案
catplot
中有很多开销,或者在 FacetGrid
中有很多开销,这将确保类别沿着网格同步.考虑例如你有一个沿着网格的列绘制的变量,并不是每个年龄组都会出现.您仍然需要显示该非出现年龄段并保持其颜色.因此,两个彼此相邻的国家图不一定构成一个猫图.
There is a lot of overhead in catplot
, or for that matter in FacetGrid
, that will ensure that the categories are synchronized along the grid. Consider e.g. that you have a variable you plot along the columns of the grid for which not every age group occurs. You would still need to show that non-occuring age group and hold on to its color. Hence, two countplots next to each other do not necessarily make up one catplot.
但是,如果您只对单个计数图感兴趣,那么绘制图显然会过分杀伤.另一方面,与计数的条形图相比,即使是单个计数图也太过分了.就是
However, if you are only interested in a single countplot, a catplot is clearly overkill. On the other hand, even a single countplot is overkill compared to a barplot of the counts. That is,
counts = df["Category"].value_counts().sort_index()
colors = plt.cm.tab10(np.arange(len(counts)))
ax = counts.plot.bar(color=colors)
将是两倍
ax = sns.countplot(x="Category", data=df)
这篇关于catplot(kind="count") 明显慢于 countplot()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!