Seaborn 显示在 Pandas 列中找不到的值 [英] Seaborn showing values not found in Pandas columns

查看:29
本文介绍了Seaborn 显示在 Pandas 列中找不到的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

原始数据框:

  dp.head(10)

使用推荐的选择方法创建新的数据框:

  dtest = pd.DataFrame(dp [dp ['numdept'].isin([3,6,8,10])]).dropna()dtest.reset_index(下降=真,就地=真)dtest.head(10)

测试以确保只有 [3,6,8,10] 中的值在 dtest['numdept'] 中:

print "numdept is 5:", dtest[dtest["numdept"].isin ([5])]打印在numdept列中的一组不同的值:",排序(set(dtest ['numdept'].tolist()))>>numdept是5:空的DataFrame>>列:[numgrade, numyear, numdept]>>指数: []>>numdept列中的一组不同值:[3、6、8、10]

绘图:

  plt.figure(figsize =(16,8))sb.boxplot(x="numyear", y="numgrade", 色调="numdept", 数据=dtest)

问题:为什么图例中的"nummdept"类别显示的值不是3,6,8,10?

问题在ipython笔记本中浮出水面,但即使我将代码携带到常规环境中也可以重现.还尝试通过使用建议

这是否使它成为熊猫错误?

解决方案

您正在使用分类变量.图例似乎基于类别变量中的类别,而不是实际存在的值.一个分类变量可能代表数据中实际上没有出现的类别,而这些类别仍然显示在图例中.

文档中所建议,您可以执行 dtest.numdept.cat.remove_unused_categories()删除空类别.

Original dataframe:

dp.head(10)

Creating new dataframe using recommended selection method:

dtest = pd.DataFrame(dp[dp['numdept'].isin([3,6,8,10])]).dropna()
dtest.reset_index(drop =True, inplace = True)
dtest.head(10)

Testing to make sure that only the values in [3,6,8,10] are in dtest['numdept']:

print "numdept is 5:", dtest[dtest["numdept"].isin ([5])]
print "set of distinct values in the numdept column:", sorted(set(dtest['numdept'].tolist()))

>> numdept is 5: Empty DataFrame
>> Columns: [numgrade, numyear, numdept]
>> Index: []
>> set of distinct values in the numdept column: [3, 6, 8, 10]

Plotting:

plt.figure(figsize=(16, 8))
sb.boxplot(x="numyear", y="numgrade", hue="numdept", data=dtest)

Question: Why are the "nummdept" categories in the plot legend showing values other than 3,6,8,10?

Problem surfaced in an ipython notebook, but recurs even when I carry the code to a regular environment. Also tried to avoid seaborn related issues by using the suggestion here, to no avail.

Using Canopy 1.7.4.3348, jupyter 1.0.0-15, pandas 0.19.0-1 matplotlib 1.5.1-9 and seaborn 0.7.0-6

EDIT: On an impulse, inserted the following before the plotting code:

grouped = dtest.groupby(['numdept', 'numyear'])
grouped.mean()

The output has numdept values that should not exist in dtest.

Does this make it a pandas bug?

解决方案

You are using a categorical variable. It appears the legend is based on the categories in the categorical variable, not the values that are actually present. A categorical variable may represent categories that don't actually occur in the data, and these categories are still shown in the legend.

As suggested in the documentation, you can do dtest.numdept.cat.remove_unused_categories() to remove the empty categories.

这篇关于Seaborn 显示在 Pandas 列中找不到的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆