如何在pandas DataFrame上绘制包含一些包含字符串的列的平行坐标? [英] How to plot parallel coordinates on pandas DataFrame with some columns containing strings?

查看:186
本文介绍了如何在pandas DataFrame上绘制包含一些包含字符串的列的平行坐标?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为pandas DataFrame绘制平行坐标,该DataFrame包含带有数字的列和其他包含字符串作为值的列.

I would like to plot parallel coordinates for a pandas DataFrame containing columns with numbers and other columns containing strings as values.

问题描述

我有以下测试代码可用于绘制带有数字的平行坐标:

I have following test code which works for plotting parallel coordinates with numbers:

import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates

df = pd.DataFrame([["line 1",20,30,100],\
    ["line 2",10,40,90],["line 3",10,35,120]],\
    columns=["element","var 1","var 2","var 3"])
parallel_coordinates(df,"element")
plt.show()

最终显示以下图形:

Which ends up showing following graphic:

但是我想尝试的是在我的绘图中添加一些带有字符串的变量.但是当我运行以下代码时:

However what I would like to attempt is to add some variables to my plot that have strings. But when I run following code:

df2 = pd.DataFrame([["line 1",20,30,100,"N"],\
    ["line 2",10,40,90,"N"],["line 3",10,35,120,"N-1"]],\
    columns=["element","var 1","var 2","var 3","regime"])
parallel_coordinates(df2,"element")
plt.show()

我收到此错误:

ValueError:无效的float()文字:N

ValueError: invalid literal for float(): N

我想这意味着parallel_coordinates函数不接受字符串.

Which I suppose means parallel_coordinates function does not accept strings.

我正在尝试做的事的例子

我打算做这样的例子,其中Race和Sex是字符串而不是数字:

I am attemting to do something like this example, where Race and Sex are strings and not numbers:

问题

是否可以使用pandas parallel_coordinates来执行这样的图形?如果没有,我该如何尝试这样的图形?也许与matplotlib?

Is there any way to perform such a graphic using pandas parallel_coordinates? If not, how could I attempt such graphic? Maybe with matplotlib?

我必须提一下,我特别希望在 Python 2.5 下使用熊猫版本为0.9.0的解决方案.

I must mention I am particularily looking for a solution under Python 2.5 with pandas version 0.9.0.

推荐答案

对于我来说,要使用regime列做什么,还不是很清楚.

It wasn't entirely clear to me what you wanted to do with the regime column.

如果问题仅在于它的存在阻止了该图的显示,那么您可以简单地从该图中忽略那些有问题的列:

If the problem was just that its presence prevented the plot to show, then you could simply omit the offending columns from the plot:

parallel_coordinates(df2, class_column='element', cols=['var 1', 'var 2', 'var 3'])

看看您提供的示例,然后我了解到您希望将分类变量以某种方式放置在垂直线上,并且类别的每个值都由不同的y值表示.我说对了吗?

looking at the example you provided, I then understood you want categorical variables to be somehow placed on a vertical lines, and each value of the category is represented by a different y-value. Am I getting this right?

如果我是,那么您需要将分类变量(此处为regime)加起来为数字值.为此,我使用了在此网站上找到的提示.

If I am, then you need to encore your categorical variables (here, regime) into a numerical value. To do this, I used this tip I found on this website.

df2.regime = df2.regime.astype('category')
df2['regime_encoded'] = df2.regime.cat.codes


print(df2)
    element var 1   var 2   var 3   regime  regime_encoded
0   line 1  20      30      100     N       0
1   line 2  10      40      90      N       0
2   line 3  10      35      120     N-1     1

此代码创建一个新列(regime_encoded),其中类别方案的每个值都由一个整数编码.然后,您可以绘制新的数据框,包括新创建的列:

this code creates a new column (regime_encoded) where each value of the category regime is coded by an integer. You can then plot your new dataframe, including the newly created column:

parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")

问题在于类别变量(0,1)的编码值与其他变量的范围无关,因此所有行似乎都趋向于同一点.答案是缩放与数据范围相比的编码(在这里,我这样做非常简单,因为您的数据限制在0到120之间,如果实际数据帧中不是这种情况,则可能需要从最小值开始缩放).

The problem is that the encoding values for the categorical variable (0, 1) have nothing to do with the range of your other variables, so all the lines seem to tend toward the same point. The answer is then to scale the encoding compared to the range of your data (here I did it very simply because your data was bounded between 0 and 120, you probably need to scale from the minimum value if that's not the case in your real dataframe).

df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True))
parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")

为了更好地适合您的示例,您可以添加注释:

To fit with your example better, you can add annotations:

df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True)
parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
ax = plt.gca()
for i,(label,val) in df2.loc[:,['regime','regime_encoded']].drop_duplicates().iterrows():
    ax.annotate(label, xy=(3,val), ha='left', va='center')

这篇关于如何在pandas DataFrame上绘制包含一些包含字符串的列的平行坐标?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆