在Python的大 pandas 中从数据框中制作matplotlib散点图 [英] making matplotlib scatter plots from dataframes in Python's pandas

查看:2359
本文介绍了在Python的大 pandas 中从数据框中制作matplotlib散点图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Python中的 pandas 数据框中的 matplotlib 制作一系列散点图的最佳方法是什么?



例如,如果我有一个具有一些感兴趣的列的数据框 df ,我发现自己通常会将所有内容到数组:

  import matplotlib.pylab as plt 
#df是一个DataFrame:fetch col1和col2
#和drop na行,如果任何列是NA
mydata = df [[col1,col2]]。dropna(how =any)
#现在绘制与matplotlib
vals = mydata.values
plt.scatter(vals [:, 0],vals [:, 1])$ ​​b $ b

在绘制之前将所有内容转换为数组的问题是它强制您突破数据框。



考虑这两种用例具有完整的数据框架对于绘制是至关重要的:


  1. 例如,如果你想现在查看所有的 col3 为您在调用 scatter 中绘制的相应值,并将每个点(或大小)值?你必须回去,拉出 col1,col2 的非na值,并检查他们对应的值。



    有没有办法在保留数据框的情况下绘制?例如:

      mydata = df.dropna(how =any,subset = [col1,col2] )
    #绘制col2由col2分散,大小根据col3
    scatter(mydata([col1,col2]),s = mydata [col3])


  2. 同样,假设您想要根据某些列。例如。如果您想自动绘制符合 col1,col2 的点的标签(标签存储在df的另一列中),那么该怎么办?或者这些点的颜色不同,就像人们在R中使用数据框一样。例如:

      mydata = df.dropna(how =任何,subset = [col1,col2])
    myscatter = scatter(mydata [[col1,col2]],s = 1)
    #较小的大小,所有点
    #的col2值大于0.5
    myscatter.replot(mydata [col2]> 0.5,color =red,s = 0.5)


如何做?



编辑回复船员:



你说最好的方法是绘制每个条件(如 subset_a subset_b )。如果你有很多条件,如您想将散点分成4种类型的点,甚至更多,将其分成不同的形状/颜色。你如何优雅地应用条件a,b,c等,并确保你随后绘制剩下的(不符合任何这些条件)作为最后一步?



同样,在您的例子中,您根据 col3 不同地绘制 col1,col2 code>,如果有NA值破坏 col1,col2,col3 之间的关联怎么办?例如,如果要根据 col3 值绘制所有 col2 值,但某些行的NA值或者 col1 col3 ,强制您先使用 dropna 。所以你可以这样做:

  mydata = df.dropna(how =any,subset = [col1,col2 ,col3)

那么你可以使用 mydata 像你显示的 - 使用 col3 的值绘制 col1,col2 之间的分散。但是, mydata 将丢失一些具有 col1,col2 的值的点,但 col3 ,那些还必须被绘制...所以你如何基本上绘制数据的剩余,即过滤集中的 mydata

解决方案

尝试传递 DataFrame 直接给matplotlib,如下面的例子,而不是将它们提取为numpy数组。

  df = pd.DataFrame(np.random.randn(10,2),columns = ['col1','col2'])
df ['col3'] = np.arange(len(df)) ** 2 * 100 + 100

在[5]中:df
出[5]:
col1 col2 col3
0 -1.000075 -0.759910 100
1 0.510382 0.972615 200
2 1.872067 -0.731010 500
3 0.131612 1.075142 1000
4 1.497820 0.237024 1700



基于另一列的不同分散点大小



  plt.scatter(df.col1,df.col2,s = df.col3)



基于另一个列



  colors = np.where(df.col3> 300,'r','k')
plt.scatter(df.col1,df.col2,s = 120,c = colors)



带有图例的散点图



然而,我发现使用图例创建散点图的最简单方法是调用 plt.scatter 每个点类型一次。

  cond = df.col3> 300 
subset_a = df [cond] .dropna()
subset_b = df [〜cond] .dropna()
plt.scatter(subset_a.col1,subset_a.col2,s = c ='b',label ='col3> 300')
plt.scatter(subset_b.col1,subset_b.col2,s = 60,c ='r',label ='col3 = )
plt.legend()



更新



从我可以看出,matplotlib只需用NA x / y坐标或NA样式设置(例如,颜色/大小)跳过点。要找到由于NA跳过的点,请尝试 isnull 方法: df [df.col3.isnull()] / p>

要将点列表分成多种类型,请查看 numpy 选择 ,它是一个向量化的if-then-else实现,并接受可选的默认值。例如:

  df ['subset'] = np.select([df.col3< 150,df.col3< ; 400,df.col3 <600],
[0,1,2],-1)
为颜色,标签为zip('bgrm',[0,1,2,-1 ]):
subset = df [df.subset == label]
plt.scatter(subset.col1,subset.col2,s = 120,c = color,label = str(label))
plt.legend()


What is the best way to make a series of scatter plots using matplotlib from a pandas dataframe in Python?

For example, if I have a dataframe df that has some columns of interest, I find myself typically converting everything to arrays:

import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2 
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])

The problem with converting everything to array before plotting is that it forces you to break out of dataframes.

Consider these two use cases where having the full dataframe is essential to plotting:

  1. For example, what if you wanted to now look at all the values of col3 for the corresponding values that you plotted in the call to scatter, and color each point (or size) it by that value? You'd have to go back, pull out the non-na values of col1,col2 and check what their corresponding values.

    Is there a way to plot while preserving the dataframe? For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"])
    # plot a scatter of col1 by col2, with sizes according to col3
    scatter(mydata(["col1", "col2"]), s=mydata["col3"])
    

  2. Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2 alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"]) 
    myscatter = scatter(mydata[["col1", "col2"]], s=1)
    # Plot in red, with smaller size, all the points that 
    # have a col2 value greater than 0.5
    myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
    

How can this be done?

EDIT Reply to crewbum:

You say that the best way is to plot each condition (like subset_a, subset_b) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?

Similarly in your example where you plot col1,col2 differently based on col3, what if there are NA values that break the association between col1,col2,col3? For example if you want to plot all col2 values based on their col3 values, but some rows have an NA value in either col1 or col3, forcing you to use dropna first. So you would do:

mydata = df.dropna(how="any", subset=["col1", "col2", "col3")

then you can plot using mydata like you show -- plotting the scatter between col1,col2 using the values of col3. But mydata will be missing some points that have values for col1,col2 but are NA for col3, and those still have to be plotted... so how would you basically plot "the rest" of the data, i.e. the points that are not in the filtered set mydata?

解决方案

Try passing columns of the DataFrame directly to matplotlib, as in the examples below, instead of extracting them as numpy arrays.

df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100

In [5]: df
Out[5]: 
       col1      col2  col3
0 -1.000075 -0.759910   100
1  0.510382  0.972615   200
2  1.872067 -0.731010   500
3  0.131612  1.075142  1000
4  1.497820  0.237024  1700

Vary scatter point size based on another column

plt.scatter(df.col1, df.col2, s=df.col3)

Vary scatter point color based on another column

colors = np.where(df.col3 > 300, 'r', 'k')
plt.scatter(df.col1, df.col2, s=120, c=colors)

Scatter plot with legend

However, the easiest way I've found to create a scatter plot with legend is to call plt.scatter once for each point type.

cond = df.col3 > 300
subset_a = df[cond].dropna()
subset_b = df[~cond].dropna()
plt.scatter(subset_a.col1, subset_a.col2, s=120, c='b', label='col3 > 300')
plt.scatter(subset_b.col1, subset_b.col2, s=60, c='r', label='col3 <= 300') 
plt.legend()

Update

From what I can tell, matplotlib simply skips points with NA x/y coordinates or NA style settings (e.g., color/size). To find points skipped due to NA, try the isnull method: df[df.col3.isnull()]

To split a list of points into many types, take a look at numpy select, which is a vectorized if-then-else implementation and accepts an optional default value. For example:

df['subset'] = np.select([df.col3 < 150, df.col3 < 400, df.col3 < 600],
                         [0, 1, 2], -1)
for color, label in zip('bgrm', [0, 1, 2, -1]):
    subset = df[df.subset == label]
    plt.scatter(subset.col1, subset.col2, s=120, c=color, label=str(label))
plt.legend()

这篇关于在Python的大 pandas 中从数据框中制作matplotlib散点图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆