从 Python 的 Pandas 中的数据帧制作 matplotlib 散点图 [英] making matplotlib scatter plots from dataframes in Python's pandas

查看:23
本文介绍了从 Python 的 Pandas 中的数据帧制作 matplotlib 散点图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Python 中 pandas 数据帧中的 matplotlib 制作一系列散点图的最佳方法是什么?

What is the best way to make a series of scatter plots using matplotlib from a pandas dataframe in Python?

例如,如果我有一个数据框 df 有一些感兴趣的列,我发现自己通常会将所有内容转换为数组:

For example, if I have a dataframe df that has some columns of interest, I find myself typically converting everything to arrays:

import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2 
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])

在绘图之前将所有内容转换为数组的问题在于它迫使您打破数据框.

The problem with converting everything to array before plotting is that it forces you to break out of dataframes.

考虑这两个用例,其中完整的数据框对于绘图至关重要:

Consider these two use cases where having the full dataframe is essential to plotting:

  1. 例如,如果您现在想查看 col3 的所有值,以获取您在调用 scatter 时绘制的相应值,以及按该值为每个点(或大小)着色?您必须返回,取出 col1,col2 的非 na 值并检查它们对应的值.

  1. For example, what if you wanted to now look at all the values of col3 for the corresponding values that you plotted in the call to scatter, and color each point (or size) it by that value? You'd have to go back, pull out the non-na values of col1,col2 and check what their corresponding values.

有没有办法在保留数据框的同时进行绘图?例如:

Is there a way to plot while preserving the dataframe? For example:

mydata = df.dropna(how="any", subset=["col1", "col2"])
# plot a scatter of col1 by col2, with sizes according to col3
scatter(mydata(["col1", "col2"]), s=mydata["col3"])

  • 同样,假设您想根据其中某些列的值对每个点进行不同的过滤或着色.例如.如果您想自动绘制符合 col1, col2 上某个截止点的点的标签(标签存储在 df 的另一列中),或者对这些点进行不同的着色,该怎么办,就像人们在 R 中处理数据框一样.例如:

  • Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2 alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"]) 
    myscatter = scatter(mydata[["col1", "col2"]], s=1)
    # Plot in red, with smaller size, all the points that 
    # have a col2 value greater than 0.5
    myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
    

  • 如何做到这一点?

    编辑回复crewbum:

    您说最好的方法是分别绘制每个条件(如subset_asubset_b).如果你有很多条件怎么办,例如您想将散点分成 4 种类型的点,甚至更多,以不同的形状/颜色绘制每个点.您如何优雅地应用条件 a、b、c 等,并确保将其余部分"(不属于这些条件中的任何一个)绘制为最后一步?

    You say that the best way is to plot each condition (like subset_a, subset_b) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?

    类似地,在您根据 col3 以不同方式绘制 col1,col2 的示例中,如果有 NA 值破坏了 col1,col2,col3?例如,如果您想根据 col3 值绘制所有 col2 值,但某些行在 col1 中具有 NA 值>col3,强制你先使用 dropna.所以你会这样做:

    Similarly in your example where you plot col1,col2 differently based on col3, what if there are NA values that break the association between col1,col2,col3? For example if you want to plot all col2 values based on their col3 values, but some rows have an NA value in either col1 or col3, forcing you to use dropna first. So you would do:

    mydata = df.dropna(how="any", subset=["col1", "col2", "col3")
    

    然后你可以像你展示的那样使用 mydata 绘图——使用 col3 的值绘制 col1,col2 之间的散点图.但是 mydata 将丢失一些具有 col1,col2 值但 col3 为 NA 的点,并且仍然需要绘制这些点.. 那么你将如何基本上绘制数据的其余部分",即过滤集合 mydatanot 的点?

    then you can plot using mydata like you show -- plotting the scatter between col1,col2 using the values of col3. But mydata will be missing some points that have values for col1,col2 but are NA for col3, and those still have to be plotted... so how would you basically plot "the rest" of the data, i.e. the points that are not in the filtered set mydata?

    推荐答案

    尝试将 DataFrame 的列直接传递给 matplotlib,如下例所示,而不是将它们提取为 numpy 数组.

    Try passing columns of the DataFrame directly to matplotlib, as in the examples below, instead of extracting them as numpy arrays.

    df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
    df['col3'] = np.arange(len(df))**2 * 100 + 100
    
    In [5]: df
    Out[5]: 
           col1      col2  col3
    0 -1.000075 -0.759910   100
    1  0.510382  0.972615   200
    2  1.872067 -0.731010   500
    3  0.131612  1.075142  1000
    4  1.497820  0.237024  1700
    

    根据另一列改变散点大小

    plt.scatter(df.col1, df.col2, s=df.col3)
    # OR (with pandas 0.13 and up)
    df.plot(kind='scatter', x='col1', y='col2', s=df.col3)
    

    colors = np.where(df.col3 > 300, 'r', 'k')
    plt.scatter(df.col1, df.col2, s=120, c=colors)
    # OR (with pandas 0.13 and up)
    df.plot(kind='scatter', x='col1', y='col2', s=120, c=colors)
    

    但是,我发现创建带有图例的散点图的最简单方法是为每种点类型调用一次 plt.scatter.

    However, the easiest way I've found to create a scatter plot with legend is to call plt.scatter once for each point type.

    cond = df.col3 > 300
    subset_a = df[cond].dropna()
    subset_b = df[~cond].dropna()
    plt.scatter(subset_a.col1, subset_a.col2, s=120, c='b', label='col3 > 300')
    plt.scatter(subset_b.col1, subset_b.col2, s=60, c='r', label='col3 <= 300') 
    plt.legend()
    

    据我所知,matplotlib 只是跳过具有 NA x/y 坐标或 NA 样式设置(例如,颜色/大小)的点.要查找由于 NA 跳过的点,请尝试 isnull 方法:df[df.col3.isnull()]

    From what I can tell, matplotlib simply skips points with NA x/y coordinates or NA style settings (e.g., color/size). To find points skipped due to NA, try the isnull method: df[df.col3.isnull()]

    要将点列表拆分为多种类型,请查看 numpy select,它是一个矢量化的 if-then-else 实现并接受一个可选的默认值.例如:

    To split a list of points into many types, take a look at numpy select, which is a vectorized if-then-else implementation and accepts an optional default value. For example:

    df['subset'] = np.select([df.col3 < 150, df.col3 < 400, df.col3 < 600],
                             [0, 1, 2], -1)
    for color, label in zip('bgrm', [0, 1, 2, -1]):
        subset = df[df.subset == label]
        plt.scatter(subset.col1, subset.col2, s=120, c=color, label=str(label))
    plt.legend()
    

    这篇关于从 Python 的 Pandas 中的数据帧制作 matplotlib 散点图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆