如何使用matplotlib子图和 pandas 制作多线图? [英] How to make multiline graph with matplotlib subplots and pandas?

查看:86
本文介绍了如何使用matplotlib子图和 pandas 制作多线图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在编码方面还比较陌生(完全可以自学),并且在我作为癌症实验室的研究助理时开始使用它.我需要一些帮助在matplot实验室中设置一些折线图.

I'm fairly new at coding (completely self taught), and have started using it at at my job as a research assistant in a cancer lab. I need some help setting up a few line graphs in matplot lab.

我有一个数据集,其中包含约80位患者的nextgen测序数据.对于每位患者,我们都有不同的分析时间点,检测到的不同基因(共40个)以及该基因的相关%突变.

I have a dataset that includes nextgen sequencing data for about 80 patients. on each patient, we have different timepoints of analysis, different genes detected (out of 40), and the associated %mutation for the gene.

我的目标是编写两个脚本,一个将生成按患者"图,一个是具有y%变异,x倍测量时间的线图,并且对所有制作的线将使用不同的色线通过每个患者的相关基因.第二个图将是一个按基因",其中一个图将包含不同的色线,分别代表该特定基因在不同患者的x/y值.

My goal is to write two scripts, one that will generate a "by patient" plot, that will be a linegraph with y-%mutation, x-time of measurement, and will have a different color line for all lines made by each of the patient's associated genes. The second plot will be a "by gene", where I will have one plot contain different color lines that represent each of the different patient's x/y values for that specific gene.

以下是上述脚本的1个基因编号的示例数据框:

Here is an example dataframe for 1 genenumber for the above script:

gene    yaxis   xaxis   pt# gene#
ASXL1-3 34  1   3   1
ASXL1-3 0   98  3   1
IDH1-3  24  1   3   11
IDH1-3  0   98  3   11
RUNX1-3 38  1   3   21
RUNX1-3 0   98  3   21
U2AF1-3 33  1   3   26
U2AF1-3 0   98  3   26

我设置了一个groupby脚本,当我对其进行迭代时,它为我提供了每个患者每个基因时间点的数据框.

I have setup a groupby script that when I iterate over it, gives me a dataframe for every gene-timepoint for each patient.

grouped = df.groupby('pt #')
for groupObject in grouped:
    group = groupObject[1]

对于患者1,将给出以下输出:

For patient 1, this gives the following output:

        y     x   gene  patientnumber patientgene  genenumber  dxtotransplant  \
0    40.0  1712  ASXL1              1     ASXL1-1           1            1857   
1    26.0  1835  ASXL1              1     ASXL1-1           1            1857   
302   7.0  1835  RUNX1              1     RUNX1-1          21            1857   

我需要帮助编写脚本来创建上述任一图表.使用以患者为例的示例,我的总体思路是,我需要为患者拥有的每个基因创建一个不同的子图,其中每个子图都是该基因代表的线图.

I need help writing a script that will create either of the plots described above. using the bypatient example, my general idea is that I need to create a different subplot for every gene a patient has, where each subplot is the line graph represented by that one gene.

使用matplotlib,这大约是我得到的:

Using matplotlib this is about as far as I have gotten:

plt.figure()

grouped = df.groupby('patient number')

for groupObject in grouped:
    group = groupObject[1]
    df = group #may need to remove this
    for element in range(len(group)): 
        xs = np.array(df[df.columns[1]]) #"x" column
        ys= np.array(df[df.columns[0]]) #"y" column
        gene = np.array(df[df.columns[2]])[element] #"gene" column
        plt.subplot(1,1,1) 
        plt.scatter(xs,ys, label=gene)
        plt.plot(xs,ys, label=gene)
        plt.legend()
    plt.show()

这将产生以下输出:

在此输出中,虚线不应该连接到其他2个点.在这种情况下,这是患者1,其具有以下数据点:

In this output, the circled line is not supposed to be connected to the other 2 points. In this case, this is patient 1, who has the following datapoint:

x       y   gene
1712    40  ASXL1
1835    26  ASXL1
1835    7   RUNX1

使用seaborn,我可以使用以下代码接近所需的图形:

Using seaborn I have gotten close to my desired graph using this code:

grouped = df.groupby(['patientnumber'])
for groupObject in grouped:
    group = groupObject[1]
    g = sns.FacetGrid(group, col="patientgene", col_wrap=4, size=4, ylim=(0,100))  
    g = g.map(plt.scatter, "x", "y", alpha=0.5)
    g = g.map(plt.plot, "x", "y", alpha=0.5)
    plt.title= "gene:%s"%element

使用此代码,我得到以下信息:

Using this code I get the following:

如果我调整线条:

g = sns.FacetGrid(group, col="patientnumber", col_wrap=4, size=4, ylim=(0,100))

我得到以下结果:

如在2d示例中看到的那样,该图将我的图上的每个点都视为来自同一条线(但实际上它们是4条单独的线).

As you can see in the 2d example, the plot is treating every point on my plot as if they are from the same line (but they are actually 4 separate lines).

如何调整迭代次数,以便将每个患者基因在同一张图上视为单独的一行?

How I can tweak my iterations so that each patient-gene is treated as a separate line on the same graph?

推荐答案

我编写了一个子图函数,该函数可以帮助您.我稍稍修改了数据,以帮助说明绘图功能.

I wrote a subplot function that may give you a hand. I modified the data a tad to help illustrate the plotting functionality.

gene,yaxis,xaxis,pt #,gene #
ASXL1-3,34,1,3,1
ASXL1-3,3,98,3,1
IDH1-3,24,1,3,11
IDH1-3,7,98,3,11
RUNX1-3,38,1,3,21
RUNX1-3,2,98,3,21
U2AF1-3,33,1,3,26
U2AF1-3,0,98,3,26
ASXL1-3,39,1,4,1
ASXL1-3,8,62,4,1
ASXL1-3,0,119,4,1
IDH1-3,27,1,4,11
IDH1-3,12,62,4,11
IDH1-3,1,119,4,11
RUNX1-3,42,1,4,21
RUNX1-3,3,62,4,21
RUNX1-3,1,119,4,21
U2AF1-3,16,1,4,26
U2AF1-3,1,62,4,26
U2AF1-3,0,119,4,26

这是子绘图功能...带有一些额外的花哨功能:)

This is the subplotting function...with some extra bells and whistles :)

def plotByGroup(df, group, xCol, yCol, title = "", xLabel = "", yLabel = "", lineColors = ["red", "orange", "yellow", "green", "blue", "purple"], lineWidth = 2, lineOpacity = 0.7, plotStyle = 'ggplot', showLegend = False):
    """
    Plot multiple lines from a Pandas Data Frame for each group using DataFrame.groupby() and MatPlotLib PyPlot.
    @params
        df          - Required  - Data Frame    - Pandas Data Frame
        group       - Required  - String        - Column name to group on           
        xCol        - Required  - String        - Column name for X axis data
        yCol        - Required  - String        - Column name for y axis data
        title       - Optional  - String        - Plot Title
        xLabel      - Optional  - String        - X axis label
        yLabel      - Optional  - String        - Y axis label
        lineColors  - Optional  - List          - Colors to plot multiple lines
        lineWidth   - Optional  - Integer       - Width of lines to plot
        lineOpacity - Optional  - Float         - Alpha of lines to plot
        plotStyle   - Optional  - String        - MatPlotLib plot style
        showLegend  - Optional  - Boolean       - Show legend
    @return
        MatPlotLib Plot Object

    """
    # Import MatPlotLib Plotting Function & Set Style
    from matplotlib import pyplot as plt
    matplotlib.style.use(plotStyle)
    figure = plt.figure()                   # Initialize Figure
    grouped = df.groupby(group)             # Set Group
    i = 0                                   # Set iteration to determine line color indexing
    for idx, grp in grouped:
        colorIndex = i % len(lineColors)    # Define line color index
        lineLabel = grp[group].values[0]    # Get a group label from first position
        xValues = grp[xCol]                 # Get x vector
        yValues = grp[yCol]                 # Get y vector
        plt.subplot(1,1,1)                  # Initialize subplot and plot (on next line)
        plt.plot(xValues, yValues, label = lineLabel, color = lineColors[colorIndex], lw = lineWidth, alpha = lineOpacity)
        # Plot legend
        if showLegend:
            plt.legend()
        i += 1
    # Set title & Labels
    axis = figure.add_subplot(1,1,1)
    axis.set_title(title)
    axis.set_xlabel(xLabel)
    axis.set_ylabel(yLabel)
    # Return plot for saving, showing, etc.
    return plt

并使用它...

import pandas

# Load the Data into Pandas
df = pandas.read_csv('data.csv')    

#
# Plotting - by Patient
#

# Create Patient Grouping
patientGroup = df.groupby('pt #')

# Iterate Over Groups
for idx, patientDF in patientGroup:
    # Let's give them specific titles
    plotTitle = "Gene Frequency over Time by Gene (Patient %s)" % str(patientDf['pt #'].values[0])
    # Call the subplot function
    plot = plotByGroup(patientDf, 'gene', 'xaxis', 'yaxis', title = plotTitle, xLabel = "Days", yLabel = "Gene Frequency")
    # Add Vertical Lines at Assay Timepoints
    timepoints = set(patientDf.xaxis.values)
    [plot.axvline(x = timepoint, linewidth = 1, linestyle = "dashed", color='gray', alpha = 0.4) for timepoint in timepoints]
    # Let's see it
    plot.show()

当然,我们可以按基因做同样的事情.

And of course, we can do the same by gene.

#
# Plotting - by Gene
#

# Create Gene Grouping
geneGroup   = df.groupby('gene')

# Generate Plots for Groups
for idx, geneDF in geneGroup:
    plotTitle = "%s Gene Frequency over Time by Patient" % str(geneDf['gene'].values[0])
    plot = plotByGroup(geneDf, 'pt #', 'xaxis', 'yaxis', title = plotTitle, xLab = "Days", yLab = "Frequency")
    plot.show()

如果这不是您想要的内容,请提供一个说明,我将作进一步说明.

If this isn't what you're looking for, provide a clarification and I'll take another crack at it.

这篇关于如何使用matplotlib子图和 pandas 制作多线图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆