根据另一列的条件连接一列的字符串 [英] concat the strings of one column based on condition on other column

查看:61
本文介绍了根据另一列的条件连接一列的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想删除名为sample"的列上的重复项,并将基因和状态列中的字符串信息添加到新列,如附图所示.

I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.

先谢谢你了

以下是数据框的修改版本.其中行中的基因被实际基因名称替换

below is the modified version of data frame.where gene in rows are replaced by actual gene names

推荐答案

这里,df 是你的 Pandas DataFrame.

Here, df is your Pandas DataFrame.

def new_1(g):
    return ','.join(g.gene)

def new_2(g):
    return ','.join(g.gene + '-' + g.status)    

new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")

new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")

如果您希望将sample"作为列而不是索引,请添加

If you wish to have "sample" as a column instead of an index, then add

new_df = new_df.reset_index(drop=True)

最后,由于您没有指定要保留哪些原始重复行,我只是使用 Pandas 的默认行为并删除除第一次出现的所有行.

Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.

我将您的示例转换为以下 CSV 文件(以,"分隔),我将其称为data.csv".

I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".

sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss

我将此数据加载为

# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")  

运行上面的代码并打印数据帧产生输出

Running the code above and printing the dataframe produces the output

  sample   gene status            new_1                           new_2
0   ppar    p53   gain      p53,gata,nb      p53-gain,gata-gain,nb-loss
3   srty    nf1   gain     nf1,cat,cd23     nf1-gain,cat-gain,cd23-gain
6   tygd  brac1   loss  brac1,brac2,ras  brac1-loss,brac2-gain,ras-loss

这正是您的示例中给出的预期输出.

This is exactly the expected output given in your example.

请注意,最左边的一列数字 (0, 3, 6) 是合并后生成的原始数据帧的索引的剩余部分.当您将此数据帧写入文件时,您可以通过为 df.to_csv(...) 设置 index=False 来排除它.

Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False for df.to_csv(...).

我检查了您通过电子邮件发送给我的 CSV 文件.CSV 文件标题中的基因"一词后面有一个空格.

I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.

更改 CSV 文件的第一行

Change the first line of your CSV file from

sample,gene ,status

sample,gene,status

此外,您的条目中有空格.如果您想删除它们,您可以

Also, there are spaces in your entries. If you wish to remove them, you can

# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip()) 

这篇关于根据另一列的条件连接一列的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆