根据另一列的条件连接一列的字符串 [英] concat the strings of one column based on condition on other column
问题描述
我有一个数据框,我想删除名为sample"的列上的重复项,并将基因和状态列中的字符串信息添加到新列,如附图所示.
I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.
先谢谢你了
以下是数据框的修改版本.其中行中的基因被实际基因名称替换
below is the modified version of data frame.where gene in rows are replaced by actual gene names
推荐答案
这里,df
是你的 Pandas DataFrame.
Here, df
is your Pandas DataFrame.
def new_1(g):
return ','.join(g.gene)
def new_2(g):
return ','.join(g.gene + '-' + g.status)
new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")
new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")
如果您希望将sample"作为列而不是索引,请添加
If you wish to have "sample" as a column instead of an index, then add
new_df = new_df.reset_index(drop=True)
最后,由于您没有指定要保留哪些原始重复行,我只是使用 Pandas 的默认行为并删除除第一次出现的所有行.
Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.
我将您的示例转换为以下 CSV 文件(以,"分隔),我将其称为data.csv".
I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".
sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss
我将此数据加载为
# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")
运行上面的代码并打印数据帧产生输出
Running the code above and printing the dataframe produces the output
sample gene status new_1 new_2
0 ppar p53 gain p53,gata,nb p53-gain,gata-gain,nb-loss
3 srty nf1 gain nf1,cat,cd23 nf1-gain,cat-gain,cd23-gain
6 tygd brac1 loss brac1,brac2,ras brac1-loss,brac2-gain,ras-loss
这正是您的示例中给出的预期输出.
This is exactly the expected output given in your example.
请注意,最左边的一列数字 (0, 3, 6) 是合并后生成的原始数据帧的索引的剩余部分.当您将此数据帧写入文件时,您可以通过为 df.to_csv(...)
设置 index=False
来排除它.
Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False
for df.to_csv(...)
.
我检查了您通过电子邮件发送给我的 CSV 文件.CSV 文件标题中的基因"一词后面有一个空格.
I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.
更改 CSV 文件的第一行
Change the first line of your CSV file from
sample,gene ,status
到
sample,gene,status
此外,您的条目中有空格.如果您想删除它们,您可以
Also, there are spaces in your entries. If you wish to remove them, you can
# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip())
这篇关于根据另一列的条件连接一列的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!