如何基于 pandas 另一列中的唯一值创建升序列 [英] How to create column of ascending values based on unique values in another column in pandas

查看:82
本文介绍了如何基于 pandas 另一列中的唯一值创建升序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中每一行都是一个样本,一列(名称为 Sample_ID)为每个样本命名(以下为df1)。某些样本会重复多次(即 Sample_ID具有相同的值)。我想根据表中第一行到最后一行的简单升序模式(例如SAMP001,SAMP002,SAMP003等)为每个样本(名称为 Sample_code)生成一个名称不同的新列。但是具有相同Sample_ID的行也需要具有相同的Sample_code值(因此,我不能简单地为新列生成一组升序的示例名称)。

I have a dataset where each row is a sample, and a column (name "Sample_ID") names each sample (df1 below). Some samples are repeated multiple times (i.e. have identical values for "Sample_ID"). I would like to generate a new column with different names for each sample (I'll call it "Sample_code") based on a simple ascending pattern (e.g. SAMP001, SAMP002, SAMP003 etc) from the first row to the last row in the table. But rows with identical Sample_IDs need to have identical Sample_code values as well (so I can't simply generate an ascending set of sample names for the new column).

在以下示例数据中,df1代表我的起始数据。 df2是我想要的结果:在每一行中,Sample_code列的值均递增,但对于重复Sample_ID的行,其值相同。

In the example data below, df1 represents my starting data. df2 is what I want to end up with: the Sample_code column values ascend as you go down each row, but with the same value for the rows where Sample_ID is duplicated.

我不知道从哪里开始,所以非常感谢您。

I'm quite puzzled where to start so any help would be much appreciated, thank you.

import numpy as np
import pandas as pd

# df1
data1 = {'Sample_ID': ['123123','123456','123123','123789','456789', '123654'], 
    'Variable_A': [15,12,7,19,3,12],
    'Variable_B':["blue","red","red","blue","blue", "red"]}
df1 = pd.DataFrame(data1)


# df2
data2 = {'Sample_ID': ['123123','123456','123123','123789','456789', '123654'],
     'Sample_code' : ['SAMP001', 'SAMP002', 'SAMP001', 'SAMP003', 'SAMP004', 'SAMP005'],
    'Variable_A': [15,12,7,19,3,12],
    'Variable_B':["blue","red","red","blue","blue", "red"]}
df2 = pd.DataFrame(data2)

df1
df2

编辑
理想情况下,我希望升序的Sample_code名称按行的原始顺序排列,因为起始数据集中的行按收集日期排序。我希望Sample_code名称基于您在各行中首次出现的特定示例。
一个新的说明性df3,它的日期栏使我明白了。

EDIT Ideally I would like to have the ascending Sample_code names be in the original order of the rows, as the rows in the starting dataset are ordered by date of collection. I'd like the Sample_code names to be based on the first time a particular sample appears as you go down the rows. A new illustrative df3 has the date column to give a sense of what I mean.

# df3
data3 = {'Sample_ID': ['123123','123456','123123','123789','456789', 
'123654', '123123', '123789'], 
        'Date' : ['15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019',
                  '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019'],
        'Variable_A': [15,12,7,19,3,12,7,9],
        'Variable_B':["blue","red","red","blue","blue", "red","blue", "red"]}
df3 = pd.DataFrame(data3)
df3

以下建议的解决方案有效,但它会基于以下示例创建Sample_code名称在出现重复的Sample_ID值的最后一行上,例如Sample_ID 123123被标记为 SAMP006(对于最后一行,此值出现),但我希望此名称为 SAMP001(其出现在第一行)。

The solution suggested below works, but it creates Sample_code names based on the final row in which the repeated Sample_ID values appear, e.g. Sample_ID "123123" is labelled "SAMP006" (for the final row this value appears), but I'd like this one to be "SAMP001" (the first row in which it appears).

lookup = {}
for i, sample_name in enumerate(df3.Sample_ID):
    lookup[sample_name] = f'SAMP{i:03}'

df3['Sample_code'] = df3.Sample_ID.apply(lambda x: lookup[x])
df3


推荐答案

您可以通过遍历唯一值来创建查找表,然后将其应用于新列:

You can create a lookup table by iterating over the unique values and then apply it to a new column:

lookup = {}
for i, sample_name in enumerate(df.Sample_ID.unique()):
    lookup[sample_name] = f'SAMP{i:03}'

df['Sample_code'] = df.Sample_ID.apply(lambda x: lookup[x])

这篇关于如何基于 pandas 另一列中的唯一值创建升序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆