根据Pandas DataFrame中的其他列值在列之间移动行值 [英] Moving row values between columns based on other column values in a Pandas DataFrame

查看:114
本文介绍了根据Pandas DataFrame中的其他列值在列之间移动行值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中列出了生物名称及其对抗生素的敏感性.我希望根据以下规则将所有生物整合到下面的数据框中的一栏中.

I have a pandas data frame with a list of organism names and their antibiotic sensitivities. I wish to consolidate all organisms into one column, in the DataFrame below, based on the following rules.

  1. 如果ORG1 == A,则什么也不做;

  1. If ORG1 == A, do nothing;

如果ORG1!= A和ORG2 == A,请将ORG2值移至ORG1列

If ORG1 != A and ORG2 == A, move ORG2 values into ORG1 column

如果ORG1!= A和ORG3 == A,请将ORG3值移至ORG1列

If ORG1 != A and ORG3 == A, move ORG3 values into ORG1 column

如果满足条件2,并且将ORG2的值移至ORG1列,则还将AS20 *中的列值移至AS10 *.

If condition 2 is met, as well as moving ORG2 value to ORG1 column, also move column values in AS20* into AS10*.

同样,如果满足条件3,并且将ORG3值移动到ORG1列,也将AS30 *中的列值移动到AS10 *.

Similarly, if condition 3 is met, as well as moving ORG3 value to ORG1 column, also move column values in AS30* into AS10*.

我自己根据上述规则编写了一个函数来进行尝试,但基于以下方面,我获得的成功有限:

I tried this myself by writing a function based on the rules above and had limited success based on the following:

If ORG2 == A:
       return ORG1.map(ORG2)

尝试根据条件依次映射AS201-> AS101,AS202-> AS102,AS203-> AS103等时,我迷路了.

I got lost when I tried to sequentially map AS201 -> AS101, AS202 -> AS102, AS203 -> AS103 etc. based on the condition.

我遇到的另一个问题是生物名称不是单个字母,也不是漂亮的字母.示例中的A等同于我的数据集中的re.match('aureus').

The other issue I have is that the organism names are not single letters, neither are the pretty. A in the example is equivalent to re.match('aureus') in my dataset.

此外,每个ORG列都有20个AS列,并超过150,000条记录,因此我希望使其能够推广用于任何数量的抗生素敏感性结果.

Also, there are 20 AS columns for every ORG column and in excess of 150,000 records so I hope to make it generalizable for any number of antibiotic sensitivity results.

我对此有点挣扎,所以朝正确方向推几把确实会有所帮助.

I am struggling a bit with it so a couple of shoves in the right direction would really help.

谢谢.


Index   ORG1    ORG2    ORG3    AB1    AS101    AS201   AS301     AB2   AS102   AS202 AS302
1          A     NaN     NaN    pen        S      NaN     NaN   dfluc       S     NaN   NaN
2          A       B       C    pen        R        S       S   dfluc       S       R     S
3          B       A       B    pen        S        S       R   dfluc       S       S     R
4          A     NaN     NaN    pen        R      NaN     NaN   dfluc       S     NaN   NaN
5          A     NaN     NaN    pen        R      NaN     NaN   dfluc       S     NaN   NaN
6          C       A       A    pen        S        R       R   dfluc       R       S     R
7          B     NaN       A    pen        R      NaN       S   dfluc       S     NaN     S
8          A       B       A    pen        R        R       R   dfluc       R       R     R
9          A     NaN     NaN    pen        R      NaN     NaN   dfluc       S     NaN   NaN

推荐答案

我们可以选择ORG1 != AORG2 == A

mask = (df['ORG1'] != 'A')&(df[orgi] == 'A')

mask然后是布尔系列.要将值从ORG2复制到ORG1,我们可以使用

mask is then a boolean Series. To copy values from ORG2 to ORG1, we could then use

df['ORG1'][mask] = df['ORG2'][mask]

或者,因为我们知道右边的值是A,所以我们可以使用

or, since we know the value on the right is A, we could just use

df['ORG1'][mask] = 'A'

可以类似地复制AS列.

Copying the AS columns can be done similarly.

我们可以找到行,其列值包含诸如'aureus'之类的字符串,

We can find rows whose column value contains some string like 'aureus' with

df[orgi].str.contains('aureus') == True

str.contains可以采用任何正则表达式模式作为其参数. 请参阅文档:矢量化字符串方法.

str.contains can take any regex pattern as its argument. See the docs: Vectorized String Methods.

注意:通常使用df[orgi].str.contains('aureus')就足够了(不带== True,但是由于df[orgi]可能包含NaN值,因此我们还需要将NaN s映射为False,所以我们使用df[orgi].str.contains('aureus') == True.

Note: Usually it would be enough to use df[orgi].str.contains('aureus') (without the == True, but since df[orgi] might contain NaN values, we need to also map the NaNs to False, so we use df[orgi].str.contains('aureus') == True.

import pandas as pd

filename = 'data.txt'
df = pd.read_table(filename, delimiter='\s+')
print(df)
#    Index ORG1 ORG2 ORG3  AB1 AS101 AS201 AS301    AB2 AS102 AS202 AS302
# 0      1    A  NaN  NaN  pen     S   NaN   NaN  dfluc     S   NaN   NaN
# 1      2    A    B    C  pen     R     S     S  dfluc     S     R     S
# 2      3    B    A    B  pen     S     S     R  dfluc     S     S     R
# 3      4    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN
# 4      5    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN
# 5      6    C    A    A  pen     S     R     R  dfluc     R     S     R
# 6      7    B  NaN    A  pen     R   NaN     S  dfluc     S   NaN     S
# 7      8    A    B    A  pen     R     R     R  dfluc     R     R     R
# 8      9    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN

for i in range(2,4):
    orgi = 'ORG{i}'.format(i=i)
    # mask = (df['ORG1'] != 'A')&(df[orgi] == 'A')
    mask = (df['ORG1'].str.contains('A') == False)&(df[orgi].str.contains('A') == True)
    # Move ORGi --> ORG1
    df['ORG1'][mask] = df[orgi][mask]
    for j in range(1,4):
        # Move ASij --> AS1j
        source_as = 'AS{i}{j:02d}'.format(i=i, j=j)
        target_as = 'AS1{j:02d}'.format(i=i, j=j)
        try:
            df[target_as][mask] = df[source_as][mask]
        except KeyError:
            pass

print(df)

收益

   Index ORG1 ORG2 ORG3  AB1 AS101 AS201 AS301    AB2 AS102 AS202 AS302
0      1    A  NaN  NaN  pen     S   NaN   NaN  dfluc     S   NaN   NaN
1      2    A    B    C  pen     R     S     S  dfluc     S     R     S
2      3    A    A    B  pen     S     S     R  dfluc     S     S     R
3      4    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN
4      5    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN
5      6    A    A    A  pen     R     R     R  dfluc     S     S     R
6      7    A  NaN    A  pen     S   NaN     S  dfluc     S   NaN     S
7      8    A    B    A  pen     R     R     R  dfluc     R     R     R
8      9    A  NaN  NaN  pen     R   NaN   NaN  dfluc     S   NaN   NaN

请注意,如果ORG2 == AORG3 == A,则AS20*AS30*列中的值都争用覆盖AS10*列中的值.我不确定您想赢得哪个价值.在上面的代码中, last 列获胜,即AS30*.

Note that if ORG2 == A and ORG3 == A, then values in column AS20* and AS30* both compete to overwrite values in column AS10*. I'm not sure which value you want to win. In the code above, the last column wins, which would be AS30*.

这篇关于根据Pandas DataFrame中的其他列值在列之间移动行值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆