如何按其他列的条件按行值提取数据框? [英] How to extract dataframe by row values by conditions with other columns?

查看:53
本文介绍了如何按其他列的条件按行值提取数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下的数据框:

#valuesa=[003C"、003P1"、003P1"、003P1"、004C"、004P1"、004P2"、003C"、0"、30"003P1"、003C"、003P1"、003P2"、003C"、003P1"、004C"、004P2"、1"、001"b=[chr18"、chr20"、chr8"、chr8"、chr11"、chr11"、chr11"、chr11"、chr11"chr11"、chr1"、chr1"、chr1"、chr1"、chr1"、chr11"、chr11"、chr9"、]chr9C = [48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]d=[C"、G"、C"、C"、C"、C"、C"、G"、G"、G"G"、T"、T"、T"、T"、T"、C"、C"、T"、T"]e=[A"、T"、A"、A"、G"、G"、G"、A"、A"、A"A"、A"、A"、A"、A"、A"、G"、G"、C"、C"]#制作数据框df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})

df

 CHROM POS REF ALT 示例0 003C chr18 48399 C A1 003P1 chr20 145653 GT2 003P1 chr8 244695 C A3 003P1 chr8 244695 C A4 004C chr11 1163940 C G5 004P1 chr11 1163940 C G6 004P2 chr11 1163940 C G7 003C chr11 5986513 G A8 003P2 chr11 5986513 G A9 003P1 chr11 5986513 G A10 003C chr1 248650751 TA11 003P1 chr1 248650751 TA12 003P2 chr1 248650751 TA13 003C chr1 125895 TA14 003P1 chr1 125895 TA15 004C chr11 2587895 C G16 004P2 chr11 2587895 C G17 001C chr9 14587952 T C18 001P1 chr9 14587952 T C

我想用 C 提取与 'CHROM' 'POS' 'REF' 'ALT' 匹配的数据帧 df['Sample']code> 与 P1P2P1 共用 &P2.例如 003C :有其对应的 003P1 或 003P2 与所有匹配的值 'CHROM' 'POS' 'REF' 'ALT' 见索引 7,8,913,1410,11,12.我想把它们全部提取出来:

预期输出为:

 CHROM POS REF ALT 示例0 003C chr1 125895 TA1 003P1 chr1 125895 TA2 004C chr11 1163940 C G3 004P1 chr11 1163940 C G4 004P2 chr11 1163940 C G5 004C chr11 2587895 C G6 004P2 chr11 2587895 C G7 003C chr11 5986513 G A8 003P2 chr11 5986513 G A9 003P1 chr11 5986513 G A10 001C chr9 14587952 T C11 001P1 chr9 14587952 T C12 003C chr1 248650751 TA13 003P1 chr1 248650751 TA14 003P2 chr1 248650751 TA

我尝试了以下代码:

df[['INT','STR']] = df['Sample'].str.extract('(\d+)(.*)')df = df[df.groupby(['CHROM', 'POS', 'REF', 'ALT', 'INT'])['STR'].transform('size').eq(3)]

但它只在 C、P1 和 P2 等所有三个中通用,而不是 C、P1 或 P2.

感谢任何帮助.谢谢

解决方案

解决方案

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')m = df['STR'].isin(['C', 'P1', 'P2'])m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)df = df[m &m1 &m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

说明

ExtractINTSTR 使用 str.extract 和正则表达式

<预><代码>>>>df[['INT','STR']]INTSTR0 003 C1 003 P12 003 P13 003 P14 004 C5 004 P16 004 P27 003 C8 003 P29 003 P110 003 C11 003 P112 003 P213 003 C14 003 P115 004 C16 004 P217 001 C18 001 P1

使用 isin 创建一个布尔掩码以检查提取列 STR 仅包含值 CP1 的条件P2

<预><代码>>>>米0 真1 真2 真3 真4 真5 真6 真7 真8 真9 真10 真11 真12 真13 真14 真15 真16 真17 真18 真名称:STR,数据类型:bool

比较 STR 列与 C 以创建一个布尔掩码,然后将该掩码分组到列 ['CHROM', 'POS', 'REF','ALT', 'INT'] 并使用 any 进行转换以创建布尔掩码 m1

<预><代码>>>>米10 真1 错误2 错误3 错误4 真5 真6 真7 真8 真9 真10 真11 真12 真13 真14 真15 真16 真17 真18 真名称:STR,数据类型:bool

屏蔽 STR 列中的值,其中布尔掩码 m1False 然后按 ['CHROM' 分组这个屏蔽的列, 'POS', 'REF', 'ALT', 'INT'] 并使用 nunique 进行转换,然后与 ge 链接以创建布尔掩码 m2

<预><代码>>>>平方米0 错误1 错误2 错误3 错误4 真5 真6 真7 真8 真9 真10 真11 真12 真13 真14 真15 真16 真17 真18 真名称:STR,数据类型:bool

现在取掩码mm1m2logical and,并用它来过滤数据框中所需的行

<预><代码>>>>df[m &m1 &m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)CHROM POS REF ALT 示例0 003C chr1 125895 TA1 003P1 chr1 125895 TA2 004C chr11 1163940 C G3 004P1 chr11 1163940 C G4 004P2 chr11 1163940 C G5 004C chr11 2587895 C G6 004P2 chr11 2587895 C G7 003C chr11 5986513 G A8 003P2 chr11 5986513 G A9 003P1 chr11 5986513 G A10 001C chr9 14587952 T C11 001P1 chr9 14587952 T C12 003C chr1 248650751 TA13 003P1 chr1 248650751 TA14 003P2 chr1 248650751 TA

I have a dataframe as follows:

#values
a=["003C", "003P1", "003P1", "003P1", "004C", "004P1", "004P2", "003C", "003P2", "003P1", "003C", "003P1", "003P2", "003C", "003P1", "004C", "004P2", "001C", "001P1"]
b=["chr18", "chr20", "chr8", "chr8", "chr11", "chr11", "chr11", "chr11", "chr11", "chr11", "chr1", "chr1", "chr1", "chr1", "chr1", "chr11", "chr11", "chr9", "chr9"]
c=[48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]
d=["C", "G", "C", "C", "C", "C", "C", "G", "G", "G", "T", "T", "T", "T", "T", "C", "C", "T", "T"]
e=["A", "T", "A", "A", "G", "G", "G", "A", "A", "A", "A", "A", "A", "A", "A", "G", "G", "C", "C"]
#Make dataframe
df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})

df

    Sample  CHROM   POS         REF  ALT
0   003C    chr18   48399       C    A
1   003P1   chr20   145653      G    T
2   003P1   chr8    244695      C    A
3   003P1   chr8    244695      C    A
4   004C    chr11   1163940     C    G
5   004P1   chr11   1163940     C    G
6   004P2   chr11   1163940     C    G
7   003C    chr11   5986513     G    A
8   003P2   chr11   5986513     G    A
9   003P1   chr11   5986513     G    A
10  003C    chr1    248650751   T    A
11  003P1   chr1    248650751   T    A
12  003P2   chr1    248650751   T    A
13  003C    chr1    125895      T    A
14  003P1   chr1    125895      T    A
15  004C    chr11   2587895     C    G
16  004P2   chr11   2587895     C    G
17  001C    chr9    14587952    T   C
18  001P1   chr9    14587952    T   C

I wanted to extract dataframe that matches 'CHROM' 'POS' 'REF' 'ALT' for df['Sample'] with C common with P1 or P2 or P1 & P2. For example 003C : has its corrsponding 003P1 or 003P2 with with all matching values 'CHROM' 'POS' 'REF' 'ALT' see index 7,8,9 and 13,14 and 10,11,12. I wanted to extract them all:

The expected output is:

    Sample  CHROM   POS       REF   ALT
0   003C    chr1    125895     T    A
1   003P1   chr1    125895     T    A
2   004C    chr11   1163940    C    G
3   004P1   chr11   1163940    C    G
4   004P2   chr11   1163940    C    G
5   004C    chr11   2587895    C    G
6   004P2   chr11   2587895    C    G
7   003C    chr11   5986513    G    A
8   003P2   chr11   5986513    G    A
9   003P1   chr11   5986513    G    A
10  001C    chr9    14587952   T    C
11  001P1   chr9    14587952   T    C
12  003C    chr1    248650751  T    A
13  003P1   chr1    248650751  T    A
14  003P2   chr1    248650751  T    A

I tried following code:

df[['INT','STR']] = df['Sample'].str.extract('(\d+)(.*)')
df = df[df.groupby(['CHROM', 'POS', 'REF', 'ALT', 'INT'])['STR'].transform('size').eq(3)]

But it pulls only common in all the three like C, P1 and P2 not C, P1 or P2.

Anyhelp appreciated. Thanks

解决方案

Solution

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')

m  = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)

df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

Explanations

Extract the columns INT and STR by using str.extract with a regex pattern

>>> df[['INT','STR']]

    INT STR
0   003   C
1   003  P1
2   003  P1
3   003  P1
4   004   C
5   004  P1
6   004  P2
7   003   C
8   003  P2
9   003  P1
10  003   C
11  003  P1
12  003  P2
13  003   C
14  003  P1
15  004   C
16  004  P2
17  001   C
18  001  P1

Create a boolean mask using isin to check for the condition where the extracted column STR contains only the values C, P1 and P2

>>> m

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
Name: STR, dtype: bool

Compare STR column with C to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using any to create a boolean mask m1

>>> m1
0      True
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
Name: STR, dtype: bool

Mask the values in column STR where the boolean mask m1 is False then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using nunique then chain with ge to create a boolean mask m2

>>> m2

0     False
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
Name: STR, dtype: bool

Now take the logical and of the masks m, m1 and m2, and use this to filter the required rows in the dataframe

>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

   Sample  CHROM        POS REF ALT
0    003C   chr1     125895   T   A
1   003P1   chr1     125895   T   A
2    004C  chr11    1163940   C   G
3   004P1  chr11    1163940   C   G
4   004P2  chr11    1163940   C   G
5    004C  chr11    2587895   C   G
6   004P2  chr11    2587895   C   G
7    003C  chr11    5986513   G   A
8   003P2  chr11    5986513   G   A
9   003P1  chr11    5986513   G   A
10   001C   chr9   14587952   T   C
11  001P1   chr9   14587952   T   C
12   003C   chr1  248650751   T   A
13  003P1   chr1  248650751   T   A
14  003P2   chr1  248650751   T   A

这篇关于如何按其他列的条件按行值提取数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆