如何按其他列的条件按行值提取数据框? [英] How to extract dataframe by row values by conditions with other columns?
问题描述
我有一个如下的数据框:
#valuesa=[003C"、003P1"、003P1"、003P1"、004C"、004P1"、004P2"、003C"、0"、30"003P1"、003C"、003P1"、003P2"、003C"、003P1"、004C"、004P2"、1"、001"b=[chr18"、chr20"、chr8"、chr8"、chr11"、chr11"、chr11"、chr11"、chr11"chr11"、chr1"、chr1"、chr1"、chr1"、chr1"、chr11"、chr11"、chr9"、]chr9C = [48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]d=[C"、G"、C"、C"、C"、C"、C"、G"、G"、G"G"、T"、T"、T"、T"、T"、C"、C"、T"、T"]e=[A"、T"、A"、A"、G"、G"、G"、A"、A"、A"A"、A"、A"、A"、A"、A"、G"、G"、C"、C"]#制作数据框df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})
df
CHROM POS REF ALT 示例0 003C chr18 48399 C A1 003P1 chr20 145653 GT2 003P1 chr8 244695 C A3 003P1 chr8 244695 C A4 004C chr11 1163940 C G5 004P1 chr11 1163940 C G6 004P2 chr11 1163940 C G7 003C chr11 5986513 G A8 003P2 chr11 5986513 G A9 003P1 chr11 5986513 G A10 003C chr1 248650751 TA11 003P1 chr1 248650751 TA12 003P2 chr1 248650751 TA13 003C chr1 125895 TA14 003P1 chr1 125895 TA15 004C chr11 2587895 C G16 004P2 chr11 2587895 C G17 001C chr9 14587952 T C18 001P1 chr9 14587952 T C
我想用 C
提取与 'CHROM' 'POS' 'REF' 'ALT'
匹配的数据帧 df['Sample']
code> 与 P1
或 P2
或 P1
共用 &P2
.例如 003C
:有其对应的 003P1 或 003P2 与所有匹配的值 'CHROM' 'POS' 'REF' 'ALT'
见索引 7,8,9
和 13,14
和 10,11,12
.我想把它们全部提取出来:
预期输出为:
CHROM POS REF ALT 示例0 003C chr1 125895 TA1 003P1 chr1 125895 TA2 004C chr11 1163940 C G3 004P1 chr11 1163940 C G4 004P2 chr11 1163940 C G5 004C chr11 2587895 C G6 004P2 chr11 2587895 C G7 003C chr11 5986513 G A8 003P2 chr11 5986513 G A9 003P1 chr11 5986513 G A10 001C chr9 14587952 T C11 001P1 chr9 14587952 T C12 003C chr1 248650751 TA13 003P1 chr1 248650751 TA14 003P2 chr1 248650751 TA
我尝试了以下代码:
df[['INT','STR']] = df['Sample'].str.extract('(\d+)(.*)')df = df[df.groupby(['CHROM', 'POS', 'REF', 'ALT', 'INT'])['STR'].transform('size').eq(3)]
但它只在 C、P1 和 P2
等所有三个中通用,而不是 C、P1 或 P2
.
感谢任何帮助.谢谢
解决方案
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')m = df['STR'].isin(['C', 'P1', 'P2'])m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)df = df[m &m1 &m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
说明
Extract
列 INT
和 STR
使用 str.extract
和正则表达式
使用 isin
创建一个布尔掩码以检查提取列 STR
仅包含值 C
、P1 的条件
和 P2
比较 STR
列与 C
以创建一个布尔掩码,然后将该掩码分组到列 ['CHROM', 'POS', 'REF','ALT', 'INT']
并使用 any
进行转换以创建布尔掩码 m1
屏蔽 STR
列中的值,其中布尔掩码 m1
是 False
然后按 ['CHROM' 分组这个屏蔽的列, 'POS', 'REF', 'ALT', 'INT']
并使用 nunique
进行转换,然后与 ge
链接以创建布尔掩码 m2
现在取掩码m
、m1
和m2
的logical and
,并用它来过滤数据框中所需的行
I have a dataframe as follows:
#values
a=["003C", "003P1", "003P1", "003P1", "004C", "004P1", "004P2", "003C", "003P2", "003P1", "003C", "003P1", "003P2", "003C", "003P1", "004C", "004P2", "001C", "001P1"]
b=["chr18", "chr20", "chr8", "chr8", "chr11", "chr11", "chr11", "chr11", "chr11", "chr11", "chr1", "chr1", "chr1", "chr1", "chr1", "chr11", "chr11", "chr9", "chr9"]
c=[48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]
d=["C", "G", "C", "C", "C", "C", "C", "G", "G", "G", "T", "T", "T", "T", "T", "C", "C", "T", "T"]
e=["A", "T", "A", "A", "G", "G", "G", "A", "A", "A", "A", "A", "A", "A", "A", "G", "G", "C", "C"]
#Make dataframe
df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})
df
Sample CHROM POS REF ALT
0 003C chr18 48399 C A
1 003P1 chr20 145653 G T
2 003P1 chr8 244695 C A
3 003P1 chr8 244695 C A
4 004C chr11 1163940 C G
5 004P1 chr11 1163940 C G
6 004P2 chr11 1163940 C G
7 003C chr11 5986513 G A
8 003P2 chr11 5986513 G A
9 003P1 chr11 5986513 G A
10 003C chr1 248650751 T A
11 003P1 chr1 248650751 T A
12 003P2 chr1 248650751 T A
13 003C chr1 125895 T A
14 003P1 chr1 125895 T A
15 004C chr11 2587895 C G
16 004P2 chr11 2587895 C G
17 001C chr9 14587952 T C
18 001P1 chr9 14587952 T C
I wanted to extract dataframe that matches 'CHROM' 'POS' 'REF' 'ALT'
for df['Sample']
with C
common with P1
or P2
or P1
& P2
.
For example 003C
: has its corrsponding 003P1 or 003P2 with with all matching values 'CHROM' 'POS' 'REF' 'ALT'
see index 7,8,9
and 13,14
and 10,11,12
. I wanted to extract them all:
The expected output is:
Sample CHROM POS REF ALT
0 003C chr1 125895 T A
1 003P1 chr1 125895 T A
2 004C chr11 1163940 C G
3 004P1 chr11 1163940 C G
4 004P2 chr11 1163940 C G
5 004C chr11 2587895 C G
6 004P2 chr11 2587895 C G
7 003C chr11 5986513 G A
8 003P2 chr11 5986513 G A
9 003P1 chr11 5986513 G A
10 001C chr9 14587952 T C
11 001P1 chr9 14587952 T C
12 003C chr1 248650751 T A
13 003P1 chr1 248650751 T A
14 003P2 chr1 248650751 T A
I tried following code:
df[['INT','STR']] = df['Sample'].str.extract('(\d+)(.*)')
df = df[df.groupby(['CHROM', 'POS', 'REF', 'ALT', 'INT'])['STR'].transform('size').eq(3)]
But it pulls only common in all the three like C, P1 and P2
not C, P1 or P2
.
Anyhelp appreciated. Thanks
Solution
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
m = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)
df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
Explanations
Extract
the columns INT
and STR
by using str.extract
with a regex pattern
>>> df[['INT','STR']]
INT STR
0 003 C
1 003 P1
2 003 P1
3 003 P1
4 004 C
5 004 P1
6 004 P2
7 003 C
8 003 P2
9 003 P1
10 003 C
11 003 P1
12 003 P2
13 003 C
14 003 P1
15 004 C
16 004 P2
17 001 C
18 001 P1
Create a boolean mask using isin
to check for the condition where the extracted column STR
contains only the values C
, P1
and P2
>>> m
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Compare STR
column with C
to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using any
to create a boolean mask m1
>>> m1
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Mask the values in column STR
where the boolean mask m1
is False
then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using nunique
then chain with ge
to create a boolean mask m2
>>> m2
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Now take the logical and
of the masks m
, m1
and m2
, and use this to filter the required rows in the dataframe
>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
Sample CHROM POS REF ALT
0 003C chr1 125895 T A
1 003P1 chr1 125895 T A
2 004C chr11 1163940 C G
3 004P1 chr11 1163940 C G
4 004P2 chr11 1163940 C G
5 004C chr11 2587895 C G
6 004P2 chr11 2587895 C G
7 003C chr11 5986513 G A
8 003P2 chr11 5986513 G A
9 003P1 chr11 5986513 G A
10 001C chr9 14587952 T C
11 001P1 chr9 14587952 T C
12 003C chr1 248650751 T A
13 003P1 chr1 248650751 T A
14 003P2 chr1 248650751 T A
这篇关于如何按其他列的条件按行值提取数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!