大 pandas 提取正则表达式允许不匹配 [英] pandas extract regex allowing mismatches

查看:74
本文介绍了大 pandas 提取正则表达式允许不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pandas有一个非常快速和不错的字符串方法,extract().此方法可与此类正则表达式完美配合:

Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one:

strict_pattern = r"^(?P< spacerACGAG)(?P< UMI> {9,13})(?P< post_spacer> TGGAGTCT)"

test_df

    R1
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT

test_df.R1.str.extract(strict_pattern)

    pre_spacer  UMI     post_spacer
21  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAG   ACGTGTCCACCA    TGGAGTCT

但是由于它不是使用 regex 包,而是使用 re (如果我没记错的话),因此它不支持使用允许不匹配的正则表达式.这样的一个:

But as it is not using the regex package but re (if I'm not wrong), it does not support the usage of a regex which allows mismatches. Such as this one:

lax_pattern = r"^(?P< spacerACGAG){s< = 1}(?P< UMI>.{9,13})(?P< post_spacer> TGGAGTCT){s< =1}"

此正则表达式允许在pre_spacer和post_spacer序列中进行一次替换.

This regex allows one substitution in the pre_spacer and post_spacer sequences.

如本例所示, regex 包允许这种正则表达式:

As shown in this example, the regex package allows this kind of regex:

seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()

{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}

我想要使extract()与这种正则表达式或任何快速解决方法兼容.

What I would like is to make extract() compatible with this kind of regex, or any fast workaround.

我已经做到了,但是比提取慢了12倍,而且我处理非常大的数据帧.

I have done this but is 12 times slower than extract and I deal with very big dataframes.

def extract_regex(pattern, seq):
    m = regex.match(pattern,seq)
    try:
        d=m.groupdict()
        return list(d.values())
    except AttributeError:
        return [np.nan]*3

test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))

test_df

    R1  pre_spacer  UMI     post_spacer
21  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG  ACGAG   TTTTCGTATTTT    TGGAGTCT
22  ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG  ACGAG   TAGGGAGGGGGG    TGGAGTCT
23  ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT  ACGAG   GGGGGGGAGGC     TGGAGTCT
24  ACGAGAATAACGTTTGGTGGAGTCTACCAC  ACGAG   AATAACGTTTGG    TGGAGTCT
25  ACGAGGGGAATAAATATTGGAGTCTCCTCC  ACGAG   GGGAATAAATAT    TGGAGTCT
26  ACGAGATTGGGTATGCTGGAGTCTCTGTTC  ACGAG   ATTGGGTATGC     TGGAGTCT
27  ACGAGGTACCCGCGCCATGGAGTCTCTCTG  ACGAG   GTACCCGCGCCA    TGGAGTCT
28  ACGAGTGGTTTTTGTCGTGGAGTCTCACCA  ACGAG   TGGTTTTTGTCG    TGGAGTCT
29  ACGAGACGTGTCCACCATGGAGTCTTGTCT  ACGAG   ACGTGTCCACCA    TGGAGTCT

关于如何优化熊猫的 extract()方法或以类似的速度提供所需功能的任何想法?

Any ideas of how to tune the pandas extract() method or to provide the desired function with a similar speed?

提前谢谢!

保罗.

推荐答案

直到 pandas 是使用 regex 库编译的,您不能在.提取.

Until pandas is compiled with the regex library, you can't use these features in .extract.

您可能必须使用自定义方法来依赖 .apply :

You will probably have to rely on .apply with a custom method:

import regex
import pandas as pd

test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})

lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")

empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])

def extract_regex(seq):
    m = lax_pattern.search(seq)
    if m:
        return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) #  list(m.groupdict().values())
    else:
        return empty_val


test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)

输出:

>>> test_df
                               R1 pre_spacer           UMI post_spacer
0  ACGAGTTTTCGTATTTTTGGAGTCTTGTGG      ACGAG  TTTTCGTATTTT    TGGAGTCT
1                        AAAAGGGA                                     

这篇关于大 pandas 提取正则表达式允许不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆