大 pandas 提取正则表达式允许不匹配 [英] pandas extract regex allowing mismatches
问题描述
Pandas有一个非常快速和不错的字符串方法,extract().此方法可与此类正则表达式完美配合:
Pandas has a very fast and nice string method, extract(). This method works perfectly with a regex such as this one:
strict_pattern = r"^(?P< spacerACGAG)(?P< UMI> {9,13})(?P< post_spacer> TGGAGTCT)"
test_df
R1
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT
test_df.R1.str.extract(strict_pattern)
pre_spacer UMI post_spacer
21 ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAG ACGTGTCCACCA TGGAGTCT
但是由于它不是使用 regex
包,而是使用 re
(如果我没记错的话),因此它不支持使用允许不匹配的正则表达式.这样的一个:
But as it is not using the regex
package but re
(if I'm not wrong), it does not support the usage of a regex which allows mismatches. Such as this one:
lax_pattern = r"^(?P< spacerACGAG){s< = 1}(?P< UMI>.{9,13})(?P< post_spacer> TGGAGTCT){s< =1}"
此正则表达式允许在pre_spacer和post_spacer序列中进行一次替换.
This regex allows one substitution in the pre_spacer and post_spacer sequences.
如本例所示, regex
包允许这种正则表达式:
As shown in this example, the regex
package allows this kind of regex:
seq = 'ACGAGCGCCCACCCGCCTGGAGTCTACCAACGGTAACAGCTG'
lax_pattern = r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}"
m = regex.match(lax_pattern,seq)
m.groupdict()
{'pre_spacer': 'ACGAG', 'UMI': 'CGCCCACCCGCC', 'post_spacer': 'TGGAGTCT'}
我想要使extract()与这种正则表达式或任何快速解决方法兼容.
What I would like is to make extract() compatible with this kind of regex, or any fast workaround.
我已经做到了,但是比提取慢了12倍,而且我处理非常大的数据帧.
I have done this but is 12 times slower than extract and I deal with very big dataframes.
def extract_regex(pattern, seq):
m = regex.match(pattern,seq)
try:
d=m.groupdict()
return list(d.values())
except AttributeError:
return [np.nan]*3
test_df["pre_spacer"],test_df["UMI"],test_df["post_spacer"] = zip(*test_df.apply(lambda row: extract_regex(lax_pattern,row.R1) ,axis=1))
test_df
R1 pre_spacer UMI post_spacer
21 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
22 ACGAGTAGGGAGGGGGGTGGAGTCTCAGCG ACGAG TAGGGAGGGGGG TGGAGTCT
23 ACGAGGGGGGGGAGGCTGGAGTCTCCGGGT ACGAG GGGGGGGAGGC TGGAGTCT
24 ACGAGAATAACGTTTGGTGGAGTCTACCAC ACGAG AATAACGTTTGG TGGAGTCT
25 ACGAGGGGAATAAATATTGGAGTCTCCTCC ACGAG GGGAATAAATAT TGGAGTCT
26 ACGAGATTGGGTATGCTGGAGTCTCTGTTC ACGAG ATTGGGTATGC TGGAGTCT
27 ACGAGGTACCCGCGCCATGGAGTCTCTCTG ACGAG GTACCCGCGCCA TGGAGTCT
28 ACGAGTGGTTTTTGTCGTGGAGTCTCACCA ACGAG TGGTTTTTGTCG TGGAGTCT
29 ACGAGACGTGTCCACCATGGAGTCTTGTCT ACGAG ACGTGTCCACCA TGGAGTCT
关于如何优化熊猫的 extract()
方法或以类似的速度提供所需功能的任何想法?
Any ideas of how to tune the pandas extract()
method or to provide the desired function with a similar speed?
提前谢谢!
保罗.
推荐答案
直到 pandas
是使用 regex
库编译的,您不能在.提取
.
Until pandas
is compiled with the regex
library, you can't use these features in .extract
.
您可能必须使用自定义方法来依赖 .apply
:
You will probably have to rely on .apply
with a custom method:
import regex
import pandas as pd
test_df = pd.DataFrame({"R1": ['ACGAGTTTTCGTATTTTTGGAGTCTTGTGG', 'AAAAGGGA']})
lax_pattern = regex.compile(r"^(?P<pre_spacer>ACGAG){s<=1}(?P<UMI>.{9,13})(?P<post_spacer>TGGAGTCT){s<=1}")
empty_val = pd.Series(["","",""], index=['pre_spacer','UMI','post_spacer'])
def extract_regex(seq):
m = lax_pattern.search(seq)
if m:
return pd.Series(list(m.groupdict().values()), index=['pre_spacer','UMI','post_spacer']) # list(m.groupdict().values())
else:
return empty_val
test_df[["pre_spacer","UMI","post_spacer"]] = test_df['R1'].apply(extract_regex)
输出:
>>> test_df
R1 pre_spacer UMI post_spacer
0 ACGAGTTTTCGTATTTTTGGAGTCTTGTGG ACGAG TTTTCGTATTTT TGGAGTCT
1 AAAAGGGA
这篇关于大 pandas 提取正则表达式允许不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!