pandas 数据框使用正则表达式检查值是否存在 [英] Pandas Dataframe check if a value exists using regex
问题描述
我的数据框很大,我想检查是否有任何单元格包含 admin
字符串.
I have a big dataframe and I want to check if any cell contains admin
string.
col1 col2 ... coln
0 323 roster_admin ... rota_user
1 542 assignment_rule_admin ... application_admin
2 123 contact_user ... configuration_manager
3 235 admin_incident ... incident_user
... ... ... ... ...
我尝试使用 df.isin(['* admin *']).any()
,但似乎 isin
不支持正则表达式.如何使用正则表达式搜索所有列?
I tried to use df.isin(['*admin*']).any()
but it seems like isin
doesn't support regex. How can I search though all columns using regex?
我避免使用循环,因为数据框包含超过一千万行和许多列,而效率对我来说很重要.
I have avoided using loops because the dataframe contains over 10 million rows and many columns and the efficiency is important for me.
推荐答案
有两种解决方案:
-
df.col.apply
方法更直接但也更慢:
In [1]: import pandas as pd
In [2]: import re
In [3]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
In [4]: df
Out[4]:
col1 col2
0 1 admin
1 2 aa
2 3 bb
3 4 c_admin_d
4 5 ee_admin
In [5]: r = re.compile(r'.*(admin).*')
In [6]: df.col2.apply(lambda x: bool(r.match(x)))
Out[6]:
0 True
1 False
2 False
3 True
4 True
Name: col2, dtype: bool
In [7]: %timeit -n 100000 df.col2.apply(lambda x: bool(r.match(x)))
167 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
-
np.vectorize
方法需要import numpy
,但效率更高(在我的timeit
测试中快4倍).
np.vectorize
method requireimport numpy
, but it's more efficient (about 4 times faster in mytimeit
test).
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import re
In [4]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
In [5]: df
Out[5]:
col1 col2
0 1 admin
1 2 aa
2 3 bb
3 4 c_admin_d
4 5 ee_admin
In [6]: r = re.compile(r'.*(admin).*')
In [7]: regmatch = np.vectorize(lambda x: bool(r.match(x)))
In [8]: regmatch(df.col2.values)
Out[8]: array([ True, False, False, True, True])
In [9]: %timeit -n 100000 regmatch(df.col2.values)
43.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
由于您已更改问题以检查任何单元格,并且还担心时间效率:
Since you have changed your question to check any cell, and also concern about time efficiency:
# if you want to check all columns no mater what `dtypes` they are
dfs = df.astype(str, copy=True, errors='raise')
regmatch(dfs.values) # This will return a 2-d array of booleans
regmatch(dfs.values).any() # For existence.
您仍然可以使用 df.applymap
方法,但是同样,它会更慢.
You can still use df.applymap
method, but again, it will be slower.
dfs = df.astype(str, copy=True, errors='raise')
r = re.compile(r'.*(admin).*')
dfs.applymap(lambda x: bool(r.match(x))) # This will return a dataframe of booleans.
dfs.applymap(lambda x: bool(r.match(x))).any().any() # For existence.
这篇关于 pandas 数据框使用正则表达式检查值是否存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!