Python Pandas 使用另一列删除子字符串 [英] Python Pandas removing substring using another column
问题描述
我试过四处寻找,但找不到一种简单的方法来做到这一点,所以我希望您的专业知识能有所帮助.
I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.
我有一个包含两列的 Pandas 数据框
I have a pandas data frame with two columns
import numpy as np
import pandas as pd
pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
这给了我
FULL_NAME NAME
0 FIRST LAST FIRST
1 NaN NaN
2 FIRST LAST NAME2
3 FIRST NAME3 NAME3
4 FIRST NAME4 LAST NAME4
5 ANOTHER NAME NAME5
6 LAST NAME NAME6
我想要做的是从NAME"列中获取值,然后从FULL NAME"列中删除(如果它在那里).所以函数会返回
what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return
FULL_NAME NAME NEW
0 FIRST LAST FIRST LAST
1 NaN NaN NaN
2 FIRST LAST NAME2 FIRST LAST
3 FIRST NAME3 NAME3 FIRST
4 FIRST NAME4 LAST NAME4 FIRST LAST
5 ANOTHER NAME NAME5 ANOTHER NAME
6 LAST NAME NAME6 LAST NAME
到目前为止,我已经在下面定义了一个函数并且正在使用 apply 方法.不过,这在我的大型数据集上运行速度相当慢,我希望有一种更有效的方法来做到这一点.谢谢!
So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!
def address_remove(x):
try:
newADDR1 = re.sub(x['NAME'], '', x[-1])
newADDR1 = newADDR1.rstrip()
newADDR1 = newADDR1.lstrip()
return newADDR1
except:
return x[-1]
推荐答案
这里有一个比您当前的解决方案快得多的解决方案,但我不相信不会有更快的解决方案
Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though
In [13]: import numpy as np
import pandas as pd
n = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
这是一种很长的内衬,但它应该可以满足您的需求
This is kind of a long one liner but it should do what you need
我能想到的快速解决方案是使用 replace
,如另一个答案中所述:
Fasted solution I can come up with is using replace
as mentioned in another answer:
In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
原答案:
In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
与您当前的解决方案相比:
compared to your current solution:
In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
这些为您提供与当前解决方案相同的答案
These get you the same answer as your current solution
这篇关于Python Pandas 使用另一列删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!