Python Pandas使用另一列删除子字符串 [英] Python Pandas removing substring using another column
问题描述
我已经尝试过搜索,但找不到一种简单的方法来完成此操作,因此希望您的专业知识可以为您提供帮助.
I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.
我有一个两列的熊猫数据框
I have a pandas data frame with two columns
import numpy as np
import pandas as pd
pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
这给了我
FULL_NAME NAME
0 FIRST LAST FIRST
1 NaN NaN
2 FIRST LAST NAME2
3 FIRST NAME3 NAME3
4 FIRST NAME4 LAST NAME4
5 ANOTHER NAME NAME5
6 LAST NAME NAME6
我想做的是从名称"列中获取值,然后从全名"列中删除(如果有的话).因此该函数将返回
what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return
FULL_NAME NAME NEW
0 FIRST LAST FIRST LAST
1 NaN NaN NaN
2 FIRST LAST NAME2 FIRST LAST
3 FIRST NAME3 NAME3 FIRST
4 FIRST NAME4 LAST NAME4 FIRST LAST
5 ANOTHER NAME NAME5 ANOTHER NAME
6 LAST NAME NAME6 LAST NAME
到目前为止,我已经在下面定义了一个函数,并且正在使用apply方法.但是,这在我的大数据集上运行得相当慢,我希望有一种更有效的方法来执行此操作.谢谢!
So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!
def address_remove(x):
try:
newADDR1 = re.sub(x['NAME'], '', x[-1])
newADDR1 = newADDR1.rstrip()
newADDR1 = newADDR1.lstrip()
return newADDR1
except:
return x[-1]
推荐答案
这是一个比您当前的解决方案快很多的解决方案,但我不相信这样做不会更快.
Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though
In [13]: import numpy as np
import pandas as pd
n = 1000
testing = pd.DataFrame({'NAME':[
'FIRST', np.nan, 'NAME2', 'NAME3',
'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
这是一个很长的班轮,但它应该可以满足您的需求
This is kind of a long one liner but it should do what you need
我可以想到的快速解决方案是使用replace
,如另一个答案中所述:
Fasted solution I can come up with is using replace
as mentioned in another answer:
In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
原始答案:
In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
与您当前的解决方案相比:
compared to your current solution:
In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
这些将为您提供与当前解决方案相同的答案
These get you the same answer as your current solution
这篇关于Python Pandas使用另一列删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!