Python Pandas使用另一列删除子字符串 [英] Python Pandas removing substring using another column

查看:326
本文介绍了Python Pandas使用另一列删除子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试过搜索,但找不到一种简单的方法来完成此操作,因此希望您的专业知识可以为您提供帮助.

I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.

我有一个两列的熊猫数据框

I have a pandas data frame with two columns

import numpy as np
import pandas as pd

pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})

这给了我

          FULL_NAME   NAME
0        FIRST LAST  FIRST
1               NaN    NaN
2        FIRST LAST  NAME2
3       FIRST NAME3  NAME3
4  FIRST NAME4 LAST  NAME4
5      ANOTHER NAME  NAME5
6         LAST NAME  NAME6

我想做的是从名称"列中获取值,然后从全名"列中删除(如果有的话).因此该函数将返回

what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return

          FULL_NAME   NAME           NEW
0        FIRST LAST  FIRST          LAST
1               NaN    NaN           NaN
2        FIRST LAST  NAME2    FIRST LAST
3       FIRST NAME3  NAME3         FIRST
4  FIRST NAME4 LAST  NAME4    FIRST LAST
5      ANOTHER NAME  NAME5  ANOTHER NAME
6         LAST NAME  NAME6     LAST NAME

到目前为止,我已经在下面定义了一个函数,并且正在使用apply方法.但是,这在我的大数据集上运行得相当慢,我希望有一种更有效的方法来执行此操作.谢谢!

So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!

def address_remove(x):
    try:
        newADDR1 = re.sub(x['NAME'], '', x[-1])
        newADDR1 = newADDR1.rstrip()
        newADDR1 = newADDR1.lstrip()
        return newADDR1
    except:
        return x[-1]

推荐答案

这是一个比您当前的解决方案快很多的解决方案,但我不相信这样做不会更快.

Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though

In [13]: import numpy as np
         import pandas as pd
         n = 1000
         testing  = pd.DataFrame({'NAME':[
         'FIRST', np.nan, 'NAME2', 'NAME3', 
         'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST  LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})

这是一个很长的班轮,但它应该可以满足您的需求

This is kind of a long one liner but it should do what you need

我可以想到的快速解决方案是使用replace,如另一个答案中所述:

Fasted solution I can come up with is using replace as mentioned in another answer:

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop

原始答案:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop

与您当前的解决方案相比:

compared to your current solution:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop

这些将为您提供与当前解决方案相同的答案

These get you the same answer as your current solution

这篇关于Python Pandas使用另一列删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆