从所有数据框列中删除子字符串 [英] Removal of substring from all dataframe columns

查看:71
本文介绍了从所有数据框列中删除子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单词列表(大约1000个单词),我称它们为否定词.

I have a list of words(around 1000 words), I call them negative words.

['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']

我很快将从这个单词列表中制作一个数据框.

I'll soon be making a dataframe out of this list of words.

我还有一个数据框,看起来像-

I also have a dataframe which looks like -

    FileName    PageNo     LineNo   GOODS_DESC                   
1   17668620    TM000004    36      CAST ARTICLES IRON SANITARY  
59  17668620    TM000014    41      CRATES                       
60  17668620    TM000014    42      CAST ARTICLES IRON           
61  17668620    TM000014    49      JAN ANIMAL AND VEGETABLE     
63  17668620    TM000016    49      SETTLING AGENT               
65  17668620    TM000016    29      JAN 
66  17668620    TM000016    32      CLAUSES SPECIAL CONDITIONS WARRANTIES   
67  17668620    TM000016    37      CARGO ISM ENDORSEMENT
69  17668620    TM000017    113     QUANTITY DECLARED IRON CRATES   

我想从数据框中删除否定词(尽快). 并获得精炼的数据框. 这样数据框看起来像这样- 数据列表中的数据框.

I want to remove the negative words from the dataframe (as fast a possible). and get the refined dataframe. So that the dataframe looks like this - dataframe out of this list of words.

我还有一个数据框,看起来像-

I also have a dataframe which looks like -

    FileName    PageNo     LineNo   GOODS_DESC                   
1   17668620    TM000004    36      IRON 
59  17668620    TM000014    41      CRATES                       
60  17668620    TM000014    42      IRON             
61  17668620    TM000014    49      ANIMAL AND VEGETABLE     
63  17668620    TM000016    49      SETTLING AGENT               
65  17668620    TM000016    29      NaN
66  17668620    TM000016    32      CONDITIONS WARRANTIES   
67  17668620    TM000016    37      CARGO ISM
69  17668620    TM000017    113     QUANTITY DECLARED IRON CRATES   

目前,我的方法是遍历数据框,获取每一行并将其拆分,然后检查拆分后的单词是否在否定单词列表中.如果不存在,那么我通过加入一个新字符串这些单词并将其添加到数据框中.

Currently my approach is that I'm iterating over the dataframe, taking each row and splitting it and checking wether the splitted word is in negative words list or not.IF its not there then I'm making a new string by joining the words and adding it in the dataframe.

for rows in df.itertuples():
    a = []
    flat_list = []
    a.append(rows.GOODS_DESC)
    flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
    flat_list = list(sorted(set(flat_list), key=flat_list.index))
    flat_list = [i for i in flat_list if i.lower() not in negative_words_list]

    if(not flat_list):
        df.drop(rows.Index,inplace=True)
        continue
    s=' '.join(flat_list)
    df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()

这种方法的唯一问题是它太慢了.

The only problem with this approach is that its too slow.

如果您有任何提示,请逻辑共享.有人可以告诉我如何在更短的时间内使用pandas数据框完成此过程.

If you have any hint,logic then do share. Can someone show me how this process can be done using pandas dataframe in less time.

推荐答案

由于熊猫中.str访问器的缓慢性和循环性,最好只使用列表理解:

Due to the slowness and loopiness of .str accessor in pandas, it may be better to just use list comprehension:

import re
l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
df['GOODS_DESC'] = [re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]

输出:

    FileName    PageNo  LineNo                     GOODS_DESC
1   17668620  TM000004      36                           IRON
59  17668620  TM000014      41                         CRATES
60  17668620  TM000014      42                           IRON
61  17668620  TM000014      49           ANIMAL AND VEGETABLE
63  17668620  TM000016      49                 SETTLING AGENT
65  17668620  TM000016      29                            NaN
66  17668620  TM000016      32          CONDITIONS WARRANTIES
67  17668620  TM000016      37                      CARGO ISM
69  17668620  TM000017     113  QUANTITY DECLARED IRON CRATES

时间

%timeit [re.sub('|'.join(l),'',i).strip()如果 re.sub('|'.join(l),'',i).strip()!=''else np.nan for i in df.GOODS_DESC]

%timeit[re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]

每个循环89.6 µs±667 ns(平均±标准偏差,共运行7次,每个循环10000次)

89.6 µs ± 667 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

使用.str访问器

%timeit df ['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()

%timeit df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()

466 µs±每个循环10.4 µs(平均±标准偏差,共运行7次,循环1000次 每个)

466 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这篇关于从所有数据框列中删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆