从所有数据框列中删除子字符串 [英] Removal of substring from all dataframe columns
问题描述
我有一个单词列表(大约1000个单词),我称它们为否定词.
I have a list of words(around 1000 words), I call them negative words.
['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
我很快将从这个单词列表中制作一个数据框.
I'll soon be making a dataframe out of this list of words.
我还有一个数据框,看起来像-
I also have a dataframe which looks like -
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 CAST ARTICLES IRON SANITARY
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 CAST ARTICLES IRON
61 17668620 TM000014 49 JAN ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 JAN
66 17668620 TM000016 32 CLAUSES SPECIAL CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM ENDORSEMENT
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
我想从数据框中删除否定词(尽快). 并获得精炼的数据框. 这样数据框看起来像这样- 数据列表中的数据框.
I want to remove the negative words from the dataframe (as fast a possible). and get the refined dataframe. So that the dataframe looks like this - dataframe out of this list of words.
我还有一个数据框,看起来像-
I also have a dataframe which looks like -
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 IRON
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 IRON
61 17668620 TM000014 49 ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 NaN
66 17668620 TM000016 32 CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
目前,我的方法是遍历数据框,获取每一行并将其拆分,然后检查拆分后的单词是否在否定单词列表中.如果不存在,那么我通过加入一个新字符串这些单词并将其添加到数据框中.
Currently my approach is that I'm iterating over the dataframe, taking each row and splitting it and checking wether the splitted word is in negative words list or not.IF its not there then I'm making a new string by joining the words and adding it in the dataframe.
for rows in df.itertuples():
a = []
flat_list = []
a.append(rows.GOODS_DESC)
flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
flat_list = list(sorted(set(flat_list), key=flat_list.index))
flat_list = [i for i in flat_list if i.lower() not in negative_words_list]
if(not flat_list):
df.drop(rows.Index,inplace=True)
continue
s=' '.join(flat_list)
df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()
这种方法的唯一问题是它太慢了.
The only problem with this approach is that its too slow.
如果您有任何提示,请逻辑共享.有人可以告诉我如何在更短的时间内使用pandas数据框完成此过程.
If you have any hint,logic then do share. Can someone show me how this process can be done using pandas dataframe in less time.
推荐答案
由于熊猫中.str访问器的缓慢性和循环性,最好只使用列表理解:
Due to the slowness and loopiness of .str accessor in pandas, it may be better to just use list comprehension:
import re
l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
df['GOODS_DESC'] = [re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]
输出:
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 IRON
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 IRON
61 17668620 TM000014 49 ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 NaN
66 17668620 TM000016 32 CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
时间
%timeit [re.sub('|'.join(l),'',i).strip()如果 re.sub('|'.join(l),'',i).strip()!=''else np.nan for i in df.GOODS_DESC]
%timeit[re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]
每个循环89.6 µs±667 ns(平均±标准偏差,共运行7次,每个循环10000次)
89.6 µs ± 667 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
使用.str访问器
%timeit df ['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()
%timeit df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()
466 µs±每个循环10.4 µs(平均±标准偏差,共运行7次,循环1000次 每个)
466 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这篇关于从所有数据框列中删除子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!