一次在pandas str.replace函数中传递文本列表,而不是迭代单个列表元素 [英] pass a list of text in pandas str.replace function at once instead of iterating individual list elements
问题描述
pandas函数str.replace
具有2个要搜索的参数,另一个是需要替换的值.可以说我有2个列表,分别为keyword
和lookupId
.
pandas function str.replace
has 2 parameters one that is to be searched and other is the value which needs to be replaced with. Lets say i have 2 lists as keyword
and lookupId
as follows.
lookupid = ['##10##','##13##','##12##','##13##']
keyword = ['IT Manager', 'Sales Manager', 'IT Analyst', 'Store Manager']
我不想使用zip()
或任何其他方式遍历列表,而是希望将两个列表直接插入str.replace代码中.有什么办法可以避免循环并仍然以更快的方式进行呢?我的数据包含我要查找和替换的数据框中的数百万条记录,并且在lookupin
和keyword
列表中也有接近200000个元素.因此,性能很重要.我怎样才能更快地执行此操作?
Instead of iterating through the lists using zip()
or any other means, i want to directly insert both the lists in the str.replace code. is there any way i can avoid the loop and still do it in a faster way? My data consists of millions of records in the dataframe where i am doiung find and replace and also in the lookupin
and keyword
list there are close to 200000 elements. Hence performance matters. How can i execute this faster?
df_find.currentTitle.str.replace(r'keyword'\b',r'lookupId',case=False)
我遇到了错误.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-cb36f6429008> in <module>()
----> 1 df_find.currentTitle=df_find.currentTitle.str.replace(r'\b'+df_replace.keyword+r'\b',r' '+df_replace.lookupId+ ' ',case=False)
c:\python27\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
1504 def replace(self, pat, repl, n=-1, case=True, flags=0):
1505 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1506 flags=flags)
1507 return self._wrap_result(result)
1508
c:\python27\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
320 # Check whether repl is valid (GH 13438)
321 if not is_string_like(repl):
--> 322 raise TypeError("repl must be a string")
323 use_re = not case or len(pat) > 1 or flags
324
TypeError: repl must be a string
我的输入数据就像
current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years
输出
current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years
按照jezrel的回答,我接受了建议,但遇到了新的错误.
as per jezrel's answer i went by the suggestion and i am getting new error.
TypeError Traceback (most recent call last)
<ipython-input-8-699e487f230e> in <module>()
----> 1 df_find.currentTitle.replace(keyword, df_replace['lookupId'], regex=True)
c:\python27\lib\site-packages\pandas\core\generic.pyc in replace(self, to_replace, value, inplace, limit, regex, method, axis)
3506 dest_list=value,
3507 inplace=inplace,
-> 3508 regex=regex)
3509
3510 else: # [NA, ''] -> 0
c:\python27\lib\site-packages\pandas\core\internals.pyc in replace_list(self, src_list, dest_list, inplace, regex, mgr)
3211 operator.eq)
3212
-> 3213 masks = [comp(s) for i, s in enumerate(src_list)]
3214
3215 result_blocks = []
c:\python27\lib\site-packages\pandas\core\internals.pyc in comp(s)
3209 return isnull(values)
3210 return _possibly_compare(values, getattr(s, 'asm8', s),
-> 3211 operator.eq)
3212
3213 masks = [comp(s) for i, s in enumerate(src_list)]
c:\python27\lib\site-packages\pandas\core\internals.pyc in _possibly_compare(a, b, op)
4613 type_names[1] = 'ndarray(dtype=%s)' % b.dtype
4614
-> 4615 raise TypeError("Cannot compare types %r and %r" % tuple(type_names))
4616 return result
4617
TypeError: Cannot compare types 'ndarray(dtype=object)' and 'str'
推荐答案
似乎您需要list comprehension
和 Series.str.replace
):
It seems you need list comprehension
with Series.replace
(not Series.str.replace
):
keyword = [ r'\b(?i)' + x +r'\b' for x in keyword]
df_find.currentTitle = df_find.currentTitle.replace(keyword,lookupid,regex=True)
#temporary display long strings
with pd.option_context('display.max_colwidth', 130):
print (df_find)
currentTitle
0 I have been working here as a ##13## since after I passed from college
1 I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
2 I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years
这篇关于一次在pandas str.replace函数中传递文本列表,而不是迭代单个列表元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!