检查字符串是否在列表中包含元素的更智能方法-Python [英] Smarter way to check if a string contains an element in a list - python
问题描述
列表top_brands
包含品牌列表,例如
List top_brands
contains a list of brands, such as
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]
items
是pandas.DataFrame
,其结构如下所示.我的任务是在缺少brand_name
的情况下从item_title
填充brand_name
items
is a pandas.DataFrame
and the structure is shown below. My task is to fill the brand_name
from item_title
if brand_name
is missing
row item_title brand_name
1 | Apple 6S | Apple
2 | New Victoria\'s Secret | missing <-- need to fill with Victoria\'s Secret
3 | Used Samsung TV | missing <--need fill with Samsung
4 | Used bike | missing <--No need to do anything because there is no brand_name in the title
....
我的代码如下.问题在于,对于包含200万条记录的数据框,速度太慢.我可以使用pandas或numpy处理任务吗?
My code is as below. The problem is that it is too slow for a dataframe that contains 2 million records. Any way I can use pandas or numpy to handle the task?
def get_brand_name(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)):
print(brand)
return brand
return 'missing' ### end of get_brand_name
items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)
推荐答案
尝试一下:
pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
输出:
item_title brand_name
0 Apple 6S Apple
1 New Victoria's Secret Victoria's Secret
2 Used Samsung TV Samsung
3 Used Bike missing
我在机器上随机抽取了200万个项目作为样本:
I ran against a random sample of 2 million items on my machine:
def read_file():
df = pd.read_csv('file1.txt')
new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
return new_df
start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')
输出:
item_title brand_name
0 LG watch LG
1 Sony watch Sony
2 Used Burger missing
3 New Bike missing
4 New underwear missing
5 New Sony Sony
6 Used Apple underwear Apple
7 Refurbished Panasonic Panasonic
8 Used Victoria's Secret TV Victoria's Secret
9 Disney phone Disney
10 Used laptop missing
... ... ...
1999990 Refurbished Disney tablet Disney
1999991 Refurbished laptop missing
1999992 Nintendo Coffee Nintendo
1999993 Nintendo desktop Nintendo
1999994 Refurbished Victoria's Secret Victoria's Secret
1999995 Used Burger missing
1999996 Nintendo underwear Nintendo
1999997 Refurbished Apple Apple
1999998 Refurbished Sony Sony
1999999 New Google phone Google
[2000000 rows x 2 columns]
Took 3.2660000324249268s to process
我的机器的规格:
Windows 7 Pro 64位 英特尔i7-4770 @ 3.40GHZ 12.0 GB内存
Windows 7 Pro 64bit Intel i7-4770 @ 3.40GHZ 12.0 GB RAM
3.266秒非常快...对吧?
3.266 seconds is pretty fast... right?
这篇关于检查字符串是否在列表中包含元素的更智能方法-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!