检查字符串是否在列表中包含元素的更智能方法-Python [英] Smarter way to check if a string contains an element in a list - python

查看:234
本文介绍了检查字符串是否在列表中包含元素的更智能方法-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

列表top_brands包含品牌列表,例如

List top_brands contains a list of brands, such as

top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]

itemspandas.DataFrame,其结构如下所示.我的任务是在缺少brand_name的情况下从item_title填充brand_name

items is a pandas.DataFrame and the structure is shown below. My task is to fill the brand_name from item_title if brand_name is missing

row     item_title                 brand_name

1    |  Apple 6S                  |  Apple
2    |  New Victoria\'s Secret    |  missing  <-- need to fill with Victoria\'s Secret
3    |  Used Samsung TV           |  missing  <--need fill with Samsung
4    |  Used bike                 |  missing  <--No need to do anything because there is no brand_name in the title 
    ....

我的代码如下.问题在于,对于包含200万条记录的数据框,速度太慢.我可以使用pandas或numpy处理任务吗?

My code is as below. The problem is that it is too slow for a dataframe that contains 2 million records. Any way I can use pandas or numpy to handle the task?

def get_brand_name(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']

    item_title = row['item_title']

    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)): 
            print(brand)
            return brand

    return 'missing'    ### end of get_brand_name


items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)

推荐答案

尝试一下:

pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)

输出:

              item_title         brand_name
0               Apple 6S              Apple
1  New Victoria's Secret  Victoria's Secret
2        Used Samsung TV            Samsung
3              Used Bike            missing

我在机器上随机抽取了200万个项目作为样本:

I ran against a random sample of 2 million items on my machine:

def read_file():
    df = pd.read_csv('file1.txt')
    new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
    return new_df

start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')

输出:

                                   item_title         brand_name
0                                    LG watch                 LG
1                                  Sony watch               Sony
2                                 Used Burger            missing
3                                    New Bike            missing
4                               New underwear            missing
5                                    New Sony               Sony
6                        Used Apple underwear              Apple
7                       Refurbished Panasonic          Panasonic
8                   Used Victoria's Secret TV  Victoria's Secret
9                                Disney phone             Disney
10                                Used laptop            missing
...                                       ...                ...
1999990             Refurbished Disney tablet             Disney
1999991                    Refurbished laptop            missing
1999992                       Nintendo Coffee           Nintendo
1999993                      Nintendo desktop           Nintendo
1999994         Refurbished Victoria's Secret  Victoria's Secret
1999995                           Used Burger            missing
1999996                    Nintendo underwear           Nintendo
1999997                     Refurbished Apple              Apple
1999998                      Refurbished Sony               Sony
1999999                      New Google phone             Google

[2000000 rows x 2 columns]
Took 3.2660000324249268s to process

我的机器的规格:

Windows 7 Pro 64位 英特尔i7-4770 @ 3.40GHZ 12.0 GB内存

Windows 7 Pro 64bit Intel i7-4770 @ 3.40GHZ 12.0 GB RAM

3.266秒非常快...对吧?

3.266 seconds is pretty fast... right?

这篇关于检查字符串是否在列表中包含元素的更智能方法-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆