根据项目的后缀创建一个新列(数据框) [英] Create a new column (data frame) based on suffix of its items

查看:48
本文介绍了根据项目的后缀创建一个新列(数据框)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用熊猫构建了以下数据集:

I have the following dataset built using pandas:

    URLS  \
0

1                   www.gene.eu   
2  www.cittametropolitana.me.it   
3     www.regione.basilicata.it   
4    www.bbc.co.uk   

                                               Paths  
0                                                     
1            /news-room/q-a-detail/ 
2                     /emergenza-sanitari/  
3                     /giunta/site/giunta/detail.jsp  
4  /focus/  

我想检查每个URL的后缀(例如,eu,co.uk,...),为这些值之一分配一个值:

And I would like to check the suffix of each URLS (eu, it, co.uk,...) assigning one of these values:

suffix=['.it','.uk','.eu'] # this should be used as set which includes all the suffix that I want to check
country=['Italy','United Kingdom','Europe'] # values to assign based on the suffix

zipped = list(zip(suffix, country)) # create a connection between suffix and country

我已经尝试了几种方法,也感谢一些帮助我解决此问题的用户)在我的数据框示例中添加了带有后缀信息的新列,但没有成功(请在此处找到与此相关的问题)不同样本的问题:添加有条件的新列):

I have tried in several ways, thanks also a couple of users that have helped me with this problem) to add this new column with suffix information in my data frame sample, but without success (please find here a question related to this problem with a different sample: Adding new column with condition):

country = {k.lower() : v for (k,v) in zipped}
og = {k : v for (k,v) in suffix}
country.update(og)
# (1)
df['value'] = df['URLS'].str.split(".", expand=True).stack().reset_index(1).query(
    "level_1 == level_1.max()"
)[0].map(country)

# (2)
original_domain = {x: y for x, y  in zipped}

df['value'] = df['URLS'].apply(lambda sen : original_domain.get( sen[-1], 'Unknown') ) )

# (3)
df['value']=df['URLS'].map(lambda x: x[-3:] in zipped) 

#(4)
df['value'] = np.where(df['URLS'].str.endswith(suffix), pd.to_datetime(df['value'])) # it returns me errors and t needs another step to assign country

但这些代码段均无效. URLS是通过解析链接派生的列.我认为问题可能出在从计算项目定义值列而不创建列表,因此我将需要根据URLS创建它. 因此,我想问您如何添加此新列,查找后缀结尾并分配相应的值(意大利,英国,...).

but none of these pieces of code works. URLS is a column derived by parsing the links. I think the problem can be in defining value column from a calculated item, without creating a list, so I would need to create it depending on URLS. So I would like to ask you how to add this new columns, looking for suffix ending and assigning a corresponding value (Italy, United Kingdom, ...).

希望您能帮助我.

谢谢

df的定义如下:

df=pd.read_csv('path/text.csv', sep=';', engine='python')

我认为当我尝试应用 sK500 提出的代码时,这可能会导致错误.

I think this can cause an error when I try to apply the code proposed by sK500.

推荐答案

如果我理解正确,那你去了:

There you go if I understood correctly:

import pandas as pd


suffix = ['it', 'uk', 'eu']
country = ['Italy', 'United Kingdom', 'Europe']
mapping = dict(zip(suffix, country))
urls = ['www.gene.eu', 'www.cittametropolitana.me.it', 'www.regione.basilicata.it', 'www.bbc.co.uk']
paths = ['/news-room/q-a-detail/', '/emergenza-sanitari/', '/giunta/site/giunta/detail.jsp', '/focus/']
frame = pd.DataFrame(zip(urls, paths), columns=['urls', 'paths'])
for ext in mapping:
    frame.loc[frame['urls'].apply(lambda x: x.split('.')[-1]) == ext, 'Country'] = mapping[ext]
print(frame)

出局:

                           urls                           paths         Country
0                   www.gene.eu          /news-room/q-a-detail/          Europe
1  www.cittametropolitana.me.it            /emergenza-sanitari/           Italy
2     www.regione.basilicata.it  /giunta/site/giunta/detail.jsp           Italy
3                 www.bbc.co.uk                         /focus/  United Kingdom

,请注意,为了使其正常工作,您需要事先添加要包含在映射中的所有扩展名,并且数据必须统一(您必须确保每个url都有一个.并以结尾带有映射中包含的扩展名,否则您将获得不需要的nan值.

and note that in order for this to work properly, you need to add all extensions you want to include in the mapping beforehand and the data must be uniform(you have to make sure that every url has a . and ends with an extension included in the mapping otherwise you'll get nan values which you wouldn't want.

这篇关于根据项目的后缀创建一个新列(数据框)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆