如何基于在dataframe列中找到的列表值创建多个标志列？ [英] How to create multiple flag columns based on list values found in the dataframe column?

查看：92 发布时间：2020/10/17 1:37:15 pandas dataframe hive pyspark data-manipulation

本文介绍了如何基于在dataframe列中找到的列表值创建多个标志列？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

表格如下：

     ID  |CITY
    ----------------------------------
    1  |London|Paris|Tokyo
    2  |Tokyo|Barcelona|Mumbai|London
    3  |Vienna|Paris|Seattle

城市列包含大约1000多个值，分别是|分隔

The city column contains around 1000+ values which are | delimited

我想创建一个标志列以指示某人是否仅访问了感兴趣的城市。

I want to create a flag column to indicate if a person visited only the city of interest.

    city_of_interest=['Paris','Seattle','Tokyo']

列表中有20个这样的值。

There are 20 such values in the list.

输出应该看起来像这样：

Ouput should look like this :

     ID      |Paris   | Seattle | Tokyo    
     -------------------------------------------
     1       |1       |0        |1      
     2       |0       |0        |1       
     3       |1       |1        |0

解决方案可以在熊猫或pyspark中。

The solution can either be in pandas or pyspark.

推荐答案

对于pyspark，请使用< a href = http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.split rel = nofollow noreferrer>拆分 + array_contains ：

For pyspark, use split + array_contains:

from pyspark.sql.functions import split, array_contains

df.withColumn('cities', split('CITY', '\|')) \
  .select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ]) 
  .show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
|  1|    1|      0|    1|
|  2|    0|      0|    1|
|  3|    1|      1|    0|
+---+-----+-------+-----+

对于熊猫，请使用 Series.str.get_dummies ：

df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)

这篇关于如何基于在dataframe列中找到的列表值创建多个标志列？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何基于在dataframe列中找到的列表值创建多个标志列？ [英] How to create multiple flag columns based on list values found in the dataframe column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何基于在dataframe列中找到的列表值创建多个标志列？ [英] How to create multiple flag columns based on list values found in the dataframe column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭