如何将字符串拆分为不同的变量? [英] How to split a string into different variables?

查看:92
本文介绍了如何将字符串拆分为不同的变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析 Airbnb 便利设施列中的商品的大型数据集列出列出的设施。

I'm trying to analyze a large data set for listings on Airbnb and in the amenities column, it lists out the amenities that the listing has.

例如,

{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire 
extinguisher",Essentials,Shampoo,Hangers} 

{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in 
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide 
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom 
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation 
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

我要解决两个问题:


  1. 我想将字符串分成不同的列,例如将会有一个标题为 TV 的列。如果字符串包含 TV ,则相应单元格中的条目将为1,否则为0。我该怎么办?

  1. I would like to split the string into different columns, e.g. there will be a column with a title TV. If the string contains TV, the entry in the corresponding cell will be 1 and 0 otherwise. How can I do this?

如何删除缺少翻译的变量:.....


推荐答案

这是一种同时使用<$ c从 data.table 包中的$ c> dcast(),如答案,但也解决了数据清理的乏味但重要的细节。

Here is an approach which uses also dcast() from the data.table package as in this answer but addresses also the tedious but important details of data cleaning.

library(data.table)

# read data file, returning one column
raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities")
# add column with row numbers
raw[, rn := seq_len(.N)]
# remove opening and closing curly braces
raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")]

# split amenities, thereby reshaping from wide to long format
long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn]
# remove double quotes and leading and trailing whitespace
long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))]

# reshape from long to wide format, omitting rows which contain "translation missing..."
dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0)
#   rn Air conditioning Carbon monoxide detector Elevator in building Essentials
#1:  1                1                        0                    0          1
#2:  2                1                        1                    1          1
#   Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen
#1:                 1             0          0       1       1    0       1
#2:                 0             1          1       1       1    1       1
#   Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector
#1:                         0                    0       1              0
#2:                         1                    1       1              1
#   Suitable for events TV Wireless Internet
#1:                   0  0                 1
#2:                   1  1                 1



数据文件



OP仅提供了两个数据样本,这些样本已复制到名为<$ c的数据文件中$ c> AirBnB.csv :

{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers}
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

这篇关于如何将字符串拆分为不同的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆