在大数据表中操纵字符串的最佳方式 [英] best way to manipulate strings in big data.table

查看：142 发布时间：2017/3/12 11:58:25 r data.table strsplit stringr

本文介绍了在大数据表中操纵字符串的最佳方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个67MM的行数据表，人姓名和姓氏用空格分隔。我只需要为每个单词创建一个新列。

I have a 67MM row data.table with people names and surname separated by spaces. I just need to create a new column for each word.

以下是数据的一小部分：

Here is an small subset of the data:

n <- structure(list(Subscription_Id = c("13.855.231.846.091.000", 
"11.156.048.529.090.800", "24.940.584.090.830", "242.753.039.111.124", 
"27.843.782.090.830", "13.773.513.145.090.800", "25.691.374.090.830", 
"12.236.174.155.090.900", "252.027.904.121.210", "11.136.991.054.110.100"
), Account_Desc = c("AGUAYO CARLA", "LEIVA LILIANA", "FULLANA MARIA LAURA", 
"PETREL SERGIO", "IPTICKET SRL", "LEDESMA ORLANDO", "CATTANEO LUIS RAUL", 
"CABRAL CARMEN ESTELA", "ITURGOYEN HECTOR", "CASA CASILDO"), 
    V1 = c("AGUAYO", "LEIVA", "FULLANA", "PETREL", "IPTICKET", 
    "LEDESMA", "CATTANEO", "CABRAL", "ITURGOYEN", "CASA"), V2 = c("CARLA", 
    "LILIANA", "MARIA", "SERGIO", "SRL", "ORLANDO", "LUIS", "CARMEN", 
    "HECTOR", "CASILDO"), V3 = c(NA, NA, "LAURA", NA, NA, NA, 
    "RAUL", "ESTELA", NA, NA), `NA` = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), .Names = c("Subscription_Id", "Account_Desc", "V1", "V2", 
"V3", NA), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L), .internal.selfref = <pointer: 0x0000000000200788>)


require("data.table")
n <- data.table(n)

strong>

Expected Output

# Subscription_Id Account_Desc V1 V2 V3 NA # 1: 13.855.231.846.091.000 AGUAYO CARLA AGUAYO CARLA NA NA # 2: 11.156.048.529.090.800 LEIVA LILIANA LEIVA LILIANA NA NA # 3: 24.940.584.090.830 FULLANA MARIA LAURA FULLANA MARIA LAURA NA

第一次尝试

如何使这项工作成为第一个问题

1st Attempt

How to make this work would be the first question

library(stringr) # This separates the strings, but i loose the Subscription_Id variable. n[, str_split_fixed(Account_Desc, "[ +]", 4)] # This doesn't work. n[, paste0("V",1:4) := str_split_fixed(Account_Desc, "[ +]", 4)]

第二次尝试

这样做，但我似乎在做计算3次。不确定其
是否是最有效的方式

2nd Attempt

This works, but i seem to be doing the calculation 3 times. Not sure if its the most effiecient way

cols = paste0("V",1:3) for(j in 1:3){ set(n,i=NULL,j=cols[j],value = sapply(strsplit(as.character(n$Account_Desc),"[ +]"), "[", j)) }

/ strong> to benchmarck

Let's use big_n to benchmarck

big_n <- data.table(Subscription_Id = rep(n[,Subscription_Id],1e7), Account_Desc = rep(n[,Account_Desc],1e7) )

推荐答案

我不使用数据集任何地方附近这个规模，所以我不知道这是否会使用或不。有一件事是使用矩阵和矩阵索引。

I don't work with datasets anywhere near this scale, so I have no idea if this is going to be of use or not. One thing that comes to mind is to use a matrix and matrix indexing.

由于我不耐烦，我'

big_n <- data.table(Subscription_Id = rep(n[,Subscription_Id],1e5), Account_Desc = rep(n[,Account_Desc],1e5))

写一个函数创建你的矩阵

Write a function to create your matrix

StringMat <- function(input) { Temp <- strsplit(input, " ", fixed = TRUE) Lens <- vapply(Temp, length, 1L) A <- unlist(Temp, use.names = FALSE) Rows <- rep(sequence(length(Temp)), Lens) Cols <- sequence(Lens) m <- matrix(NA, nrow = length(Temp), ncol = max(Lens), dimnames = list(NULL, paste0("V", sequence(max(Lens))))) m[cbind(Rows, Cols)] <- A m }

查看输出的时间

Time it and view the output

system.time(outB1 <- cbind(big_n, StringMat(big_n$Account_Desc))) # user system elapsed # 4.524 0.000 4.533 outB1 # Subscription_Id Account_Desc V1 V2 V3 # 1: 13.855.231.846.091.000 AGUAYO CARLA AGUAYO CARLA NA # 2: 11.156.048.529.090.800 LEIVA LILIANA LEIVA LILIANA NA # 3: 24.940.584.090.830 FULLANA MARIA LAURA FULLANA MARIA LAURA # 4: 242.753.039.111.124 PETREL SERGIO PETREL SERGIO NA # 5: 27.843.782.090.830 IPTICKET SRL IPTICKET SRL NA # --- # 999996: 13.773.513.145.090.800 LEDESMA ORLANDO LEDESMA ORLANDO NA # 999997: 25.691.374.090.830 CATTANEO LUIS RAUL CATTANEO LUIS RAUL # 999998: 12.236.174.155.090.900 CABRAL CARMEN ESTELA CABRAL CARMEN ESTELA # 999999: 252.027.904.121.210 ITURGOYEN HECTOR ITURGOYEN HECTOR NA # 1000000: 11.136.991.054.110.100 CASA CASILDO CASA CASILDO NA

更正 set_method 函数和比较计时

Correct the set_method function and compare timings

set_method <- function(DT){ cols = paste0("V",1:3) for(j in 1:3){ set(DT,i=NULL,j=cols[j], value = sapply(strsplit(as.character(DT[, Account_Desc, with = TRUE]), "[ +]"), "[", j)) } } system.time(set_method(big_n)) # user system elapsed # 25.319 0.022 25.586

重置big_n数据集并尝试 str_split_fixed （uch！）

Reset the "big_n" dataset and try out str_split_fixed (ouch!)

big_n[, c("V1", "V2", "V3") := NULL] library(stringr) system.time(outBrodie <- cbind(big_n, as.data.table(str_split_fixed( big_n$Account_Desc, "[ +]", 4)))) # user system elapsed # 204.966 0.514 206.910

这篇关于在大数据表中操纵字符串的最佳方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在大数据表中操纵字符串的最佳方式 [英] best way to manipulate strings in big data.table

问题描述

第一次尝试

如何使这项工作成为第一个问题

1st Attempt

How to make this work would be the first question

第二次尝试

2nd Attempt

推荐答案

写一个函数创建你的矩阵

Write a function to create your matrix

查看输出的时间

Time it and view the output

更正 `set_method` 函数和比较计时

Correct the `set_method` function and compare timings

重置big_n数据集并尝试 `str_split_fixed` （uch！）

Reset the "big_n" dataset and try out `str_split_fixed` (ouch!)

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在大数据表中操纵字符串的最佳方式 [英] best way to manipulate strings in big data.table

问题描述

第一次尝试

如何使这项工作成为第一个问题

1st Attempt

How to make this work would be the first question

第二次尝试

2nd Attempt

推荐答案

写一个函数创建你的矩阵

Write a function to create your matrix

查看输出的时间

Time it and view the output

更正 set_method 函数和比较计时

Correct the set_method function and compare timings

重置big_n数据集并尝试 str_split_fixed （uch！）

Reset the "big_n" dataset and try out str_split_fixed (ouch!)

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

更正 `set_method` 函数和比较计时

Correct the `set_method` function and compare timings

重置big_n数据集并尝试 `str_split_fixed` （uch！）

Reset the "big_n" dataset and try out `str_split_fixed` (ouch!)

登录关闭