我如何在嵌套的data.table-data.table中的data.table中进行FAST / ADVANCE数据操作 [英] How can i do FAST/ADVANCE data manipulation in nested data.table - data.table within data.table

查看:82
本文介绍了我如何在嵌套的data.table-data.table中的data.table中进行FAST / ADVANCE数据操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有一个名为 route_data 的数据表。我需要创建一个嵌套的data.table leg_data route_data 的每一行,并从 route_data

I have a data.table named route_data in R. I need to create a nested data.table leg_data to each row of route_data with information extracted from each row of route_data

route_data <- data.table(route = c("Seattle>NewDelhi>Patna>Motihari", "Seattle>NewDelhi>Motihari","Seattle>Hyderabad>NewDelhi>Patna>Motihari"),
                         travel_type = c("business_meeting", "casual_trip","office_meeting"), 
                         leg1_time_hr = c(18.0,18.0,18.0),
                         leg2_time_hr = c(2,18,2.25),
                         leg3_time_hr = c(4.0,NA,1.75),
                         leg4_time_hr = c(NA,NA,4.0))

route_data

             route                           travel_type          leg1_time_hr  leg2_time_hr leg3_time_hr leg4_time_hr
1:           Seattle>NewDelhi>Patna>Motihari business_meeting           18         2.00         4.00           NA
2:                 Seattle>NewDelhi>Motihari      casual_trip           18        18.00           NA           NA
3: Seattle>Hyderabad>NewDelhi>Patna>Motihari   office_meeting           18         2.25         1.75            4

我需要在<$中创建嵌套的 leg_data c $ c> route_data 例如在第一行中应如下所示:

I need to create a nested leg_data in route_datafor example in the first row that should look like this:

example_nested_data <- data.table(leg = c("Seattle>Hyderabad", "Hyderabad>NewDelhi","NewDelhi>Patna","Patna>Motihari"),
                         leg_num = c(1,2,3,4), 
                         leg_transit_time_hr = c(18.0,2.25,1.75,4.0)
                         )

example_nested_data route_data

    leg                  leg_num           leg_transit_time_hr
1:  Seattle>Hyderabad       1               18.00
2: Hyderabad>NewDelhi       2                2.25
3:     NewDelhi>Patna       3                1.75
4:     Patna>Motihari       4                4.00

类似地,在 route_data 的第二行和第三行中p>

Similarly, in the second and third row of route_data

推荐答案

我将尝试自己回答。我看到警告消息,希望对任何限制都可以更好地理解它。但是,对我来说,它工作正常(忽略警告消息)。

I am going to try answering it myself. I am seeing Warning message in a hope to understand it better for any limitations. However, for me it works fine (ignoring the warning message).

另一方面,data.table打破了R的所有限制,阻止了R进行大数据处理。

On a side note, data.table breaks all the limitation of R that stops it to do Big Data processing, and lest i forget my own research would like it to be documented.

同时,让我们创建一个函数来中断行程:

Meanwhile let us create a function that breaks up the route in legs:

construct.legs <- function(ro) {
      node_vector <- unlist(strsplit(ro, ">"))
      d_nodes <- node_vector[!node_vector %in% node_vector[1]]
      o_nodes <- node_vector[!node_vector %in% node_vector[length(node_vector)]]
      legs <- paste(o_nodes,d_nodes, sep = ">")
    }

现在创建对于每个包含该路段的路线,嵌套 leg_table 。当然可以使用上面定义的函数 construct.legs

Now create nested leg_table for each route containing legs of the route. Of course using the function construct.legs that was defined above:

route_data[, leg_data := .(list(data.table(leg = construct.legs(route)))), by = list(row.names(route_data))]

我们的 route_data 看起来像现在吗?

How does our route_data look like now?

                                       route      travel_type leg1_time_hr leg2_time_hr leg3_time_hr leg4_time_hr     leg_data
1:           Seattle>NewDelhi>Patna>Motihari business_meeting           18         2.00         4.00           NA <data.table>
2:                 Seattle>NewDelhi>Motihari      casual_trip           18        18.00           NA           NA <data.table>
3: Seattle>Hyderabad>NewDelhi>Patna>Motihari   office_meeting           18         2.25         1.75            4 <data.table>

让我们看看如果<$ c $的第三行中嵌套的data.table是什么c> route_data

route_data$leg_data[3]  #Access the leg_table like we do in data.frame. But this returns leg_data as a list
route_data$leg_data[[3]]  #This returns leg_data as a data.table
route_data[3, leg_data] #Access the leg_table like we do in data.table. This returns leg_data as a list
route_data[3, leg_data[[1]]] #This returns leg_data as a data.table

data.table存储在 route_data

data.table stored in the 3rd row of route_data

                  leg
1:  Seattle>Hyderabad
2: Hyderabad>NewDelhi
3:     NewDelhi>Patna
4:     Patna>Motihari

让我在 route_data 中添加行号,稍后我将在填充运输中使用嵌套表中的时间 leg_data

Let me add row number in route_data tha i will use later in populating transit time within nested table leg_data

route_data[, route_num := seq_len(.N)]

类似地在嵌套表 leg_Table

route_data[, leg_data := .(list(leg_data[[1]][, leg_num := seq_len(.N)])), by = list(row.names(route_data))]

您看到一条警告消息,指出存在无效的内部自我参照,该自我参照已通过浅层复制得到了修复。因此,到目前为止,我将忽略这一点。在这里,我需要有人的帮助,该人可以帮助我了解它是否有任何故障。无论如何,让我们继续。

You see a Warning message that says there was invalid internal self reference that has been fixed by shallow copying. So, i am going to ignore this as of now. I would need help here from someone who can help me understand if it breaks anything. Anyway, lets proceed.

为什么我们有 [[1]] ?这是为了确保sub_table值以data.table而不是list的形式返回。尝试运行 route_data [3,leg_data [[1]]] route_data [3,leg_data] 来查看区别。

Why do we have [[1]]? This is to ensure that sub_table values are returned as data.table, not as list. Try running route_data[3, leg_data[[1]]] and route_data[3, leg_data] to see the difference.

现在终于在 route_data leg_data 中添加了运输时间$ c>

Now finally add the transit time in nested leg_data from route_data

route_data[, leg_data := .(list(leg_data[[1]][, leg_transit_time_hr := sapply(leg_num, function(x) {route_data[[route_num, 2+x, with = FALSE]]})])), by = list(row.names(route_data))]

我们在这里做什么?

我们只是循环输入了行号 leg_data 的> leg_num 作为向量,并利用行号 route_num route_data 来确定要从 route_data 中提取的运输时间的右列。

We just looped in row number leg_num of leg_data via sapply by passing it as vector and utilized the row number route_num of route_data to identify right column of transit time to extract from the route_data.

我们为什么在 [[route_num,2 + x,= FALSE]]上放置双 [[]] / code>?

Why did we place double [[]] on the [[route_num, 2+x, with = FALSE]]?

大括号确保其返回的值是向量而不是数据表。

Double braces ensure it returns value as vector not as data.table

最后,让我们看一下嵌套的数据。 route_data 第三行的表 leg_data

And, finally, let's take a look into the nested data.table leg_data of 3rd row of route_data

route_data[3, leg_data[[1]]]
        leg             leg_num       leg_transit_time_hr
1:  Seattle>Hyderabad       1               18.00
2: Hyderabad>NewDelhi       2                2.25
3:     NewDelhi>Patna       3                1.75
4:     Patna>Motihari       4                4.00

让我们看看第二行嵌套表的外观:

Let's see how 2nd row nested table looks like:

         leg            leg_num       leg_transit_time_hr
1:  Seattle>NewDelhi       1                  18
2: NewDelhi>Motihari       2                  18

这篇关于我如何在嵌套的data.table-data.table中的data.table中进行FAST / ADVANCE数据操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆