为非常大的数据集生成摘要表 [英] Producing smmary tables for very large datasets

查看:73
本文介绍了为非常大的数据集生成摘要表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理迁移数据,我想从一个非常大的数据集(> 400万个)中生成三个汇总表。下面是一个示例,具体示例如下:

I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:

migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
                                                   2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds", 
                                                                                                                   "london", "plymouth"), class = "factor"), area.new = structure(c(7L, 
                                                                                                                                                                                    13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L, 
                                                                                                                                                                                    1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow", 
                                                                                                                                                                                                              "harrogate", "leeds", "london", "manchester", "newcastle", "oxford", 
                                                                                                                                                                                                              "plymouth", "poole", "york"), class = "factor"), persons = c(6L, 
                                                                                                                                                                                                                                                                           3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 
                                                                                                                                                                                                                                                                           9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                        -19L))

汇总表1: area.within

我要创建的第一个表称为 area.within。这将仅详细说明人们在同一区域内移动过的区域(即,将计算在 area.old和 area.new中记下 london的总人数)。数据表中可能会多次出现这种情况。然后它将对所有不同的区域执行此操作,因此摘要将为:

The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:

      area.within persons
1      london      13
2       leeds       5
3    plymouth       5

使用数据表包中,我的距离是:

Using the data table package, I have as far as:

setDT(migration)[as.character(area.old)==as.character(area.new)]

...但这不会消除重复项...

... but this doesn't get rid of duplicates...

摘要表2:'moved.from'

第二个表将总结经历过人们迁徙的区域(即 area.old中的那些独特值)。它将确定第1列和第2列不同的区域,并将所有详细人员加在一起(即,不包括在区域之间移动的人员-汇总表1中)。结果表应为:

The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:

      moved.from persons
1     london      24
2      leeds      17
3   plymouth      19

摘要表3: moved.to

第三张表总结了人们经历过哪些领域的迁移(即 area.new中的那些独特值)。它将识别第1列和第2列不同的所有唯一区域,并将所有详细人员加在一起(即,不包括在区域之间移动的人员-汇总表1)。结果表应为:

The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:

     moved.to persons
1      london       5
2        york       3
3   cambridge       2
4     bristol       5
5     glasgow       6
6       leeds       8
7        york       6
8   harrogate       3
9  manchester       4
10   plymouth       0
11      poole       2
12  newcastle       3
13       bath       4
14     oxford       9

最重要的是,表2和表3中详细列出的所有人员的总和应相同。然后将此值与表1的人员总数相加应等于原始表中所有人员的总和。

Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.

如果有人可以帮助我确定如何使用数据表包生成我的表来构造代码,我应该非常感激。

If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.

推荐答案

使用<$ c $我认为c> data.table 是个不错的选择。

setDT(migration) #This has to be done only once



1。



为避免重复,只需按城市将其总结如下,即可

1.

To avoid duplicates just sum them up by city as follows

migration[as.character(area.old)==as.character(area.new), 
                 .(persons = sum(persons)), 
                 by=.(area.within = area.new)]



2。



这与1.非常相似,但使用!= i -参数

migration[as.character(area.old)!=as.character(area.new), 
                 .(persons = sum(persons)), 
                 by=.(moved.from = area.old)]



3。



与2相同。

3.

Same as 2.

migration[as.character(area.old)!=as.character(area.new), 
                 .(persons = sum(persons)), 
                 by=.(moved.to = area.new)]

替代
作为2 。和3.非常相似,您也可以这样做:

Alternative As 2. and 3. are very similar you can also do:

moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]

因此仅在必须选择正确的行之后。

Thus only once the selection of the right rows has to be done.

这篇关于为非常大的数据集生成摘要表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆