Stata:使用存储在其他数据集中的标准对数据进行子集 [英] Stata: Subsetting data using criteria stored in other data set

查看:315
本文介绍了Stata:使用存储在其他数据集中的标准对数据进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据集。我必须使用存储在其他dta文件(Criteria_data)中的值来对数据集(Big_data)进行子集化。我会先告诉你问题:

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:

   **Big_data**                           **Criteria_data**
====================      ================================================
  lon        lat             4_digit_id   minlon  maxlon  minlat  maxlat
-76.22      44.27              0765       -78.44  -77.22  34.324  35.011
-67.55      33.19              6161       -66.11  -65.93  40.32   41.88
    .......                                   ........
 (over 1 million obs)                    (271 observations)        
====================      ================================================

我必须按如下方式对出价数据进行分组:

I have to subset the bid data as follows:

use Big_data

preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore

preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore

....

(1)Stata子集化的高效编程应该是什么? (2)不正确表达式是否正确写入?

(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?

推荐答案

1)子集数据

主文件中有400,000个观察值,参考文件中有300个观察值,大约需要1.5分钟。我无法通过主文件中的两次观察来测试这一点,因为缺少RAM需要我的计算机进行爬行。

With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.

该策略涉及根据需要创建尽可能多的变量保持参考纬度和经度(OP的情况下为271 * 4 = 1084; Stata IC及以上可以处理此情况。请参阅帮助限制)。这需要一些重塑和追加。然后我们检查满足条件的大数据文件的观察结果。

The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.

clear all
set more off

*----- create example databases -----

tempfile bigdata reference

input ///
lon        lat   
-76.22      44.27
-66.0      40.85 // meets conditions
-77.10     34.8 // meets conditions
-66.00    42.0 
end

expand 100000

save "`bigdata'"
*list

clear all

input ///
str4 id   minlon  maxlon  minlat  maxlat
"0765"       -78.44  -75.22  34.324  35.011
"6161"       -66.11  -65.93  40.32   41.88
end

drop id
expand 150
gen id = _n

save "`reference'"
*list


*----- reshape original reference file -----

use "`reference'", clear

tempfile reference2

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 

gen lat = .
gen lon = .

save "`reference2'"


*----- create working database -----

use "`bigdata'"

timer on 1
quietly {
    forvalues num = 1/300 {
        gen minlon`num' = .
        gen maxlon`num' = .
        gen minlat`num' = .
        gen maxlat`num' = .
    }
}
timer off 1

timer on 2
append using "`reference2'"
drop i
timer off 2

*----- flag observations for which conditions are met -----

timer on 3
gen byte flag = 0
foreach le of local lev {
    quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3

*keep if flag
*keep lon lat

*list

timer list

inrange()函数意味着必须事先调整最小值和最大值以满足OP的严格不等式(函数测试< =,> =)。

The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).

可能使用进行扩展,使用相关性和(所以数据是长的形式)可以加快速度。现在对我来说并不完全清楚。我确信在普通Stata模式下有更好的方法。 Mata可能会更好。

Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.

joinby 也经过测试,但RAM也是一个问题。)

(joinby was also tested but again RAM was a problem.)

以块为单位进行计算而不是对整个数据库进行计算,可显着改善RAM问题。使用具有120万个观测值的主文件和具有300个观测值的参考文件,以下代码在大约1.5分钟内完成所有工作:

Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:

set more off

*----- create example big data -----

clear all

set obs 1200000
set seed 13056

gen lat = runiform()*100
gen lon = runiform()*100

local sizebd `=_N' // to be used in computations

tempfile bigdata
save "`bigdata'"

*----- create example reference data -----

clear all

set obs 300
set seed 97532

gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5

gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5

gen id = _n

tempfile reference
save "`reference'"


*----- reshape original reference file -----

use "`reference'", clear

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 
drop i

tempfile reference2
save "`reference2'"


*----- create file to save results -----

tempfile results
clear all
set obs 0

gen lon = .
gen lat = .

save "`results'"


*----- start computations -----

clear all

* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb

timer clear

timer on 99
forvalues en = `step'(`step')`sizebd' {

    * load observations and join with references
    timer on 1
    local start = `en' - (`step' - 1)
    use in `start'/`en' using "`bigdata'", clear
    timer off 1

    timer on 2
    append using "`reference2'"
    timer off 2

    * flag observations that meet conditions
    timer on 3
    gen byte flag = 0
    foreach le of local lev {
        quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
    }
    timer off 3

    * append to result database
    timer on 4
    quietly {
        keep if flag
        keep lon lat
        append using "`results'"
        save "`results'", replace
    }
    timer off 4

}
timer off 99

timer list
display "total time is " `r(t99)'/60 " minutes"

use "`results'"
browse

2)不平等

你问你的不平等是否正确。它们实际上是合法的,这意味着Stata不会抱怨,但结果可能是意料之外的。

You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.

以下结果可能看起来令人惊讶:

The following result may seem surprising:

. display  (66.11 < 100 < 67.93)
1

这是怎么回事表达式的计算结果为真(即1)? Stata首先评估 66.11< 100 这是真的,然后看到 1< 67.93 当然也是如此。

How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.

预期的表达式是(和Stata现在可以做你想做的事):

The intended expression was (and Stata will now do what you want):

. display  (66.11 < 100) & (100 < 67.93)
0

您还可以依赖函数 inrange()

以下示例与前面的说明一致:

The following example is consistent with the previous explanation:

. display  (66.11 < 100 < 0)
0

Stata看到 66.11< 100 这是真的(即1)并跟随 1< 0 ,这是假的(即0)。

Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).

这篇关于Stata:使用存储在其他数据集中的标准对数据进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆