Stata:使用存储在其他数据集中的标准对数据进行子集 [英] Stata: Subsetting data using criteria stored in other data set
问题描述
我有一个大数据集。我必须使用存储在其他dta文件(Criteria_data)中的值来对数据集(Big_data)进行子集化。我会先告诉你问题:
I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:
**Big_data** **Criteria_data**
==================== ================================================
lon lat 4_digit_id minlon maxlon minlat maxlat
-76.22 44.27 0765 -78.44 -77.22 34.324 35.011
-67.55 33.19 6161 -66.11 -65.93 40.32 41.88
....... ........
(over 1 million obs) (271 observations)
==================== ================================================
我必须按如下方式对出价数据进行分组:
I have to subset the bid data as follows:
use Big_data
preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore
preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore
....
(1)Stata子集化的高效编程应该是什么? (2)不正确表达式是否正确写入?
(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?
推荐答案
1)子集数据
主文件中有400,000个观察值,参考文件中有300个观察值,大约需要1.5分钟。我无法通过主文件中的两次观察来测试这一点,因为缺少RAM需要我的计算机进行爬行。
With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.
该策略涉及根据需要创建尽可能多的变量保持参考纬度和经度(OP的情况下为271 * 4 = 1084; Stata IC及以上可以处理此情况。请参阅帮助限制
)。这需要一些重塑和追加。然后我们检查满足条件的大数据文件的观察结果。
The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits
). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.
clear all
set more off
*----- create example databases -----
tempfile bigdata reference
input ///
lon lat
-76.22 44.27
-66.0 40.85 // meets conditions
-77.10 34.8 // meets conditions
-66.00 42.0
end
expand 100000
save "`bigdata'"
*list
clear all
input ///
str4 id minlon maxlon minlat maxlat
"0765" -78.44 -75.22 34.324 35.011
"6161" -66.11 -65.93 40.32 41.88
end
drop id
expand 150
gen id = _n
save "`reference'"
*list
*----- reshape original reference file -----
use "`reference'", clear
tempfile reference2
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
gen lat = .
gen lon = .
save "`reference2'"
*----- create working database -----
use "`bigdata'"
timer on 1
quietly {
forvalues num = 1/300 {
gen minlon`num' = .
gen maxlon`num' = .
gen minlat`num' = .
gen maxlat`num' = .
}
}
timer off 1
timer on 2
append using "`reference2'"
drop i
timer off 2
*----- flag observations for which conditions are met -----
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
*keep if flag
*keep lon lat
*list
timer list
inrange()
函数意味着必须事先调整最小值和最大值以满足OP的严格不等式(函数测试< =,> =)。
The inrange()
function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).
可能使用进行扩展
,使用相关性和按
(所以数据是长的形式)可以加快速度。现在对我来说并不完全清楚。我确信在普通Stata模式下有更好的方法。 Mata可能会更好。
Probably some expansion using expand
, use of correlatives and by
(so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.
( joinby
也经过测试,但RAM也是一个问题。)
(joinby
was also tested but again RAM was a problem.)
以块为单位进行计算而不是对整个数据库进行计算,可显着改善RAM问题。使用具有120万个观测值的主文件和具有300个观测值的参考文件,以下代码在大约1.5分钟内完成所有工作:
Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:
set more off
*----- create example big data -----
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
*----- reshape original reference file -----
use "`reference'", clear
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
drop i
tempfile reference2
save "`reference2'"
*----- create file to save results -----
tempfile results
clear all
set obs 0
gen lon = .
gen lat = .
save "`results'"
*----- start computations -----
clear all
* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb
timer clear
timer on 99
forvalues en = `step'(`step')`sizebd' {
* load observations and join with references
timer on 1
local start = `en' - (`step' - 1)
use in `start'/`en' using "`bigdata'", clear
timer off 1
timer on 2
append using "`reference2'"
timer off 2
* flag observations that meet conditions
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
* append to result database
timer on 4
quietly {
keep if flag
keep lon lat
append using "`results'"
save "`results'", replace
}
timer off 4
}
timer off 99
timer list
display "total time is " `r(t99)'/60 " minutes"
use "`results'"
browse
2)不平等
你问你的不平等是否正确。它们实际上是合法的,这意味着Stata不会抱怨,但结果可能是意料之外的。
You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.
以下结果可能看起来令人惊讶:
The following result may seem surprising:
. display (66.11 < 100 < 67.93)
1
这是怎么回事表达式的计算结果为真(即1)? Stata首先评估 66.11< 100
这是真的,然后看到 1< 67.93
当然也是如此。
How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100
which is true, and then sees 1 < 67.93
which is also true, of course.
预期的表达式是(和Stata现在可以做你想做的事):
The intended expression was (and Stata will now do what you want):
. display (66.11 < 100) & (100 < 67.93)
0
您还可以依赖函数 inrange()
。
以下示例与前面的说明一致:
The following example is consistent with the previous explanation:
. display (66.11 < 100 < 0)
0
Stata看到 66.11< 100
这是真的(即1)并跟随 1< 0
,这是假的(即0)。
Stata sees 66.11 < 100
which is true (i.e. 1) and follows up with 1 < 0
, which is false (i.e. 0).
这篇关于Stata:使用存储在其他数据集中的标准对数据进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!