Stata:提取值并将其保存为标量(等等) [英] Stata: Extracting values and save them as scalars (and more)

查看:2831
本文介绍了Stata:提取值并将其保存为标量(等等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是来自 Stata:替换,如果价值的后续问题。考虑这个数据:

pre $ set seed 123456
set obs 5000
g firmid =firm+ string( _n)/ *观察(公司)id * /
g nw = floor(100 * runiform())/ *公司工人数* /
g double lat = 39 + runiform()/ *纬度公司十进制度* /
g double lon = -76 + runiform()/ *公司十进制经度* /

前10个观察值是:
$ b $ pre $ + -------- ------------------------------ +
| firmid nw lat lon |
| -------------------------------------- |
1. | firm1 81 39.915526 -75.505018 |
2. | firm2 35 39.548523 -75.201567 |
3. | firm3 10 39.657866 -75.17988 |
4. | firm4 83 39.957938 -75.898837 |
5. | firm5 56 39.575881 -75.169157 |
6. | firm6 73 39.886184 -75.857255 |
7. | firm7 27 39.33288 -75.724665 |
8. | firm8 75 39.165549 -75.96502 |
9. | firm9 64 39.688819 -75.232764 |
10. | firm10 76 39.012228 -75.166272 |
+ -------------------------------------- +

我需要计算公司1和所有其他公司之间的距离。因此, vincenty 命令如下所示:

 。标量theLat = 39.915526 
。标量theLon = -75.505018
。 vincenty lat lon theLat theLon,hav(distance_km)inkm

vincenty命令创建 distance_km 每个观察值和公司1之间的距离的变量。在这里,我手动复制并粘贴39.915526和-75.505018两个数字。



问题1 :提取这些数字的语法是什么?



现在,我可以保留距离_km <= 2的观察值。
$ b

 。 egen near_nw_sum = sum(nw)

会创建公司2公里内的工人总数。 (或者 collapse 命令可以完成这项工作。)



问题2 :我必须这样做所有的公司,最后的数据应该是这样的:

  + -------------- -------------------------------------------------- -  + 
| firmid nw lat lon near_nw_sum |
| --------------------------------------------- -------------------- |
1. | firm1 81 39.915526 -75.505018(#firm1附近的工作人员)|
2. | firm2 35 39.548523 -75.201567(#firm2附近的工作人员)|
3. | firm3 10 39.657866 -75.17988(#firm3附近的工作人员)|
4. | firm4 83 39.957938 -75.898837(#firm4附近的工作人员)|
5. | firm5 56 39.575881 -75.169157(#firm5附近的工作人员)|
6. | firm6 73 39.886184 -75.857255(#firm6附近的工作人员)|
7. | firm7 27 39.33288 -75.724665(#firm7附近的工作人员)|
8. | firm8 75 39.165549 -75.96502(#firm8附近的工作人员)|
9. | firm9 64 39.688819 -75.232764(#company9 near firm9)|
10. | firm10 76 39.012228 -75.166272(#firm10附近的工作人员)|
+ --------------------------------------------- -------------------- +

创建 near_nw_sum 变量是我的最终目标。我需要你的帮助,因为我的数据管理能力很弱。

解决方案

以下基本上是相同的策略这里,是基于你的最终目标。同样,根据原始数据集的大小,它可能是有用的。 joinby 会创建观察值,所以您可能会超过Stata限制。然而,我相信这是你想要的。

 清除全部
设置更多

set seed 123456
set obs 10
g firmid = _n / * Observation(firm)id * /
g nw = floor(100 * runiform())/ *企业中的工人数量* /
g double lat = 39 + runiform()/ *公司十进制度的纬度* /
g double lon = -76 + runiform()/ *公司十进制度的经度* /
gen dum = 1
list

* joinby过程
tempfile main
保存`main

重命名使用`main
drop dum

*漂亮打印
排序firmid0 firmid
命令firmid0 firmid
list,sepby(firmid0)

*如果您不想将工作人员包含在基本公司中,请取消注释。
* drop if firmid0 == firmid

*计算距离
vincenty lat0 lon0 lat lon,hav(distance_km)inkm
如果distance_km <= 40 //任意距离
list,sepby(firmid0)

*附近公司的计算工作者
collapse(sum)nw_sum = nw(mean)nw0 lat0 lon0,by(firmid0)
list

它的作用是形成两两组合的公司来计算距离,附近导向的企业。这里不需要像问题1中提出的那样提取标量。另外,不需要将变量复制到 firmid 转换为字符串。



以下解决了Stata观测数量限制的问题。

 清除全部
设置更多关闭

*创建空数据库
gen x =。
tempfile结果
保存`结果,替换

*创建练习输入
设置种子123456
set obs 500
g firmid = _n / * Observation(firm)id * /
g nw = floor(100 * runiform())/ *一家公司的工人数量* /
g double lat = 39 + runiform()/ *纬度在公司的十进制数中* /
g double lon = -76 + runiform()/ *公司十进制经度* /
gen dum = 1
* list

*保存行数
本地大小= _N
display`size

* joinby程序
tempfile main
保存 `主''

定时器清除1
定时器清除2
定时器清除3
定时器清除4

安静地{
定时器1
forvalues i = 1 /`size'{
定时器2
在`i'中使用``main',清除//假定在固件
重命名(firmid lat lon nw)= 0

使用`main'加入dum,不匹配(使用)
drop _merge dum
或der firmid0 firmid
定时器关闭2

定时器3
vincenty lat0 lon0 lat lon,hav(dist)inkm
定时器关闭3
保留if dist < = 40 //任意距离

4
的定时器collapse(sum)nw_sum = nw(mean)nw0 lat0 lon0,by(firmid0)

追加使用结果
保存结果,取消
定时器关闭4
}
定时器关闭1
}

使用`results',清除
类型firmid0
drop x
list

定时器列表
timer 的测试显示大部分计算时间都进入了 vincenty 命令,你将无法逃脱。以下是使用英特尔®酷睿™i5处理器和传统硬盘(非SSD)10,000次观察的时间(以秒为单位)。定时器1是总数,而2,3,4是组件(大约)。定时器3对应于 vincenty

 。计时器清单
1:1953.99 / 1 = 1953.9940
2:169.19 / 10000 = 0.0169 $ b $ 3:1669.95 / 10000 = 0.1670
4:94.47 / 10000 = 0.0094

当然,请注意,在两个代码中,重复计算距离(例如,firm1-firm2和firm2之间的距离-firm1是计算),你可以避免。就目前来看,11万观测值需要很长时间。从积极的方面来说,我注意到与第一个设置中的相同数量的观察相比,第二个设置需要很少的RAM。事实上,我的4GB机器与后者冻结。

另外请注意,即使我使用相同的种子,数据是不同的,因为我创建不同数量的观察(而不是5000),这在变量创建过程中有所不同。



(顺便说一下,如果您想将值保存为标量,可以使用脚注 scalar latitude = lat [1] )。


This question is a follow-up question from Stata: replace, if, forvalues. Consider this data:

set seed 123456
set obs 5000
g firmid = "firm" + string(_n)    /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */

The first 10 observations are:

     +--------------------------------------+
     | firmid   nw         lat          lon |
     |--------------------------------------|
  1. |  firm1   81   39.915526   -75.505018 |
  2. |  firm2   35   39.548523   -75.201567 |
  3. |  firm3   10   39.657866    -75.17988 |
  4. |  firm4   83   39.957938   -75.898837 |
  5. |  firm5   56   39.575881   -75.169157 |
  6. |  firm6   73   39.886184   -75.857255 |
  7. |  firm7   27    39.33288   -75.724665 |
  8. |  firm8   75   39.165549    -75.96502 |
  9. |  firm9   64   39.688819   -75.232764 |
 10. | firm10   76   39.012228   -75.166272 |
     +--------------------------------------+

I need to calculate the distances between firm 1 and all other firms. So, the vincenty command looks like:

. scalar theLat = 39.915526
. scalar theLon = -75.505018
. vincenty lat lon theLat theLon, hav(distance_km) inkm

The vincenty command creates the distance_km variable that has distances between each observation and firm 1. Here, I manually copy and paste the two numbers that are 39.915526 and -75.505018.

Question 1: What's the syntax that extracts those numbers?

Now, I can keep observations where distances_km <= 2. And,

. egen near_nw_sum = sum(nw)

will create the sum of workers within 2 kilometers of the firm 1. (Or, the collapse command may do the job.)

Question 2: I have to do this for all firms, and the final data should look like:

     +-----------------------------------------------------------------+
     | firmid   nw         lat          lon            near_nw_sum     |
     |-----------------------------------------------------------------|
  1. |  firm1   81   39.915526   -75.505018  (# workers near firm1)    |
  2. |  firm2   35   39.548523   -75.201567  (# workers near firm2)    |
  3. |  firm3   10   39.657866    -75.17988  (# workers near firm3)    |
  4. |  firm4   83   39.957938   -75.898837  (# workers near firm4)    |
  5. |  firm5   56   39.575881   -75.169157  (# workers near firm5)    |
  6. |  firm6   73   39.886184   -75.857255  (# workers near firm6)    |
  7. |  firm7   27    39.33288   -75.724665  (# workers near firm7)    |
  8. |  firm8   75   39.165549    -75.96502  (# workers near firm8)    |
  9. |  firm9   64   39.688819   -75.232764  (# workers near firm9)    |
 10. | firm10   76   39.012228   -75.166272  (# workers near firm10)   |
     +-----------------------------------------------------------------+

Creating the near_nw_sum variable is my final goal. I need your help here for my weak data management skill.

解决方案

The following is basically the same strategy found here and is based on your "final goal". Again, it can be useful depending on the size of your original dataset.joinby creates observations so you may exceed the Stata limit. However, I believe it does what you want.

clear all
set more off

set seed 123456
set obs 10
g firmid = _n   /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */
gen dum = 1
list

* joinby procedure
tempfile main
save "`main'"

rename (firmid lat lon nw) =0
joinby dum using "`main'"
drop dum

* Pretty print
sort firmid0 firmid
order firmid0 firmid
list, sepby(firmid0)

* Uncomment if you do not want to include workers in the "base" firm.
*drop if firmid0 == firmid

* Compute distance
vincenty lat0 lon0 lat lon, hav(distance_km) inkm
keep if distance_km <= 40 // an arbitrary distance
list, sepby(firmid0)

* Compute workers of nearby-firms
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
list

What it does is form pairwise combinations of firms to compute distances and sum workers of nearby-firms. No need here to extract scalars as asked in Question 1. Also, no need to complicate the variable firmid converting to string.

The following overcomes the problem of the Stata limit on number of observations.

clear all
set more off

* Create empty database
gen x = .
tempfile results
save "`results'", replace

* Create input for exercise
set seed 123456
set obs 500
g firmid = _n   /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */
gen dum = 1
*list

* Save number of firms
local size = _N
display "`size'"

* joinby procedure
tempfile main
save "`main'"

timer clear 1
timer clear 2
timer clear 3
timer clear 4

quietly {
    timer on 1
    forvalues i=1/`size'{
        timer on 2
        use "`main'" in `i', clear // assumed sorted on firmid
        rename (firmid lat lon nw) =0

        joinby dum using "`main'", unmatched(using)
        drop _merge dum
        order firmid0 firmid
        timer off 2

        timer on 3
        vincenty lat0 lon0 lat lon, hav(dist) inkm
        timer off 3
        keep if dist <= 40 // an arbitrary distance

        timer on 4
        collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)

        append using "`results'"
        save "`results'", replace
        timer off 4
    }
    timer off 1
}

use "`results'", clear
sort firmid0
drop x
list

timer list

However inefficicent, some testing using timer shows that most of the computation time goes into the vincenty command which you won't be able to escape. The following is the time (in seconds) for 10,000 observations with an Intel Core i5 processor and a conventional hard drive (not SSD). Timer 1 is the total while 2, 3, 4 are the components (approx.). Timer 3 corresponds to vincenty:

. timer list
   1:   1953.99 /        1 =    1953.9940
   2:    169.19 /    10000 =       0.0169
   3:   1669.95 /    10000 =       0.1670
   4:     94.47 /    10000 =       0.0094

Of course, note that in both codes duplicate computations of distances are made (e.g. both the distances between firm1-firm2 and firm2-firm1 are computed) and this you can probably avoid. As it stands, for 110,000 observations it will take a long time. On the positive side, I noticed this second setup demands very little RAM as compared to the same amount of observations in the first setup. In fact, my 4GB machine freezes with the latter.

Also note that even though I use the same seed as you do, data is different because I create different numbers of observations (not 5000), which makes a difference in the variable creation process.

(By the way, if you wanted to save the value as a scalar you could use subscripting: scalar latitude = lat[1]).

这篇关于Stata:提取值并将其保存为标量(等等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆