字符串比较Stata中的因素 [英] string comparison against factors in Stata
问题描述
假设我有一个因子变量,标签为ab 和c,并且想要看到哪些观察具有b的标签。 Stata拒绝解析
gen isb = myfactor ==b
pre>
当然,字面上有一个类型不匹配,因为我的因子被编码为一个整数,因此不能与字符串b比较。然而,它不会杀死Stata(i)执行明显的解析或(ii)提供一个翻译函数,所以我可以写比较为
label(myfactor)==b
。使用decode
(重新)创建一个字符串变量,可以减少编码的目的,这是为了节省空间,使计算更有效率,对吗?
我并没有真正期望上面的比较工作,但我至少想到会有一个或两个方法。这是我到目前为止发现的。有一个很好的宏(扩展)函数映射其他方式(从一个整数到一个标签,下面看到
local labi:label ...
)。下面是使用它的解决方案://示例数据
清除
输入str5 mystr int mynum
a 5
b 5
b 6
c 4
end
编码mystr,gen(myfactor)
//首先,有多少组?
by myfactor,sort:gen ng = _n == 1
replace ng = sum(ng)
标量ng = ng [_N]
drop ng
//现在,哪个代码对应b?
forvalues i = 1 /`= ng'{
local labi:label myfactor`i'
如果b==`labi'{
scalar bcode =`i'
break
}
}
di bcode
第二步是什么让我伤心,但我确定还有一个更快,更惯用的方法来执行第一步。我可以抓取标签向量的长度,例如?
解决方案例如:
清除所有
设置更多off
sysuse auto
gen isdom = 1 if foreign == 国内:`:价值标签外国'
列出外国isdom在1/60
$ b b这会创建一个名为
从 [U] 18.3.8 宏表达式:isdom
的变量,如果foreigns
的值标签等于国内。
此外,键入
`:扩展宏函数'
等效于
local macroname:扩展宏函数
引用`macroname'的命令
这解释了提供的语法中的两个
:
之一。另一个可以解释为
...直接在表达式中指定值标签,而不是通过
底层数字value ...您可以在双引号
()中指定标签,后跟冒号(:),后跟
标签的名称。
报价来自Kenneth Higbee的The Stata Journal(2004)的 Stata tip 14:Using values labels in expressions 。可免费在 http://www.stata-journal.com/sjpdf.html?在计算不同观察值的数量时,另一种方法是:计算不同观察值的数量,计算方法如下:< / b>< / b>
by myfactor,sort:gen ng = _n == 1
如果ng
则计数b标量sc_ng = r(N)
display sc_ng
实际上,此处记录了此信息: http:// www.stata.com/support/faqs/data-management/number-of-distinct-observations/ ,以及更多方法和评论。
Suppose I have a factor variable with labels "a" "b" and "c" and want to see which observations have a label of "b". Stata refuses to parse
gen isb = myfactor == "b"
Sure, there is literally a "type mismatch", since my factor is encoded as an integer and so cannot be compared to the string "b". However, it wouldn't kill Stata to (i) perform the obvious parse or (ii) provide a translator function so I can write the comparison as
label(myfactor) == "b"
. Usingdecode
to (re)create a string variable defeats the purpose of encoding, which is to save space and make computations more efficient, right?
I hadn't really expected the comparison above to work, but I at least figured there would be a one- or two-line approach. Here is what I have found so far. There is a nice macro ("extended") function that maps the other way (from an integer to a label, seen below as
local labi: label ...
). Here's the solution using it:// sample data clear input str5 mystr int mynum a 5 b 5 b 6 c 4 end encode mystr, gen(myfactor) // first, how many groups are there? by myfactor, sort: gen ng = _n == 1 replace ng = sum(ng) scalar ng = ng[_N] drop ng // now, which code corresponds to "b"? forvalues i = 1/`=ng'{ local labi: label myfactor `i' if "b" == "`labi'" { scalar bcode = `i' break } } di bcode
The second step is what irks me, but I'm sure there's a also faster, more idiomatic way of performing the first step. Can I grab the length of the label vector, for example?
解决方案An example:
clear all set more off sysuse auto gen isdom = 1 if foreign == "Domestic":`:value label foreign' list foreign isdom in 1/60
This creates a variable called
isdom
and it will equal 1 ifforeigns
's value label is equal to "Domestic". It uses an extended macro function.From [U] 18.3.8 Macro expressions:
Also, typing
command that makes reference to `:extended macro function'
is equivalent to
local macroname : extended macro function command that makes reference to `macroname'
This explains one of the two
:
in the offered syntax. The other can be explained by... to specify value labels directly in an expression, rather than through the underlying numeric value ... You specify the label in double quotes (""), followed by a colon (:), followed by the name of the value label.
The quote is from Stata tip 14: Using value labels in expressions, by Kenneth Higbee, The Stata Journal (2004). Freely available at http://www.stata-journal.com/sjpdf.html?articlenum=dm0009
Edit
On computing the number of distinct observations, another way is:
by myfactor, sort: gen ng = _n == 1 count if ng scalar sc_ng = r(N) display sc_ng
But yours is fine. In fact, it is documented here: http://www.stata.com/support/faqs/data-management/number-of-distinct-observations/, along with more methods and comments.
这篇关于字符串比较Stata中的因素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!