字符串比较Stata中的因素 [英] string comparison against factors in Stata

查看:189
本文介绍了字符串比较Stata中的因素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个因子变量,标签为ab 和c,并且想要看到哪些观察具有b的标签。 Stata拒绝解析

  gen isb = myfactor ==b
pre>

当然,字面上有一个类型不匹配,因为我的因子被编码为一个整数,因此不能与字符串b比较。然而,它不会杀死Stata(i)执行明显的解析或(ii)提供一个翻译函数,所以我可以写比较为 label(myfactor)==b。使用 decode (重新)创建一个字符串变量,可以减少编码的目的,这是为了节省空间,使计算更有效率,对吗?






我并没有真正期望上面的比较工作,但我至少想到会有一个或两个方法。这是我到目前为止发现的。有一个很好的宏(扩展)函数映射其他方式(从一个整数到一个标签,下面看到 local labi:label ... )。下面是使用它的解决方案:

  //示例数据

清除
输入str5 mystr int mynum
a 5
b 5
b 6
c 4
end

编码mystr,gen(myfactor)

//首先,有多少组?

by myfactor,sort:gen ng = _n == 1
replace ng = sum(ng)
标量ng = ng [_N]
drop ng

//现在,哪个代码对应b?

forvalues i = 1 /`= ng'{
local labi:label myfactor`i'
如果b==`labi'{
scalar bcode =`i'
break
}
}

di bcode

第二步是什么让我伤心,但我确定还有一个更快,更惯用的方法来执行第一步。我可以抓取标签向量的长度,例如?

解决方案

例如:

 清除所有
设置更多off

sysuse auto

gen isdom = 1 if foreign == 国内:`:价值标签外国'

列出外国isdom在1/60


$ b b

这会创建一个名为 isdom 的变量,如果 foreigns 的值标签等于国内。

[U] 18.3.8 宏表达式:


此外,键入

  `:扩展宏函数'

等效于

  local macroname:扩展宏函数
引用`macroname'的命令


这解释了提供的语法中的两个之一。另一个可以解释为


...直接在表达式中指定值标签,而不是通过
底层数字value ...您可以在双引号
()中指定标签,后跟冒号(:),后跟
标签的名称。


报价来自Kenneth Higbee的The Stata Journal(2004)的 Stata tip 14:Using values labels in expressions 。可免费在 http://www.stata-journal.com/sjpdf.html?在计算不同观察值的数量时,另一种方法是:计算不同观察值的数量,计算方法如下:< / b>< / b>

  by myfactor,sort:gen ng = _n == 1 
如果ng
则计数b标量sc_ng = r(N)

display sc_ng

实际上,此处记录了此信息: http:// www.stata.com/support/faqs/data-management/number-of-distinct-observations/ ,以及更多方法和评论。


Suppose I have a factor variable with labels "a" "b" and "c" and want to see which observations have a label of "b". Stata refuses to parse

gen isb = myfactor == "b"

Sure, there is literally a "type mismatch", since my factor is encoded as an integer and so cannot be compared to the string "b". However, it wouldn't kill Stata to (i) perform the obvious parse or (ii) provide a translator function so I can write the comparison as label(myfactor) == "b". Using decode to (re)create a string variable defeats the purpose of encoding, which is to save space and make computations more efficient, right?


I hadn't really expected the comparison above to work, but I at least figured there would be a one- or two-line approach. Here is what I have found so far. There is a nice macro ("extended") function that maps the other way (from an integer to a label, seen below as local labi: label ...). Here's the solution using it:

// sample data 

clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end

encode mystr, gen(myfactor)

// first, how many groups are there?

by myfactor, sort: gen ng = _n == 1
replace ng = sum(ng)
scalar ng = ng[_N]
drop ng

// now, which code corresponds to "b"?

forvalues i = 1/`=ng'{
    local labi: label myfactor `i'
    if "b" == "`labi'" {
        scalar bcode = `i'
        break
    }
}

di bcode

The second step is what irks me, but I'm sure there's a also faster, more idiomatic way of performing the first step. Can I grab the length of the label vector, for example?

解决方案

An example:

clear all
set more off

sysuse auto

gen isdom = 1 if foreign == "Domestic":`:value label foreign'

list foreign isdom in 1/60

This creates a variable called isdom and it will equal 1 if foreigns's value label is equal to "Domestic". It uses an extended macro function.

From [U] 18.3.8 Macro expressions:

Also, typing

command that makes reference to `:extended macro function'

is equivalent to

local macroname : extended macro function
command that makes reference to `macroname'

This explains one of the two : in the offered syntax. The other can be explained by

... to specify value labels directly in an expression, rather than through the underlying numeric value ... You specify the label in double quotes (""), followed by a colon (:), followed by the name of the value label.

The quote is from Stata tip 14: Using value labels in expressions, by Kenneth Higbee, The Stata Journal (2004). Freely available at http://www.stata-journal.com/sjpdf.html?articlenum=dm0009

Edit

On computing the number of distinct observations, another way is:

by myfactor, sort: gen ng = _n == 1
count if ng
scalar sc_ng = r(N)

display sc_ng

But yours is fine. In fact, it is documented here: http://www.stata.com/support/faqs/data-management/number-of-distinct-observations/, along with more methods and comments.

这篇关于字符串比较Stata中的因素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆