Stats to Pandas:即使有重复的价值标签? [英] Stata to Pandas: even if there are repeated Value Labels?

查看:167
本文介绍了Stats to Pandas:即使有重复的价值标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试打开.dta作为DataFrame. 但是出现一个错误:"ValueError:列的值标签不是唯一的.重复的标签是:",其后是在列中两次将其包围的标签.

i try to open a .dta as DataFrame. But an Error appears: "ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels wich apper twice in a column.

我知道在stata中使用完全相同的值来标记乘法代码并不聪明(这不是我的错:)) 经过一些研究,我知道熊猫不会接受重复的值标签(这很聪明).

I know labeling multiplie codes with the exact same value label in stata is not clever (not my fault :)) After some research i know, pandas will not accept repeated value labels (this IS clever).

但是我不知道一个(好的)解决方案: 有吗?

But i can't figure out a (good) solution: Is there:

a.一个简单的方法来打开熊猫数据并在此过程中将双打重命名(例如,将"label"更改为"label(2)")?

a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?

这是数据的样子(括号内的值标签):

here is what the data looks like (value labels in brackets):

  | multilabel    
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)

到目前为止我的代码:

import pandas as pd

#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')

b.一种快速简便的告诉状态的方法:仅将所有重复值标签重命名为"label(2)",而不是"label"? 是的,到目前为止的代码也很无聊:

b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"? and yes, the code so far is also rather boring:

use "file.dta"

*followed by a loop wich finds repeated labels and changes them

save "file.dta", replace

是的,有很多重复的值标签可以一一对应.

And yes, there are to many repeated value labels to go trough it one by one.

在这里,Stata-Commands产生一个最小的示例:

And here the Stata-Commands to produce a minimal example:

set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat

每个建议我都很高兴!

推荐答案

如果您的变量带有重复的标签,则

If you have a variable with repeated labels, then

decode multilabel, gen(valuelabel)
label values multilabel

将值标签放入字符串变量中,然后撤消multilabel值与先前附加的值标签的关联.我不知道您还需要做什么,以及为什么您还要做其他事情.您现在拥有与以前相同的信息.我不知道熊猫是否会忽略价值标签的定义.

puts the value labels in a string variable and then undoes the association of multilabel values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.

为了完整起见,这是一种找出哪些变量的值标签与数字值不一一对应的方法.

For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.

* your sandbox, simplified and extended  
clear 
set obs 3
generate var1 = _n 
generate multilabel = 11 * _n
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat

label define var1 1 "frog" 2 "toad" 3 "newt"
label val var1 var1 


* my code 
local bad 
ds *, has(vallabel) 

quietly foreach v in `r(varlist)' { 
    tempvar decoded diff 
    decode `v', gen(`decoded') 
    bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded') 
    count if `diff' 
    if r(N) > 0 local bad `bad' `v' 
    drop `decoded' `diff' 
} 

di "`bad'" 

这篇关于Stats to Pandas:即使有重复的价值标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆