如何将SAS格式文件导入R? [英] How can I import SAS format files into R?

查看:166
本文介绍了如何将SAS格式文件导入R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析2012-2013 NATS调查的数据,从这个位置。 zip文件夹中有三个文件,标记为2012-2013 NATS format.sas,formats.sas7bcat和nats2012.sas7bdat。第三个文件包含实际数据,但第二个文件包含与数据一起使用的标签;也就是说,例如,如果原始数据文件中的变量Race具有类别1,2,3和4,则标签显示这些类别代表高加索人,非裔美国人,西班牙裔和'其他'。
我已经能够使用'sas7bdat'包将sas7bdat文件导入到R中,但是当我尝试进行交叉制表时,我无法看到每个单元格代表哪个类别。例如,如果我尝试这样做:

I am trying to analyze data from the 2012-2013 NATS survey, from this location. There are three files in the zip folder there, labelled 2012-2013 NATS format.sas, formats.sas7bcat and nats2012.sas7bdat. The third file contains the actual data, but the second file contains the labels that go with the data; that is, as an example, if the variable 'Race' in the raw data file has categories 1,2,3 and 4, the labels show that these categories stand for 'Caucasian', 'African-American','Hispanic' and 'Other'. I have been able to import the sas7bdat file into R, using the 'sas7bdat' package, but when I try to do cross-tabulations, I am not able to see which category each cell represents. For example, if I try to do this:

table(SMOKSTATUS_R, RACEETHNIC)

我得到的是:

RACEETHNIC
SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
           1  4045   455    55     7    63     0   675   393   373
           2  1183   222    38     2    26     0   217   255   154
           3 14480   957   238    14    95     3  1112   950   369
           4 23923  2532  1157    23   147     1  1755  3223   909
           5    81    18     4     0     1     0    11    17     9

据我所知,将标签包含在数据中的唯一方法是手动输入数据,但是有240个变量,此外,目前还有标签, format.sas7bcat文件的形式。有没有办法将格式文件导入R,以便标签可以附加到变量?这是在SAS中完成的,但我现在没有访问权限。感谢您的帮助。

As far as I can tell, the only way to inlcude the labels to the data is manually typing them in, but there are 240 variables and besides, there are labels currently existing, in the form of the format.sas7bcat file. Is there any way to import the format file into R, so that the labels can be attached to the variables? This is how it is done in SAS, but I do not have access t oSAS right now. Thanks for all the help.

推荐答案

这应该是一个单行:

library('haven')
sas <- read_sas('nats2012.sas7bdat', 'formats.sas7bcat')

with(sas, table(SMOKSTATUS_R, RACEETHNIC))
#             RACEETHNIC
# SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
#            1  4045   455    55     7    63     0   675   393   373
#            2  1183   222    38     2    26     0   217   255   154
#            3 14480   957   238    14    95     3  1112   950   369
#            4 23923  2532  1157    23   147     1  1755  3223   909
#            5    81    18     4     0     1     0    11    17     9

table(names(attr(sas[, 'SMOKSTATUS_R'], 'labels')[sas[, 'SMOKSTATUS_R']]),
      names(attr(sas[, 'RACEETHNIC'], 'labels')[sas[, 'RACEETHNIC']]))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

使用避免来读取数据,但是给你一些有用的属性,即变量标签:

Use haven to read in the data, but that also gives you some useful attributes, namely the variable labels:

attributes(sas$SMOKSTATUS_R)
# $label
# [1] "SMOKER STATUS (4-level)"
# 
# $class
# [1] "labelled"
# 
# $labels
# Current everyday smoker Current some days smoker            Former smoker 
#                       1                        2                        3 
# Never smoker                  Unknown 
#            4                        5 
# 
# $is_na
# [1] FALSE FALSE FALSE FALSE FALSE

您可以轻松地将其写入函数更常用:

You can easily write this into a function to use more generally:

do_fmt <- function(x, fmt) {
  lbl <- if (!missing(fmt))
    unlist(unname(fmt)) else attr(x, 'labels')

  if (!is.null(lbl))
    tryCatch(names(lbl[match(unlist(x), lbl)]),
             error = function(e) {
               message(sprintf('formatting failed for %s', attr(x, 'label')),
                       domain = NA)
               x
             }) else x
}

table(do_fmt(sas[, 'SMOKSTATUS_R']),
      do_fmt(sas[, 'RACEETHNIC']))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

并适用于整个数据集

sas[] <- lapply(sas, do_fmt)
sas$SMOKSTATUS_R[1:4]
# [1] "Never smoker"  "Former smoker" "Former smoker" "Never smoker" 

虽然有时会失败如下。这看起来像避货港包有问题

Although sometimes this fails like below. This looks like something wrong with the haven package

attr(sas$SMOKTYPE, 'labels')
# INAPPLICABLE            REFUSED                 DK    NOT ASCERTAINED 
#     -4.00000           -0.62500           -0.50000           -0.46875 
# PREMADE CIGARETTES      ROLL-YOUR-OWN               BOTH 
#            1.00000            2.00000            3.00000 

因此,您可以使用一些简单的正则表达式解析format.sas文件

So instead of this, you can parse the format.sas file with some simple regexes

locf <- function(x) {
  x <- data.frame(x, stringsAsFactors = FALSE)
  x[x == ''] <- NA
  indx <- !is.na(x)

  x[] <- lapply(seq_along(x), function(ii) {
    idx <- cumsum(indx[, ii])
    idx[idx == 0] <- NA
    x[, ii][indx[, ii]][idx]
  })
  x[, 1]
}

fmt <- readLines('~/desktop/2012-2013-NATS-Format/2012-2013-NATS-Format.sas')
## not sure if comments are allowed in the value definitions, but
## this will check for those in case
fmt <- gsub('\\*.*;|\\/\\*.*\\*\\/', '', fmt)

vars <- gsub('(?i)value\\W+(\\w*)|.', '\\1', fmt, perl = TRUE)
vars <- locf(vars)

regex <- '[\'\"].*[\'\"]|[\\w\\d-]+'
vals <- gsub(sprintf('(?i)\\s*(%s)\\s*(=)\\s*(%s)|.', regex, regex),
               '\\1\\2\\3', fmt, perl = TRUE)

View(dd <- na.omit(data.frame(values = vars, formats = vals,
                              stringsAsFactors = FALSE)))

sp <- split(dd$formats, dd$values)
sp <- lapply(sp, function(x) {
  x <- Filter(nzchar, x)
  x <- strsplit(x, '=')
  tw <- function(x) gsub('^\\s+|\\s+$', '', x)
  sapply(x, function(y)
    setNames(tw(y[1]), tw(y[2])))
})

因此,例如,烟雾类型格式(其中一个在上面失败)被解析为这个:

So the smoke type formats (one of them that failed above), for example, gets parsed like this:

sp['A5_']
# $A5_
# 'INAPPLICABLE'            'REFUSED'                 'DK' 
#           "-1"                 "-7"                 "-8" 
# 'NOT ASCERTAINED' 'PREMADE CIGARETTES'      'ROLL-YOUR-OWN'  'BOTH' 
#              "-9"                  "1"                  "2"     "3" 

然后你可以再次使用这个功能来申请到数据

And then you can use the function again to apply to the data

table(do_fmt(sas['SMOKTYPE'], sp['A5_']))

# 'BOTH'                 'DK'       'INAPPLICABLE' 
#   736                   17                51857 
# 'PREMADE CIGARETTES'            'REFUSED'      'ROLL-YOUR-OWN' 
#                 7184                    2                  396 

这篇关于如何将SAS格式文件导入R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆