sapply 与 lapply 在读取文件并绑定它们时 [英] sapply vs. lapply while reading files and rbind'ing them

查看:39
本文介绍了sapply 与 lapply 在读取文件并绑定它们时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我关注了 Hadley 的帖子:使用 rbind 将多个 .csv 文件加载到 R 中的单个数据帧中的问题 读取多个 CSV 文件,然后将它们转换为一个数据帧.我还尝试了 lapplysapply 的对比,如 分组函数(tapply、by、aggrega)和*apply族.

I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV files and then convert them to one dataframe. I also experimented with lapply vs. sapply as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.

这是我的第一个 CSV 文件:

Here's my first CSV file:

dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

这是我的第二个 CSV 文件:

Here's my second CSV file:

dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

这是我的代码:

dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))

虽然这很好用,但我想将 lapply 更改为 sapply.从上面的线程中,我意识到 sapply 会将读取因子从 csv 文件更改为矩阵,但我不确定为什么翻转字段.例如,Income 字段占用第 3 行和第 8 行,但不在一列中.

While this works beautifully, I wanted to change lapply to sapply. From the above thread, I realize that sapply would change the read factors from csv file to matrices, but I am unsure why the fields are flipped. For instance, Income field occupies row#3 and row#8, but are not in one column.

代码如下:

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

# change lapply to sapply    
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))

输出如下:

    [,1] [,2] [,3] [,4] [,5]
 [1,]    1    2    1    1    1
 [2,]    1    2    2    2    2
 [3,]   55   23   34   45   44
 [4,]   23   21   22   24   25
 [5,]    3    3    1    4    2
 [6,]    1    2    1    1    1
 [7,]    1    2    2    2    2
 [8,]   55   55   55   55   55
 [9,]   24   24   24   24   24
[10,]    3    3    1    4    2

我很感激任何帮助.我对 R 相当陌生,不确定发生了什么.

I'd appreciate any help. I am fairly new to R and not sure what's going on.

推荐答案

这个问题与因素无关,它是通用的sapply vs lapply.为什么 sapply 会出错而 lapply 会正确?请记住,在 R 中,数据框是列列表.并且每一列都可以有不同的类型.

The issue had nothing to do with factors, it's generic sapply vs lapply. Why does sapply get it so wrong whereas lapply gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.

  • lapply 返回一个列列表给 rbind,它正确地进行连接.它将相应的列保持在一起.所以你的因素正确出现.
  • sapply 但是...
    • 返回一个数字矩阵...(因为矩阵只能有一种类型,与数据帧不同)
    • ...更糟糕的是,有一个不需要的转置
    • so sapply 将您的两个 5x6 输入数据帧转换为转置的 6x5 矩阵(列现在对应于行)...
    • 所有数据都被强制转换为数字(垃圾!).
    • then rbind row-连接"这两个垃圾 6x5 数字矩阵到一个非常垃圾的 12x5 矩阵中.由于列已转为行,因此行连接矩阵组合了数据类型,显然您的因素被搞乱了.
    • lapply returns a list-of-columns to rbind, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.
    • sapply however...
      • returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
      • ...which, worse still, has an unwanted transpose
      • so sapply turns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)...
      • with all data coerced to numeric (garbage!).
      • then rbind row-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.

      总结:只需使用lapply

      这篇关于sapply 与 lapply 在读取文件并绑定它们时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆