将第一行添加到空数据时,行号有所不同(NA与1) [英] Row numbers differ (NA vs 1) when adding first row to empty data.frame

查看:46
本文介绍了将第一行添加到空数据时,行号有所不同(NA与1)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解为什么这两种为空的 data.frame 编制索引的方法导致 NA 行号仅分配给第一行:

I'd like to understand why these two methods for indexing an empty data.frame result in an NA row number being assigned to the first row only:

方法1:

df <- data.frame(Number=numeric(), Text=character(), stringsAsFactors = FALSE)
df[1,]$Number <- 123456
df[1,]$Text <- "abcdef"
df[2,]$Number <- 456789
df[2,]$Text <- "abcdef"

输出1:

> df
   Number   Text
NA 123456 abcdef
2  456789 abcdef

方法2:

df <- data.frame(Number=numeric(), Text=character(), stringsAsFactors = FALSE)
df[1,1] <- 123456
df[1,2] <- "abcdef"
df[2,1] <- 456789
df[2,2] <- "abcdef"

输出2:

> df
  Number   Text
1 123456 abcdef
2 456789 abcdef

我看到的唯一区别是,第一种方法使用列名而不是列号访问 data.frame ,但是我不明白为什么这会导致产生 NA 行号仅分配给第一个观测值,因为从第二行开始,行号似乎按预期工作.

The only difference I see is that the first method accesses the data.frame using the column name instead of the column number, but I don't see the reason why this results in an NA row number being assigned to the first observation only since the row numbers seem to work as expected from the second row onwards.

推荐答案

好吧,此答案最重要的部分是应避免使用此类代码.将数据逐行添加到data.frame是非常低效的(请参阅 R地狱).几乎总会有更好的方法来执行此操作,具体取决于您的实际操作.

Well, the most important part of this answer is that code like this should be avoided. It is very inefficient to add data to a data.frame in a R row-by-row (see Circle 2 of the R Inferno) . There are almost always better ways to do this depending on what exactly are you doing.

但是要了解这里发生的情况.所有这些都归结为 $.data.frame<- [.data.frame [<-.data.frame >功能.在第一种情况下,使用

But in getting to what's going on here. All of this comes down to the $.data.frame<-, [.data.frame, and [<-.data.frame functions. In the first case, with

df[1,]$Number <- 123456

您首先执行的子集将调用 [<-.. data.frame .当您请求不存在的data.frame的行时,您会得到所有内容(包括行名)的一堆NA值.因此,现在您有了一个空的data.frame,在列和行名中具有NA值.现在,您调用 $<-.. data.frame 来更新 Number 列.您不更新行号.然后将此新值传递给 [<-.. data.frame ",以将其合并回data.frame.运行此命令时,它将检查以确保没有重复的行名.对于第一行,由于只有一行并且其名称为NA,因此将保留该名称.但是,当名称重复时,该函数将这些值替换为行号的索引.这就是为什么第一行得到一个NA的原因,但是当它尝试添加下一行时,它又尝试了NA,但是看到它是重复的,因此必须选择一个新名称.(查看当您尝试 df [1:2,] $ Number<-123456 然后 df [3,] $ Number<-456789 时会发生什么)

you are doing the subset first which calls [<-.data.frame. When you ask for a row of a data.frame that doesn't exist, you get a bunch of NA values for everything (including row names). So now you have an empty data.frame with NA values in the columns and row names. Now you call $<-.data.frame to just update the Number column. You don't update the row numbers. This new value then get's passed to [<-.data.frame to merge it back into the data.frame. When this command runs, it checks to make sure that there are no duplicated row names. For the first row, since there's only one row and it has the name NA, that name is kept. However when there are duplicate names, the function replaces those values with the index of the row numbers. That's why you get an NA for the first row, but when it tries to add the next row, it tried NA again, but sees that's a duplicate so it has to choose a new name. (See what happens when you try df[1:2,]$Number <- 123456 then df[3,]$Number <- 456789)

另一方面,当您这样做

df[1,1] <- 123456

这不会首先创建子集以创建缺少行名的行.您可以直接跳过 $.data.frame<- [.data.frame "进行分配.在这种情况下,它不必合并到具有NA行名称的新行中,它可以立即创建该行并分配一个行名称.这只是调用赋值运算符而不必先进行提取的一种特殊属性.您可以使用 debug(`[<-.data.frame`)打开调试器,以查看确切的过程.

That doesn't do the subsetting first to create a row with missing row names. you go right to assignment skipping $.data.frame<- and [.data.frame. In this case, it doesn't have to merge in a new row with an NA row name, it can create the row right away and assign a row name. This is just a special property of calling the assignment operator with having to do the extraction first. You can put the debugger on with debug(`[<-.data.frame`) to see exactly how that happens.

因此,第一种方法基本上执行三个步骤:1)精确地 df [1,] ,2)更改数字列的值,然后3)将新值合并回 df [1,] .第二种方法跳过第一个方法,而只是将值直接合并到 df [1,] 中.真正的区别在于,每个函数如何为尚不存在的行选择行名.

So the first method is basically doing three steps: 1) extact df[1,], 2) change the value of the number column, then 3) merge that new value back into df[1,]. The second method skips the first to steps and is just directly merging values into df[1,]. And the real difference is just how each of those functions choose row names for rows that don't exist yet.

这篇关于将第一行添加到空数据时,行号有所不同(NA与1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆