读取数据框时,为什么我的列名中会出现 X.? [英] Why am I getting X. in my column names when reading a data frame?

查看:53
本文介绍了读取数据框时,为什么我的列名中会出现 X.?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我问了一个 几个月前关于这个的问题,我认为答案已经解决了我的问题,但我再次遇到了这个问题,该解决方案对我不起作用.

我正在导入 CSV:

orders <- read.csv("", sep=",", header=T, check.names = FALSE)

这是数据框的结构:

str(orders)'data.frame':3331575 观察.2个变量:$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...$ OrderDate: Factor w/402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

如果我在第一列 OrderID 上运行 length 命令,我会得到:

length(orders$OrderID)[1] 0

如果我在 OrderDate 上运行 length,它会正确返回:

length(orders$OrderDate)[1] 3331575

这是CSVhead 的复制/粘贴.

OrderID,OrderDate-2034590217,2011-10-14-2034590216,2011-10-14-2031892773,2011-10-24-2031892767,2011-10-21-2021008573,2011-12-08-2021008572,2011-12-07-2021008571,2011-12-07-2021008570,2011-12-07-2021008569,2011-12-07

现在,如果我重新运行 read.csv,但去掉 check.names 选项,dataframe 的第一列> 现在在名称的开头有一个 X..

orders2 <- read.csv("", sep=",", header=T)str(订单2)'data.frame':3331575 观察.2个变量:$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...$ OrderDate: Factor w/402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...长度(订单$X.OrderID)[1] 3331575

这可以正常工作.

我的问题是为什么 R 在第一列名称的开头添加 X.?从 CSV 文件中可以看出,没有特殊字符.它应该是一个简单的负载.添加 check.names,虽然将从 CSV 导入名称,但会导致数据无法正确加载以供我执行分析.

我该怎么做才能解决这个问题?

旁注:我意识到这是次要的 - 我只是因为我认为我加载正确但没有得到我预期的结果而感到更加沮丧.我可以使用 colnames(orders)[1] <- "OrderID" 重命名该列,但仍然想知道它为什么不能正确加载.

解决方案

read.csv() 是更通用的 read.table() 函数的包装器.后一个函数有参数 check.names 记录为:

<块引用>

check.names:逻辑.如果为TRUE",则变量的名称在检查数据框以确保它们在语法上有效的变量名.如有必要,它们会被调整(通过‘make.names’) 以便它们是,并且还确保有没有重复.

如果您的标题包含在语法上无效的标签,则 make.names() 将根据无效名称将它们替换为有效名称,删除无效字符并可能在前面添加 X:

R>make.names("$Foo")[1] 《X.Foo》

这记录在 ?make.names 中:

<块引用>

详情:句法上有效的名称由字母、数字和点或下划线字符并以字母或点开头后面没有数字.诸如.2way"之类的名称无效,保留字也不是._letter_ 的定义取决于当前的语言环境,但只有 ASCII 数字被认为是数字.如有必要,可在前面加上字符X".全部无效字符被翻译成‘"."’.翻译了缺失值到不".与 R 关键字匹配的名称后附有一个点他们.重复的值被‘make.unique’改变.

您所看到的行为与 read.table() 载入数据的记录方式完全一致.这表明您在 CSV 文件的标题行中有语法上无效的标签.请注意上面 ?make.names 中的一点,什么是字母取决于您系统的语言环境;例如,CSV 文件可能包含文本编辑器将显示的有效字符,但如果 R​​ 不在同一语言环境中运行,该字符可能在那里无效?

我会查看 CSV 文件并识别标题行中的任何非 ASCII 字符;标题行中也可能有不可见的字符(或转义序列; ?).在读取具有无效名称的文件和在控制台中显示它之间可能会发生很多事情,这可能会掩盖无效字符,所以不要认为没有 check.names 表示文件没问题.

发布 sessionInfo() 的输出也很有用.

I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn't work for me.

I'm importing a CSV:

orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)

Here's the structure of the dataframe:

str(orders)

'data.frame':   3331575 obs. of  2 variables:
 $ OrderID  : num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

If I run the length command on the first column, OrderID, I get this:

length(orders$OrderID)
[1] 0

If I run the length on OrderDate, it returns correctly:

length(orders$OrderDate)
[1] 3331575

This is a copy/paste of the head of the CSV.

OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07

Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.

orders2 <- read.csv("<file_location>", sep=",", header=T)

str(orders2)

'data.frame':   3331575 obs. of  2 variables:
 $ X.OrderID: num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

length(orders$X.OrderID)
[1] 3331575

This works correctly.

My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.

What can I do to fix this?

Side note: I realize this is a minor - I'm just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn't load correctly.

解决方案

read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:

check.names: logical.  If ‘TRUE’ then the names of the variables in the
         data frame are checked to ensure that they are syntactically
         valid variable names.  If necessary they are adjusted (by
         ‘make.names’) so that they are, and also to ensure that there
         are no duplicates.

If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:

R> make.names("$Foo")
[1] "X.Foo"

This is documented in ?make.names:

Details:

    A syntactically valid name consists of letters, numbers and the
    dot or underline characters and starts with a letter or the dot
    not followed by a number.  Names such as ‘".2way"’ are not valid,
    and neither are the reserved words.

    The definition of a _letter_ depends on the current locale, but
    only ASCII digits are considered to be digits.

    The character ‘"X"’ is prepended if necessary.  All invalid
    characters are translated to ‘"."’.  A missing value is translated
    to ‘"NA"’.  Names which match R keywords have a dot appended to
    them.  Duplicated values are altered by ‘make.unique’.

The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?

I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; ?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.

Posting the output of sessionInfo() would also be useful.

这篇关于读取数据框时,为什么我的列名中会出现 X.?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆