如何读取HDFS中的文件,而不丢失列和行名称 [英] How to read files in HDFS in R without loosing column and row names
问题描述
我的问题是,当我读取一个csv文件包含列名称例如(header),列的名称dissapear和V1,V2...
我有csv格式的 mtcars
数据集,这里是预览
model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
马自达RX4,21,6,160,110,3.9,2.62,16.46,0,1, 4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1 ,4,1
我想上传到HDFS并读取它, HUE平台并上传文件。我可以在文件管理器中查看它。这里是一个小预览:
< img src =https://i.stack.imgur.com/yzTA1.pngalt =enter image description here>
然后在R会话使用 plyrmr
我运行下面的代码:
filename3< ; - /user/sgerony/mtcars.csv
输入(filename3,format = make.input.format(format =csv,sep =,))
,结果如下:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1克莱斯勒帝国14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
2菲亚特128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
3本田思域30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
4丰田卡罗拉33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
您可以看到列名已经消失。我做错了什么?
感谢
解决方案解决方案我发现(我真的不喜欢喜欢它,所以如果你有一个更好的一个请做共享)。
我分开的csv文件在两个csv文件,一个只包含列名(mtcars_names.csv),另一个包含数据(mtcars_no_names.csv)。
filename< - /user/sgerony/mtcars_no_names.csv
filename.names< - /user/sgerony/mtcars_names.csv
filename.names< - as.data.frame(input(filename.names,
format = make.input.format(
#转换字符类型的列
for(i in 1:dim(filename.names)[2] ){
filename.names [,i]< - as.character(filename.names [,i])
}
现在我每次写/读文件时都会编码:
### comlumn名称信息再次丢失
输出(输入(filename,format = make.input.format(format =csv,
sep =,,col.names = filename.names [1 ,]),
path =/ user / sgerony / mtcars_output_csv)
输入(/ user / sgerony / mtcars_output_csv,
format = make.input.format (format =csv,
sep =,,col.names = filename.names [1,]))
如果我生成数据子集,这可能会很麻烦。对于具有不同列名的每个子集,将必须生成包含列名的新文件
My problem is that when I read a csv file containing column names for example (header), the names of the columns dissapear and have "V1","V2"... instead
I have the
mtcars
dataset in csv format and here is the previewmodel,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4 Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4 Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
I would like to upload to the HDFS and read it, so I go on the "HUE" platform and upload the file. I can view it in the file manager. here is a small preview:
Then in the R session using
plyrmr
I run the code hereafter:filename3 <- "/user/sgerony/mtcars.csv" input(filename3,format=make.input.format(format = "csv", sep=","))
and the result is this:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 1 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 3 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
As you can see the column names have gone away. What am I doing wrong?
Thanks
解决方案This is the solution I found (I really don't like like it so if you have a better one please do share).
I separated the csv file in two csv files, one containing only the column names (mtcars_names.csv) and the other containing the data (mtcars_no_names.csv). Then uploaded them on the file manager.
filename <- "/user/sgerony/mtcars_no_names.csv" filename.names <- "/user/sgerony/mtcars_names.csv" filename.names <- as.data.frame(input(filename.names, format=make.input.format(format = "csv", sep=","))) # transform the columns in "character" types for(i in 1:dim(filename.names)[2]){ filename.names[,i] <- as.character(filename.names[,i]) }
Now everytime I write /read the file I code:
### comlumn name information is once more lost output(input(filename,format=make.input.format(format = "csv", sep=",", col.names = filename.names[1,])), path="/user/sgerony/mtcars_output_csv") input("/user/sgerony/mtcars_output_csv", format=make.input.format(format = "csv", sep=",", col.names = filename.names[1,]))
which can get quite messy if I generate data subsets. For each subset with different column names a new file containing the column names will have to be generated
这篇关于如何读取HDFS中的文件,而不丢失列和行名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!