如何读取HDFS中的文件,而不丢失列和行名称 [英] How to read files in HDFS in R without loosing column and row names

查看:235
本文介绍了如何读取HDFS中的文件,而不丢失列和行名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是,当我读取一个csv文件包含列名称例如(header),列的名称dissapear和V1,V2...



我有csv格式的 mtcars 数据集,这里是预览

  model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb 
马自达RX4,21,6,160,110,3.9,2.62,16.46,0,1, 4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1 ,4,1

我想上传到HDFS并读取它, HUE平台并上传文件。我可以在文件管理器中查看它。这里是一个小预览:



< img src =https://i.stack.imgur.com/yzTA1.pngalt =enter image description here>



然后在R会话使用 plyrmr 我运行下面的代码:

  filename3< ;  - /user/sgerony/mtcars.csv
输入(filename3,format = make.input.format(format =csv,sep =,))



,结果如下:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 
1克莱斯勒帝国14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
2菲亚特128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
3本田思域30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
4丰田卡罗拉33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1

您可以看到列名已经消失。我做错了什么?



感谢

解决方案

解决方案我发现(我真的不喜欢喜欢它,所以如果你有一个更好的一个请做共享)。



我分开的csv文件在两个csv文件,一个只包含列名(mtcars_names.csv),另一个包含数据(mtcars_no_names.csv)。

  filename<  - /user/sgerony/mtcars_no_names.csv
filename.names< - /user/sgerony/mtcars_names.csv
filename.names< - as.data.frame(input(filename.names,
format = make.input.format(

#转换字符类型的列
for(i in 1:dim(filename.names)[2] ){
filename.names [,i]< - as.character(filename.names [,i])
}

现在我每次写/读文件时都会编码:

  ### comlumn名称信息再次丢失
输出(输入(filename,format = make.input.format(format =csv,
sep =,,col.names = filename.names [1 ,]),
path =/ user / sgerony / mtcars_output_csv)

输入(/ user / sgerony / mtcars_output_csv,
format = make.input.format (format =csv,
sep =,,col.names = filename.names [1,]))

如果我生成数据子集,这可能会很麻烦。对于具有不同列名的每个子集,将必须生成包含列名的新文件


My problem is that when I read a csv file containing column names for example (header), the names of the columns dissapear and have "V1","V2"... instead

I have the mtcars dataset in csv format and here is the preview

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

I would like to upload to the HDFS and read it, so I go on the "HUE" platform and upload the file. I can view it in the file manager. here is a small preview:

Then in the R session using plyrmr I run the code hereafter:

filename3 <- "/user/sgerony/mtcars.csv"
input(filename3,format=make.input.format(format = "csv", sep=","))

and the result is this:

                V1   V2  V3    V4  V5   V6    V7    V8 V9 V10  V11  V12
1    Chrysler Imperial 14.7   8   440 230 3.23 5.345 17.42  0   0    3    4
2             Fiat 128 32.4   4  78.7  66 4.08   2.2 19.47  1   1    4    1
3          Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1   1    4    2
4       Toyota Corolla 33.9   4  71.1  65 4.22 1.835  19.9  1   1    4    1

As you can see the column names have gone away. What am I doing wrong?

Thanks

解决方案

This is the solution I found (I really don't like like it so if you have a better one please do share).

I separated the csv file in two csv files, one containing only the column names (mtcars_names.csv) and the other containing the data (mtcars_no_names.csv). Then uploaded them on the file manager.

filename <- "/user/sgerony/mtcars_no_names.csv"
filename.names <- "/user/sgerony/mtcars_names.csv"
filename.names <- as.data.frame(input(filename.names,
format=make.input.format(format = "csv", sep=",")))

# transform the columns in "character" types
for(i in 1:dim(filename.names)[2]){
  filename.names[,i] <- as.character(filename.names[,i])
}

Now everytime I write /read the file I code:

### comlumn name information is once more lost
output(input(filename,format=make.input.format(format = "csv",
sep=",", col.names = filename.names[1,])),
path="/user/sgerony/mtcars_output_csv")

input("/user/sgerony/mtcars_output_csv",
format=make.input.format(format = "csv", 
sep=",", col.names = filename.names[1,]))

which can get quite messy if I generate data subsets. For each subset with different column names a new file containing the column names will have to be generated

这篇关于如何读取HDFS中的文件,而不丢失列和行名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆