如何读取HDFS中的文件，而不丢失列和行名称 [英] How to read files in HDFS in R without loosing column and row names

查看：235 发布时间：2017/2/26 15:29:36 r csv hadoop hdfs

本文介绍了如何读取HDFS中的文件，而不丢失列和行名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题是，当我读取一个csv文件包含列名称例如（header），列的名称dissapear和V1，V2...

我有csv格式的 mtcars 数据集，这里是预览

  model，mpg，cyl，disp，hp，drat，wt，qsec，vs，am，gear，carb 
马自达RX4,21,6,160,110,3.9,2.62,16.46,0,1， 4,4 
 Mazda RX4 Wag，21,6,160,110,3.9,2.875,17.02,0,1,4,4 
 Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1 ，4,1

我想上传到HDFS并读取它， HUE平台并上传文件。我可以在文件管理器中查看它。这里是一个小预览：

< img src =https://i.stack.imgur.com/yzTA1.pngalt =enter image description here>

然后在R会话使用 plyrmr 我运行下面的代码：

filename3< ; - /user/sgerony/mtcars.csv 输入（filename3，format = make.input.format（format =csv，sep =，））

，结果如下：

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 
 1克莱斯勒帝国14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 
 2菲亚特128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 
 3本田思域30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 
 4丰田卡罗拉33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1

您可以看到列名已经消失。我做错了什么？

感谢

解决方案

解决方案我发现（我真的不喜欢喜欢它，所以如果你有一个更好的一个请做共享）。

我分开的csv文件在两个csv文件，一个只包含列名（mtcars_names.csv），另一个包含数据（mtcars_no_names.csv）。

  filename<  - /user/sgerony/mtcars_no_names.csv
 filename.names<  - /user/sgerony/mtcars_names.csv
 filename.names<  -  as.data.frame（input（filename.names，
 format = make.input.format（ 
 
＃转换字符类型的列
 for（i in 1：dim（filename.names）[2] ）{
 filename.names [，i]<  -  as.character（filename.names [，i]）
}

现在我每次写/读文件时都会编码：

  ### comlumn名称信息再次丢失
输出（输入（filename，format = make.input.format（format =csv，
 sep =，，col.names = filename.names [1 ，]），
 path =/ user / sgerony / mtcars_output_csv）
 
输入（/ user / sgerony / mtcars_output_csv，
 format = make.input.format （format =csv，
 sep =，，col.names = filename.names [1，]））

如果我生成数据子集，这可能会很麻烦。对于具有不同列名的每个子集，将必须生成包含列名的新文件

My problem is that when I read a csv file containing column names for example (header), the names of the columns dissapear and have "V1","V2"... instead

I have the mtcars dataset in csv format and here is the preview

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

I would like to upload to the HDFS and read it, so I go on the "HUE" platform and upload the file. I can view it in the file manager. here is a small preview:

Then in the R session using plyrmr I run the code hereafter:

filename3 <- "/user/sgerony/mtcars.csv"
input(filename3,format=make.input.format(format = "csv", sep=","))

and the result is this:

                V1   V2  V3    V4  V5   V6    V7    V8 V9 V10  V11  V12
1    Chrysler Imperial 14.7   8   440 230 3.23 5.345 17.42  0   0    3    4
2             Fiat 128 32.4   4  78.7  66 4.08   2.2 19.47  1   1    4    1
3          Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1   1    4    2
4       Toyota Corolla 33.9   4  71.1  65 4.22 1.835  19.9  1   1    4    1

As you can see the column names have gone away. What am I doing wrong?

Thanks

解决方案

This is the solution I found (I really don't like like it so if you have a better one please do share).

I separated the csv file in two csv files, one containing only the column names (mtcars_names.csv) and the other containing the data (mtcars_no_names.csv). Then uploaded them on the file manager.

filename <- "/user/sgerony/mtcars_no_names.csv"
filename.names <- "/user/sgerony/mtcars_names.csv"
filename.names <- as.data.frame(input(filename.names,
format=make.input.format(format = "csv", sep=",")))

# transform the columns in "character" types
for(i in 1:dim(filename.names)[2]){
  filename.names[,i] <- as.character(filename.names[,i])
}

Now everytime I write /read the file I code:

### comlumn name information is once more lost
output(input(filename,format=make.input.format(format = "csv",
sep=",", col.names = filename.names[1,])),
path="/user/sgerony/mtcars_output_csv")

input("/user/sgerony/mtcars_output_csv",
format=make.input.format(format = "csv", 
sep=",", col.names = filename.names[1,]))

which can get quite messy if I generate data subsets. For each subset with different column names a new file containing the column names will have to be generated

这篇关于如何读取HDFS中的文件，而不丢失列和行名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何读取HDFS中的文件，而不丢失列和行名称 [英] How to read files in HDFS in R without loosing column and row names

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录关闭

如何读取HDFS中的文件，而不丢失列和行名称 [英] How to read files in HDFS in R without loosing column and row names

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭