R + Hadoop:如何从HDFS读取CSV文件并执行mapreduce? [英] R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

查看:1858
本文介绍了R + Hadoop:如何从HDFS读取CSV文件并执行mapreduce?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


$ b

  small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k,v)cbind(v,v ^ 2))

mapreduce函数的输入数据是一个名为small.ints的对象,它指向HDFS中的块。



现在我已经有一个CSV文件存储在HDFS中了。

  hdfs://172.16.1.58:8020 / tmp / test_short.csv

如何获取对象呢?

据我所知(这可能是错误的),如果我想从CSV文件输入数据作为mapreduce,我必须首先在R中生成一个表,包含CSV文件中的所有值。我有类似的方法:

  data = from.dfs(hdfs://172.16.1.58:8020 / tmp / test_short .csv,make.input.format(format =csv,sep =,))
mydata = data $ val

使用这种方法获取mydata,然后执行object = to.dfs(mydata)似乎没问题,但问题在于test_short.csv文件很大,这是TB大小和内存不能保存from.dfs的输出!



实际上,我想知道如果我使用hdfs://172.16.1.58:8020 / tmp / test_short.csv作为mapreduce直接输入,而inside map函数做from.dfs()的东西,我能得到数据块吗?



请给我一些建议,不管!

解决方案

mapreduce输入=路径,input.format = make.input.format(...),map ...)



from.dfs用于小数据。在大多数情况下,您不会在map函数中使用from.dfs。这些参数拥有一部分输入数据已经是

In the following example:

  small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints, 
    map = function(k, v) cbind(v, v^2))

The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.

Now I have a CSV file already stored in HDFS as

"hdfs://172.16.1.58:8020/tmp/test_short.csv"

How to get an object for it?

And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:

data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val

It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!

Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?

Please give me some advice, whatever!

解决方案

mapreduce(input = path, input.format = make.input.format(...), map ...)

from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already

这篇关于R + Hadoop:如何从HDFS读取CSV文件并执行mapreduce?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆