R + Hadoop：如何从HDFS读取CSV文件并执行mapreduce？ [英] R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

查看：1858 发布时间：2018/5/31 19:29:48 r hadoop rhadoop

本文介绍了R + Hadoop：如何从HDFS读取CSV文件并执行mapreduce？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

$ b
small.ints = to.dfs（1：1000） mapreduce（ input = small.ints， map = function（k，v）cbind（v，v ^ 2））
mapreduce函数的输入数据是一个名为small.ints的对象，它指向HDFS中的块。

现在我已经有一个CSV文件存储在HDFS中了。

hdfs：//172.16.1.58：8020 / tmp / test_short.csv
如何获取对象呢？

据我所知（这可能是错误的），如果我想从CSV文件输入数据作为mapreduce，我必须首先在R中生成一个表，包含CSV文件中的所有值。我有类似的方法：

data = from.dfs（hdfs：//172.16.1.58：8020 / tmp / test_short .csv，make.input.format（format =csv，sep =，）） mydata = data $ val
使用这种方法获取mydata，然后执行object = to.dfs（mydata）似乎没问题，但问题在于test_short.csv文件很大，这是TB大小和内存不能保存from.dfs的输出！

实际上，我想知道如果我使用hdfs：//172.16.1.58：8020 / tmp / test_short.csv作为mapreduce直接输入，而inside map函数做from.dfs（）的东西，我能得到数据块吗？

请给我一些建议，不管！

解决方案
mapreduce输入=路径，input.format = make.input.format（...），map ...）

from.dfs用于小数据。在大多数情况下，您不会在map函数中使用from.dfs。这些参数拥有一部分输入数据已经是
In the following example:
small.ints = to.dfs(1:1000) mapreduce( input = small.ints, map = function(k, v) cbind(v, v^2))
The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.

Now I have a CSV file already stored in HDFS as
"hdfs://172.16.1.58:8020/tmp/test_short.csv"
How to get an object for it?

And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:
data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=",")) mydata=data$val
It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!

Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?

Please give me some advice, whatever!
解决方案
mapreduce(input = path, input.format = make.input.format(...), map ...)

from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already

这篇关于R + Hadoop：如何从HDFS读取CSV文件并执行mapreduce？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R + Hadoop：如何从HDFS读取CSV文件并执行mapreduce？ [英] R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

R + Hadoop：如何从HDFS读取CSV文件并执行mapreduce？ [英] R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭