如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列 [英] How to select 3 columns from CSV file using java Spark coding also group by and finally sum

查看：165 发布时间：2016/5/21 15:22:35 java apache apache-spark cloudera spark-streaming

本文介绍了如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我新的火花，我想写星火Java程序，就像我必须加载包含75列和140万行的CSV文件。同时给予文件我们只选择3列，我必须写给出的条件，我必须为1列执行groupbykey，我所要做的另一列的总和

I am New to Spark, I want to write Spark Java program, Like I have to load CSV file which contains 75 columns and 1.4 million rows. while giving the file we have to select only 3 columns, I have to write give condition for that and I have to perform groupbykey for 1 column and I have to do sum of another column

推荐答案

根据所星火版本正在运行（1.3或1.4），则可以使用Databricks火花CSV封装，要么加载CSV文件：

Depending on which version of Spark you are running (1.3 or 1.4) you can load the csv-file using Databricks spark-csv package with either:

星火1.3

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> filePath,"header"->"true"))

星火1.4

val df = sqlContext.read.format("com.databricks.spark.csv").options(Map("path" -> filePath,"header"->"true")).load()

在下面我会认为你只是在3列2，和32列2需要兴趣被解析为一个日期，第3列是String类型和列32的ID是需要一个量解析为双。

In the following I will assume that you are only interested in columns 2, 3 and 32 and that column 2 needs to be parsed as a date, column 3 is an ID of type String and column 32 is an amount that needs to be parsed as a Double.

因此，一旦文件被加载，你可以得到3列是这样的：

So, once the file is loaded you can get the 3 columns like this:

val allData = df.map(row => (row.getString(3), row.getString(32).toDouble, LocalDate.parse(row.getString(2), DateTimeFormatter.ISO_LOCAL_DATE))

（请注意，我用Java这LOCALDATE就是Java 8的一部分，在这里，你可以使用JodaTime代替，如果你preFER。）

(Note that I am using Java LocalDate which is part of Java 8 here. You could use JodaTime instead if you prefer.)

假设你只想要行带有日期晚于说2015年5月24日，你可以使用过滤器来摆脱不必要的行

Assuming that you only want rows with a date later than say May 24th, 2015, you can use a filter to get rid of unwanted rows

val startDate = LocalDate.of(2015,5,24)
val filteredData = allData.filter{case(_,_,date) => date.isAfter(startDate)}

现在，来总结一个特定的列每个ID，你需要映射你的数据键 - 值对（ID，数量），再总结使用 reduceByKey


Now, to sum a particular column for each ID, you need to map your data to key-value pairs (ID, amount), and then sum the amount using a reduceByKey
filteredData.map{case(id,amount, _) => (id, amount)}
            .reduceByKey(_ + _)

是这样的东西你要找的人？
Was this what you were looking for?

                        这篇关于如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列 [英] How to select 3 columns from CSV file using java Spark coding also group by and finally sum

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列 [英] How to select 3 columns from CSV file using java Spark coding also group by and finally sum

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭