如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列 [英] How to select 3 columns from CSV file using java Spark coding also group by and finally sum

查看:165
本文介绍了如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我新的火花,我想写星火Java程序,就像我必须加载包含75列和140万行的CSV文件。同时给予文件我们只选择3列,我必须写给出的条件,我必须为1列执行groupbykey,我所要做的另一列的总和

I am New to Spark, I want to write Spark Java program, Like I have to load CSV file which contains 75 columns and 1.4 million rows. while giving the file we have to select only 3 columns, I have to write give condition for that and I have to perform groupbykey for 1 column and I have to do sum of another column

推荐答案

根据所星火版本正在运行(1.3或1.4),则可以使用Databricks火花CSV封装,要么加载CSV文件:

Depending on which version of Spark you are running (1.3 or 1.4) you can load the csv-file using Databricks spark-csv package with either:

星火1.3

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> filePath,"header"->"true"))

星火1.4

val df = sqlContext.read.format("com.databricks.spark.csv").options(Map("path" -> filePath,"header"->"true")).load()

在下面我会认为你只是在3列2,和32列2需要兴趣被解析为一个日期,第3列是String类型和列32的ID是需要一个量解析为双。

In the following I will assume that you are only interested in columns 2, 3 and 32 and that column 2 needs to be parsed as a date, column 3 is an ID of type String and column 32 is an amount that needs to be parsed as a Double.

因此​​,一旦文件被加载,你可以得到3列是这样的:

So, once the file is loaded you can get the 3 columns like this:

val allData = df.map(row => (row.getString(3), row.getString(32).toDouble, LocalDate.parse(row.getString(2), DateTimeFormatter.ISO_LOCAL_DATE))

(请注意,我用Java这LOCALDATE就是Java 8的一部分,在这里,你可以使用JodaTime代替,如果你preFER。)

(Note that I am using Java LocalDate which is part of Java 8 here. You could use JodaTime instead if you prefer.)

假设你只想要行带有日期晚于说2015年5月24日,你可以使用过滤器来摆脱不必要的行

Assuming that you only want rows with a date later than say May 24th, 2015, you can use a filter to get rid of unwanted rows

val startDate = LocalDate.of(2015,5,24)
val filteredData = allData.filter{case(_,_,date) => date.isAfter(startDate)}

现在,来总结一个特定的列每个ID,你需要映射你的数据键 - 值对(ID,数量),再总结使用 reduceByKey

Now, to sum a particular column for each ID, you need to map your data to key-value pairs (ID, amount), and then sum the amount using a reduceByKey

filteredData.map{case(id,amount, _) => (id, amount)}
            .reduceByKey(_ + _)

是这样的东西你要找的人?

Was this what you were looking for?

这篇关于如何使用Java编码星火集团也通过最后一笔以选择CSV文件3列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆