Apache Spark:将CSV文件映射到键:值格式 [英] Apache spark: map csv file to key: value format

查看:73
本文介绍了Apache Spark:将CSV文件映射到键:值格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Apache Spark和Scala 完全陌生,并且在将.csv文件映射到键值(例如JSON)结构时遇到问题.

I'm totally new to Apache Spark and Scala, and I'm having problems with mapping a .csv file into a key-value (like JSON) structure.

我想要完成的是获取.csv文件:

What I want to accomplish is to get the .csv file:

user, timestamp, event
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:52:56,USER_PURCHASED
ad0e431a69cb3b445ddad7bb97f55665,2015-03-06 13:52:57,USER_SHARED
83b2d8a2c549fbab0713765532b63b54,2015-03-06 13:52:57,USER_SUBSCRIBED
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST
...

进入这样的结构:

ec79fcac8c76ebe505b76090f03350a2: [(2015-03-06 13:52:56,USER_PURCHASED), (2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST)]
ad0e431a69cb3b445ddad7bb97f55665: [(2015-03-06 13:52:57,USER_SHARED)]
83b2d8a2c549fbab0713765532b63b54: [(2015-03-06 13:52:57,USER_SUBSCRIBED)]
...

如果通过以下方式读取文件,该怎么办?

How can this be done if the file is read by:

val csv = sc.textFile("file.csv")

非常感谢您的帮助!

推荐答案

类似的东西

     case class MyClass(user: String, date: String, event: String)
     def csvToMyClass(line: String) =
     {
        val split = line.split(',')
        // This is a good place to do validations 
        // And convert strings to numbers, enums, UUIDs, etc.
        MyClass(split(0), split(1), split(2))
     }

     val csv = sc.textFile("file.csv")
        .map(scvToMyClass)

当然,要做更多的工作,以便在类上拥有更多具体的数据类型,而不仅仅是字符串...

Of course, do a little more work to have more concrete data types on your class rather than just strings...

这是用于将CSV文件读入结构中(似乎是您的主要问题).如果随后需要合并单个用户的所有数据,则可以映射到键/值元组(String->(String,String))并使用 .aggregateByKey()加入用户的所有元组.然后,您的聚合函数可以返回所需的任何结构.

This is for reading the CSV file into a structure (seems to be your main question). If you then need to merge all data for a single user you can map to a key/value tuple (String -> (String, String)) instead and use .aggregateByKey() to join all tuples for a user. Your aggregation function can then return whatever structure you want.

这篇关于Apache Spark:将CSV文件映射到键:值格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆