如何使用Spark Scala将CSV行拆分为元组 [英] How to split CSV lines into tuples with Spark Scala

查看:74
本文介绍了如何使用Spark Scala将CSV行拆分为元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我想由Scala检索的数据.数据如下所示:userId,movieId1,11721,14051,21931,29682,522,1442,248

Here is a data I want to retrieve by Scala. The data looks like this: userId,movieId 1,1172 1,1405 1,2193 1,2968 2,52 2,144 2,248

首先,我想跳过第一行,然后通过split(,")分割用户和电影并映射到(userID,movieID)

First I want to skip the first line, and then split user and movie by split(",") and map to (userID,movieID)

这是我第一次尝试Scala,一切都使我发疯.我写了这段代码以跳过第一行并拆分

This is my first time trying scala, everything made me insane. I wrote this code to skip first line and split

rdd.mapPartitionsWithIndex{ (idx, iter) => 
if (idx == 0) 
    iter.drop(1) 
else     
    iter }.flatMap(line=>line.split(","))

但是结果是这样的:

    1
    1172
    1
    1405
    1
    2193
    1
    2968
    2
    52

我想这是因为mapPartitionsWithIndex有什么方法可以在不更改结构的情况下正确跳过标题?

I guess it's because mapPartitionsWithIndex Is there any way to correctly skip the header without change the structure?

推荐答案

嗯,您的问题不是关于标题的问题,而是关于如何将行分割为(userid,movieid)的问题?代替 .flatMap(line => line.split(,")),您应该尝试以下操作:

Ah, your question is not about the header, but about how to split the lines into (userid, movieid)? Instead of .flatMap(line=>line.split(",")) you should try this:

.map(line => line.split(",") match { case Array(userid, movieid) => (userid, movieid) })

这篇关于如何使用Spark Scala将CSV行拆分为元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆