如何识别csv文件中的空字段? [英] How to identify null fields in a csv file?

查看:662
本文介绍了如何识别csv文件中的空字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 2.1.1和Scala 2.11.8.

I'm using Spark 2.1.1 and Scala 2.11.8.

我必须从csv文件中读取数据,列的范围从最小6到最大8.我必须拆分9个条目,并且一旦拆分,列0到5将始终具有数据.但是,第6至8列中可以存在或不存在数据.我使用以下命令将所需的列分离并存储在RDD中:

I have to read data from a csv file with columns ranging from minimum 6 to maximum of 8. I have to split the 9 entries and once split, columns 0 to 5 will always have data. However data can either be present or absent in columns 6 to 8. I separated and stored the required columns in a RDD using:

val read_file = sc.textFile("Path to input file");

val uid = read_file.map(line => {var arr = line.split(","); (arr(2).split(":")(0),arr(3),arr(4).split(":")(0),arr(5).split(":")(0),arr(6).split(":")(0),arr(7).split(":")(0),arr(8).split(":")(0))})

现在,在获得的RDD'uid'中,第0到3列将始终被填充,但是4到7列可能有也可能没有数据.例如:我正在从中读取数据的csv文件,

Now, in the RDD 'uid' obtained, columns 0 to 3 will always be filled but 4 to 7 may or may not have data. Eg: The csv file from which I'm reading the data,

2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795

2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246

2017-05-09 21:52:42 , 1494391962 , v73392772_serv33i:9:1400 , 1:4x , c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone:314129, fb_217409795286934:294262

可以看出,第一个条目填充了所有9列,第二个条目填充了8个,第三个条目仅填充了6列.

As it can be seen, the first entry has all 9 columns filled, the second entry has 8 filled and the 3rd entry has only 6 columns filled.

从获得的RDD中,我必须将具有arr(3)(0)列的arr(1)(0)列映射到arr(7)(0).第1列的映射仅应使用填充的列完成从3到7.3到7之间的空列不必与列1映射.我试图使用for循环来完成此操作:

From the RDD obtained, I have to map column arr(1)(0) with columns arr(3)(0) to arr(7)(0).The mapping of column 1 should be done only with filled columns from 3 to 7. Empty columns between 3 to 7 do not have to be mapped with column 1. I was trying to do this using for loop:

在执行语句val uid = read_file.map()之后,我有了这个

Once I have this after executing the statement val uid = read_file.map():

(String, String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502"," fb_172395756592775")

我愿意

for (var x <= 5 to 7) { if var arr => (arr(x) != null) {
val pairedRdd = uid.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6), (x._1, x._7)) ) }

这将适用于给定数据示例中的第一个语句,但不适用于第二个和第三个语句.

This will work for the first statement in the example of the data given but not the second and third.

我承认逻辑是错误的,但这只是传达了我正在尝试做的事情的想法.

The logic is wrong, I admit but it's only to convey an idea of what I'm trying to do.

P.S:不允许使用Spark SQL.

P.S : Use of Spark SQL is not allowed.

推荐答案

您可以执行以下

val read_file = sc.textFile("Path to input file")
val uid = read_file.map(line => line.split(",")).map(array => array.map(arr => {
    if(arr.contains(":")) (array(2).split(":")(0), arr.split(":")(0))
    else (array(2).split(":")(0), arr)
}))

现在在做

uid.map(array => array.drop(2)).map(array => array.toSeq)

将为您提供rdd作为

WrappedArray(( p69465323_serv80i, p69465323_serv80i), ( p69465323_serv80i, 7 ), ( p69465323_serv80i, fb_406423006398063), ( p69465323_serv80i, guest_861067032060185_android), ( p69465323_serv80i, fb_100000829486587), ( p69465323_serv80i, fb_100007900293502), ( p69465323_serv80i, fb_172395756592775))
WrappedArray(( z67265107_serv77i, z67265107_serv77i), ( z67265107_serv77i, 2), ( z67265107_serv77i, fb_106996523208498), ( z67265107_serv77i, fb_274049626104849), ( z67265107_serv77i, fb_111857069377742), ( z67265107_serv77i, fb_127277511127344))
WrappedArray(( v73392772_serv33i, v73392772_serv33i), ( v73392772_serv33i, 1), ( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone), ( v73392772_serv33i, fb_217409795286934))

在做什么

uid.map(array => array.drop(2)).flatMap(array => array)

将为您提供rdd作为

( p69465323_serv80i, p69465323_serv80i)
( p69465323_serv80i, 7 )
( p69465323_serv80i, fb_406423006398063)
( p69465323_serv80i, guest_861067032060185_android)
( p69465323_serv80i, fb_100000829486587)
( p69465323_serv80i, fb_100007900293502)
( p69465323_serv80i, fb_172395756592775)
( z67265107_serv77i, z67265107_serv77i)
( z67265107_serv77i, 2)
( z67265107_serv77i, fb_106996523208498)
( z67265107_serv77i, fb_274049626104849)
( z67265107_serv77i, fb_111857069377742)
( z67265107_serv77i, fb_127277511127344)
( v73392772_serv33i, v73392772_serv33i)
( v73392772_serv33i, 1)
( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone)
( v73392772_serv33i, fb_217409795286934)

由您选择

这篇关于如何识别csv文件中的空字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆