如何识别csv文件中的空字段? [英] How to identify null fields in a csv file?
问题描述
我正在使用Spark 2.1.1和Scala 2.11.8.
I'm using Spark 2.1.1 and Scala 2.11.8.
我必须从csv文件中读取数据,列的范围从最小6到最大8.我必须拆分9个条目,并且一旦拆分,列0到5将始终具有数据.但是,第6至8列中可以存在或不存在数据.我使用以下命令将所需的列分离并存储在RDD中:
I have to read data from a csv file with columns ranging from minimum 6 to maximum of 8. I have to split the 9 entries and once split, columns 0 to 5 will always have data. However data can either be present or absent in columns 6 to 8. I separated and stored the required columns in a RDD using:
val read_file = sc.textFile("Path to input file");
val uid = read_file.map(line => {var arr = line.split(","); (arr(2).split(":")(0),arr(3),arr(4).split(":")(0),arr(5).split(":")(0),arr(6).split(":")(0),arr(7).split(":")(0),arr(8).split(":")(0))})
现在,在获得的RDD'uid'中,第0到3列将始终被填充,但是4到7列可能有也可能没有数据.例如:我正在从中读取数据的csv文件,
Now, in the RDD 'uid' obtained, columns 0 to 3 will always be filled but 4 to 7 may or may not have data. Eg: The csv file from which I'm reading the data,
2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795
2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246
2017-05-09 21:52:42 , 1494391962 , v73392772_serv33i:9:1400 , 1:4x , c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone:314129, fb_217409795286934:294262
可以看出,第一个条目填充了所有9列,第二个条目填充了8个,第三个条目仅填充了6列.
As it can be seen, the first entry has all 9 columns filled, the second entry has 8 filled and the 3rd entry has only 6 columns filled.
从获得的RDD中,我必须将具有arr(3)(0)列的arr(1)(0)列映射到arr(7)(0).第1列的映射仅应使用填充的列完成从3到7.3到7之间的空列不必与列1映射.我试图使用for循环来完成此操作:
From the RDD obtained, I have to map column arr(1)(0) with columns arr(3)(0) to arr(7)(0).The mapping of column 1 should be done only with filled columns from 3 to 7. Empty columns between 3 to 7 do not have to be mapped with column 1. I was trying to do this using for loop:
在执行语句val uid = read_file.map()之后,我有了这个
Once I have this after executing the statement val uid = read_file.map():
(String, String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502"," fb_172395756592775")
我愿意
for (var x <= 5 to 7) { if var arr => (arr(x) != null) {
val pairedRdd = uid.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6), (x._1, x._7)) ) }
这将适用于给定数据示例中的第一个语句,但不适用于第二个和第三个语句.
This will work for the first statement in the example of the data given but not the second and third.
我承认逻辑是错误的,但这只是传达了我正在尝试做的事情的想法.
The logic is wrong, I admit but it's only to convey an idea of what I'm trying to do.
P.S:不允许使用Spark SQL.
P.S : Use of Spark SQL is not allowed.
推荐答案
您可以执行以下
val read_file = sc.textFile("Path to input file")
val uid = read_file.map(line => line.split(",")).map(array => array.map(arr => {
if(arr.contains(":")) (array(2).split(":")(0), arr.split(":")(0))
else (array(2).split(":")(0), arr)
}))
现在在做
uid.map(array => array.drop(2)).map(array => array.toSeq)
将为您提供rdd
作为
WrappedArray(( p69465323_serv80i, p69465323_serv80i), ( p69465323_serv80i, 7 ), ( p69465323_serv80i, fb_406423006398063), ( p69465323_serv80i, guest_861067032060185_android), ( p69465323_serv80i, fb_100000829486587), ( p69465323_serv80i, fb_100007900293502), ( p69465323_serv80i, fb_172395756592775))
WrappedArray(( z67265107_serv77i, z67265107_serv77i), ( z67265107_serv77i, 2), ( z67265107_serv77i, fb_106996523208498), ( z67265107_serv77i, fb_274049626104849), ( z67265107_serv77i, fb_111857069377742), ( z67265107_serv77i, fb_127277511127344))
WrappedArray(( v73392772_serv33i, v73392772_serv33i), ( v73392772_serv33i, 1), ( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone), ( v73392772_serv33i, fb_217409795286934))
在做什么
uid.map(array => array.drop(2)).flatMap(array => array)
将为您提供rdd
作为
( p69465323_serv80i, p69465323_serv80i)
( p69465323_serv80i, 7 )
( p69465323_serv80i, fb_406423006398063)
( p69465323_serv80i, guest_861067032060185_android)
( p69465323_serv80i, fb_100000829486587)
( p69465323_serv80i, fb_100007900293502)
( p69465323_serv80i, fb_172395756592775)
( z67265107_serv77i, z67265107_serv77i)
( z67265107_serv77i, 2)
( z67265107_serv77i, fb_106996523208498)
( z67265107_serv77i, fb_274049626104849)
( z67265107_serv77i, fb_111857069377742)
( z67265107_serv77i, fb_127277511127344)
( v73392772_serv33i, v73392772_serv33i)
( v73392772_serv33i, 1)
( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone)
( v73392772_serv33i, fb_217409795286934)
由您选择
这篇关于如何识别csv文件中的空字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!