一列添加到火花数据帧,并计算出它的值 [英] Add a column to a Spark DataFrame and calculate a value for it

查看:188
本文介绍了一列添加到火花数据帧,并计算出它的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件我加载到包含经度和纬度列的SQLContext。

I have a CSV document I'm loading into a SQLContext that contains latitude and longitude columns.

val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("delimiter","\t").schema(customSchema).load(inputFile);

CSV例如

metro_code, resolved_lat, resolved_lon
602, 40.7201, -73.2001

我试图找出添加一个新列和计算GeoHex每一行的最好方式。散列lat和长很容易与geohex包。我想我需要运行并行化方法或我见过一些例子传递函数withColumn。

I'm trying to figure out the best way to add a new column and calculate the GeoHex for each row. Hashing the lat and long is easy with the geohex package. I think I need to run the parallelize method or I've seen some examples passing a function to withColumn.

推荐答案

包装要求与UDF函数应该做的伎俩:

Wrapping required function with an UDF should do the trick:

import org.apache.spark.sql.functions.udf
import org.geohex.geohex4j.GeoHex

val df = sc.parallelize(Seq(
  (Some(602), 40.7201, -73.2001), (None, 5.7805, 139.5703)
)).toDF("metro_code", "resolved_lat", "resolved_lon")

def geoEncode(level: Int) = udf(
  (lat: Double, long: Double) => GeoHex.encode(lat, long, level))

df.withColumn("code", geoEncode(9)($"resolved_lat", $"resolved_lon")).show
// +----------+------------+------------+-----------+
// |metro_code|resolved_lat|resolved_lon|       code|
// +----------+------------+------------+-----------+
// |       602|     40.7201|    -73.2001|PF384076026|
// |      null|      5.7805|    139.5703|PR081331784|
// +----------+------------+------------+-----------+

这篇关于一列添加到火花数据帧,并计算出它的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆