在Spark数据框中找到两列的差异并将其追加到新列 [英] Finding the difference of two columns in Spark dataframes and appending to a new column

查看:254
本文介绍了在Spark数据框中找到两列的差异并将其追加到新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我的代码,用于将csv数据加载到数据帧中并在两列上应用差异,然后使用withColumn将其追加到新列中.我试图找到差异的两列是Double类型的.请帮助我找出以下异常:

Below is my code for loading csv data into dataframe and applying the difference on two columns and appending to a new one using withColumn.The two columns I am trying to find the difference is of kind Double. Please help me in figuring out the following exception:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

/**
  * Created by Guest1 on 5/10/2017.
  */
object arith extends App {
  Logger.getLogger("org").setLevel(Level.ERROR)
  Logger.getLogger("akka").setLevel(Level.ERROR)

  val spark = SparkSession.builder().appName("Arithmetics").
                config("spark.master", "local").getOrCreate()
  val df =spark.read.option("header","true")
                  .option("inferSchema",true")
                  .csv("./Input/Arith.csv").persist()

//  df.printSchema()
val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
}

我遇到以下异常:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description,  Total Discharges ,  Average Covered Charges ,  Average Total Payments , Average Medicare Payments);
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
    at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
    at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
    at arith$.delayedEndpoint$arith$1(arith.scala:19)
    at arith$delayedInit$body.apply(arith.scala:7)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at arith$.main(arith.scala:7)
    at arith.main(arith.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

推荐答案

这里有多个问题.

首先,如果您查看例外情况,它基本上会告诉您数据框中没有平均总付款"列(它也会为您提供所看到的列).从csv读取的列名似乎在末尾有一个额外的空间.

First if you look at the exception, it basically tells you that there is no "Average Total Payments" column in the dataframe (it also helpfully gives you the columns it sees). It seems the column name read from the csv has an extra space at the end.

第二个df(平均付款总额")和df(平均医疗保险付款额")是列.

Second df("Average Total Payments") and df("Average Medicare Payments") are columns.

您正在尝试在df上调用show(平均药费"). Show不是列的成员(并且在数据框上它返回单位,因此您无法执行df("Average Total Payments")-df("Average Medicare Payments").show(5),因为那将是Column-Unit ).

You are trying to call show on df("Average medicate payments"). Show is not a member of column (and on dataframe it returns unit so you couldn't do df("Average Total Payments") -df("Average Medicare Payments").show(5) anyway because that would be Column - Unit).

您要做的是定义一个新列,该列是两者之间的区别,并将其作为新列添加到数据框中.然后,您只想选择该列并显示它.例如:

What you want to do is define a new column which is the difference between the two and add it to the dataframe as a new column. Then you want to select just that column and show it. For example:

val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)

这篇关于在Spark数据框中找到两列的差异并将其追加到新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆