查找 Spark 数据帧中两列的差异并附加到新列 [英] Finding the difference of two columns in Spark dataframes and appending to a new column

查看:30
本文介绍了查找 Spark 数据帧中两列的差异并附加到新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我的代码,用于将 csv 数据加载到数据框中并在两列上应用差异并使用 withColumn 附加到新列.我试图找出差异的两列是 Double 类型.请帮我找出以下异常:

Below is my code for loading csv data into dataframe and applying the difference on two columns and appending to a new one using withColumn.The two columns I am trying to find the difference is of kind Double. Please help me in figuring out the following exception:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

/**
  * Created by Guest1 on 5/10/2017.
  */
object arith extends App {
  Logger.getLogger("org").setLevel(Level.ERROR)
  Logger.getLogger("akka").setLevel(Level.ERROR)

  val spark = SparkSession.builder().appName("Arithmetics").
                config("spark.master", "local").getOrCreate()
  val df =spark.read.option("header","true")
                  .option("inferSchema",true")
                  .csv("./Input/Arith.csv").persist()

//  df.printSchema()
val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
}

我收到以下异常:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description,  Total Discharges ,  Average Covered Charges ,  Average Total Payments , Average Medicare Payments);
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
    at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
    at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
    at arith$.delayedEndpoint$arith$1(arith.scala:19)
    at arith$delayedInit$body.apply(arith.scala:7)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at arith$.main(arith.scala:7)
    at arith.main(arith.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

推荐答案

这里有多个问题.

首先,如果您查看异常,它基本上会告诉您数据框中没有平均总付款"列(它还有助于为您提供它看到的列).从 csv 中读取的列名似乎在末尾有一个额外的空格.

First if you look at the exception, it basically tells you that there is no "Average Total Payments" column in the dataframe (it also helpfully gives you the columns it sees). It seems the column name read from the csv has an extra space at the end.

第二个 df("Average Total Payments") 和 df("Average Medicare Payments") 是列.

Second df("Average Total Payments") and df("Average Medicare Payments") are columns.

您正在尝试在 df("Average medic Payment") 上调用 show.Show 不是列的成员(并且在数据帧上它返回单位,因此您无法执行 df("Average Total Payments") -df("Average Medicare Payments").show(5) 无论如何,因为那将是 Column - Unit).

You are trying to call show on df("Average medicate payments"). Show is not a member of column (and on dataframe it returns unit so you couldn't do df("Average Total Payments") -df("Average Medicare Payments").show(5) anyway because that would be Column - Unit).

您想要做的是定义一个新列,这是两者之间的差异,并将其作为新列添加到数据框中.然后您只想选择该列并显示它.例如:

What you want to do is define a new column which is the difference between the two and add it to the dataframe as a new column. Then you want to select just that column and show it. For example:

val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)

这篇关于查找 Spark 数据帧中两列的差异并附加到新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆