Spark Dataframe中的重复列 [英] Duplicate columns in Spark Dataframe

查看：3471 发布时间：2017/2/24 22:36:14 r csv hadoop apache-spark sparkr

本文介绍了Spark Dataframe中的重复列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个10GB csv文件在hadoop群集与重复的列。我尝试在SparkR中分析它，所以我使用 spark-csv 包解析为 DataFrame ：

I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame:

  df <- read.df(
    sqlContext,
    FILE_PATH,
    source = "com.databricks.spark.csv",
    header = "true",
    mode = "DROPMALFORMED"
  )

但是因为df有重复的电子邮件列，如果我想选择此列，它会错误地出现：

But since df have duplicate Email columns, if I want to select this column, it would error out:

select(df, 'Email')

15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.;
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
...

我想保留第一次出现的电子邮件列并删除后者，我该怎么办？ / p>

I want to keep the first occurrence of Email column and delete the latter, how can I do that?

推荐答案

最好的方法是更改上游的列名称）;

The best way would be to change the column name upstream ;)

但是，它似乎是不可能的，所以有几个选项：

However, it seems that is not possible, so there are a couple of options:

如果列的情况（emailvsEmail）可以打开区分大小写：

If the case of the columns are different("email" vs "Email") you can turn on case sensitivity:

     sql(sqlContext, "set spark.sql.caseSensitive=true")

如果列名称完全相同，需要手动指定模式并跳过第一行以避免标题：

If the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers:

customSchema <- structType(
structField("year", "integer"), 
structField("make", "string"),
structField("model", "string"),
structField("comment", "string"),
structField("blank", "string"))

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema)

这篇关于Spark Dataframe中的重复列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Dataframe中的重复列 [英] Duplicate columns in Spark Dataframe

问题描述

推荐答案

相关文章

Office最新文章

热门教程

热门工具

登录关闭

Spark Dataframe中的重复列 [英] Duplicate columns in Spark Dataframe

问题描述

推荐答案

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭