“ skip.header.line.count” =“ 1”。在SparkSession中的Hive中不起作用 [英] "skip.header.line.count"="1" does not work in Hive in SparkSession

查看:584
本文介绍了“ skip.header.line.count” =“ 1”。在SparkSession中的Hive中不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SparkSession将CSV数据加载到Hive表中。
我想在加载到配置单元表中并设置tblproperties( skip.header.line.count = 1)时跳过标头数据也不起作用。

I am trying to load a CSV data into Hive table using SparkSession. I want to skip the header data while loading into hive table and setting tblproperties("skip.header.line.count"="1") is also not working.

我正在使用以下代码。

import java.io.File

import org.apache.spark.sql.{SparkSession,Row,SaveMode}

case class Record(key: Int, value: String)

val warehouseLocation=new File("spark-warehouse").getAbsolutePath

val spark=SparkSession.builder().appName("Apache Spark Book Crossing Analysis").config("spark.sql.warehouse.dir",warehouseLocation).enableHiveSupport().getOrCreate()

import spark.implicits._
import spark.sql
//sql("set hive.vectorized.execution.enabled=false")
sql("drop table if exists BookTemp")
sql ("create table BookTemp(ISBN int,BookTitle String,BookAuthor String ,YearOfPublication int,Publisher String,ImageURLS String,ImageURLM String,ImageURLL String)row format delimited fields terminated by ';' ")
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")
 sql("load data local inpath 'BX-Books.csv'  into table BookTemp")
 sql("select * from BookTemp limit 5").show

consol错误:

res55: org.apache.spark.sql.DataFrame = []
<console>:1: error: ')' expected but '.' found.
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")

2019-02-20 22:48:09 WARN  LazyStruct:151 - Extra bytes detected at the end of the row! Ignoring similar problems.
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|ISBN|           BookTitle|          BookAuthor|YearOfPublication|           Publisher|           ImageURLS|           ImageURLM|           ImageURLL|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|null|        "Book-Title"|       "Book-Author"|             null|         "Publisher"|       "Image-URL-S"|       "Image-URL-M"|       "Image-URL-L"|
|null|"Classical Mythol...|"Mark P. O. Morford"|             null|"Oxford Universit...|"http://images.am...|"http://images.am...|"http://images.am...|
|null|      "Clara Callan"|"Richard Bruce Wr...|             null|"HarperFlamingo C...|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Decision in Norm...|      "Carlo D'Este"|             null|   "HarperPerennial"|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Flu: The Story o...|  "Gina Bari Kolata"|             null|"Farrar Straus Gi...|"http://images.am...|"http://images.am...|"http://images.am...|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows

如结果所示,我想跳过第一行数据

AS shown in the result , I want to skip the first row of data

推荐答案

另一种方法是我尝试使用spark sql将带有标头的CSV转换为实木复合地板

Another alternative I was trying using spark sql to convert the CSV with Header to parquet

val df = spark.sql( select * from schema.table)

val df = spark.sql("select * from schema.table")

df.coalesce(1).write.options(Map( header-> true, compression-> snappy))。mode(SaveMode。覆盖).parquet()

df.coalesce(1).write.options(Map("header"->"true","compression"->"snappy")).mode(SaveMode.Overwrite).parquet()

这篇关于“ skip.header.line.count” =“ 1”。在SparkSession中的Hive中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆