Spark 2.2查询Hive表时,Dataframe上的Thrift服务器错误NumberFormatException [英] Spark 2.2 Thrift server error on dataframe NumberFormatException when query Hive table

查看:2513
本文介绍了Spark 2.2查询Hive表时,Dataframe上的Thrift服务器错误NumberFormatException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有Hortonworks HDP 2.6.3运行Spark2(v2.2)。我的测试用例非常简单:


  1. 使用一些随机值创建一个Hive表。 Hive at port 10000


  2. 打开Spark Thrift服务器(10016)

  3. 通过10016查询Hive表


    但是,由于NumberFormatException的原因,我无法从Spark中获取数据。 p>

    下面是我的测试用例:


    1. 用示例行创建Hive表: li>




     直线> !connect jdbc:hive2:// localhost:10000 / default hive hive 

    create table test1(id int,desc varchar(40));

    插入表test1值(1,aa),(2,bb);





    1. 运行Spark Thrift服务器:

      $ b $ su - spark -c'/usr/hdp/2.6.3.0-235/spark2/sbin/start- thriftserver.sh - master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port = 10016'


      1. 运行pyspark作为spark用户
        su - spark -c'pyspark'


      2. 输入以下代码:



        df = sqlContext.read.format(jdbc)。options(driver =org.apache.hive.jdbc.HiveDriver,url =jdbc:hive2:/ / localhost:10016 / default,dbtable =test1,user =hive,password =hive)。load()

        df.select *)。show()


      3. 我得到这个错误:


        17/12/15 08:04:13 ERROR执行程序:阶段2.0中任务0.0的异常
        (TID 2)java.sql.SQLException:无法转换第1列到
        integerjava.lang.NumberFormatException:对于输入字符串:id在
        org.apache.hive.j dbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $执行$ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:394)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:393)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:330)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:312)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at
        org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

        org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at
        org.apache.spark.sql.catalyst.expressions.GeneratedCl ass $ GeneratedIterator.processNext(Unknown
        Source)at
        org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at
        org.apache .spark.sql.execution.WholeStageCodegenExec $$ anonfun $ $$ 8不久$ 1.hasNext(WholeStageCodegenExec.scala:395)
        。在
        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2中。在
        上应用(SparkPlan.scala:234)
        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at
        org.apache .spark.scheduler.ResultTask.ru
        上的$ n $ $ $ $ $ b $ org.apache.spark.scheduler.Task.run(Task.scala:108)
        中的$ nTask(ResultTask.scala:87)org.apache.spark.executor.Executor $ TaskRunner.run

        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
        java.util.concurrent.ThreadPoolExecutor $工作。运行(ThreadPoolExecutor.java:617)
        在java.lang.Thread.run(Thread.java:745)引起:
        java.lang.NumberFormatException:对于输入字符串:idat
        java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)$ b $ at java.lang.Integer.parseInt(Integer.java:580)at
        java.lang.Integer.valueOf(Integer。 java:766)at
        org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)
        ... 23更多17/12/15 08:04:13 WARN TaskSetManager:Lost
        stage 2.0中的任务0.0(TID 2,localhost,executor driver):java.sql.SQLException:
        无法将第1列转换为integerjava.lang.NumberFo rmatException:对于
        输入字符串:
        处的idorg.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)
        at
        org.apache.spark .sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:394)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:393)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:330)
        at
        org.apache.spark。 sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:312)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at
        org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at
        org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:3 2)
        at
        org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown
        Source)at
        org.apache.spark.sql.execution .BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        。在
        org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ $$ 8不久$ 1.hasNext(WholeStageCodegenExec.scala:395)
        at
        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234)
        at
        org.apache.spark.sql.execution。 SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala :827)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:323)
        org.apache上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        .spark.rdd.RDD.ite
        上的rator(RDD.scala:287)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at
        org.apache.spark.scheduler.Task.run(任务。
        org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)
        at
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor。 java:1142)
        at
        java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)导致:
        java.lang.NumberFormatException:对于输入字符串:id在
        java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)$ b $在java.lang.Integer。
        处的parseInt(Integer.java:580)
        处的java.lang.Integer.valueOf(Integer.java:766)org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)
        ... 23更多

        17/12/15 08:04:14 ERROR TaskSetManager:阶段2.0中的任务0失败1
        次;中止作业Traceback(最近调用最后一次):File
        /usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py中的文件
        ,第1行, line
        336,in show
        print(self._jdf.showString(n,20))File/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src .zip / py4j / java_gateway.py,
        line 1133,in call File
        / usr / hdp / current / spark2-client / python / pyspark / sql / utils .py,第63行,
        在deco
        返回f(* a,** kw)文件/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4- src.zip/py4j/protocol.py,
        第319行,在get_return_value中py4j.protocol.Py4JJavaError:调用o75.showString时发生错误
        。 :
        org.apache.spark.SparkException:由于阶段失败导致作业中止:
        阶段2.0中的任务0失败1次,最近失败:失败任务0.0
        在阶段2.0(TID 2 ,localhost,executor driver):
        java.sql.SQLException:无法将第1列转换为
        integerjava.lang.NumberFormatException:对于输入字符串:idat
        org.apache.hive。 jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $执行$ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:394)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:393)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:330)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNex t(JdbcUtils.scala:312)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at
        org.apache.spark.InterruptibleIterator.hasNext( InterruptibleIterator.scala:37)
        at
        org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at
        org.apache.spark.sql。 catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown
        Source)at
        org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at
        org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395)
        at
        org.apache.spark.sql.execution.SparkPlan $ $ anonfun $ 2.apply(SparkPlan.scala:234)
        at
        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.s park.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala :38)
        at org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at
        org.apache.spark.scheduler.Task.run(Task.scala:108)at
        org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)
        at
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
        java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)$ b $ java.lang.Thread.run(Thread.java:745)导致:
        java。 lang.NumberFormatException:对于输入字符串:idat
        java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer .java:580)
        $ b $ java.lang.Integer.valueOf(Integer.java:766)
        org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)
        ... 23 more



        驱动程序堆栈跟踪:在
        org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1517)
        。在
        org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505)
        。在
        org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1504)
        at
        scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59 )
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at
        org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
        at
        org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)
        at
        o rg.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)at
        org.apache。 spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at
        org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
        at
        org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        。在
        org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        。在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)
        at
        org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)$ $ b $ org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at
        org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at
        org.apache。 spark.SparkContext.runJob(SparkContext.scala:2069)at
        org.apache.spark .sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
        at
        org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at
        org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:2854)
        at
        org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2154)
        at
        org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2154)
        at org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2838)
        at
        org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution .scala:65)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837)at
        org.apache.spark.sql.Dataset.head(Dataset.scala:2154 )at
        org.apache.spark.sql.Dataset.take(Dataset.scala:2367)at
        org.apache.spark.sql.Dataset.showString(Dataset.scala:245)at
        sun.reflect.NativeMethodAccessorImpl.invoke0(本地方法)
        su n.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        。在
        在java.lang.reflect.Method中sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        。在
        处调用(Method.java:498)py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at
        py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at
        py4j.Gateway.invoke(Gateway.java:280)在
        py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        在py4j.commands.CallCommand.execute(CallCommand.java:79 )at
        py4j.GatewayConnection.run(GatewayConnection.java:214)at
        java.lang.Thread.run(Thread.java:745)导致:
        java.sql.SQLException:无法将第1列转换为
        integerjava.lang.NumberFormatException:对于输入字符串:idat
        org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351)
        at
        org.apache.spark.sql.execution.datasources .jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:394)
        at
        org.apache.spark .sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 6.apply(JdbcUtils.scala:393)
        at
        org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:330)
        at
        org.apache.spark.sql.execution.datasources。 jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:312)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at
        org。 apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at
        org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at
        org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(Unknown
        Source)at
        org.apache.spark.sql.execution.BufferedRowIterato r.hasNext(BufferedRowIterator.java:43)
        at
        org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:395)

        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:234)
        at
        org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:228)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala: 827)
        at
        org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:827)
        at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        位于org.apache.spark.rdd.RDD.computeOrReCheckCheck(RDD.scala:323)
        位于org.apache。
        处的spark.rdd.RDD.iterator(RDD.scala:287)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at
        org.apache.spark.scheduler。 Task.run(Task.scala:108)在
        org.apache.spa rk.executor.Executor $ TaskRunner.run(Executor.scala:338)
        at
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
        java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more引起:java.lang.NumberFormatException:对于输入
        字符串:idat
        java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)$ b $ at java.lang.Integer.parseInt(Integer.java:580)at
        java.lang.Integer.valueOf(Integer。
        org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346)
        ... 23 more





        1. 我怀疑它与 id 列有关,所以我改变了to:df.select(desc)。show()


        2. 然后我得到了这个奇怪的结果:

        3. $ b




          + ---- + 
        | desc |
        + ---- +
        | desc |
        | desc |
        + ---- +





        1. 如果我回到Hive查询,通过端口10016一切正常:




         直线> !connect jdbc:hive2:// localhost:10016 / default hive hive 

        select * from test1;

        + ----- + ------- + - +
        | id | desc |
        + ----- + ------- + - +
        | 1 | aa |
        | 2 | bb |
        + ----- + ------- + - +





        1. 如果我在pyspark中更改端口10000,则同样的问题仍然存在。

        您能否帮我理解为什么以及如何通过Spark获取行?

        更新1



        在这两种情况下,我都遵循了@Achyuth的建议,但他们仍然无法正常工作。



        直线:

        pre $ create table test4(id String, desc String);
        插入表test4的值(1,aa),(2,bb);
        select * from test4;

        Pyspark:

         >>> df = sqlContext.read.format(jdbc)。options(driver =org.apache.hive.jdbc.HiveDriver,url =jdbc:hive2:// localhost:10016 / default,dbtable =test4 ,user =hive,password =hive)。option(fetchsize,10)。load()
        >>> df.select(*)。show()
        + --- + ---- +
        | ID |倒序|
        + --- + ---- +
        | ID |倒序|
        | ID |倒序|
        + --- + ---- +

        出于某种原因,栏目名称?!



        案例2



        直线:

          create table test5(id int,desc varchar(40))STORED AS ORC; 
        插入表test5值(1,aa),(2,bb);
        select * from test5;

        Pyspark:

        code>导致:java.lang.NumberFormatException:对于输入字符串:id



        UPDATE 2



        创建一个表并通过Hive端口插入值10000然后查询它。这可以通过直线很好地工作

         直线> !连接jdbc:hive2:// localhost:10000 / default hive hive 
        连接到jdbc:hive2:// localhost:10000 / default
        连接到:Apache Hive(版本1.2.1000.2.5.3.0驱动程序:Hive JDBC(版本1.2.1000.2.5.3.0-37)
        事务隔离:TRANSACTION_REPEATABLE_READ
        0:jdbc:hive2:// localhost:10000 / default> create table test2(id String,desc String)STORED AS ORC;
        没有行受到影响(0.3秒)
        0:jdbc:hive2:// localhost:10000 / default>插入表test2值(1,aa),(2,bb);
        INFO:Session已经打开
        INFO:Dag名称:insert into table tes ...1,aa),(2,bb)(Stage-1)
        INFO:Tez会议闭幕。重新开启...
        信息:会话重新建立。
        INFO:

        信息:状态:正在运行(在应用程序ID为application_1514019042819_0006的YARN群集上执行)

        信息:地图1: - / -
        INFO :Map 1:0/1
        INFO:Map 1:0(+1)/ 1
        INFO:Map 1:1/1
        INFO:将数据从webhdfs加载到表default.test2 ://demo.myapp.local:40070 / apps / hive / warehouse / test2 / .hive-staging_hive_2017-12-23_04-29-54_569_601147868480753216-3 / -ext-10000
        INFO:Table default.test2 stats: [numFiles = 1,numRows = 2,totalSize = 317,rawDataSize = 342]
        没有行受到影响(15.414秒)
        0:jdbc:hive2:// localhost:10000 / default>从table2中选择*;
        错误:编译语句时出错:FAILED:SemanticException [错误10001]:行1:14未找到表'table2'(state = 42S02,code = 10001)
        0:jdbc:hive2://本地主机:10000 /默认> select * from test2;
        + ----------- + ------------- + - +
        | test2.id | test2.desc |
        + ----------- + ------------- + - +
        | 1 | aa |
        | 2 | bb |
        + ----------- + ------------- + - +
        选择2行(0.364秒)

        另外通过直线,我可以使用Spark Thrift Server 10016来做同样的事情,它工作的很好:

         直线> !连接jdbc:hive2:// localhost:10016 / default hive hive 
        连接到jdbc:hive2:// localhost:10016 / default
        1:jdbc:hive2:// localhost:10016 / default> create table test3(id String,desc String)STORED AS ORC;
        + --------- + - +
        |结果|
        + --------- + - +
        + --------- + - +
        未选择行(1.234秒)
        1:jdbc:hive2:// localhost:10016 / default>插入表test3值(1,aa),(2,bb);
        + --------- + - +
        |结果|
        + --------- + - +
        + --------- + - +
        未选择行(9.111秒)
        1:jdbc:hive2:// localhost:10016 / default>从test3中选择*;
        + ----- + ------- + - +
        | id | desc |
        + ----- + ------- + - +
        | 1 | aa |
        | 2 | bb |
        + ----- + ------- + - +
        2 rows selected(3.387 seconds)

        这意味着Spark和Thrift Server可以正常工作。但是使用 pyspark 我遇到同样的问题,因为结果是空的:

          >>> df = sqlContext.read.format(jdbc)。options(driver =org.apache.hive.jdbc.HiveDriver,url =jdbc:hive2:// localhost:10016 / default,dbtable =test3 ,user =hive,password =hive)。load()
        >>> df.select(*)。show()
        + --- + ---- +
        | ID |倒序|
        + --- + ---- +
        + --- + ---- +

        UPDATE 3



        DESCRIBE EXTENDED test3;

         #详细表格信息|目录表(
        表:`default`.`test3`
        所有者:hive
        创建日期:星期六12月23日04:37:14 PST 2017
        上次访问:Wed Dec 31 16: 00:00 PST 1969
        类型:MANAGED
        架构:[`id` string,`desc` string]
        属性:[totalSize = 620,numFiles = 2,transient_lastDdlTime = 1514032656,STATS_GENERATED_VIA_STATS_TASK = true]
        存储(位置:webhdfs://demo.myapp.local:40070 / apps / hive / warehouse / test3,InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,Serde:org.apache.hadoop.hive.ql.io.orc.OrcSerde,Properties:[serialization.format = 1]))

        SHOW CREATE TABLE test3;

          CREATE TABLE`test3`(`id` string,`desc` string)
        ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
        WITH SERDEPROPERTIES
        'serialization.format'='1'

        STORED AS
        INPUTFORMAT'org.apache.hadoop.hiv e.ql.io.orc.OrcInputFormat'
        OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
        TBLPROPERTIES(
        'totalSize'='620',
        'numFiles'='2',
        'transient_lastDdlTime'='1514032656',
        'STATS_GENERATED_VIA_STATS_TASK'='true'

        su - spark -c'hdfs dfs -cat webhdfs://demo.myapp.local:40070 / apps / hive / warehouse / test3 / part -00000'



        解决方案

        div>

        即使您正在创建具有特定数据类型的hive表,插入时表中的基础数据仍以String格式存储。

        因此,当spark正在尝试读取数据时,它会使用Metastore来查找数据类型。它在配置单元Metastore中以int形式出现,并在文件中作为字符串出现,并抛出转换异常。

        解决方案



        创建表格作为字符串并从火花中读取它将起作用。

          create table test1(id String,desc String); 

        如果您希望保留数据类型,请指定其中一种文件格式,例如orc或parquet创建表格然后插入它。你可以从spark中读取这个文件,但是没有例外。

          create table test1(id int,desc varchar(40)STORED AS ORC ); 

        现在问题是wwhy hive是否可以读取它? Hive有很好的投射选项,而火花则不会。


        I have Hortonworks HDP 2.6.3 running Spark2 (v2.2). My test case is very simple:

        1. Create a Hive table with some random values. Hive at port 10000

        2. Turn on Spark Thrift server at 10016

        3. Run pyspark and query the Hive table via 10016

        However, I was unable to get the data from Spark due to NumberFormatException.

        Here is my test case:

        1. Create Hive table with sample rows:

        beeline> !connect jdbc:hive2://localhost:10000/default hive hive
        
        create table test1 (id int, desc varchar(40));
        
        insert into table test1 values (1,"aa"),(2,"bb");
        

        1. Run Spark Thrift server:

        su - spark -c '/usr/hdp/2.6.3.0-235/spark2/sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10016'

        1. Run pyspark as spark user su - spark -c 'pyspark'

        2. Type in below code:

          df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test1",user="hive", password="hive").load()

          df.select("*").show()

        3. I got this error:

        17/12/15 08:04:13 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more 17/12/15 08:04:13 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

        17/12/15 08:04:14 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 336, in show print(self._jdf.showString(n, 20)) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o75.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

        Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at org.apache.spark.sql.Dataset.take(Dataset.scala:2367) at org.apache.spark.sql.Dataset.showString(Dataset.scala:245) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

        1. I suspected it has something to do with id column so I changed to this: df.select("desc").show()

        2. Then I got this strange result:

        +----+
        |desc|
        +----+
        |desc|
        |desc|
        +----+
        

        1. If I go back to Hive to query, everything went fine via port 10016:

        beeline> !connect jdbc:hive2://localhost:10016/default hive hive
        
        select * from test1;
        
        +-----+-------+--+
        | id  | desc  |
        +-----+-------+--+
        | 1   | aa    |
        | 2   | bb    |
        +-----+-------+--+
        

        1. If I change port 10000 in pyspark, same problem persisted.

        Could you please help me understand why and how to get the rows via Spark?

        UPDATE 1

        I followed @Achyuth advise below in both cases and they still don't work.

        Case 1

        Beeline:

        create table test4 (id String, desc String);
        insert into table test4 values ("1","aa"),("2","bb");
        select * from test4;
        

        Pyspark:

        >>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test4",user="hive", password="hive").option("fetchsize", "10").load()
        >>> df.select("*").show()
        +---+----+
        | id|desc|
        +---+----+
        | id|desc|
        | id|desc|
        +---+----+
        

        For some reason, it returned in the column names?!

        Case 2

        Beeline:

        create table test5 (id int, desc varchar(40)) STORED AS ORC;
        insert into table test5 values (1,"aa"),(2,"bb");
        select * from test5;
        

        Pyspark:

        Still same error Caused by: java.lang.NumberFormatException: For input string: "id"

        UPDATE 2

        Create a table and insert values via Hive port 10000 then query it. This works fine via beeline

        beeline> !connect jdbc:hive2://localhost:10000/default hive hive
        Connecting to jdbc:hive2://localhost:10000/default
        Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
        Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
        Transaction isolation: TRANSACTION_REPEATABLE_READ
        0: jdbc:hive2://localhost:10000/default> create table test2 (id String, desc String) STORED AS ORC;
        No rows affected (0.3 seconds)
        0: jdbc:hive2://localhost:10000/default> insert into table test2 values ("1","aa"),("2","bb");
        INFO  : Session is already open
        INFO  : Dag name: insert into table tes..."1","aa"),("2","bb")(Stage-1)
        INFO  : Tez session was closed. Reopening...
        INFO  : Session re-established.
        INFO  :
        
        INFO  : Status: Running (Executing on YARN cluster with App id application_1514019042819_0006)
        
        INFO  : Map 1: -/-
        INFO  : Map 1: 0/1
        INFO  : Map 1: 0(+1)/1
        INFO  : Map 1: 1/1
        INFO  : Loading data to table default.test2 from webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test2/.hive-staging_hive_2017-12-23_04-29-54_569_601147868480753216-3/-ext-10000
        INFO  : Table default.test2 stats: [numFiles=1, numRows=2, totalSize=317, rawDataSize=342]
        No rows affected (15.414 seconds)
        0: jdbc:hive2://localhost:10000/default> select * from table2;
        Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table2' (state=42S02,code=10001)
        0: jdbc:hive2://localhost:10000/default> select * from test2;
        +-----------+-------------+--+
        | test2.id  | test2.desc  |
        +-----------+-------------+--+
        | 1         | aa          |
        | 2         | bb          |
        +-----------+-------------+--+
        2 rows selected (0.364 seconds)
        

        Also via beeline, I can use Spark Thrift Server 10016 to do the same thing and it worked fine:

        beeline> !connect jdbc:hive2://localhost:10016/default hive hive
        Connecting to jdbc:hive2://localhost:10016/default
        1: jdbc:hive2://localhost:10016/default> create table test3 (id String, desc String) STORED AS ORC;
        +---------+--+
        | Result  |
        +---------+--+
        +---------+--+
        No rows selected (1.234 seconds)
        1: jdbc:hive2://localhost:10016/default> insert into table test3 values ("1","aa"),("2","bb");
        +---------+--+
        | Result  |
        +---------+--+
        +---------+--+
        No rows selected (9.111 seconds)
        1: jdbc:hive2://localhost:10016/default> select * from test3;
        +-----+-------+--+
        | id  | desc  |
        +-----+-------+--+
        | 1   | aa    |
        | 2   | bb    |
        +-----+-------+--+
        2 rows selected (3.387 seconds)
        

        This means Spark and Thrift Server work fine. But using pyspark I got same problem as the results are empty:

        >>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test3",user="hive", password="hive").load()
        >>> df.select("*").show()
        +---+----+
        | id|desc|
        +---+----+
        +---+----+
        

        UPDATE 3

        DESCRIBE EXTENDED test3;

        # Detailed Table Information  | CatalogTable(
            Table: `default`.`test3`
            Owner: hive
            Created: Sat Dec 23 04:37:14 PST 2017
            Last Access: Wed Dec 31 16:00:00 PST 1969
            Type: MANAGED
            Schema: [`id` string, `desc` string]
            Properties: [totalSize=620, numFiles=2, transient_lastDdlTime=1514032656, STATS_GENERATED_VIA_STATS_TASK=true]
            Storage(Location: webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3, InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, Serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde, Properties: [serialization.format=1]))
        

        SHOW CREATE TABLE test3;

        CREATE TABLE `test3`(`id` string, `desc` string)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
        WITH SERDEPROPERTIES (
          'serialization.format' = '1'
        )
        STORED AS
          INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
        TBLPROPERTIES (
          'totalSize' = '620',
          'numFiles' = '2',
          'transient_lastDdlTime' = '1514032656',
          'STATS_GENERATED_VIA_STATS_TASK' = 'true'
        )
        

        su - spark -c 'hdfs dfs -cat webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3/part-00000'

        解决方案

        Even though you are creating the hive table with specific datatype, The underlying data in the table when you inserted is stored as String format.

        So when the spark is trying to read the data, it uses the metastore to find the data types. It is present as int in the hive metastore and as string in the file and it is throwing the cast exception.

        Solutions

        Create the table as string and read from spark it will work.

        create table test1 (id String, desc String);
        

        If you want data type preserved, then specify the one of the file formats such as orc or parquet which creating the table and then insert it. You can able to read the file from spark without exceptions

         create table test1 (id int, desc varchar(40) STORED AS ORC);
        

        Now question is wwhy hive able to read it? Hive has good cast options avialable while spark doesn't.

        这篇关于Spark 2.2查询Hive表时,Dataframe上的Thrift服务器错误NumberFormatException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

        查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆