星火序列化错误 [英] Spark serialization error

查看:175
本文介绍了星火序列化错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习火花+阶。我想读从HBase的,但没有马preduce。
我创建了一个简单的HBase表 - 测试,并在它做3手看跌。我想读它通过火花(不HBaseTest它采用MA preduce)。我试图运行外壳以下命令

I am trying to learn spark + scala. I want to read from HBase, but without mapreduce. I created a simple HBase table - "test" and did 3 puts in it. I want to read it via spark (without HBaseTest which uses mapreduce). I tried to run the following commands on shell

val numbers = Array(
  new Get(Bytes.toBytes("row1")), 
  new Get(Bytes.toBytes("row2")), 
  new Get(Bytes.toBytes("row3")))
val conf = new HBaseConfiguration()
val table = new HTable(conf, "test")
sc.parallelize(numbers, numbers.length).map(table.get).count()

我不断收到错误 -
org.apache.spark.SparkException:作业中止:任务不序列化:java.io.NotSerializableException:org.apache.hadoop.hbase.HBaseConfiguration

I keep getting error - org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.hbase.HBaseConfiguration

有人可以帮助我,我怎么能创建一个使用serialzable配置Htable

Can someone help me , how can I create a Htable which uses serialzable configuration

感谢

推荐答案

您的问题是,不是序列化(而它的成员 CONF )和您尝试使用它地图里面的序列化。他们的方式你想读HBase的说法并不正确,它看起来像你想一些具体的获取的,然后试着做他们并行。即使你没有得到这个工作,这真的不会扩展为您要执行随机读取。你想要做的是使用星火执行表扫描,这里是一个code片段,应该可以帮助你做到这一点:

Your problem is that table is not serializable (rather it's member conf) and your trying to serialize it by using it inside a map. They way your trying to read HBase isn't quite correct, it looks like your trying some specific Get's and then trying to do them in parallel. Even if you did get this working, this really wouldn't scale as your going to perform random reads. What you want to do is perform a table scan using Spark, here is a code snippet that should help you do it:

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, tableName)

sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

这会给你包含NaviagableMap的构成行的RDD。下面是你如何改变NaviagbleMap为字符串的正常斯卡拉图:

This will give you an RDD containing the NaviagableMap's that constitute the rows. Below is how you can change the NaviagbleMap to a normal Scala map of Strings:

...
.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap)))
.map(kv => (Bytes.toString(kv._1), rowToStrMap(kv._2)))

def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
  navMap.asScala.toMap.map(cf =>
    (cf._1, cf._2.asScala.toMap.map(col =>
      (col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))

def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr =
  navMap.map(cf =>
    (Bytes.toString(cf._1), cf._2.map(col =>
      (Bytes.toString(col._1), col._2.map(elem => (elem._1, Bytes.toString(elem._2)))))))

最后一点,如果你真的想尝试执行并行随机读取我相信你也许可以把地图里面的HBase的表初始化。

Final point, if you really do want to try to perform random reads in parallel I believe you might be able to put the HBase table initialization inside the map.

这篇关于星火序列化错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆