Spark to Cassandra:将没有空值的稀疏行写入Cassandra [英] Spark To Cassandra: Writing Sparse Rows With No Null Values To Cassandra

查看:100
本文介绍了Spark to Cassandra:将没有空值的稀疏行写入Cassandra的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问:如何仅将带有Spark DataFrame中值的列写入Cassanrda,并有效地做到这一点?(有效地执行了最少的Scala代码行,并且没有在Cassandra中创建一堆逻辑删除,而是使其运行迅速等)

Q: How do I write only columns with values from a Spark DataFrame into Cassanrda and do this efficiently? (efficiently as in minimal lines of Scala code and not creating a bunch of tombstones in Cassandra, having it run quickly, etc)

我有一个Cassandra表,其中有两个键列和300个潜在的描述符值.

I have a Cassandra table with two key columns and 300 potential descriptor values.

create table sample {
    key1   text,
    key2   text,
    0      text,
    ............
    299    text,
    PRIMARY KEY (key1, key2)
}

我有一个与基础表匹配的Spark数据框,但是数据帧中的每一行都很稀疏-除了两个键值之外,特定行可能只有4到5个描述符"(列0-> 299)和一个值.

I have a Spark dataframe that matches the underlying table but each row in the dataframe is very sparse - other than the two key values, a particular row may have only 4 to 5 of the "descriptors" (columns 0->299) with a value.

我目前正在将Spark数据帧转换为RDD,并使用saveRdd写入数据.

I am currently converting the Spark dataframe to an RDD and using saveRdd to write the data.

这有效,但是当没有值时,"null"存储在列中.

This works, but "null" is stored in columns when there is no value.

例如:

  val saveRdd = sample.rdd

  saveRdd.map(line => (
    line(0), line(1), line(2),
    line(3), line(4), line(5),
    line(6), line(7), line(8),
    line(9), line(10), line(11),
    line(12), line(13), line(14),
    line(15), line(16), line(17),
    line(18), line(19), line(20))).saveToCassandra..........

在Cassandra中创建它:

Creates this in Cassandra:

XYZ |10 |49849 |F ||空|空|空|空|空|空|空|空|空|空||空|空|空|空|空|空|空|空|空|空|TO11142017_导入|空|空|空|空|空|空|空|空|空|空|20 |空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|斯科特·迪克(Scott Dick-Peddie)|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|空|2014年7月13日0:00 |空|空|空|空|空|空|空|空|空|空|0 |空|空|空|空|空|空|空|空|空|空||空|空|空|空|空|空|空|空|空|空|8 |空|空|空|空|空|空|空|空|空|空||空|空|空|空|空|空|空|空|空|空|位置|空|空|空|空|空|空|空|空|空|空|位置|空|空|空|空|空|空|空|空|空|空

XYZ | 10 | 49849 | F | | null | null | null | null | null | null | null | null | null | null | | null | null | null | null | null | null | null | null | null | null | TO11142017_Import | null | null | null | null | null | null | null | null | null | null | 20 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Scott Dick-Peddie | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 7/13/2014 0:00 | null | null | null | null | null | null | null | null | null | null | 0 | null | null | null | null | null | null | null | null | null | null | | null | null | null | null | null | null | null | null | null | null | 8 | null | null | null | null | null | null | null | null | null | null | | null | null | null | null | null | null | null | null | null | null | LOCATIONS | null | null | null | null | null | null | null | null | null | null | LOCATIONS | null | null | null | null | null | null | null | null | null | null

在SparkSession上设置spark.cassandra.output.ignoreNulls不起作用:

Setting spark.cassandra.output.ignoreNulls on SparkSession does not work:

spark.conf.set("spark.cassandra.output.ignoreNulls", "true")
spark.conf.get("spark.cassandra.output.ignoreNulls")

这也不起作用:

spark-shell  --conf spark.cassandra.output.ignoreNulls=true

(尝试了不同的设置方式,但我设置的方式似乎无效)

(tried different ways to set this and it doesn't seem to work any way I set it)

withColumn 和过滤器似乎不是合适的解决方案.一个未设定的概念可能是正确的事情,但不确定在这种情况下如何使用它.

withColumn and filter do not seem to be appropriate solutions. An unset concept might be the right thing, but not sure how to use that in this case.

cassandra.3.11.2

cassandra.3.11.2

spark-cassandra-connector:2.3.0-s_2.11

spark-cassandra-connector:2.3.0-s_2.11

火花2.2.0.2.6.3.0-235

spark 2.2.0.2.6.3.0-235

谢谢!

推荐答案

您确定 ignoreNulls 对您不起作用吗?当给定单元格中没有值时,Cassandra输出 null .您可以使用 sstabledump 工具检查数据是否真的写入了SSTable中-您肯定会看到附加了删除信息的单元格(即存储空值的方式).

Are you sure that ignoreNulls doesn't work for you? Cassandra outputs null when there is no value in given cell. You can check if the data is really written into SSTable using the sstabledump tool - you'll definitely see the cells with deletion information attached (that's how nulls are stored).

以下是在不使用 ignoreNulls (默认值)且将 ignoreNulls 设置为 true 的情况下运行Spark的示例.测试是在DSE 5.1.11(具有较旧版本的连接器,但与Cassandra 3.11匹配)上完成的.

Here is example of running Spark without ignoreNulls (default), and with ignoreNulls is set to true. Testing was done on DSE 5.1.11, that has older version of connector, but matching to Cassandra 3.11.

让我们创建一个像这样的测试表:

Let create a test table like this:

create table test.t3 (id int primary key, t1 text, t2 text, t3 text);

没有 ignoreNulls -我们需要以下代码进行测试:

without ignoreNulls - we need following code for testing:

case class T3(id: Int, t1: Option[String], t2: Option[String], t3: Option[String])
val rdd = sc.parallelize(Seq(new T3(1, None, Some("t2"), None)))
rdd.saveToCassandra("test", "t3")

如果我们使用 cqlsh 查看数据,则会看到以下内容:

If we look into data using cqlsh we will see following:

cqlsh:test> SELECT * from test.t3;

 id | t1   | t2 | t3
----+------+----+------
  1 | null | t2 | null

(1 rows)

执行 nodetool flush 后,我们可以查看SSTables.这就是我们在这里看到的:

After doing nodetool flush we can look into SSTables. That's what we'll see here:

>sstabledump mc-1-big-Data.db
[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 30,
        "liveness_info" : { "tstamp" : "2018-11-06T07:53:38.418171Z" },
        "cells" : [
          { "name" : "t1", "deletion_info" : { "local_delete_time" : "2018-11-06T07:53:38Z" }
          },
          { "name" : "t2", "value" : "t2" },
          { "name" : "t3", "deletion_info" : { "local_delete_time" : "2018-11-06T07:53:38Z" }
          }
        ]
      }
    ]
  }
]

您可以看到 t1 列和& t3 为空,其中有一个字段 deletion_info .

You can see that for columns t1 & t3 that were nulls there is a field deletion_info.

现在,让我们使用 TRUNCATE test.t3 删除数据,然后将 ignoreNulls 设置为true,再次启动spark-shell:

Now, let remove data with TRUNCATE test.t3, and start spark-shell again with ignoreNulls set to true:

dse spark --conf spark.cassandra.output.ignoreNulls=true

执行相同的Spark代码后,我们将在 cqlsh 中看到相同的结果:

After executing the same Spark code we'll see same results in the cqlsh:

cqlsh:test> SELECT * from test.t3;

 id | t1   | t2 | t3
----+------+----+------
  1 | null | t2 | null

但是执行刷新后, sstabledump 显示的图片完全不同:

But after performing flush, the sstabledump shows completely different picture:

>sstabledump mc-3-big-Data.db
[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 27,
        "liveness_info" : { "tstamp" : "2018-11-06T07:56:27.035600Z" },
        "cells" : [
          { "name" : "t2", "value" : "t2" }
        ]
      }
    ]
  }
]

如您所见,我们只有列 t2 的数据,而没有提及列 t3 &的数据. t1 为空.

As you see, we have only data for column t2, and no mentions of columns t3 & t1 that were nulls.

这篇关于Spark to Cassandra:将没有空值的稀疏行写入Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆