在表中使用多个映射类型时,Datastax Cassandra的性能降级 [英] Performance degradation with Datastax Cassandra when using multiple map types in a table

查看:195
本文介绍了在表中使用多个映射类型时,Datastax Cassandra的性能降级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下面的表有五个地图类型集合。集合中元素的最大数量为12,该项的最大大小为50字节。





  CREATE TABLE persons(
treeid int,
personid bigint,
出生文本,
出生地文本,
clientnote文本,
clientnoteisprivate boolean,
confidence int,
connection map< int,bigint>,
createddate timestamp,
deathdate text,
deathplace text,
b familyrelations map< text,text>,
flags int,
gender text,
givenname text,
identifiers map< int,text>,
issues int,
media map< uuid,int>,
mergedpersonas map< int,bigint> ;,
note text,
primaryphotoid uuid,
quality int,
suffix text ,
surname text,
userid uuid,
vitalstatus int,
PRIMARY KEY(treed,personid)

/ pre>

这里我的分区键是treeid,唯一的键是personid。我试图从.net应用程序使用datastax .net驱动程序将记录插入到此表中。我有大约20万条记录要插入,并且随着插入记录的数量增加,性能下降(超过200ms / op)。从Opscenter我看到Par新的垃圾回收时间增加(超过20毫秒)随着插入记录。





现在如果我使用不同的主键,像PRIMARY KEY(personid,treeid)改变表,并做相同的插入,性能更好(低于1 ms / op),垃圾收集时间好的在1毫秒以下





好吧,为什么分区键在性能上有所不同?我有其他表与treeid,personid作为主键,但没有任何地图数据类型,他们执行得很好。
我想根据treeid执行查询,我应该怎么做呢?我应该把地图变成不同的表吗?或者在treeid列上创建二级索引?什么更有效的阅读?



我使用datastax enterprise 4.0.1。山ra豆2.0.5。我在centos 6.4上有一个三节点集群,复制因子为3

解决方案

主键中提到的第一列称为分区键。主键中提到的任何其他列称为聚簇列。给定分区键的所有聚类列都存储为单个Cassandra分区(保证在单个节点上一起) - 以前称为宽行。因此,每个treeid将引用单个分区,每个personid在分区内开始一行。



你有多少treeid?如果你有少量的树ids和大量的人,那将导致非常少量的Cassandra分区,每个都有大量的行(传统的宽行)。



您可以随意详细说明您尝试使用此树ID,但表面上看起来像个人ID是分区键的更好选择。



OR ...也许你真的想要一个复合分区键:



PRIMARY KEY((treed,personid))



因此,使用tree id和person id的组合来区分分区。



这取决于你真正想如何组织数据。


I have the following table with five map type collections. The max number of elements in the collection is 12 and the maximum size of the item is 50 Bytes.

#

CREATE TABLE persons (
  treeid int,
  personid bigint,
  birthdate text,
  birthplace text,
  clientnote text,
  clientnoteisprivate boolean,
  confidence int,
  connections map<int, bigint>,
  createddate timestamp,
  deathdate text,
  deathplace text,
  familyrelations map<text, text>,
  flags int,
  gender text,
  givenname text,
  identifiers map<int, text>,
  issues int,
  media map<uuid, int>,
  mergedpersonas map<int, bigint>,
  note text,
  primaryphotoid uuid,
  quality int,
  suffix text,
  surname text,
  userid uuid,
  vitalstatus int,
  PRIMARY KEY (treed,personid)
)

Here my partition key is the treeid and the unique key is the personid. i am trying to insert records into this table from a .net application using the datastax .net driver. I have about 200K records to insert and the performance degrades (goes beyond 200ms/op) as the number of inserted records increases. From the Opscenter i see that the Par New garbage collection time increases (goes beyond 20 ms)as the records are being inserted.

#

Now if i change the table with the different primary key like PRIMARY KEY (personid, treeid) and do the same inserts, the performance is much better (below 1 ms/op) and the garbage collection time is well under 1 ms

#

Well, why is the partition key makes difference in the performance here? I have other tables with the treeid,personid as the primary key but with out any map data types and they perform very well. I want to perform the query based on the treeid and how should i be able to do it? should i take the maps into different tables? or create a secondary index on the treeid column? whats more efficient for reading?

I am using datastax enterprise 4.0.1. cassandra 2.0.5. I have a three node cluster on centos 6.4 and replication factor of 3

解决方案

The first column mentioned in the primary key is known as the partition key. Any additional columns mentioned in the primary key are known as the clustering columns. All of the clustering columns for a given partition key are stored as a single Cassandra partition (guaranteed to be together on a single node) - what used to be known as a "wide row". So, each treeid will refer to a single partition with each personid begin a row within the partition.

How many treeid's do you have? If you had a small number of tree ids and a large number of persons, that would result in a very small number of Cassandra partitions, each with a large number of rows (a traditional wide row.)

Feel free to elaborate on what you are trying to do with this tree id, but superficially it sounds as if person id is a better choice for partition key.

OR... maybe you really want a "composite partition key":

PRIMARY KEY ((treed,personid))

So that a combination of tree id and person id are used to distinguish partitions.

It depends on how you really want to organize your data.

这篇关于在表中使用多个映射类型时,Datastax Cassandra的性能降级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆