对抗唯一性的更好选择? [英] Counter a better choice for uniqueness?

查看:69
本文介绍了对抗唯一性的更好选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前具有基本用户事件表的以下表布局:

 如果不存在则创建表events.events_by_user( 
用户文本,
add_week int,
add_timestamp时间戳,
事件文本,
uuid uuid,
PRIMARY KEY((user,added_week),additive_timestamp,事件,uuid))
与群集排序依据(添加时间戳记DESC)

因此,唯一性基本上是由uuid保证作为主键的最后一列。有可能在同一毫秒(时间戳)中发生同一用户的多个相同事件。



另一种方法是(如果我没有记错的话)删除uuid列并用计数器列代替,例如:

 如果不存在则创建表events.events_by_user(
用户文本,
add_week int,
add_timestamp时间戳,
事件文本,
频率计数器,
主键((用户,add_week),added_timestamp,事件))
,带有排序命令BY(added_timestamp DESC)

我的想法是,使用此计数器可以节省一些空间,而且行的宽度不会太大。我不确定这是否可能对保持此计数器有其他性能影响,或者是否有其他原因导致这不是一个好主意?

解决方案

为什么要使用计数器节省空间? C *设计习惯是使用空间来获得效率。


回到您的问题,计数器对您的功能有很大的限制可以执行的操作,例如必须在自己的表上使用,在该表上可以有任意数量的主键列,然后只有计数器列。它们仅支持递增和递减运算,并且由于它们仅支持这两个运算,因此每个查询都不是幂等的。如果您可以忍受计算的错误,值...(即使C * 2.1+有所缓解,计数不足也是一个众所周知的问题)


这意味着您无法指定事件列,因为它不是主键的一部分,所以您的设计无效。


返回到唯一性要求,您可以使用 timeuuid 列类型。它们是基于时间的Type 1 UUID,提供了不错的低冲突概率。来自 Cassandra Wiki


类型1 UUID包含以下内容:




  • 00起包含100纳秒间隔的时间戳记: 1582年10月15日,00:00.00(公历改革为
    基督教日历的日期)。



  • 一个版本(应值是1)。



  • 一个变体(值应该为2)。



  • 一个序列号,可以是计数器或伪随机数。



  • 节点这将是
    机器的MAC地址(应该使$ U $ ID在
    机器之间唯一)。




使用UUID的挑战是使
在单个计算机上运行的多个进程以及在单个进程中运行的多个线程
是唯一的。上面指定的类型1 UUID都不会
。在具有多个内核的快速计算机上,很有可能
生成的UUID具有相同的时间值。仅当序列号可以跨越线程和进程时,才可以补救
,这对于高效地完成工作是非常困难的。$ p $ b引用的基于时间的UUID可以补偿这些问题的解决方法是:



  • 仅使用
    返回的正常毫秒粒度System.currentTimeMillis()并将其调整为假装包含100
    ns计数。



  • 每重复一次将时间增加1(以非线程安全方式)



  • 使用与UUID类关联的伪随机
    数字作为序列号。
    将时间增加1可使多个线程在同一过程中的同一毫秒内唯一地创建
    多达10,000个UUID。使用
    a伪随机数作为序列号可在
    16,384机会中为1,每个UUID类将具有唯一的ID。




这些机制为生成的
UUID是唯一的提供了合理的可能性。但是,要注意的问题是:



  • 计算机每
    微秒能够生成10,000个以上的UUID。

    p>

  • 在不同线程上创建UUID的应用程序可能会重复
    ,因为时间不会以线程安全的
    方式递增。



  • 在不同的
    类加载器中的VM中有一个以上的类实例-每个具有以下类的类都可以缓解它自己的
    序列号。



  • 不能保证
    中两个UUID实例相同或不同虚拟机将具有不同的序号-只是
    a合理的概率。




实际上,C *已经可以完成您想做的事情。但是,如果您真的担心最终会重复,那么您需要自己做点算,我建议您在应用程序级别实施。


I currently have the following table layout for a basic user event table:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    uuid uuid,
    PRIMARY KEY((user, added_week), added_timestamp, event, uuid))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

Thus uniqueness is basically warranted by the uuid as last column of the primary key. There is a chance that several identical events for the same user occur in the same millisecond (timestamp).

Another approach might be (if I am not mistaken), to drop the uuid column and replace it by a counter column instead, like this:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    frequency counter,
    PRIMARY KEY((user, added_week), added_timestamp, event))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

My thoughts are that I could save some space by using this counter and also my rows would not widen so much. I am not sure though if this could have other performance implications maintaining this counter or if there are any other reasons why this might not be a good idea?

解决方案

Why you would use a counter to save space? The C* design idiom is to use space to gain efficiency.

Back to your question, counters are very limiting on what you can do, eg must be used on their own tables where you can have as many columns as you want for the primary key, and then only counter columns. They support only increment and decrement operations, and since they only support these two operations, every query is not idempotent. If you can live with inaccuracies of the "counted" value... (over-under counting is a well known problems even if C* 2.1+ mitigated that a bit)

That means you cannot specify your event column because is not part of your primary key, so your design is not valid.

Back to your uniqueness requirements, you could use the timeuuid column type. They are time-based Type 1 UUIDs and provide a decent low collision probability. From Cassandra wiki:

A Type 1 UUID consists of the following:

  • A timestamp consisting of a count of 100-nanosecond intervals since 00:00:00.00, 15 October 1582 (the date of Gregorian reform to the Christian calendar).

  • A version (which should have a value of 1).

  • A variant (which should have a value of 2).

  • A sequence number, which can be a counter or a pseudo-random number.

  • A "node" which will be the machines MAC address (which should make the UUID unique across machines).

The challenge with a UUID is to make it be unique for multiple processes running on a single machine and multiple threads running in a single process. The Type 1 UUID as specified above does neither. On a fast machine with multiple cores it is quite possible to have a UUID generated with the same time value. This can be remedied only if the sequence number can span threads and processes, something that is quite challenging to do efficiently.

The Time Based UUID referenced compensates for these issues by:

  • Only using the normal millisecond granularity returned by System.currentTimeMillis() and adjusting it to pretend to contain 100 ns counts.

  • Incrementing the time by 1 (in a non-threadsafe manner) whenever a duplicate time value is encountered.

  • Using a pseudo-random number associated with the UUID Class for the sequence number. Incrementing the time by 1 allows multiple threads to uniquely create up to 10,000 UUIDs in the same millisecond in the same process. Using a pseudo-random number for the sequence number provides a 1 in a 16,384 chance that each UUID Class will have a unique id.

These mechanisms provide a reasonable probability that the generated UUIDs will be unique. However, the issues to be aware of are:

  • The computer is capable of generating more than 10,000 UUIDs per microsecond.

  • Applications creating UUIDs on different threads could get duplicates since the time is not incremented in a thread-safe manner.

  • More than one instance of the Class is in the VM in different Class Loaders - this will be mitigated by each Class having its own sequence number.

  • There is no guarantee that two instances of a UUID in the same or different VMs will have a different sequence number - just a reasonable probability that they will.

In practice, C* will already do what you want to do. However, if you really fear that you'll end up with duplicates then you need to do proper counting yourself, and I'd suggest you to implement that at application level.

这篇关于对抗唯一性的更好选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆