结构化cassandra数据库 [英] Structuring cassandra database

查看:132
本文介绍了结构化cassandra数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白Cassandra的一件事。说,我有类似的网站Facebook,人们可以分享,喜欢,评论,上传图片等。

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.

现在,让我们说,


  • 用户名1喜欢您的评论

  • 用户名2已更新他的个人资料图片

等等。

我需要做的是为每个单一的事情创建新的列族,例如: user_likes user_comments user_shares 。基本上,你可以想到的任何事情,甚至在我这样做后,我仍然需要为大多数列创建二级索引,以便我可以搜索数据?即使如此,我怎么知道哪些用户是我的朋友?我需要先获得我所有的朋友ID,然后搜索所有的列家庭的每个用户ID?

So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?

EDIT
好​​的,所以我做了一些更多的阅读,现在我理解的东西一点点,但我还是不能真的找出了如何构造我的表,所以我将设置一个赏金,我想得到一个清楚的例子我的表应该看起来像,如果我想以这种顺序存储和检索数据:

EDIT Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:


  • 全部

  • 喜欢

  • 评论

  • 收藏

  • 下载

  • 分享

  • 邮件

  • All
  • Likes
  • Comments
  • Favourites
  • Downloads
  • Shares
  • Messages

所以让我们说,我要检索所有我的朋友或我关注的人最后上传的十个文件,这是它的样子:

So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:

John上传歌曲AC / DC - 10分钟前回到黑色

喜欢的评论和分享将类似于...

And every thing like comments and shares would be similar to that...

现在可能最大的挑战是检索所有类别的最后10件事,所以列表将是一个混合所有的东西...

Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...

现在我不需要一个完全详细的表,我只需要一个很清楚的例子,我将如何构建和在中使用连接检索数据,如

Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins

推荐答案

使用sql,您可以构建表以规范化数据,并使用索引和联接来查询。对于cassandra,你不能这样做,所以你构造你的表来服务你的查询,这需要非规范化。

With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.

你想查询你的朋友上传的项目,一个方法是每个用户有一个表,并且每当用户的朋友上传某个内容时写入此表。

You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.

friendUploads { #columm family
    userid { #column 
        timestamp-upload-id : null #key : no value
    }
 }

为例,

friendUploads {
    userA {
         12313-upload5 : null
         12512-upload6 : null
         13512-upload8 : null
    }
}

friendUploads {
    userB {
         11313-upload3 : null
         12512-upload6 : null
    }
}

请注意,上传6会复制到两个不同的列,因为上传6是用户A和用户B的朋友。

Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.

现在查询朋友的上传显示的朋友,在userid列上做一个限制为10的getSlice。

Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.

要将最新项目放在第一位,请使用反向比较器在较小时间戳之前排序较大的时间戳。

To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.

是当用户A上传歌曲时,您必须进行N次写入以更新friendsUploads列,其中N是用户A的朋友的人数。

The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.

对于与每个timestamp-upload-id键相关联的值,您可以存储足够的信息以显示结果(可能在json blob中),或者您可以不存储任何内容,并使用uploadid获取上传信息。

For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.

为避免重复写入,您可以使用

To avoid duplicating writes, you can use a structure like,

userUploads { #columm family
    userid { #column 
        timestamp-upload-id : null #key : no value
    }
 }

这存储特定用户的上传。现在,当想要显示用户B的朋友的上传时,您必须执行N个查询,一个用于用户B的每个朋友,并将结果合并到您的应用程序中。这是查询速度较慢,但​​写入速度较快。

This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.

最有可能的是,如果用户可以有成千上万的朋友,那么您将使用第一个方案,并且执行更多的写入,而不是更多的查询,因为您可以在

Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.

作为反规范化的一个例子,你可以看看twitter rainbird在一个单独的点击。每个写入用于支持单个查询。

As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.

这篇关于结构化cassandra数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆