列族概念和数据模型 [英] Column-family concept and data model

查看:98
本文介绍了列族概念和数据模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究不同类型的 NoSQL 数据库类型,并且我正在尝试围绕列族存储的数据模型进行研究,例如 Bigtable、HBase 和 Cassandra.

I'm investigating the different types of NoSQL database types and I'm trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cassandra.

有些人将列族描述为行的集合,其中每一行包含列[1], [2].此模型的示例(列族为大写):

Some people describe a column family as a collection of rows, where each row contains columns [1], [2]. An example of this model (column families are uppercased):

{
  "USER":
  {
    "codinghorror": { "name": "Jeff", "blog": "http://codinghorror.com/" },
    "jonskeet": { "name": "Jon Skeet", "email": "jskeet@site.com" }
  },
  "BOOKMARK":
  {
    "codinghorror":
    {
      "http://codinghorror.com/": "My awesome blog",
      "http://unicorns.com/": "Weaponized ponies"
    },
    "jonskeet":
    {
      "http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
      "http://manning.com/skeet2/": "C# in Depth, Second Edition"
    }
  }
}

第二个模型

其他站点将列族描述为一组相关的列在一行中 [3]、[4].来自上一个示例的数据,以这种方式建模:

Second model

Other sites describe a column family as a group of related columns within a row [3], [4]. Data from the previous example, modeled in this fashion:

{
  "codinghorror":
  {
    "USER": { "name": "Jeff", "blog": "http://codinghorror.com/" },
    "BOOKMARK":
    {
      "http://codinghorror.com/": "My awesome blog",
      "http://unicorns.com/": "Weaponized ponies"
    }
  },
  "jonskeet":
  {
    "USER": { "name": "Jon Skeet", "email": "jskeet@site.com" },
    "BOOKMARK":
    {
      "http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
      "http://manning.com/skeet2/": "C# in Depth, Second Edition"
    }
  }
}

第一个模型背后的一个可能的基本原理是,并非所有列族都具有像 USERBOOKMARK 那样的关系.这意味着并非所有列族都包含相同的键.从这个角度来看,将列族放在外层感觉更自然.

A possible rationale behind the first model is that not all column families have a relation like USER and BOOKMARK do. This implies that not all column families contain identical keys. Placing the column families at the outer level feels more natural from this point of view.

列族"这个名称意味着一组列.这正是列族在第二个模型中的呈现方式.

The name 'column family' implies a group of columns. This is exactly how column families are presented in the second model.

两种模型都是数据的有效表示.我意识到这些表示仅用于将数据传达给人类;应用程序不会以这种方式思考"数据.

Both models are valid representations of the data. I realize that these representations are solely for communicating the data towards humans; applications don't 'think' of data in such a way.

列族的标准"定义是什么?它是行的集合,还是一行内的一组相关列?

What is the 'standard' definition of a column family? Is it a collection of rows, or a group of related columns within a row?

我必须就这个主题写一篇论文,所以我也对人们通常如何向其他人解释列族"概念很感兴趣.这两个模型似乎相互矛盾.我想使用正确"或普遍接受的模型来描述列式商店.

I have to write a paper on the subject, so I'm also interested in how people usually explain the 'column family' concept to other people. Both of these models seem to contradict each other. I'd like to use the 'correct' or generally accepted model to describe column-family stores.

我已经用第二个模型来解释我论文中的数据模型.我仍然对如何向其他人解释列族存储的数据模型感兴趣.

I have settled with the second model for explaining the data model in my paper. I'm still interested in how you explain the data model of column-family stores to other people.

推荐答案

我认为 Cassandra 数据库遵循您的第一个模型.ColumnFamily 是行的集合,其中可以包含任何列,以稀疏方式(因此,如果需要,每行可以具有不同的列名称集合).一行中允许的列数几乎是无限的(在 Cassandra v0.7 中为 20 亿).

The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).

一个关键点是,根据定义,行键在列族中必须是唯一的 - 但可以在其他列族中重复使用.因此,您可以在不同的 ColumnFamilies 中存储关于相同键的无关数据.

A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.

在 Cassandra 中,这很重要,因为特定列族中的数据存储在磁盘上的相同文件中 - 因此将可能一起检索的数据项放在同一个列族中会更有效.这在一定程度上是一个实际的速度问题,但也是一个将数据组织成清晰模式的问题.这涉及您的第二个定义 - 人们可能会将有关特定键的所有数据视为行",但按列族进行分区.然而,在 Cassandra 中,它并不是真正的单行,因为对于同一行键,一个 ColumnFamily 中的数据可以独立于其他 ColumnFamily 中的数据进行更改.

In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.

这篇关于列族概念和数据模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆