Google BigQuery/Amazon Redshift 使用基于列的关系数据库还是 NoSQL 数据库? [英] Does Google BigQuery/ Amazon Redshift use column-based relational database or NoSQL database?

查看:25
本文介绍了Google BigQuery/Amazon Redshift 使用基于列的关系数据库还是 NoSQL 数据库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍然不太清楚基于列的关系数据库与基于列的 NoSQL 数据库之间的区别.

I'm still not very clear about the difference between a column-based relational database vs. column-based NoSQL database.

Google BigQuery 支持类似 SQL 的查询,那么它怎么可能是 NoSQL?

Google BigQuery enables SQL-like query so how can it be NoSQL?

我所知道的基于列的关系数据库是 InfoBright、Vertica 和 Sybase IQ.

Column-based relational database I know of are InfoBright, Vertica and Sybase IQ.

我所知道的基于列的 NoSQL 数据库是 Cassandra 和 HBase.

Column-based NoSQL database I know of are Cassandra and HBase.

以下有关 Redshift 的文章以NoSQL"开头,但以使用 PostgreSQL(关系型)结束:http://nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/

The following article about Redshift starts with saying "NoSQL" but ends with PostgreSQL (which is relational) being used: http://nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/

推荐答案

这里有几件事情需要澄清,主要是关于 Google BigQuery.

A few things to clarify here mostly about Google BigQuery.

BigQuery 是一个混合系统,允许您将数据存储在列中,但它通过附加功能(例如记录)融入了 NoSQL 世界code> 类型,以及 nested 特性.您还可以拥有一个 2Mbyte 的 STRING 列,您可以在其中存储原始文档,例如 JSON 文档.查看其他适用的数据格式和限制.您也可以在 Javascript 中编写用户定义的函数,例如:您可以粘贴到执行 NLP javascript 库的库中.

BigQuery is a hybrid system that allows you to store data in columns, but it takes into the NoSQL world with additional features, like the record type, and the nested feature. Also you can have a 2Mbyte STRING column in which you can store raw document like a JSON document. See other data formats and limits that apply. Also you can write User Defined Functions in Javascript, eg: you can paste in a library that does NLP javascript library.

既然您拥有所有这些功能来存储数据,您就可以使用 JSON 函数 例如查询存储在其中一列中的文档,因此这可以用作无模式存储,因为您没有为该列定义 JSON 文档结构,您只是将其存储为 JSON.明白了吗?

Now that you have all these capabilities to store data you can use JSON Functions for example to query your document stored in one of the columns, hence this can be used as no schema storage, because you didn't defined your JSON document structure for that column, you just stored it as JSON. Got it?

从元列查询的基本示例,它是一个 JSON 文档,原因键,并执行 contains 语言构造以找出该键中有多少用户具有取消订阅"字样:

Basic example to query from the meta column, which is a JSON document, the reason key, and doing a contains language construct to find out how many users have in that key the "unsubscribed" word:

SELECT 
  SUM(IF(JSON_EXTRACT_SCALAR(meta,'$.reason') contains 'unsubscribed',1,0))  
FROM ...

另一方面,您有表通配符查询.如果您的行跨多个表,则需要这样做.表通配符函数是一种从一组特定的表中查询数据的经济高效的方法.当您使用表通配符函数时,BigQuery 只会访问匹配通配符的表并向您收费.因此,这意味着建议将数据存储在相似的表中,只是在设定的时间范围内将数据分区到不同的表中,例如:每日表、每月表.

On the other hand you have table-wildcard querying. This is needed if you have your rows across many tables. Table wildcard functions are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. So this means that it's advised to store data in similar tables just partitioned in different tables per a set time frame eg: daily, monthly tables.

我们不应该忘记BigQuery只是设计附加,所以你不能更新旧记录,没有UPDATE语言结构(更新>:现在有 DML 语言结构来做一些更新/删除操作).相反,您需要追加一条新记录,并且您的查询必须以始终适用于数据的最新版本的方式编写.如果您的系统是事件驱动的,则这非常简单,因为每个事件都将附加到 BQ 中.但是如果用户更新了它的配置文件,您需要再次存储配置文件,您不能更新旧行.您需要有一个列版本/日期来告诉您哪个是最新版本,并且您的查询将首先被编写以获取行的最新版本,然后处理逻辑.

We should not forget that BigQuery is append only by design, so you cannot update old records, there is no UPDATE language construct (Update: There is now DML language construct to do some update/delete ops). Instead you need to append a new record and your queries must be written in a way that always work with the last version of your data. If your system is event driven, than this is very simple because each event will be appended in the BQ. But if the user updates it's profile, you need to store the profile again, you cannot update old row. You need to have a column version/date that tells you which is the most recent version, and your queries will be written first to obtain the most recent version of your rows then deal with the logic.

您可以使用该字段的 over/partition 之类的内容,并使用最新的值 seqnum=1.

You can use something like over/partition by that field and use the most recent value seqnum=1.

这从 profile 返回,每个 user_id 的最后一个 emailtimestamp 定义的最新条目列.

This returns from profile, the last email for each user_id defined by the most recent entry by timestamp column.

SELECT email
   FROM
     (SELECT email
             row_number() over (partition BY user_id
                                ORDER BY TIMESTAMP DESC) seqnum
      FROM [profile]
    )
   WHERE seqnum=1

这篇关于Google BigQuery/Amazon Redshift 使用基于列的关系数据库还是 NoSQL 数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆