Google BigQuery / Amazon Redshift使用基于列的关系数据库还是NoSQL数据库? [英] Does Google BigQuery/ Amazon Redshift use column-based relational database or NoSQL database?

查看:448
本文介绍了Google BigQuery / Amazon Redshift使用基于列的关系数据库还是NoSQL数据库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍然不清楚基于列的关系数据库与基于列的NoSQL数据库之间的区别。

Google BigQuery启用类似SQL的查询那么它怎么能成为NoSQL呢?



我知道的基于列的关系数据库是InfoBright,Vertica和Sybase IQ。



我知道基于列的NoSQL数据库是Cassandra和HBase。

以下关于Redshift的文章以NoSQL开头,但以PostgreSQL(关系型)被使用:
http: //nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/

解决方案

div>

这里有几点需要澄清一下Google BigQuery。


$ b BigQuery是一个 hybrid 系统它允许您将数据存储在列中,但是它将采用NoSQL世界的智慧h附加功能,如记录类型和嵌套功能。你也可以有一个2M字节的STRING列,你可以在其中存储原始文档,如 JSON 文档。查看适用的其他数据格式和限制。您也可以在Javascript中编写用户定义的函数,例如:您可以粘贴一个执行NLP JavaScript库的库。



既然您拥有存储数据的所有功能,可以使用 JSON函数来查询存储在其中一列中的文档,因此可以将其用作无模式存储,因为您没有为该列定义JSON文档结构,只是将其存储为JSON。知道吗?



基本示例从Meta列(它是JSON文档,原因键)以及包含语言构造来查询有多少用户在这个关键字中,取消订阅字:

  SELECT 
SUM(IF(JSON_EXTRACT_SCALAR(meta,'$。reason ')包含'取消订阅',1,0))
FROM ...

开另一方面您有 表格通配符查询 。如果您在多个表中有行,则这是必需的。表通配符函数是从特定的一组表中查询数据的经济有效的方法。当您使用表格通配符功能时,BigQuery只会访问您的通配符表格并向您收费。因此,这意味着建议将数据存储在类似的表中,并按照设定的时间范围将其分隔在不同的表中,例如:每日表格和月表格。 我们不应该忘记 BigQuery仅通过设计追加,因此您无法更新旧记录,没有UPDATE语言结构更新:现在 DML语言构造做一些更新/删除操作)。相反,您需要附加一条新记录,并且您的查询必须以一种始终与最新版本的数据配合使用的方式编写。如果您的系统是事件驱动的,则这非常简单,因为每个事件都将附加到BQ中。但是,如果用户更新它的配置文件,则需要再次存储配置文件,不能更新旧的行。你需要有一个列版本/日期,告诉你哪一个版本是最新的版本,并且你的查询将首先被写入,以获得最新版本的行,然后处理逻辑。



您可以在该字段中使用over / partition之类的内容并使用最近的值 seqnum = 1



个人资料,最后一个电子邮件为每个 user_id 由最近的条目定义,由 timestamp 列。

  SELECT email 
FROM
(SELECT email
row_number()over(partition BY user_id
ORDER BY TIMESTAMP DESC)seqnum $ b $ FROM [profile]

WHERE seqnum = 1


I'm still not very clear about the difference between a column-based relational database vs. column-based NoSQL database.

Google BigQuery enables SQL-like query so how can it be NoSQL?

Column-based relational database I know of are InfoBright, Vertica and Sybase IQ.

Column-based NoSQL database I know of are Cassandra and HBase.

The following article about Redshift starts with saying "NoSQL" but ends with PostgreSQL (which is relational) being used: http://nosqlguide.com/column-store/intro-to-amazon-redshift-a-columnar-nosql-database/

解决方案

A few things to clarify here mostly about Google BigQuery.

BigQuery is a hybrid system that allows you to store data in columns, but it takes into the NoSQL world with additional features, like the record type, and the nested feature. Also you can have a 2Mbyte STRING column in which you can store raw document like a JSON document. See other data formats and limits that apply. Also you can write User Defined Functions in Javascript, eg: you can paste in a library that does NLP javascript library.

Now that you have all these capabilities to store data you can use JSON Functions for example to query your document stored in one of the columns, hence this can be used as no schema storage, because you didn't defined your JSON document structure for that column, you just stored it as JSON. Got it?

Basic example to query from the meta column, which is a JSON document, the reason key, and doing a contains language construct to find out how many users have in that key the "unsubscribed" word:

SELECT 
  SUM(IF(JSON_EXTRACT_SCALAR(meta,'$.reason') contains 'unsubscribed',1,0))  
FROM ...

On the other hand you have table-wildcard querying. This is needed if you have your rows across many tables. Table wildcard functions are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. So this means that it's advised to store data in similar tables just partitioned in different tables per a set time frame eg: daily, monthly tables.

We should not forget that BigQuery is append only by design, so you cannot update old records, there is no UPDATE language construct (Update: There is now DML language construct to do some update/delete ops). Instead you need to append a new record and your queries must be written in a way that always work with the last version of your data. If your system is event driven, than this is very simple because each event will be appended in the BQ. But if the user updates it's profile, you need to store the profile again, you cannot update old row. You need to have a column version/date that tells you which is the most recent version, and your queries will be written first to obtain the most recent version of your rows then deal with the logic.

You can use something like over/partition by that field and use the most recent value seqnum=1.

This returns from profile, the last email for each user_id defined by the most recent entry by timestamp column.

SELECT email
   FROM
     (SELECT email
             row_number() over (partition BY user_id
                                ORDER BY TIMESTAMP DESC) seqnum
      FROM [profile]
    )
   WHERE seqnum=1

这篇关于Google BigQuery / Amazon Redshift使用基于列的关系数据库还是NoSQL数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆