BigQuery 架构的定义/文档? [英] Definition/documentation for BigQuery schemas?

查看:16
本文介绍了BigQuery 架构的定义/文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道 BigQuery 架构定义的文档在哪里?换句话说,您在上传文件时提供的 JSON 架构 - personsDataSchema.json 在这个例子中.

我已经在谷歌上搜索了很长时间,但我找不到任何关于模式模式的文档.

我能得到的最接近的是关于自动检测架构的文档.但如果这不合适并且您需要提供预定义的 JSON 模式,是否有任何文档说明哪些字段是必需的,哪些值是允许的?

解决方案

要定义一个模式,你只需要定义 3 个字段:nametype模式.

表中的每个字段都必须定义了这 3 个键.例如,如果您有一个像这样的表:

user_id 来源1 次搜索2 电子邮件

那么架构可以定义为:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},{"name": "source", "type": "STRING", "mode": "NULLABLE"}]

name只是描述了字段名,比如user_id".

关键的type是数据类型,如STRING、INTEGER、FLOAT等.目前,BigQuery 支持这些类型:

  • STRING
  • INT64
  • FLOAT64
  • 布尔
  • BYTES(编码字节为您提供字符串表示).
  • 日期
  • 日期时间
  • 时间
  • 时间戳
  • 记录

现在,如果您打开文档,您会看到我们还有数据类型 ARRAY,它是一个 REPEATED 字段.稍后我将详细讨论它们.

第三个键,mode,可以是其中之一:

  • NULLABLE(允许值为 NULL)
  • 必需(不允许值为 NULL)
  • REPEATED(这是 ARRAY 字段,这意味着该字段基本上是一个值列表).

所以,让我们以前面的例子为例,添加一个重复的字段(即 ARRAY 字段)来说明:

user_id 源愿望清单1 次搜索 ["sku 0", "sku 1"]2 电子邮件 []3 直接 ["sku 0", "sku 3"]

架构可以定义如下:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},{"name": "source", "type": "STRING", "mode": "NULLABLE"},{"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]

就这样,ARRAY 字段定义为字符串值的重复.

我们仍然只剩下一种类型的字段,那就是 RECORD 字段 (STRUCT).这些基本相同,除了我们还为它们定义了第四个键fields.由于 RECORD 包括其他字段,您还必须描述它们的定义;举个例子更容易理解:

user_id 源心愿单 location.country location.city1 搜索 ["sku 0", "sku 1"] 美国纽约2 电子邮件 [] 美国洛杉矶3 直接 ["sku 0", "sku 3"] BR SP

这里,location 是一个 RECORD (STRUCT),里面有 2 个键:countrycity.这就是您为它们定义架构的方式:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},{"name": "source", "type": "STRING", "mode": "NULLABLE"},{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]

您想要一个 REPEATED 字段的 RECORDS 吗?当然,为什么不呢!例如,如果您希望客户在您网站上的每个 hit 都有一个 REPEATED 字段,您可以像这样定义架构:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},{"name": "source", "type": "STRING", "mode": "NULLABLE"},{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},{"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]

综上所述,我们终于可以回答您的问题了,dataPersons.json 架构将如何定义?

这是一行personsData的例子:

{"kind": "person","fullName": "John Doe",年龄":22,"性别": "男","phoneNumber": {"areaCode": "206", "number": "1234567"},儿童":[{姓名":简",性别":女性",年龄":6"},{"name": "John", "gender": "Male", "age": "15"}],"citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},{"place": "Stockholm", "yearsLived": ["2005"]}]}

首先,我们有 "kind": "person".这很简单,它的架构是:

{"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}

phoneNumber 是一个 RECORD (STRUCT) 字段,具有两个内部字段,areaCodenumber.好吧,我们已经看到了如何定义它们!

{"name": "phoneNumber","类型": "记录","mode": "NULLABLE 或 REQUIRED",字段":[{名称":区域代码",类型":INT64",模式":NULLABLE"},{"name": "number", "type": "INT64", "mode": "NULLABLE"}]}

现在childrencitiesLived有相同的定义,即它们都是RECORDs (STRUCT)的REPEATED (ARRAY)字段.就像我们上一个例子一样,这个例子也应该很简单;citiesLived 将被定义为:

{"name": "citiesLived","类型": "记录",模式":重复",字段":[{名称":地点",类型":字符串",模式":NULLABLE"},{"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}

这就给你了.这基本上就是模式定义的全部内容.例如,如果您正在使用 Python,那么想法是相同的.您导入类 SchemaField 来定义每个字段,如下所示:

 from google.cloud.bigquery import SchemaFieldfield_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")

其他客户也会遵循同样的想法.

总而言之,您始终必须为表中的每个字段定义 3 个键:nametypemode.如果字段是 RECORD 类型,那么您还必须定义 fields 并且对于每个内部字段,您再次定义 3 个键(如果内部字段再次是 RECORD 类型,则为 4 个).

希望这能让我们更清楚地了解如何定义架构.如果您对此主题仍有任何疑问,请告诉我,我会更新答案.

Does anyone know where the documentation is for the definition of BigQuery schemas? In other words, the JSON schema you supply when uploading files - personsDataSchema.json in this example.

I have been Googling for ages, but I cannot find any documentation about the schema for schemas.

The closest I can get is documentation about auto-detecting schemas. But in cases where that is not appropriate and you need to supply a pre-defined JSON schema, is there any documentation about which fields are required and which values are allowed?

解决方案

To define a schema, all you need basically is to define 3 fields: name, type and mode.

Each field in your table must have defined these 3 keys. If you have for instance a table like:

user_id    source
1          search
2          email

Then the schema could be defined as:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"}]

The key name just describes the field name, such as "user_id".

The key type is the data type, such as STRING, INTEGER, FLOAT and so on. Currently, BigQuery supports these types:

  • STRING
  • INT64
  • FLOAT64
  • BOOL
  • BYTES (enconding bytes gives you the string representation).
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • RECORD

Now, if you open the documentation, you'll see that we also have the data type ARRAY that is a REPEATED field. I'll discuss more about them later.

The third key, mode, can be one of these:

  • NULLABLE (allows values to be NULL)
  • REQUIRED (does not allow values to be NULL)
  • REPEATED (this is the ARRAY field, it means that the field is basically a list of values).

So, let's take our previous example and add a repeated field (i.e, ARRAY field) to illustrate:

user_id    source    wishlist
1          search    ["sku 0", "sku 1"]
2          email     []
3          direct    ["sku 0", "sku 3"]

The schema could be defined as follows:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]

And there you have it, the ARRAY field defined as a repetition of string values.

We are still left with one type of field and that is the RECORD field (STRUCT). These are basically the same, except that we also defined a fourth key fields for them. As RECORDs includes other fields, you must describe their definition as well; this is easier to understand with an example:

user_id    source    wishlist            location.country    location.city
1          search    ["sku 0", "sku 1"]  USA                 NY
2          email     []                  USA                 LA
3          direct    ["sku 0", "sku 3"]  BR                  SP

Here, location is a RECORD (STRUCT) with 2 keys inside: the country and the city. That's how you'd define a schema for them:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
 {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]

You want to have a REPEATED field of RECORDS? Sure, why not! If you want a REPEATED field for each hit your client had in your website for instance, you can define the schema like so:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
 {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},
 {"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]

Given all that, we can finally answer your question, how would dataPersons.json schema be defined?

This is an example of a row of personsData:

{"kind": "person",
 "fullName": "John Doe",
 "age": 22,
 "gender": "Male",
 "phoneNumber": {"areaCode": "206", "number": "1234567"},
 "children": [{"name": "Jane", "gender": "Female", "age": "6"},
              {"name": "John", "gender": "Male", "age": "15"}],
 "citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},
                 {"place": "Stockholm", "yearsLived": ["2005"]}]}

First, we have "kind": "person". This is easy, its schema would be:

{"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}

phoneNumber is a RECORD (STRUCT) field with two inner fields, areaCode and number. Well, we already saw how to define them!

{"name": "phoneNumber",
 "type": "RECORD",
 "mode": "NULLABLE OR REQUIRED",
 "fields": [{"name": "areaCode", "type": "INT64", "mode": "NULLABLE"},
            {"name": "number", "type": "INT64", "mode": "NULLABLE"}]}

Now children and citiesLived have the same definition, that is, they are both a REPEATED (ARRAY) field of RECORDs (STRUCT). Just as in our last example, this one should be straightforward as well; citiesLived would be defined as:

{"name": "citiesLived",
 "type": "RECORD",
 "mode": "REPEATED",
 "fields": [{"name": "place", "type": "STRING", "mode": "NULLABLE"},
            {"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}

And there you have it. That's basically all there is to schemas definition. If you are using Python for instance, the idea is the same. You import the class SchemaField to define each field, like so:

from google.cloud.bigquery import SchemaField
field_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")

Other clients will follow the same idea.

So to summarize, you always have to define 3 keys for each field in your table: name, type and mode. If the field is of type RECORD, then you also have to define fields and for each inner field, you again define the 3 keys (4, if the inner field is of type RECORD again).

Hopefully this made a bit more clear on how to define a schema. Let me know if you still have any questions regarding this subject and I'll update the answer.

这篇关于BigQuery 架构的定义/文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆