BigQuery模式的定义/文档? [英] Definition/documentation for BigQuery schemas?

查看:143
本文介绍了BigQuery模式的定义/文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道文档在BigQuery架构的定义中的位置?换句话说,您在上传文件时提供的JSON模式 - personsDataSchema .json 在这个例子中



我一直在谷歌搜索了很长时间,但我找不到任何有关架构模式的文档。



我能得到的最接近的是关于自动检测模式的文档。但是,如果这不合适,并且您需要提供预定义的JSON模式,是否有任何文档需要哪些字段以及哪些值是允许的?

>定义一个模式,你所需要的基本就是定义3个字段: name type mode



表中的每个字段都必须定义这3个键。如果你有一个像这样的表:

  user_id源
1搜索
2电子邮件

然后可以将模式定义为:

  [{name:user_id,type:INT64,mode:REQUIRED},
{name:source,type :STRING,mode:NULLABLE}]

c> name 只是描述了字段名称,比如user_id。

键入是数据类型,如STRING,INTEGER,FLOAT等。目前,BigQuery支持这些类型




  • STRING

  • INT64

  • FLOAT64

  • BOOL

  • BYTES(引用字节为您提供字符串表示形式)。
  • DATE

  • DATETIME

  • TIME

  • TIMESTAMP

  • 记录


现在,如果您打开文档,您会看到我们也有一个REPEATED字段的数据类型 ARRAY 。稍后我会进一步讨论它们。

第三个关键字 mode 可以是以下其中一个:


  • NULLABLE(允许值为 NULL
  • REQUIRED(不允许值为 NULL

  • REPEATED(这是ARRAY字段,它表示该字段基本上是一个值列表) / li>


因此,让我们以前面的示例并添加一个重复字段(即ARRAY字段)来说明:

  user_id源愿望清单
1搜索[sku 0,sku 1]
2电子邮件[]
3直接[sku 0,sku 3]

模式可以定义如下: / p>

  [{name:user_id,type:INT64,mode:REQUIRED}, 
{name:source,type:STRING,mode:NULLABLE},
{name:wishlist,type:STRING ,mode:REPEATED}]

在那里,ARRAY字段定义d作为字符串值的重复。



我们还剩下一种类型的字段,那就是RECORD字段(STRUCT)。除了我们还为它们定义了第四个键字段,它们基​​本上是相同的。由于RECORDs包括其他领域,您必须描述他们的定义;举个例子,这很容易理解:

  user_id源wishlist location.country location.city 
1 search [ SKU 0,SKU 1]美国纽约
2电子邮件[]美国LA
3直销[SKU 0,SKU 3] BR SP

这里, location 是一个RECORD(STRUCT),里面有2个键: country city 。这就是你如何为它们定义模式:

  [{name:user_id,type: 
{name:source,type:STRING,mode:NULLABLE},
{name :wishlist,type:STRING,mode:REPEATED},
{name:location,type:RECORD,mode:NULLABLE ,fields:[{name:country,type:STRING,mode:NULLABLE},{name:city,type:STRING模式:NULLABLE}]}]

你想有RECORDS的REPEATED字段?当然,为什么不呢!如果你想为你的客户端在你的网站上的每一个 hit 有一个REPEATED字段,你可以像这样定义模式:

  [{name:user_id,type:INT64,mode:REQUIRED},
{name:来源,类型:STRING,模式:NULLABLE},
{name:wishlist,type:STRING,mode:REPEATED
{name:location,type:RECORD,mode:NULLABLE,fields:[{name:country,type:STRING ,mode:NULLABLE},{name:city,type:STRING,mode:NULLABLE}]},
{name ,type:RECORD,mode:REPEATED,fields:[{name:hitNumber,type:INT64,mode:NULLABLE},{ name:hitPage,type:STRING,mode:NULLABLE}]}]

鉴于此,我们最终可以回答您的问题,如何定义 dataPersons.json 模式?



这是一行personData的示例:

  {kind:person,
福llName:John Doe,
age:22,
gender:Male,
phoneNumber:{areaCode:206,number :1234567},
children:[{name:Jane,gender:Female,age:6},
{name 约翰,性别:男性,年龄:15}],
citiesLived:[{place:Seattle,yearsLived:[1995]} ,
{place:Stockholm,yearsLived:[2005]}]}

首先,我们有kind:person。这很容易,它的模式将是:

  {name:kind,type:STRING, mode:REQUIRED或NULLABLE} 

phoneNumber 是一个带有两个内部字段( areaCode number )的RECORD(STRUCT)字段。

  {name:phoneNumber,
type :RECORD,
mode:NULLABLE或REQUIRED,
fields:[{name:areaCode,type:INT64,mode NULLABLE},
{name:number,type:INT64,mode:NULLABLE}]}
$ b

现在 children citiesLived 具有相同的定义,即是,它们都是RECORD(STRUCT)的REPEATED(ARRAY)字段。就像在我们的最后一个例子中一样,这个也应该是直截了当的; citiesLived 将被定义为:

  {name:citiesLived ,
type:RECORD,
mode:REPEATED,
fields:[{name:place,type:STRING ,mode:NULLABLE},
{name:yearLived,type:INT64,mode:REPEATED}]}

现在就有了它。这基本上都是模式定义。如果你使用Python,这个想法是一样的。您可以导入类 SchemaField 来定义每个字段,如下所示:

  from google.cloud.bigquery import SchemaField 
field_kind = SchemaField(name =kind,type =STRING,mode =NULLABLE)

其他客户也会遵循同样的想法。



因此,总而言之,您必须为每个字段定义3个键在你的表中: name type mode 。如果该字段是RECORD类型,那么您还必须定义字段,并且对于每个内部字段,再次定义3个键(4,如果内部字段是类型的RECORD)。



希望这可以让我们更清楚地了解如何定义模式。让我知道如果您仍然有关于这个问题的任何问题,我会更新答案。


Does anyone know where the documentation is for the definition of BigQuery schemas? In other words, the JSON schema you supply when uploading files - personsDataSchema.json in this example.

I have been Googling for ages, but I cannot find any documentation about the schema for schemas.

The closest I can get is documentation about auto-detecting schemas. But in cases where that is not appropriate and you need to supply a pre-defined JSON schema, is there any documentation about which fields are required and which values are allowed?

解决方案

To define a schema, all you need basically is to define 3 fields: name, type and mode.

Each field in your table must have defined these 3 keys. If you have for instance a table like:

user_id    source
1          search
2          email

Then the schema could be defined as:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"}]

The key name just describes the field name, such as "user_id".

The key type is the data type, such as STRING, INTEGER, FLOAT and so on. Currently, BigQuery supports these types:

  • STRING
  • INT64
  • FLOAT64
  • BOOL
  • BYTES (enconding bytes gives you the string representation).
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • RECORD

Now, if you open the documentation, you'll see that we also have the data type ARRAY that is a REPEATED field. I'll discuss more about them later.

The third key, mode, can be one of these:

  • NULLABLE (allows values to be NULL)
  • REQUIRED (does not allow values to be NULL)
  • REPEATED (this is the ARRAY field, it means that the field is basically a list of values).

So, let's take our previous example and add a repeated field (i.e, ARRAY field) to illustrate:

user_id    source    wishlist
1          search    ["sku 0", "sku 1"]
2          email     []
3          direct    ["sku 0", "sku 3"]

The schema could be defined as follows:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]

And there you have it, the ARRAY field defined as a repetition of string values.

We are still left with one type of field and that is the RECORD field (STRUCT). These are basically the same, except that we also defined a fourth key fields for them. As RECORDs includes other fields, you must describe their definition as well; this is easier to understand with an example:

user_id    source    wishlist            location.country    location.city
1          search    ["sku 0", "sku 1"]  USA                 NY
2          email     []                  USA                 LA
3          direct    ["sku 0", "sku 3"]  BR                  SP

Here, location is a RECORD (STRUCT) with 2 keys inside: the country and the city. That's how you'd define a schema for them:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
 {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]

You want to have a REPEATED field of RECORDS? Sure, why not! If you want a REPEATED field for each hit your client had in your website for instance, you can define the schema like so:

[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
 {"name": "source", "type": "STRING", "mode": "NULLABLE"},
 {"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
 {"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},
 {"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]

Given all that, we can finally answer your question, how would dataPersons.json schema be defined?

This is an example of a row of personsData:

{"kind": "person",
 "fullName": "John Doe",
 "age": 22,
 "gender": "Male",
 "phoneNumber": {"areaCode": "206", "number": "1234567"},
 "children": [{"name": "Jane", "gender": "Female", "age": "6"},
              {"name": "John", "gender": "Male", "age": "15"}],
 "citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},
                 {"place": "Stockholm", "yearsLived": ["2005"]}]}

First, we have "kind": "person". This is easy, its schema would be:

{"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}

phoneNumber is a RECORD (STRUCT) field with two inner fields, areaCode and number. Well, we already saw how to define them!

{"name": "phoneNumber",
 "type": "RECORD",
 "mode": "NULLABLE OR REQUIRED",
 "fields": [{"name": "areaCode", "type": "INT64", "mode": "NULLABLE"},
            {"name": "number", "type": "INT64", "mode": "NULLABLE"}]}

Now children and citiesLived have the same definition, that is, they are both a REPEATED (ARRAY) field of RECORDs (STRUCT). Just as in our last example, this one should be straightforward as well; citiesLived would be defined as:

{"name": "citiesLived",
 "type": "RECORD",
 "mode": "REPEATED",
 "fields": [{"name": "place", "type": "STRING", "mode": "NULLABLE"},
            {"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}

And there you have it. That's basically all there is to schemas definition. If you are using Python for instance, the idea is the same. You import the class SchemaField to define each field, like so:

from google.cloud.bigquery import SchemaField
field_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")

Other clients will follow the same idea.

So to summarize, you always have to define 3 keys for each field in your table: name, type and mode. If the field is of type RECORD, then you also have to define fields and for each inner field, you again define the 3 keys (4, if the inner field is of type RECORD again).

Hopefully this made a bit more clear on how to define a schema. Let me know if you still have any questions regarding this subject and I'll update the answer.

这篇关于BigQuery模式的定义/文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆