如何将非结构化数据插入/附加到bigquery表 [英] How to insert/append unstructured data to bigquery table

查看:190
本文介绍了如何将非结构化数据插入/附加到bigquery表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想将新行格式化的JSON插入/附加到 bigquery 通过python客户端API表。



例如:

  {name :xyz,mobile:xxx,location:abc} 
{name:xyz,mobile:xxx,age:22}
pre>

问题是,一行中的所有字段都是可选的,并且没有固定定义的数据模式。



查询



我已经阅读过我们可以使用联合表格,它支持autoschema检测。



然而,我正在寻找一种功能,它可以自动从数据中检测模式,相应地创建表格,甚至在数据中出现额外的列/键而不是创建新表格时调整表格模式。



这可以使用python客户端API。

BigQuery加载API,即您使用bq cli工具的示例将会显示如下所示:

 〜$ cat /tmp/x.json 
{name:xyz,移动:xxx,location:abc}
{name:xyz,mobile:xxx,age:22}

〜$ bq加载--autodetect --source_format = NEWLINE_DELIMITED_JSON tmp.x /tmp/x.json
上传完成。

〜$ bq show tmp.x
表tmp.x

上次修改的架构总数行总数字节过期
------- ---------- --------------------- ------------ ------- ------ ------------
8月8日08:23:35 | - 年龄:整数2 33
| - 位置:字符串
| - mobile:string
| - name:string


〜$ bq查询select * from tmp.x

+ ---- - + ---------- + -------- + ------ +
|年龄|位置|手机|名称|
+ ------ + ---------- + -------- + ------ +
| NULL | abc | xxx | xyz |
| 22 | NULL | xxx | xyz |
+ ------ + ---------- + -------- + ------ +

更新:如果以后需要添加其他字段,则可以使用schema_update_option来允许新字段。唉,它还没有与自动检测一起工作,所以你需要明确地向加载API提供新的模式:

 〜$ cat /tmp/x1.json 
{name:abc,mobile:yyy,age:25,gender:male}

〜$ bq load --schema = name:STRING,age:INTEGER,location:STRING,mobile:STRING,gender:STRING --schema_update_option = ALLOW_FIELD_ADDITION --source_format = NEWLINE_DELIMITED_JSON tmp.x /tmp/x1.json
上传完成。

〜$ bq show tmp.x
表tmp.x

上次修改的架构总数行总数字节过期
------- ---------- --------------------- ------------ ------- ------ -----------
8月19日10:43:09 | - name:string 3 57
| - age:整数
| - location:string
| - mobile:string
| - gender:string


〜$ bq查询select * from tmp.x
status :完成
+ ------ + ------ + ---------- + -------- + -------- +
|名称|年龄|位置|手机|性别|
+ ------ + ------ + ---------- + -------- + -------- +
| abc | 25 | NULL | yyy |男性|
| xyz | NULL | abc | xxx | NULL |
| xyz | 22 | NULL | xxx | NULL |
+ ------ + ------ + ---------- + -------- + -------- +


Background

I want to insert/append newline formatted JSON into bigquery table through python client API.

Eg:

{"name":"xyz",mobile:xxx,location:"abc"}
{"name":"xyz",mobile:xxx,age:22}

Issue is, all fields in a row are optional and there is no fixed defined schema for the data.

Query

I have read that we can use Federated tables, which supports autoschema detection.

However, I am looking for a feature, that would automatically detect schema from data,create tables accordingly and even adjust the table schema if any extra columns/keys appear in data instead of creating new table.

Would this be possible using python client API.

解决方案

You can use autodetect with BigQuery load API, i.e. your example using bq cli tool will look like following:

~$ cat /tmp/x.json
{"name":"xyz","mobile":"xxx","location":"abc"}
{"name":"xyz","mobile":"xxx","age":"22"}

~$ bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- ------------ 
  16 Aug 08:23:35   |- age: integer       2            33                        
                    |- location: string                                          
                    |- mobile: string                                            
                    |- name: string                                              


~$ bq query "select * from tmp.x"

+------+----------+--------+------+
| age  | location | mobile | name |
+------+----------+--------+------+
| NULL | abc      | xxx    | xyz  |
|   22 | NULL     | xxx    | xyz  |
+------+----------+--------+------+

Update: If later you need to add additional fields, you can use schema_update_option to allow new fields. Alas it doesn't yet work with autodetect, so you need to provide new schema explicitly to the load API:

~$ cat /tmp/x1.json 
{"name":"abc","mobile":"yyy","age":"25","gender":"male"}

~$ bq load --schema=name:STRING,age:INTEGER,location:STRING,mobile:STRING,gender:STRING --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x1.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- -----------
  19 Aug 10:43:09   |- name: string       3            57                        
                    |- age: integer                                              
                    |- location: string                                          
                    |- mobile: string                                            
                    |- gender: string                                            


~$ bq query "select * from tmp.x"
status: DONE   
+------+------+----------+--------+--------+
| name | age  | location | mobile | gender |
+------+------+----------+--------+--------+
| abc  |   25 | NULL     | yyy    | male   |
| xyz  | NULL | abc      | xxx    | NULL   |
| xyz  |   22 | NULL     | xxx    | NULL   |
+------+------+----------+--------+--------+

这篇关于如何将非结构化数据插入/附加到bigquery表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆