如何将 json 从云存储上的文件导入 Bigquery [英] How to import a json from a file on cloud storage to Bigquery

查看:33
本文介绍了如何将 json 从云存储上的文件导入 Bigquery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 api 将文件 (json.txt) 从云存储导入 Bigquery 并抛出错误.当这通过 web ui 完成时,它可以工作并且没有错误(我什至设置了 maxBadRecords=0).有人可以告诉我我在这里做错了什么吗?代码有误,还是我需要在 Bigquery 的某处更改某些设置?

I am trying to import a file (json.txt) from cloud storage to Bigquery via the api and have errors thrown. When this is done via the web ui, it works and has no errors (I even set maxBadRecords=0). Could someone please tell me what Im doing incorrectly here? Is the code wrong, or do I need to change some settings in Bigquery somewhere?

该文件是一个纯文本 utf-8 文件,内容如下:我一直在看有关 bigquery 和 json 导入的文档.

The file is a plain text utf-8 file with contents as follows: Ive kept to the docs on bigquery and json imports.

{"person_id":225,"person_name":"John","object_id":1}
{"person_id":226,"person_name":"John","object_id":1}
{"person_id":227,"person_name":"John","object_id":null}
{"person_id":229,"person_name":"John","object_id":1}

并在导入作业时引发以下错误:值无法转换为预期类型."对于每一行.

and on importing the job throws the following error: "Value cannot be converted to expected type." for every single line.

    {
    "reason": "invalid",
    "location": "Line:15 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:16 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:17 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
  {
    "reason": "invalid",
    "location": "Line:18 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "message": "Too many errors encountered. Limit is: 10."
   }
  ]
 },
 "statistics": {
  "creationTime": "1384484132723",
  "startTime": "1384484142972",
  "endTime": "1384484182520",
  "load": {
   "inputFiles": "1",
   "inputFileBytes": "960",
   "outputRows": "0",
   "outputBytes": "0"
  }
 }
}

可以在此处访问该文件:http://www.sendspace.com/file/7q0o37

The file can be accessed here: http://www.sendspace.com/file/7q0o37

我的代码和架构如下:

def insert_and_import_table_in_dataset(tar_file, table, dataset=DATASET)
config= {
  'configuration'=> {
      'load'=> {
        'sourceUris'=> ["gs://test-bucket/#{tar_file}"],
        'schema'=> {
          'fields'=> [
            { 'name'=>'person_id', 'type'=>'INTEGER', 'mode'=> 'nullable'},
            { 'name'=>'person_name', 'type'=>'STRING', 'mode'=> 'nullable'},
            { 'name'=>'object_id',  'type'=>'INTEGER', 'mode'=> 'nullable'}
          ]
        },
        'destinationTable'=> {
          'projectId'=> @project_id.to_s,
          'datasetId'=> dataset,
          'tableId'=> table
        },
        'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
        'createDisposition' => 'CREATE_IF_NEEDED',
        'maxBadRecords'=> 10,
      }
    },
  }

result = @client.execute(
  :api_method=> @bigquery.jobs.insert,
  :parameters=> {
     #'uploadType' => 'resumable',          
      :projectId=> @project_id.to_s,
      :datasetId=> dataset},
  :body_object=> config
)

# upload = result.resumable_upload
# @client.execute(upload) if upload.resumable?

puts result.response.body
json = JSON.parse(result.response.body)    
while true
  job_status = get_job_status(json['jobReference']['jobId'])
  if job_status['status']['state'] == 'DONE'
    puts "DONE"
    return true
  else
   puts job_status['status']['state']
   puts job_status 
   sleep 5
  end
end
end

有人能告诉我我做错了什么吗?我要解决什么问题?在哪里解决?

Could someone please tell me what I am doing wrong? What do I fix and where?

此外,在未来的某个时候,我希望使用压缩文件并从中导入-tar.gz"是否可以用于该目的,还是只需要将其设为.gz"?

Also at some point in the future, I expect to be using compressed files and importing from them- is the "tar.gz" ok for that or do I need to make it a ".gz" only?

预先感谢您提供的所有帮助.欣赏.

Thank you in advance for all help. Appreciate it.

推荐答案

你遇到了很多人(包括我)都遇到过的事情——您正在导入 json 文件但未指定导入格式,因此默认为 csv.

You're getting hit by the same thing that a lot of people (including me) have gotten hit by -- you are importing a json file but not specifying an import format, so it defaults to csv.

如果您将 configuration.load.sourceFormat 设置为 NEWLINE_DELIMITED_JSON,您应该很高兴.

If you set configuration.load.sourceFormat to NEWLINE_DELIMITED_JSON you should be good to go.

我们有一个错误,使其更难执行或至少能够检测文件类型错误,但我会提高优先级.

We've got a bug to make it harder to do or at least be able to detect when the file is the wrong type, but I'll bump the priority.

这篇关于如何将 json 从云存储上的文件导入 Bigquery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆