如何将云存储上的文件导入到BigQuery中 [英] How to import a json from a file on cloud storage to Bigquery

查看:90
本文介绍了如何将云存储上的文件导入到BigQuery中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过api将文件(json.txt)从云存储导入到Bigquery中,并导致错误。当这是通过网络用户界面完成,它的工作原理和没有错误(我甚至设置maxBadRecords = 0)。有人能告诉我我在这里做错了吗?是代码错误,还是我需要在某处更改BigQuery中的某些设置?



该文件是一个纯文本utf-8文件,内容如下:Ive keep to

  {person_id:225,person_name:John,object_id :1} 
{person_id:226,person_name:John,object_id:1}
{person_id:227,person_name:John,object_id :null}
{person_id:229,person_name:John,object_id:1}

并导入作业时会抛出以下错误:值无法转换为预期类型。为每一行。

  {
reason:invalid,
location:Line:15 /字段:1,
消息:值无法转换为预期的类型。


reason:invalid,
location:Line:16 / Field:1,
message:值不能转换为预期的类型。


reason:invalid,
location:Line:17 / Field:1,
message:值不能转换为预期的类型。
},
{
reason:invalid,
location:Line:18 / Field:1,
message:值不能转换为预期的类型。
},
{
reason:invalid,
message:遇到太多错误,限制为:10.


$ bstatistics:{
creationTime:1384484132723,
startTime:1384484142972,
endTime:1384484182520,
load:{
inputFiles:1,
inputFileBytes:960,
outputRows :0,
outputBytes:0
}
}
}

该文件可以在这里访问:
http:/ / b>

以及我的代码和模式如下:

  def insert_and_import_table_in_dataset(tar_file,table,dataset = DATASET)
config = {
'configuration'=> {
'load'=> {
'sourceUris'=> [gs:// test-bucket /#{tar_file}],
'schema'=> {
'fields'=> [
{'name'=>'person_id','type'=>'INTEGER','mode'=> 'nullable'},
{'name'=>'person_name','type'=>'STRING','mode'=> 'nullable'},
{'name'=>'object_id','type'=>'INTEGER','mode'=> 'nullable'}}
]
},
'destinationTable'=> {
'projectId'=> @ project_id.to_s,
'datasetId'=>数据集
'tableId'=>表
},
'sourceFormat'=> 'NEWLINE_DELIMITED_JSON',
'createDisposition'=> 'CREATE_IF_NEEDED',
'maxBadRecords'=> 10,
}
},
}

result = @ client.execute(
:api_method => @ bigquery.jobs.insert,
:parameters => {
#'uploadType'=>'resumable',
:projectId => @project_id.to_s,
:datasetId => dataset},
:body_object => config


#upload = result.resumable_upload
#@ client.execute(upload)if upload.resumable?

puts result.response.body
json = JSON.parse(result.response.body)
while true
job_status = get_job_status(json ['jobReference'] ['jobId'])
如果job_status ['status'] ['state'] =='DONE'
放置DONE
返回true
else
puts job_status ['status'] ['state']
puts job_status
sleep 5
end
end
end

有人能告诉我我做错了什么吗?我该如何解决和在哪里?



在未来的某个时间点,我期望使用压缩文件并从它们导入 - 是tar.gz,或者我需要使它成为.gz只?



预先感谢您的一切帮助。欣赏它。

解决方案

你受到很多人(包括我) -
您正在导入json文件,但未指定导入格式,因此它默认为csv。



如果将configuration.load.sourceFormat设置为NEWLINE_DELIMITED_JSON你应该很好去。



我们有一个bug使得难以做到或者至少能够检测到文件是错误的类型,但是我会优先考虑。


I am trying to import a file (json.txt) from cloud storage to Bigquery via the api and have errors thrown. When this is done via the web ui, it works and has no errors (I even set maxBadRecords=0). Could someone please tell me what Im doing incorrectly here? Is the code wrong, or do I need to change some settings in Bigquery somewhere?

The file is a plain text utf-8 file with contents as follows: Ive kept to the docs on bigquery and json imports.

{"person_id":225,"person_name":"John","object_id":1}
{"person_id":226,"person_name":"John","object_id":1}
{"person_id":227,"person_name":"John","object_id":null}
{"person_id":229,"person_name":"John","object_id":1}

and on importing the job throws the following error: "Value cannot be converted to expected type." for every single line.

    {
    "reason": "invalid",
    "location": "Line:15 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:16 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "location": "Line:17 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
  {
    "reason": "invalid",
    "location": "Line:18 / Field:1",
    "message": "Value cannot be converted to expected type."
   },
   {
    "reason": "invalid",
    "message": "Too many errors encountered. Limit is: 10."
   }
  ]
 },
 "statistics": {
  "creationTime": "1384484132723",
  "startTime": "1384484142972",
  "endTime": "1384484182520",
  "load": {
   "inputFiles": "1",
   "inputFileBytes": "960",
   "outputRows": "0",
   "outputBytes": "0"
  }
 }
}

The file can be accessed here: http://www.sendspace.com/file/7q0o37

and my code and schema are as follows:

def insert_and_import_table_in_dataset(tar_file, table, dataset=DATASET)
config= {
  'configuration'=> {
      'load'=> {
        'sourceUris'=> ["gs://test-bucket/#{tar_file}"],
        'schema'=> {
          'fields'=> [
            { 'name'=>'person_id', 'type'=>'INTEGER', 'mode'=> 'nullable'},
            { 'name'=>'person_name', 'type'=>'STRING', 'mode'=> 'nullable'},
            { 'name'=>'object_id',  'type'=>'INTEGER', 'mode'=> 'nullable'}
          ]
        },
        'destinationTable'=> {
          'projectId'=> @project_id.to_s,
          'datasetId'=> dataset,
          'tableId'=> table
        },
        'sourceFormat' => 'NEWLINE_DELIMITED_JSON',
        'createDisposition' => 'CREATE_IF_NEEDED',
        'maxBadRecords'=> 10,
      }
    },
  }

result = @client.execute(
  :api_method=> @bigquery.jobs.insert,
  :parameters=> {
     #'uploadType' => 'resumable',          
      :projectId=> @project_id.to_s,
      :datasetId=> dataset},
  :body_object=> config
)

# upload = result.resumable_upload
# @client.execute(upload) if upload.resumable?

puts result.response.body
json = JSON.parse(result.response.body)    
while true
  job_status = get_job_status(json['jobReference']['jobId'])
  if job_status['status']['state'] == 'DONE'
    puts "DONE"
    return true
  else
   puts job_status['status']['state']
   puts job_status 
   sleep 5
  end
end
end

Could someone please tell me what I am doing wrong? What do I fix and where?

Also at some point in the future, I expect to be using compressed files and importing from them- is the "tar.gz" ok for that or do I need to make it a ".gz" only?

Thank you in advance for all help. Appreciate it.

解决方案

You're getting hit by the same thing that a lot of people (including me) have gotten hit by -- you are importing a json file but not specifying an import format, so it defaults to csv.

If you set configuration.load.sourceFormat to NEWLINE_DELIMITED_JSON you should be good to go.

We've got a bug to make it harder to do or at least be able to detect when the file is the wrong type, but I'll bump the priority.

这篇关于如何将云存储上的文件导入到BigQuery中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆