bigquery bug:上传数据时不会收到所有不良记录 [英] bigquery bug: do not receive all bad records when uploading data

查看:159
本文介绍了bigquery bug:上传数据时不会收到所有不良记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将数据上传到bigquery表中



这里是表格模式:

  [{
name:temp,
type:STRING
}]

这里是我上传的文件:

  {temp:0} 
{temp1:1}
{temp2:2}
{temp3:3}
{temp4:4}
{temp5:5}
{temp6:6}
{temp7:7}
{temp:8}
{temp:9}

这里是上传启用错误的bq命令:

  bq加载--source_format = NEWLINE_DELIMITED_JSON --max_bad_records = 100 mydataset.mytable ./tmp.json 

我收到:

 上传完成。 
正在等待bqjob_123 .._ 1 ...(2s)当前状态:DONE
作业执行期间遇到的警告:

从文件位置15开始的行中的JSON解析错误:文件-00000000。没有这样的字段:temp1。

在文件file:00000000处从位置31开始的行中的JSON解析错误。没有这样的字段:temp2。

在文件file:00000000处从位置47开始的行中的JSON解析错误。没有这样的字段:temp3。

从file:00000000的位置63开始的行中的JSON解析错误。没有这样的字段:temp4。

在文件file:00000000处从位置79开始的行中的JSON解析错误。没有这样的字段:temp5。

现在我正在使用:

  bq --format = prettyjson show -j< jobId> 

这就是我所得到的(我在这里只复制了相关的字段):

  {
配置:{
...
maxBadRecords:100



statistics:{
load:{
inputFileBytes:157,
inputFiles:1,
outputBytes:9,
outputRows:3
}
},
status:{
errors :[
{
message:JSON解析错误,在文件:file-00000000的位置15开始的行中,没有这样的字段:temp1。,
reason:invalid

{
message:从文件file:00000000处的位置31开始的行中的JSON解析错误,没有这样的字段:temp2。,
原因:无效
},
{
message:JSON解析错误,位于文件file-00000000的第47位开头,没有这样的字段:temp3。
原因:无效
},
{
message:在file:00000000位置63开始的行中的JSON解析错误。没有这样的字段:temp4。,
reason:invalid
},
{
message:JSON解析错误:file-00000000。没有这样的字段:temp5。,
reason:invalid
}
],
state:DONE
}
}

现在当我进入我的表时,我实际上有3条新记录(它们实际上匹配 outputRows:3 字段):

  {temp:0} 
{temp:8}
{temp:9}

现在这些是我的疑问:


  1. 记录我只收到其中5个 - 没有收到temp6现在我尝试上传更多不良记录的文件,并且总是只收到5.这是一个大问题?


  2. 假设我的记录更大,并且上传了许多记录,导致出错,上传后我怎么知道哪些记录是坏记录? - 我需要知道哪些记录没有加载到bigquery。
    all I得到的是从文件位置15开始的行中的JSON解析错误。位置没有告诉我很多。为什么我不能收到记录的号码吗?或者有什么方法可以根据头寸计算记录数量?

  3. >
  4. 我们只返回前5个错误,因为我们不想让回复过大。
  5. 正如我在另一个线程中所解释的,BigQuery的设计通过并行处理大文件来快速处理大文件。如果文件是1GB,我们可能会创建数百个工作人员,每个工作人员处理一个文件块。如果一名工作人员正在处理文件的最后10MB并发现一条错误记录,要知道该记录的编号,则需要读取以前的所有990MB文件。因此每个工人只是报告坏记录的开始位置。一些编辑支持在文件中寻找偏移量。在vim中,1000go将移动到位置1000.少于1000P。


I'm trying to upload data to bigquery table

here's the table schema:

[{
    "name": "temp",
    "type": "STRING"
  }]

here is my file I'm uploading:

{"temp" : "0"}
{"temp1" : "1"}
{"temp2" : "2"}
{"temp3" : "3"}
{"temp4" : "4"}
{"temp5" : "5"}
{"temp6" : "6"}
{"temp7" : "7"}
{"temp" : "8"}
{"temp" : "9"}

here is the bq command for uploading enabling errors:

bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records=100 mydataset.mytable ./tmp.json 

I receive:

Upload complete.
Waiting on bqjob_123.._1 ... (2s) Current status: DONE   
Warnings encountered during job execution:

JSON parsing error in row starting at position 15 at file: file-00000000. No such field: temp1.

JSON parsing error in row starting at position 31 at file: file-00000000. No such field: temp2.

JSON parsing error in row starting at position 47 at file: file-00000000. No such field: temp3.

JSON parsing error in row starting at position 63 at file: file-00000000. No such field: temp4.

JSON parsing error in row starting at position 79 at file: file-00000000. No such field: temp5.

now I'm using:

bq --format=prettyjson show -j <jobId> 

and this is what I get (I copied here only relevant fields):

{
  "configuration": {
    ...
      "maxBadRecords": 100

    }
  ,
  "statistics": {
    "load": {
      "inputFileBytes": "157",
      "inputFiles": "1",
      "outputBytes": "9",
      "outputRows": "3"
    }
  },
  "status": {
    "errors": [
      {
        "message": "JSON parsing error in row starting at position 15 at file: file-00000000. No such field: temp1.",
        "reason": "invalid"
      },
      {
        "message": "JSON parsing error in row starting at position 31 at file: file-00000000. No such field: temp2.",
        "reason": "invalid"
      },
      {
        "message": "JSON parsing error in row starting at position 47 at file: file-00000000. No such field: temp3.",
        "reason": "invalid"
      },
      {
        "message": "JSON parsing error in row starting at position 63 at file: file-00000000. No such field: temp4.",
        "reason": "invalid"
      },
      {
        "message": "JSON parsing error in row starting at position 79 at file: file-00000000. No such field: temp5.",
        "reason": "invalid"
      }
    ],
    "state": "DONE"
  }
}

now when I go to my table I actually have 3 new records (which actually matches the outputRows : 3 field) :

{"temp" : "0"}
{"temp" : "8"}
{"temp" : "9"}

now these are my qustions:

  1. as you see I had 6 bad records I receive only 5 of them. - didn't receive temp6. Now I tried uploading files with more bad records and always receive only 5. Is this a bigquery bug?

  2. assuming my records are larger and I upload many records enabling errors, after uploading how can I know which records were the bad ones? - I need to know which records weren't loaded to bigquery. all I get is JSON parsing error in row starting at position 15 at file.. Position does't tell me much. Why can't I receive the number of the record? Or is there a way to calculate the record number by the position?

解决方案

  1. We only return the first 5 errors, as we don't want to make the reply too big.
  2. As I explained in another thread, BigQuery is designed to process large files fast by processing them in parallel. If the file is 1GB, we might create hundreds of workers and each worker processes a chunk of the file. If a worker is processing the last 10MB of the file and found a bad record, to know the number of this record it needs to read all the previous 990MB. Thus every worker just report the start position of the bad record. Some editors support seeking to a offset in a file. In vim, 1000go will move to position 1000. In less, it's 1000P.

这篇关于bigquery bug:上传数据时不会收到所有不良记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆