在获得输入数据之前,应如何对其重新格式化? [英] What's reformatting my input data before I get to it?

查看:52
本文介绍了在获得输入数据之前,应如何对其重新格式化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Data Lake Store帐户.我有一个充满文件的目录,其中包含JSON格式的数据,包括一些包含ISO 8601格式时间的字符串值,例如:{ "reading_time": "2008-09-15T15:53:00.91077" }

I have a Data Lake Store account. I have a directory full of files containing data in JSON format, including some string values that contain times in ISO 8601 format, to wit: { "reading_time": "2008-09-15T15:53:00.91077" }

现在,当我使用将这些JSON文件用作输入数据集的数据工厂创建管道时,它会以典型的美国格式"9/15/2008 3:53:00 PM"看到reading_time的值.具体来说,当我尝试在输出数据集中填充DateTime字段时,会收到此消息:

Now when I create a Pipeline with a Data Factory that uses these JSON files as an input dataset, it sees the value of reading_time in a typical US format: "9/15/2008 3:53:00 PM". Specifically, I get this message when I try to populate a DateTime field in the output dataset:

"reports.reading_time"列包含无效值"9/15/2008 3:53:00 PM".无法将"9/15/2008 3:53:00 PM"转换为"DateTime"

Column 'reports.reading_time' contains an invalid value '9/15/2008 3:53:00 PM'. Cannot convert '9/15/2008 3:53:00 PM' to type 'DateTime'

我以为嘿,如果我告诉输入数据集明确要求输入ISO输入日期该怎么办?因此,我在管道规范中尝试了此操作:

I thought hey, what if I tell my input dataset to specifically expect an ISO input date? So I tried this in my pipeline specification:

"datasets": [
  {
    "name": "ImprovedInputDataset",
      "properties": {
        "structure": [
          {
            "name": "reports.reading_time",
            "type": "Datetime",
            "format": "ISO"
          }
        ]
      }
    }
  }
]

我对自己收到的错误消息稍有不同感到印象深刻(请参阅最后的格式为'ISO'"):

I was pretty impressed with myself for getting a slightly different error message (see "with format 'ISO'" at end):

"reports.reading_time"列包含无效值"9/15/2008 3:53:00 PM".无法将"9/15/2008 3:53:00 PM"转换为格式为"ISO"的"DateTime"

Column 'reports.reading_time' contains an invalid value '9/15/2008 3:53:00 PM'. Cannot convert '9/15/2008 3:53:00 PM' to type 'DateTime' with format 'ISO'

长话短说,似乎有些东西在我的原始输入中注意到了ISO日期格式,并给了我将管道转换成美式日期字符串之前的可疑偏爱"看到它.但是,在Azure文档在线上找不到任何能准确解释我的管道规格执行之前输入数据集发生了什么的事情.

Long story short, it seems as though something is noticing the ISO date format in my original input and doing me the dubious "favor" of converting it to a US-style date string before my pipeline gets to see it. I can't find anything in the Azure documentation online that explains exactly what happens to my input dataset before my Pipeline spec executes though.

如果有人愿意,我将不胜感激a)向我解释将我的ISO日期/时间字符串转换为美国类型的日期/时间字符串以及如何更正它是什么?或b)指出运行管道规范之前必须在数据工厂内部进行的预处理"文档.

I would appreciate if someone would either a) explain to me what it is that's converting my ISO date/time string to a US type date/time string and how to correct it; or b) point me to the documentation on the "preprocessing" that must be happening inside the Data Factory before my Pipeline spec is run.

推荐答案

我可以重现此问题,但使用输入数据集的字符串"数据类型可以使其正常工作.您也不能指定数据类型,例如

I can reproduce this issue but got it to work using "String" datatype for input data set. You can also not specify a datatype, eg

{
    "name": "InputDataset-9ad",
    "properties": {
        "structure": [
            {
                "name": "reading_time"
            }
        ],
...

这与我当前的想法一致,即JSON没有这样的日期时间数据类型.

This is in line with my current thinking that JSON does not have a datetime datatype as such. Documentation suggests format would be a .net format, "ISO" will never work. I spent some time trying to debug many different date formats, eg "yyyy-MM-ddTHH:mm:ss.fffffff" but non of them work either. My guess is either datetime is simply not supported for JSON or it's buggy / has an issue with the "T" and basically ignores the format and defaults to something, what looks like "en-US" in your example.

我确实发现大多数日期格式在没有指定结构的情况下都正常工作".如果您确实遇到了一些不具有国际性或不可移植性的内容,例如"01/04/2017"(是4月1日还是1月4日?),那么解决方法是将其作为字符串导入到临时表中并从那里进行转换.

I did find that most date formats "just work" without specifying structure. If you did having something not international or not portable, eg "01/04/2017" (is it the 1st April or the 4th Jan?) then the workaround would be to import it to a staging table as string and transform from there.

我的内部新闻组确实有一个未解决的问题,如果收到任何进一步的信息,我将更新此帖子.注意:我不适用于Microsoft.

I do have a question outstanding with an internal newsgroup and I'll update this post if I receive any further information. NB I do not work for Microsoft.

HTH

这篇关于在获得输入数据之前,应如何对其重新格式化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆