星火数据类型猜测器UDAF [英] Spark data type guesser UDAF

查看:179
本文介绍了星火数据类型猜测器UDAF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要拍这样的事情
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
并创建一个蜂巢UDAF创建一个返回数据类型的猜测聚合函数。

星火是否有这样的事情已经内置?
将是非常有用的新的大型数据集探索数据。将是ML也有帮助,例如决定分类VS数值变量。

你通常如何确定星火数据类型?

P.S。像H 2 O的框架自动确定数据类型扫描数据的样本,或整个数据集。于是人们可以决定例如如果一个变量应该是分类变量或数字。

P.P.S。另一个用例是,如果你得到一个任意的数据集(我们让他们经常),并希望将其保存为实木复合地板表。
提供正确的数据类型使拼花更多的空间effiecient(也可能更多的查询时间高性能,例如
实木复合地板更好布隆过滤器不仅仅只是存储的一切,串/ VARCHAR)。


解决方案

  

星火是否有这样的事情已经内置了?


部分。有在星火生态系统的一些工具,如 火花CSV pyspark-CSV 和类别推理(分类对比数值)如 VectorIndexer

到目前为止好。问题是,架构推断具有有限的适用性,不是一般一件容易的事,可以引入难以诊断的问题,可以说是相当昂贵的:


  1. 有没有这么多的格式可以与星火使用,可能需要架构推断。在实践中它仅限于CSV的不同种类和固定宽度格式的数据。

  2. 根据数据重新presentation就不可能确定正确的数据类型或推断的类型可能会导致信息丢失:


    • 间preting数值数据为float或double可能导致precision不可接受的损失,特别是如果与财务数据的工作

    • 日期或数字格式可能不同根据区域设置

    • 一些常见标识符可以像Numerics的同时,具有可以在转换失去了一些内部结构


  3. 自动架构推断可以掩盖的输入数据不同的问题,如果它不会被其他工具的支持,可以凸显可能出现的问题可以是危险的。此外数据加载和清洗过程中任何错误可以通过完整的数据处理管线被传播。

    按理说我们应该发展的输入数据很好地理解之前,我们甚至开始考虑将来可能的复presentation和编码。


  4. 架构推断和/或类别推论可能需要完整的数据扫描和/或大的查找表。既可以是昂贵的,甚至对大数据集是不可行的。


修改

它看起来像CSV文件架构推断能力已经直接添加到SQL星火。见<一href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala\"相对=nofollow> CSVInferSchema

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess.

Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.

How do you normally determine data types in Spark?

P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.

P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar).

解决方案

Does Spark have something like this already built-in?

Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv or pyspark-csv and category inference (categorical vs. numerical) like VectorIndexer.

So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:

  1. There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
  2. Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:

    • interpreting numeric data as float or double can lead to unacceptable loss of precision, especially if working with financial data
    • date or number formats can differ based on a locale
    • some common identifiers can look like numerics while having some internal structure which can lost in conversion
  3. Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.

    Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.

  4. Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.

Edit:

It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema.

这篇关于星火数据类型猜测器UDAF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆