PySpark/Hive:如何使用LazySimpleSerDe创建表以转换布尔值't'/'f'? [英] PySpark/Hive: how to CREATE TABLE with LazySimpleSerDe to convert boolean 't' / 'f'?

查看:837
本文介绍了PySpark/Hive:如何使用LazySimpleSerDe创建表以转换布尔值't'/'f'?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,亲爱的stackoverflow社区

Hello dear stackoverflow community,

这是我的问题:

A)我在csv中有一些布尔列的数据; 不幸的是,这些列中的值是tf(单字母); 这是我无法控制的工件(来自Redshift).

A) I have data in csv with some boolean columns; unfortunately, the values in these columns are t or f (single letter); this is an artifact (from Redshift) that I cannot control.

B)我需要根据这些数据创建一个spark数据框, 希望转换t -> truef -> false. 为此,我创建了一个Hive数据库和一个临时Hive表 然后SELECT *从它开始,像这样:

B) I need to create a spark dataframe from this data, hopefully converting t -> true and f -> false. For that, I create a Hive DB and a temp Hive table and then SELECT * from it, like this:

sql_str = """SELECT * FROM {db}.{s}_{t} """.format(
             db=hive_db_name, s=schema, t=table)
df = sql_cxt.sql(sql_str)

这行得通,我可以打印df,它为我的所有列提供了正确的数据类型. 但是:

This works, I can print df, and it gives me all my columns with correct data types. But:

C)如果我这样创建表:

C) If I create the table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS {db}.{schema}_{table}({cols})                    
ROW FORMAT DELIMITED                                                                                          
FIELDS TERMINATED BY '|t'                                                                                     
STORED AS TEXTFILE 
LOCATION ...

,这会将我的所有tf都转换为Null.

, this converts all my t and f to Nulls.

所以:

D)我发现关于LazySimpleSerDe的信息大概必须按照我的意思进行(将tf即时转换为truefalse).来自https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties(引号):

D) I found out about LazySimpleSerDe that presumably must do what I mean (convert t and f to true and false on the fly). From https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties (quote):

"""
hive.lazysimple.extended_boolean_literal
Default Value: false
Added in: Hive 0.14 with HIVE-3635
LazySimpleSerDe uses this property to determine 
if it treats 'T', 't', 'F', 'f', '1', and '0' as extended, 
legal boolean literals, in addition to 'TRUE' and 'FALSE'. 
The default is false, which means only 'TRUE' and 'FALSE' 
are treated as legal boolean literals.
"""

据此(或者至少我认为如此),我现在在Hive DB中创建一个表,如下所示:

According to this (or at least so I think), I now create a table in Hive DB like this:

create_table_sql = """
    CREATE EXTERNAL TABLE IF NOT EXISTS {db_name}.{schema}_{table}({cols})
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES ("separatorChar" = "\|")
    STORED AS TEXTFILE
    LOCATION '{loc}'
    TBLPROPERTIES ('hive.lazysimple.extended_boolean_literal'='true')
    """.format(db_name=hive_db_name,
               schema=schema,
               table=table,
               cols=",\n".join(cols),
               loc=location)

return sql_cxt.sql(create_table_sql)

这确实创建了一个表格, 我可以再次看到所有具有正确数据类型的列, df.count()是正确的,但df.head(3)仍然 给我所有我的boolean列== Null的值.

This does create a table, I can again see all the columns with proper data types, the df.count() is correct, but df.head(3) still gives me all values for my boolean columns == Null.

(:___

我为我的CREATE TABLE尝试了几个小时的不同变体...

I tried for hours different variants for my CREATE TABLE...

  • 有或没有SERDEPROPERTIES,
  • 有或没有TBLPROPERTIES,
  • 带有"FIELDS TERMINATED BY ..."(终止于...的字段)

所有人都给我

  • 以空值代替"t"和"f",或
  • 一个空的df(不是df.head(5)中的任何内容),或者
  • 语法错误,或者
  • 大约100页的Java异常.
  • Null in place of 't' and 'f', or
  • an empty df (nothing from df.head(5)), or
  • a syntax error, or
  • some 100 pages of Java exceptions.

我要说的真正问题是,没有一个单独的带有LazySimpleSerDe的CREATE TABLE的示例. 可以完成文档中所述的工作.

The real problem is, I would say, that there is no single example of CREATE TABLE with LazySimpleSerDe that does the job that is described in the docs.

我真的非常感谢您的帮助或任何想法.我几乎拔掉了我所有的头发.

I would really, really appreciate your help or any ideas. I pulled out almost all my hair.

提前谢谢!

推荐答案

根据吉拉问题中的补丁:

SET hive.lazysimple.extended_boolean_literal=true;

例如,如果您有一个制表符分隔的文本文件,其中包含标题行,并且't'/'f'表示true false:

So for example, if you have a tab-delimited text file, containing header rows, and 't'/'f' for true false:

create table mytable(myfield boolean)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/path'
tblproperties (
    'skip.header.line.count' = '1'
);
...
select count(*) from mytable where myfield is null; <-- returns 100% null
...
SET hive.lazysimple.extended_boolean_literal=true;
select count(*) from mytable where myfield is null; <-- changes the serde to interpret the booleans with a more forgiving interpretation, yields a different count

这篇关于PySpark/Hive:如何使用LazySimpleSerDe创建表以转换布尔值't'/'f'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆