正确地从文件中读取类型PySpark [英] Correctly reading the types from file in PySpark

查看:1791
本文介绍了正确地从文件中读取类型PySpark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含行作为

id1 name1   ['a', 'b']  3.0 2.0 0.0 1.0

这是一个id,名称,与某些字符串列表,和一系列的4浮动属性。
我读此文件

that is, an id, a name, a list with some strings, and a series of 4 float attributes. I am reading this file as

rdd = sc.textFile('myfile.tsv') \
    .map(lambda row: row.split('\t'))
df = sqlc.createDataFrame(rdd, schema)

在这里我给架构为

where I give the schema as

schema = StructType([
    StructField('id', StringType(), True),
    StructField('name', StringType(), True),
    StructField('list', ArrayType(StringType()), True),
    StructField('att1', FloatType(), True),
    StructField('att2', FloatType(), True),
    StructField('att3', FloatType(), True),
    StructField('att4', FloatType(), True)
])

问题是,这两个列表和属性没有得到正确读取,从收集上的数据帧判断。事实上,我得到为所有这些:

Problem is, both the list and the attributes do not get properly read, judging from a collect on the DataFrame. In fact, I get None for all of them:

Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)

我在做什么错了?

What am I doing wrong?

推荐答案

这是正确读取,它就像你期望不起作用。架构参数声明什么的的类型,以避免昂贵的架构推断没怎么投中的数据。提供了符合申报模式输入是你的责任。

It is properly read, it just doesn't work as you expect. Schema argument declares what are the types to avoid expensive schema inference not how to cast the data. Providing input that matches declared schema is your responsibility.

这也可以被处理或者通过数据源(看看火花CSV 则InferSchema 选项) 。它不会处理复杂的类型,比如数组虽然。

This can be also handled either by data source (take a look at spark-csv and inferSchema option). It won't handle complex types like array though.

由于您的模式是大多持平,你知道的类型,你可以尝试这样的事:

Since your schema is mostly flat and you know the types you can try something like this:

df = rdd.toDF([f.name for f in schema.fields])

exprs = [
    # You should excluding casting
    # on other complex types as well
    col(f.name).cast(f.dataType) if f.dataType.typeName() != "array" 
    else col(f.name)
    for f in schema.fields
]

df.select(*exprs)

和处理单独使用字符串处理函数或UDF的复杂类型。或者,因为你读的Python数据无论如何,只要你创建DF之前执行所需的类型。

and handle complex types separately using string processing functions or UDFs. Alternatively, since you read data in Python anyway, just enforce desired types before you create DF.

这篇关于正确地从文件中读取类型PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆