正确地从文件中读取类型PySpark [英] Correctly reading the types from file in PySpark
问题描述
我有一个包含行作为
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
这是一个id,名称,与某些字符串列表,和一系列的4浮动属性。
我读此文件
that is, an id, a name, a list with some strings, and a series of 4 float attributes. I am reading this file as
rdd = sc.textFile('myfile.tsv') \
.map(lambda row: row.split('\t'))
df = sqlc.createDataFrame(rdd, schema)
在这里我给架构为
where I give the schema as
schema = StructType([
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('list', ArrayType(StringType()), True),
StructField('att1', FloatType(), True),
StructField('att2', FloatType(), True),
StructField('att3', FloatType(), True),
StructField('att4', FloatType(), True)
])
问题是,这两个列表和属性没有得到正确读取,从收集
上的数据帧判断。事实上,我得到无
为所有这些:
Problem is, both the list and the attributes do not get properly read, judging from a collect
on the DataFrame. In fact, I get None
for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
我在做什么错了?
What am I doing wrong?
推荐答案
这是正确读取,它就像你期望不起作用。架构参数声明什么的是的类型,以避免昂贵的架构推断没怎么投中的数据。提供了符合申报模式输入是你的责任。
It is properly read, it just doesn't work as you expect. Schema argument declares what are the types to avoid expensive schema inference not how to cast the data. Providing input that matches declared schema is your responsibility.
这也可以被处理或者通过数据源(看看火花CSV
和则InferSchema
选项) 。它不会处理复杂的类型,比如数组虽然。
This can be also handled either by data source (take a look at spark-csv
and inferSchema
option). It won't handle complex types like array though.
由于您的模式是大多持平,你知道的类型,你可以尝试这样的事:
Since your schema is mostly flat and you know the types you can try something like this:
df = rdd.toDF([f.name for f in schema.fields])
exprs = [
# You should excluding casting
# on other complex types as well
col(f.name).cast(f.dataType) if f.dataType.typeName() != "array"
else col(f.name)
for f in schema.fields
]
df.select(*exprs)
和处理单独使用字符串处理函数或UDF的复杂类型。或者,因为你读的Python数据无论如何,只要你创建DF之前执行所需的类型。
and handle complex types separately using string processing functions or UDFs. Alternatively, since you read data in Python anyway, just enforce desired types before you create DF.
这篇关于正确地从文件中读取类型PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!