Spark 选项:inferSchema vs header = true [英] Spark Option: inferSchema vs header = true

查看:406
本文介绍了Spark 选项:inferSchema vs header = true的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

参考 pyspark:差异性能spark.read.format("csv") vs spark.read.csv

我以为我需要 .options("inferSchema" , "true").option("header", "true") 来打印我的标题,但显然我仍然可以用标题打印我的 csv.

I thought I needed .options("inferSchema" , "true") and .option("header", "true") to print my headers but apparently I could still print my csv with headers.

标头和架构有什么区别?我不太明白inferSchema:自动推断列类型.它需要额外传递一次数据,默认情况下为 false".

What is the difference between header and schema? I don't really understand the meaning of "inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default".

推荐答案

标题和架构是分开的.

标题:

如果 csv 文件有标题(第一行中的列名称),则设置 header=true.这将使用 csv 文件中的第一行作为数据框的列名.设置 header=false(默认选项)将生成具有默认列名的数据框:_c0_c1_c2

If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc.

将此设置为 true 或 false 应基于您的输入文件.

Setting this to true or false should be based on your input file.

架构:

此处引用的架构是列类型.列可以是字符串、双精度、长整型等类型.使用 inferSchema=false(默认选项)将给出一个数据框,其中所有列都是字符串(StringType).根据您想要做什么,字符串可能不起作用.例如,如果您想将不同列的数字相加,那么这些列应该是某种数字类型(字符串不起作用).

The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work).

通过设置inferSchema=true,Spark 将自动遍历 csv 文件并推断每一列的模式.这需要对文件进行额外的传递,这将导致读取 inferSchema 设置为 true 的文件变慢.但作为回报,数据帧很可能会根据其输入具有正确的模式.

By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input.

作为使用 inferSchema 读取 csv 的替代方法,您可以在读取时提供架构.这具有比推断架构更快的优点,同时提供具有正确列类型的数据框.此外,对于没有标题行的 csv 文件,可以自动给出列名.要提供架构,请参见例如:在读取 csv 文件时提供架构作为数据框

As an alternative to reading a csv with inferSchema you can provide the schema while reading. This have the advantage of being faster than inferring the schema while giving a dataframe with the correct column types. In addition, for csv files without a header row, column names can be given automatically. To provde schema see e.g.: Provide schema while reading csv file as a dataframe

这篇关于Spark 选项:inferSchema vs header = true的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆