在将 csv 文件作为数据帧读取时提供架构 [英] Provide schema while reading csv file as a dataframe
问题描述
我正在尝试将 csv 文件读入数据帧.我知道我的数据框的架构应该是什么,因为我知道我的 csv 文件.我也使用 spark csv 包来读取文件.我试图指定如下所示的架构.
I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
但是当我检查我创建的数据框的架构时,它似乎采用了自己的架构.我做错了什么吗?如何让 spark 获取我提到的架构?
But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?
> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
推荐答案
试试下面的代码,你不需要指定架构.当您将 inferSchema 设为 true 时,它应该从您的 csv 文件中获取.
Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
如果您想手动指定架构,您可以按如下方式进行:
If you want to manually specify the schema, you can do it as below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true))
)
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
这篇关于在将 csv 文件作为数据帧读取时提供架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!