在读取csv文件作为数据框时提供架构 [英] Provide schema while reading csv file as a dataframe
问题描述
我正在尝试将csv文件读入数据帧.由于知道csv文件,因此我知道数据框的架构应该是什么.另外我正在使用spark csv包来读取文件.我试图指定如下所示的模式.
I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
但是当我检查创建的数据框的架构时,它似乎采用了自己的架构.我做错什么了吗?如何发出火花来拾取我提到的架构?
But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?
> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
推荐答案
尝试以下代码,无需指定架构.当将inferSchema设置为true时,应从csv文件中获取它.
Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
如果要手动指定架构,则可以按照以下步骤进行操作:
If you want to manually specify the schema, you can do it as below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true))
)
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
这篇关于在读取csv文件作为数据框时提供架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!