在将 csv 文件作为数据帧读取时提供架构 [英] Provide schema while reading csv file as a dataframe

查看:27
本文介绍了在将 csv 文件作为数据帧读取时提供架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 csv 文件读入数据帧.我知道我的数据框的架构应该是什么,因为我知道我的 csv 文件.我也使用 spark csv 包来读取文件.我试图指定如下所示的架构.

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

但是当我检查我创建的数据框的架构时,它似乎采用了自己的架构.我做错了什么吗?如何让 spark 获取我提到的架构?

But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?

> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)

推荐答案

试试下面的代码,你不需要指定架构.当您将 inferSchema 设为 true 时,它​​应该从您的 csv 文件中获取.

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

如果您想手动指定架构,您可以按如下方式进行:

If you want to manually specify the schema, you can do it as below:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("project", StringType, true),
  StructField("article", StringType, true),
  StructField("requests", IntegerType, true),
  StructField("bytes_served", DoubleType, true))
)

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .schema(customSchema)
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

这篇关于在将 csv 文件作为数据帧读取时提供架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆