在将 csv 文件作为数据帧读取时提供架构 [英] Provide schema while reading csv file as a dataframe

查看：27 发布时间：2021/11/12 5:29:22 scala apache-spark dataframe apache-spark-sql spark-csv

本文介绍了在将 csv 文件作为数据帧读取时提供架构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将 csv 文件读入数据帧.我知道我的数据框的架构应该是什么，因为我知道我的 csv 文件.我也使用 spark csv 包来读取文件.我试图指定如下所示的架构.

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

但是当我检查我创建的数据框的架构时，它似乎采用了自己的架构.我做错了什么吗?如何让 spark 获取我提到的架构?

But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?

> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)

推荐答案

试试下面的代码，你不需要指定架构.当您将 inferSchema 设为 true 时，它应该从您的 csv 文件中获取.

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

如果您想手动指定架构，您可以按如下方式进行:

If you want to manually specify the schema, you can do it as below:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("project", StringType, true),
  StructField("article", StringType, true),
  StructField("requests", IntegerType, true),
  StructField("bytes_served", DoubleType, true))
)

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .schema(customSchema)
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

这篇关于在将 csv 文件作为数据帧读取时提供架构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在将 csv 文件作为数据帧读取时提供架构 [英] Provide schema while reading csv file as a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在将 csv 文件作为数据帧读取时提供架构 [英] Provide schema while reading csv file as a dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭