将databricks复制到dbfs并从中读取文件的错误,即> 2GB [英] databricks error to copy and read file from to dbfs that is > 2gb

查看:161
本文介绍了将databricks复制到dbfs并从中读取文件的错误,即> 2GB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的csv大小为6GB. 到目前为止,我正在使用以下行,当我在复制后在dbfs上检查其大小时 使用java io,它仍然显示为6GB,所以我认为它是正确的.但是当我执行spark.read.csv(samplePath)时,它仅读取1800万行,而不是6600万行.

I have a csv of size 6GB. So far I was using the following line which when I check its size on dbfs after this copy using java io, it still shows as 6GB so I assume it was right. But when I do a spark.read.csv(samplePath) it reads only 18mn rows instead of 66mn.

Files.copy(Paths.get(_outputFile), Paths.get("/dbfs" + _outputFile))

所以我尝试了dbutils进行复制,如下所示,但是它给出了错误.我已经更新了maven dbutil依赖项,并将其导入了我在此行中调用的对象中的依赖项.还有其他地方我应该做些更改以在scala代码中使用dbutils在databricks上运行吗?

So I tried dbutils to copy as shown below but it gives error. I have updated maven dbutil dependency and imported the same in this object where I am calling this line. Is there any other place too where I should make any change to use dbutils in scala code to run on databricks?

dbutils.fs.cp("file:" + _outputFile, _outputFile)

Databricks自动假定当您执行spark.read.csv(path)时,默认情况下它将在dbfs上搜索此路径.如何确保它可以从驱动程序内存而不是dbfs读取此路径?因为我感觉将Java io与databricks结合使用时,由于2GB的大小限制,文件副本实际上并未复制所有行.

Databricks automatically assumes that when you do spark.read.csv(path) then it searches this path on dbfs by default. How to make sure it can read this path from driver memory instead of dbfs? Because I feel the file copy is not actually copying all rows due to 2GB size limit while using java io with databricks.

我可以使用这个吗?

spark.read.csv("file:/databricks/driver/sampleData.csv")

有什么建议吗?

谢谢.

推荐答案

注意:本地文件I/O API仅支持小于2GB的文件.如果您使用本地文件I/O API读取或写入大于2GB的文件,则可能会看到损坏的文件.而是使用DBFS CLI,dbutils.fs或Spark API访问大于2GB的文件.

Note: Local file I/O APIs only support files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs.

使用Spark API时,您可以使用"/mnt/training/file.csv"引用文件.或"dbfs:/mnt/training/file.csv".如果您使用的是本地文件API,则必须在/dbfs下提供路径,例如:"/dbfs/mnt/training/file.csv".使用Spark API时,不能在dbfs下使用路径.

When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or "dbfs:/mnt/training/file.csv". If you’re using local file APIs, you must provide the path under /dbfs, for example: "/dbfs/mnt/training/file.csv". You cannot use a path under dbfs when using Spark APIs.

有多种方法可以解决此问题.

There are multiple way to solve this issue.

选项1 :您可以使用本地文件API来读取和写入DBFS路径. Azure Databricks使用FUSE挂载配置每个群集节点,该挂载允许在群集节点上运行的进程使用本地文件API读写基础分布式存储层.例如:

You can use local file APIs to read and write to DBFS paths. Azure Databricks configures each cluster node with a FUSE mount that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs. For example:

Python:

#write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line

scala:

import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

选项2 :读取大型DBFS-使用Python API挂载的文件.

将文件从dbfs://移动到本地文件系统(file://).然后使用Python API阅读.例如:

Move the file from dbfs:// to local file system (file://). Then read using the Python API. For example:

  1. 将文件从dbfs://复制到file://:

%fs cp dbfs:/mnt/large_file.csv文件:/tmp/large_file.csv

%fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv

  1. 在pandas API中读取文件:

将熊猫作为pd导入

import pandas as pd

pd.read_csv('file:/tmp/large_file.csv',).head()

pd.read_csv('file:/tmp/large_file.csv',).head()

希望这会有所帮助.

这篇关于将databricks复制到dbfs并从中读取文件的错误,即&gt; 2GB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆