可以结合使用Paramiko和Dask的read_csv()方法从远程服务器读取.csv吗? [英] Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

查看:199
本文介绍了可以结合使用Paramiko和Dask的read_csv()方法从远程服务器读取.csv吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天,我开始使用 Dask Paramiko 软件包,部分是作为学习练习,部分是因为我正在开始一个需要处理大型数据集的项目( 10s GB)只能从远程VM访问(即不能在本地存储).

以下代码段属于一个简短的帮助程序,该程序将制作VM上托管的大型csv文件的dask数据帧.我想稍后将其输出(对dask数据框的引用)传递给第二个函数,该函数将对其执行一些概述分析.

import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys

def remote_file_to_dask_dataframe(remote_path):

   if isinstance(remote_path, (str)):
      try:
         client = pm.SSHClient()
         client.load_system_host_keys()
         client.connect('#myserver', username='my_username', password='my_password')
         sftp_client = client.open_sftp()
         remote_file = sftp_client.open(remote_path)
         df = dd.read_csv(remote_file)
         remote_file.close()
         sftp_client.close()
         return df 
      except:
         print("An error occurred.")
         sftp_client.close()
         remote_file.close()
   else:
      raise ValueError("Path to remote file as string required")

代码既不好也不完整,我会及时用ssh密钥替换用户名和密码,但这不是问题.在jupyter笔记本中,我以前打开了sftp连接,并提供了服务器上文件的路径,并通过常规的Pandas read_csv调用将其读取到数据帧中.但是,此处使用Dask的等效行是问题的根源:df = dd.read_csv(remote_file).

我已经在线查看了文档(此处),但我无法确定我在上面尝试的方法是否可行.对于联网选项,Dask似乎需要一个URL.例如的参数传递选项S3,似乎取决于该基础架构的后端.不幸的是,我对 dash-ssh 文档(

有人能指出我正确的方向来实现我的目标吗?我希望Dask的read_csv可以像Pandas的一样发挥作用,因为它是基于Pandas的.

非常感谢您的帮助.

p.s.我知道Pandas的read_csv chunksize选项,但是如果可能的话,我想通过Dask实现.

解决方案

在Dask的主版本中,文件系统操作现在使用fsspec,它与以前的实现(s3,gcs,hdfs)一起支持其他一些文件系统,请参阅映射到协议标识符 fsspec.registry.known_implementations .

简而言之,如果您是从master那里安装fsspec和Dask的,那么现在就可以使用"sftp://user:pw @ host:port/path"之类的网址了.

Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally).

The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output (reference to the dask dataframe) to a second function that will perform some overview analysis on it.

import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys

def remote_file_to_dask_dataframe(remote_path):

   if isinstance(remote_path, (str)):
      try:
         client = pm.SSHClient()
         client.load_system_host_keys()
         client.connect('#myserver', username='my_username', password='my_password')
         sftp_client = client.open_sftp()
         remote_file = sftp_client.open(remote_path)
         df = dd.read_csv(remote_file)
         remote_file.close()
         sftp_client.close()
         return df 
      except:
         print("An error occurred.")
         sftp_client.close()
         remote_file.close()
   else:
      raise ValueError("Path to remote file as string required")

The code is neither nice nor complete, and I will replace username and password with ssh keys in time, but this is not the issue. In a jupyter notebook, I've previously opened the sftp connection with a path to a file on the server, and read it into a dataframe with a regular Pandas read_csv call. However, here the equivalent line, using Dask, is the source of the problem:df = dd.read_csv(remote_file).

I've looked at the documentation online (here), but I can't tell whether what I'm trying above is possible. It seems that for networked options, Dask wants a url. The parameter passing options for, e.g. S3, appear to depend on that infrastructure's backend. I unfortunately cannot make any sense of the dash-ssh documentation (here).

I've poked around with print statements and the only line that fails to execute is the one stated. The error risen is: raise TypeError('url type not understood: %s' % urlpath) TypeError: url type not understood:

Can anybody point me in the right direction for achieving what I'm trying to do? I'd expected Dask's read_csv to function as Pandas' had, as it's based on the same.

I'd appreciate any help, thanks.

p.s. I'm aware of Pandas' read_csv chunksize option, but I would like to achieve this through Dask, if possible.

解决方案

In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.

In short, using a url like "sftp://user:pw@host:port/path" should now work for you, if you install fsspec and Dask from master.

这篇关于可以结合使用Paramiko和Dask的read_csv()方法从远程服务器读取.csv吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆