从SFTP服务器解析CSV太慢,如何提高效率? [英] Parsing CSV from SFTP server too slow, how to improve efficiency?

查看:88
本文介绍了从SFTP服务器解析CSV太慢,如何提高效率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个SFTP服务器,该服务器承载一个CSV文件,该文件包含有关多个课程的数据.数据采用以下格式(4列):

So I have an SFTP server that hosts a single CSV file that contains data about multiple courses. The data is in the following format (4 columns):

Activity Name,Activity Code,Completion Status,Full Name
Safety with Lasers, 3XX1, 10-Jul-20, "Person, Name"
Safety with Lasers, 3XX1, NaN, "OtherP, OtherName"
How to use wrench, 7NPA, 10-Aug-19, "OtherName, Person"
etc...

我正在使用Paramiko通过以下代码访问文件:

I am using Paramiko to access the file using the following code:

file = sftp.open('Data.csv')

但是我遇到的问题是它是SFTPFile类型.我该如何解析其中的数据?我需要提取课程的名称,并跟踪有多少人完成了该课程,但还没有完成.我目前正在使用以下代码,但速度非常慢.任何建议,将不胜感激:

But the issue I am having is that it is a SFTPFile type. How can I go about parsing the data from it? I need to extract the names of the courses, and keep track of how many people have completed it and not completed it. I am using the following code at the moment but it is horrendously slow. Any suggestions would be appreciated:

Courses = ['']
Total =[0]
Compl =[0]
csvreal = pandas.read_csv(file)
for index, row in csvreal.iterrows():
    string =(csvreal.loc[[index]].to_string(index=False, header=False))
    if(Courses[i] !=string.split('  ')[0]):
        i+=1
        Courses.append(string.split('  ')[0])
        Total.append(0)
        Compl.append(0)
    if(len(string.split('  ')[2])>3):  #Note that incomplete courses do not have completion date, so it is NaN
        Compl[i]+=1
    Total[i]+=1

我知道这很糟糕,我是新手,不知道自己在做什么.有关在哪里阅读相关文档或教程的任何建议,将不胜感激.谢谢!

I know it is very terrible, I'm new and have no idea what I am doing. Any advice on where to read up on relevant documentation or tutorials would be appreciated. Thank you!

推荐答案

sftp.open 远程服务器上打开文件,因此每次读取都会通过网络进行.与从本地磁盘读取相比,此网络遍历非常慢.使用 sftp.get ,然后可以读取它而不会产生遍历网络的开销.如果需要更新文件,则可以更新本地副本,然后使用

sftp.open opens the file on the remote server, so every read will take place over the network. This network traversal is very slow compared to reading from local disk. It would be more efficient to copy the file to your local machine using sftp.get, and then it can be read without incurring the overhead of traversing the network. If you need to update the file you can update the local copy and then copy back to the server with sftp.put.

代码将是这样的(未经测试,因为我没有ftp服务器):

The code would be something like this (untested, as I don't have an ftp server to hand):

# Retrieve a copy and open
myfile = sftp.get('Data.csv', 'local-copy-Data.csv')
csvreal = pandas.read_csv(myfile)

# Update remote
sftp.put('local-copy-Data.csv', 'Data.csv')

这篇关于从SFTP服务器解析CSV太慢,如何提高效率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆