与 Python 相比,在 Julia 中读取 csv 速度较慢 [英] reading csv in Julia is slow compared to Python

查看:17
本文介绍了与 Python 相比,在 Julia 中读取 csv 速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与 Python 相比,在 Julia 中读取大型文本/csv 文件需要很长时间.以下是读取大小为 486.6 MB 且有 153895 行和 644 列的文件的时间.

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.

python 3.3 示例

python 3.3 example

import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)

Output: 19.90

R 3.0.2 示例

system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
   stringsAsFactors=F,na.strings=""))

Output:
User    System  Elapsed
181.13  1.07    182.32

Julia 0.2.0 (Julia Studio 0.4.4) 示例 #1

Julia 0.2.0 (Julia Studio 0.4.4) example # 1

using DataFrames
timing = @time myData = readtable("C:/myFile.txt",separator='|',header=false)

Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)

Julia 0.2.0 (Julia Studio 0.4.4) 示例 #2

Julia 0.2.0 (Julia Studio 0.4.4) example # 2

timing = @time myData = readdlm("C:/myFile.txt",'|',header=false)

Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)

  1. Julia 比 R 快,但比 Python 慢.我可以做些什么来加快读取大型文本文件的速度?

  1. Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?

另一个问题是内存中的大小是 Julia 中硬盘文件大小的 18 x 大小,但对于 python,只有 2.5 x 大小.在我发现对大文件来说内存效率最高的 Matlab 中,它是硬盘文件大小的 2 倍.Julia 内存中的大文件有什么特殊原因吗?

a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?

推荐答案

最好的答案可能是我不像 Wes 那样优秀的程序员.

The best answer is probably that I'm not as a good a programmer as Wes.

一般来说,DataFrames 中的代码的优化程度远不如 Pandas 中的代码.我相信我们可以赶上,但这需要一些时间,因为我们需要首先实现很多基本功能.由于在 Julia 中需要构建的东西太多了,我倾向于专注于三个部分:(1)构建任何版本,(2)构建一个正确的版本,(3)构建一个快速、正确的版本.对于我所做的工作,Julia 通常不提供任何版本的基本功能,因此我的工作集中在 (1) 和 (2) 上.随着我需要的工具越来越多,我会更容易专注于性能.

In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.

至于内存使用,我认为答案是我们在解析表格数据时使用了一组数据结构,其效率远低于 Pandas 使用的那些.如果我更了解 Pandas 的内部结构,我可以列出效率较低的地方,但现在我只是推测一个明显的失败是我们正在将整个数据集读入内存而不是从磁盘中抓取数据块.这当然可以避免,并且这样做存在一些问题.这只是时间问题.

As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.

在这一点上,readtable 代码相当容易阅读.让 readtable 更快的最确定的方法是淘汰 Julia 分析器并开始修复它发现的性能缺陷.

On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.

这篇关于与 Python 相比,在 Julia 中读取 csv 速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆