带有Numpy/Scipy和纯C ++的Python进行大数据分析 [英] Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis

查看:173
本文介绍了带有Numpy/Scipy和纯C ++的Python进行大数据分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在相对较小的项目上使用Python使我很欣赏这种语言的动态类型化性质(无需声明代码来跟踪类型),这通常使开发过程更快捷,更轻松.但是,我觉得在更大的项目中这实际上可能是一个障碍,因为代码的运行速度比C ++中的等效代码要慢.但是话又说回来,将Numpy和/或Scipy与Python结合使用可使您的代码运行速度与本地C ++程序一样快(在C ++中,有时开发代码会花费更长的时间).

在阅读Justin Peel对线程"是Python的评论后,我发布了这个问题比C ++更快更轻?",他说: 此外,那些说Python在进行严重数字运算时速度很慢的人还没有使用过Numpy和Scipy模块.如今,Python确实在科学计算中起了作用.当然,速度来自使用C编写的模块或编写的库在Fortran中,但是我认为这就是脚本语言的美."就像S. Lott在Python的同一线程中写道:"...因为它为我管理内存,所以我不必进行任何内存管理,从而节省了追查核心泄漏的时间." 我还在"基准测试( python vs. c ++(使用BLAS)和(numpy),其中JF Sebastian写道:"...我的机器上C ++和numpy之间没有区别."

这两个线程都让我想知道,对于使用Numpy/Scipy来生产软件来分析大数据"的Python程序员来说,了解C ++是否具有真正的优势,其中性能显然非常重要(而且代码也很重要).可读性和开发速度是必须的)?

注意:我对处理大型文本文件特别感兴趣.文本文件的行数大约为100K-800K,多列,其中Python可能需要五分钟的时间来分析仅" 200K行的文件.

解决方案

首先,如果您的工作"大部分来自处理巨大的文本文件,那么这通常意味着您唯一有意义的速度瓶颈是磁盘I/O.速度,无论使用哪种编程语言.


对于核心问题,它可能太有见识而无法回答",但是我至少可以给您我自己的经验.多年来,我一直在编写Python来进行大数据处理(天气和环境数据).由于这种语言,我从未遇到过重大的性能问题.

开发人员(包括我自己)往往会忘记的是,一旦流程运行得足够快,浪费公司的时间就浪费了时间. Python(使用成熟的工具,例如 pandas /

Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).

I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states: "Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks." I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where J.F. Sebastian writes "...There is no difference between C++ and numpy on my machine."

Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?

Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.

解决方案

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.


As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.

Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

这篇关于带有Numpy/Scipy和纯C ++的Python进行大数据分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆