Scala和巨蟒之间的API兼容性? [英] API compatibility between scala and python?

查看:190
本文介绍了Scala和巨蟒之间的API兼容性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读文档的十几页,它似乎是:

I have read a dozen pages of docs, and it seems that:


  1. 我可以跳过学习Scala的一部分

  1. I can skip learning the scala part

该API的Python完全实现的(我不需要学习任何东西斯卡拉)

the API is completely implemented in python (I dont need to learn scala for anything)

互动模式完全并尽可能快的工作原理斯卡拉外壳和故障排除同样容易

the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy

像numpy的Python模块仍将进口(无残缺的蟒蛇environement)

python modules like numpy will still be imported (no crippled python environement)

是否有回落短线方面,这将使它不可能的?

Are there fall-short areas that will make it impossible?

推荐答案

在最近发布的Spark(1.0以上版本),我们已经实现了所有丢失PySpark的功能,详见下文。一些新的功能仍然下落不明,如Python绑定GraphX​​,但其他API已经接近平价实现(包括实验性的Python API为星火流)。

In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).

我前面的回答转载如下:

My earlier answers are reproduced below:

很多已经七个月,因为我原来的答复改变(在这个答案的底部转载):

A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):

  • Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
  • Spark 0.8.1 added support for persist(), sample(), and sort().
  • The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
  • Spark 0.9 also adds Python bindings for MLLib (docs).
  • I've implemented tools to help keep the Java API up-to-date.

由于星火0.9,在PySpark主要缺少的特点是:

As of Spark 0.9, the main missing features in PySpark are:

  • zip() / zipPartitions.
  • Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
  • Support for running on YARN clusters.
  • Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
  • Support for job cancellation.

虽然我们已经做了很多性能方面的改进,还是有星火的斯卡拉和Python的API之间的性能差距。星火用户邮件列表有一个开放的线程讨论其目前的业绩。

Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.

如果您发现任何PySpark缺少的功能,请打开我们的<一个新的票证href=\"https://spark-project.atlassian.net/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20PySpark%20ORDER%20BY%20priority%20DESC\"相对=nofollow> JIRA问题跟踪。

If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.

借助星火Python编程指南有一个列表缺少PySpark功能。由于星火0.7.2的,PySpark目前缺少在不同StorageLevels样品()的支持,排序()和持久性。它也缺少添加到斯卡拉API几个方便的方法。

The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.

Java API的是与斯卡拉API同步当它被释放,但此后并没有所有的人都被添加到Java包装类已添加了一些新的RDD方法。有一个关于如何保持Java API了最新的 HTTPS讨论://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ 。在该线程,我建议的自动查找缺少的功能的技术,所以它只是有人服用添加并提交pull请求一个时间问题。

The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.

关于性能,PySpark将是比Scala的火花慢。性能差异的分叉部分大型堆过程时,从一个奇怪的JVM问题茎,但有一个开拉申请应该解决这个问题。另一个瓶颈来自序列化:现在,PySpark不需要用户直接注册串行他们的对象(我们目前使用二进制的cPickle加上一些配料优化)。在过去,我已经研究过增加对用户自定义的序列化,允许你指定类型的对象,从而使用专门的序列化的速度更快的支持;我希望在某个时候恢复这项工作。

Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.

PySpark使用普通的CPython间preTER,所以像numpy的图书馆应该工作正常(如果PySpark是用Jython编写的这不会是这种情况)。实施

PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).

这是pretty容易上手PySpark;只需下载pre-内置星火包装并运行 pyspark 间preTER应该够测试出来你的个人电脑上,就会让你评估其交互功能。如果你喜欢用IPython中,你可以使用 IPython的= 1 ./pyspark 在你的shell与一个IPython的外壳推出Pyspark。

It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.

这篇关于Scala和巨蟒之间的API兼容性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆