Spark DataFrame中向量的访问元素(逻辑回归概率向量) [英] Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

查看：162 发布时间：2020/9/4 0:11:58 python apache-spark pyspark spark-dataframe apache-spark-ml

本文介绍了Spark DataFrame中向量的访问元素(逻辑回归概率向量)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PySpark(ML软件包)中训练了LogisticRegression模型，并且预测的结果是PySpark DataFrame(cv_predictions)(请参见[1]). probability列(请参阅[2])是vector类型(请参阅[3]).

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_predictions) (see [1]). The probability column (see [2]) is a vector type (see [3]).

[1]
type(cv_predictions_prod)
pyspark.sql.dataframe.DataFrame

[2]
cv_predictions_prod.select('probability').show(10, False)
+----------------------------------------+
|probability                             |
+----------------------------------------+
|[0.31559134817066054,0.6844086518293395]|
|[0.8937864350711228,0.10621356492887715]|
|[0.8615878905395029,0.1384121094604972] |
|[0.9594427633777901,0.04055723662220989]|
|[0.5391547673698157,0.46084523263018434]|
|[0.2820729747752462,0.7179270252247538] |
|[0.7730465873083118,0.22695341269168817]|
|[0.6346585276598942,0.3653414723401058] |
|[0.6346585276598942,0.3653414723401058] |
|[0.637279255218404,0.362720744781596]   |
+----------------------------------------+
only showing top 10 rows

[3]
cv_predictions_prod.printSchema()
root
 ...
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)

如何创建解析PySpark DataFrame的vector的方式，以便创建一个仅提取每个probability向量的第一个元素的新列?

How do I create parse the vector of the PySpark DataFrame, such that I create a new column that just pulls the first element of each probability vector?

这个问题类似于，但是下面的链接中的解决方案不起作用/对我不清楚:

This question is similar to, but the solutions in the links below didn't work/weren't clear to me:

如何在PySpark中访问densityVector的值

如何访问以下元素Spark DataFrame中的VectorUDT列?

推荐答案

更新:

似乎在spark中有一个bug，阻止您在select语句期间访问密集向量中的各个元素.通常，您应该可以像访问numpy数组一样访问它们，但是当尝试运行先前发布的代码时，您可能会收到错误pyspark.sql.utils.AnalysisException: "Can't extract value from probability#12;"

It seems like there is a bug in spark that prevents you from accessing individual elements in a dense vector during a select statement. Normally you should would be able to access them just like you would a numpy array, but when trying to run the code previously posted, you may get the error pyspark.sql.utils.AnalysisException: "Can't extract value from probability#12;"

因此，避免这种愚蠢的错误的一种处理方法是使用udf.与另一个问题类似，您可以通过以下方式定义udf:

So, one way to handle this to avoid this silly bug is to use a udf. Similar to the other question, you can define a udf in the following way:

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

firstelement=udf(lambda v:float(v[0]),FloatType())
cv_predictions_prod.select(firstelement('probability')).show()

在幕后，它仍然像访问numpy数组一样访问DenseVector的元素，但它不会引发与以前相同的错误.

Behind the scenes this still accesses the elements of the DenseVector like a numpy array, but it doesn't throw the same bug as before.

由于这得到了很多好评，所以我认为我应该删除该答案的不正确部分.

Since this is getting a lot of upvotes, I figured I should strike through the incorrect portion of this answer.

~~原始答案: 密集向量只是numpy数组的包装器.因此，您可以以与访问numpy数组的元素相同的方式访问元素.~~

~~Original answer: A dense vector is just a wrapper for a numpy array. So you can access the elements in the same way that you would access the elements of a numpy array.~~

有几种方法可以访问数据帧中数组的各个元素.一种是在select语句中显式调用列cv_predictions_prod['probability'].通过显式调用列，您可以对该列执行操作，例如选择数组中的第一个元素.例如:

There are several ways to access individual elements of an array in a dataframe. One is to explicitly call the column cv_predictions_prod['probability'] in your select statement. By explicitly calling the column, you can perform operations on that column, like selecting the first element in the array. For example:

cv_predictions_prod.select(cv_predictions_prod['probability'][0]).show()

~~应该解决问题.~~

~~should solve the problem.~~

这篇关于Spark DataFrame中向量的访问元素(逻辑回归概率向量)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark DataFrame中向量的访问元素(逻辑回归概率向量) [英] Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark DataFrame中向量的访问元素(逻辑回归概率向量) [英] Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭