保留索引字符串对应火花字符串索引器 [英] Preserve index-string correspondence spark string indexer

查看：29 发布时间：2021/11/14 22:18:16 python apache-spark apache-spark-sql pyspark apache-spark-ml

本文介绍了保留索引字符串对应火花字符串索引器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark 的 StringIndexer 非常有用，但是通常需要检索生成的索引值和原始字符串之间的对应关系，并且似乎应该有一种内置的方法来完成此操作.我将使用 Spark 文档:

Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation:

from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed_df = indexer.fit(df).transform(df)

这个简化的案例给了我们:

This simplified case gives us:

+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+

一切都很好，但对于许多用例，我想知道原始字符串和索引标签之间的映射.我能想到的最简单的方法是这样的:

All fine and dandy, but for many use cases I want to know the mapping between my original strings and the index labels. The simplest way I can think to do this off hand is something like this:

   In [8]: indexed.select('category','categoryIndex').distinct().show()
+--------+-------------+
|category|categoryIndex|
+--------+-------------+
|       b|          2.0|
|       c|          1.0|
|       a|          0.0|
+--------+-------------+

如果需要，我可以将其结果存储为字典或类似内容:

The result of which I could store as a dictionary or similar if I wanted:

In [12]: mapping = {row.categoryIndex:row.category for row in
           indexed.select('category','categoryIndex').distinct().collect()}

In [13]: mapping
Out[13]: {0.0: u'a', 1.0: u'c', 2.0: u'b'}

我的问题是:由于这是一项如此常见的任务，而且我猜测(但当然可能是错误的)字符串索引器无论如何都以某种方式存储此映射，有没有办法更多地完成上述任务简单地?

My question is this: Since this is such a common task, and I'm guessing (but could of course be wrong) that the string indexer is somehow storing this mapping anyway, is there a way to accomplish the above task more simply?

我的解决方案或多或少是直接的，但对于大型数据结构，这涉及(也许)我可以避免的大量额外计算.想法?

My solution is more or less straightforward, but for large data structures this involves a bunch of extra computation that (perhaps) I can avoid. Ideas?

推荐答案

可以从列元数据中提取标签映射:

Label mapping can extracted from the column metadata:

meta = [
    f.metadata for f in indexed_df.schema.fields if f.name == "categoryIndex"
]
meta[0]
## {'ml_attr': {'name': 'category', 'type': 'nominal', 'vals': ['a', 'c', 'b']}}

其中 ml_attr.vals 提供位置和标签之间的映射:

where ml_attr.vals provide a mapping between position and label:

dict(enumerate(meta[0]["ml_attr"]["vals"]))
## {0: 'a', 1: 'c', 2: 'b'}

Spark 1.6+

您可以使用 IndexToString 将数值转换为标签.这将使用如上所示的列元数据.

You can convert numeric values to labels using IndexToString. This will use column metadata as shown above.

from pyspark.ml.feature import IndexToString

idx_to_string = IndexToString(
    inputCol="categoryIndex", outputCol="categoryValue")

idx_to_string.transform(indexed_df).drop("id").distinct().show()
## +--------+-------------+-------------+
## |category|categoryIndex|categoryValue|
## +--------+-------------+-------------+
## |       b|          2.0|            b|
## |       a|          0.0|            a|
## |       c|          1.0|            c|
## +--------+-------------+-------------+

火花 <= 1.5

这是一个肮脏的黑客，但您可以简单地从 Java 索引器中提取标签，如下所示:

It is a dirty hack but you can simply extract labels from a Java indexer as follows:

from pyspark.ml.feature import StringIndexerModel

# A simple monkey patch so we don't have to _call_java later 
def labels(self):
    return self._call_java("labels")

StringIndexerModel.labels = labels

# Fit indexer model
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex").fit(df)

# Extract mapping
mapping = dict(enumerate(indexer.labels()))
mapping
## {0: 'a', 1: 'c', 2: 'b'}

这篇关于保留索引字符串对应火花字符串索引器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

保留索引字符串对应火花字符串索引器 [英] Preserve index-string correspondence spark string indexer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

保留索引字符串对应火花字符串索引器 [英] Preserve index-string correspondence spark string indexer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭