给定列名列表,如何选择数据集的多列? [英] How to select multiple columns of dataset, given a list of column names?

查看:253
本文介绍了给定列名列表,如何选择数据集的多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过传递列表参数在Spark 2.3 Java中选择数据集ds的多列?

How can I select multiple columns of dataset ds in Spark 2.3 Java by passing a list argument?

例如,这可以正常工作:

For example, this works fine:

ds.select("col1","col2","col3").show();

但是,这失败了:

List<String> columns = Arrays.toList("col1","col2","col3");
ds.select(columns.toString()).show()

推荐答案

使用spark 2.4.0,您必须将List<String>转换为Seq<String>,并在spark文档中使用selectExpr.

Using spark 2.4.0 you have to convert the List<String> to Seq<String>, and use selectExpr following spark documentation.

如果要使用select,则必须从列表中删除第一列,并将其作为参数添加到select.

If you want to use select, you have to remove the first column from your list and add it as a parameter to select.

请找到两个版本:

假设您具有以下.csv文件:

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom

您可以使用以下代码来解决您的问题:

You can use this code to solve your issue:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;
import scala.collection.JavaConverters;
import scala.collection.Seq;


public class SparkJavaTest {
    public static SparkSession spark = SparkSession
            .builder()
            .appName("JavaSparkTest")
            .master("local")
            .getOrCreate();

    public static Seq<String> convertListToSeq(List<String> inputList) {
        return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
    }

    public static void main(String[] args) {
        Dataset<Row> ds = spark.read().option("header",true).csv("spark-file.csv");

        List<String> columns = Arrays.asList("InvoiceNo","StockCode","Description");

        //using selectExpr
        ds.selectExpr(convertListToSeq(columns)).show(false);

        //using select => this first column will be added to select
        List<String> columns2 = Arrays.asList("StockCode","Description");

        ds.select("InvoiceNo", convertListToSeq(columns2)).show(false);

    }
}

希望它会有所帮助:)

这篇关于给定列名列表,如何选择数据集的多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆