有什么办法可以将Spark的Dataset.show()方法的输出作为字符串获取? [英] Is there any way to get the output of Spark's Dataset.show() method as a string?

查看:945
本文介绍了有什么办法可以将Spark的Dataset.show()方法的输出作为字符串获取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark Dataset.show()方法对于查看数据集的内容很有用,尤其是在调试时(它会打印出格式正确的表).据我所知,它只能打印到控制台,但是能够以字符串形式获取将很有用.例如,最好使用IntelliJ进行调试时能够将其写入日志或将其视为表达式的结果.

The Spark Dataset.show() method is useful for seeing the contents of a dataset, particularly for debugging (it prints out a nicely-formatted table). As far as I can tell, it only prints to the console, but it would be useful to be able to get this as a string. For example, it would be nice to be able to write it to a log, or see it as the result of an expression when debugging with, say, IntelliJ.

有什么方法可以将Dataset.show()的输出作为字符串获取?

Is there any way to get the output of Dataset.show() as a string?

推荐答案

sql包外部看不到show后面的相应方法.我已经采用了相应的方法并对其进行了更改,以便可以将数据框作为参数传递(代码取自Dataset.scala):

The corresponding method behind show isn't visible from outside the sql package. I've taken the corresponding method and changed it such that a dataframe can be passed as parameter (code taken from Dataset.scala) :

  def showString(df:DataFrame,_numRows: Int = 20, truncate: Int = 20): String = {
    val numRows = _numRows.max(0)
    val takeResult = df.take(numRows + 1)
    val hasMoreData = takeResult.length > numRows
    val data = takeResult.take(numRows)

    // For array values, replace Seq and Array with square brackets
    // For cells that are beyond `truncate` characters, replace it with the
    // first `truncate-3` and "..."
    val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
      row.toSeq.map { cell =>
        val str = cell match {
          case null => "null"
          case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
          case array: Array[_] => array.mkString("[", ", ", "]")
          case seq: Seq[_] => seq.mkString("[", ", ", "]")
          case _ => cell.toString
        }
        if (truncate > 0 && str.length > truncate) {
          // do not show ellipses for strings shorter than 4 characters.
          if (truncate < 4) str.substring(0, truncate)
          else str.substring(0, truncate - 3) + "..."
        } else {
          str
        }
      }: Seq[String]
    }

    val sb = new StringBuilder
    val numCols = df.schema.fieldNames.length

    // Initialise the width of each column to a minimum value of '3'
    val colWidths = Array.fill(numCols)(3)

    // Compute the width of each column
    for (row <- rows) {
      for ((cell, i) <- row.zipWithIndex) {
        colWidths(i) = math.max(colWidths(i), cell.length)
      }
    }

    // Create SeparateLine
    val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()

    // column names
    rows.head.zipWithIndex.map { case (cell, i) =>
      if (truncate > 0) {
        StringUtils.leftPad(cell, colWidths(i))
      } else {
        StringUtils.rightPad(cell, colWidths(i))
      }
    }.addString(sb, "|", "|", "|\n")

    sb.append(sep)

    // data
    rows.tail.map {
      _.zipWithIndex.map { case (cell, i) =>
        if (truncate > 0) {
          StringUtils.leftPad(cell.toString, colWidths(i))
        } else {
          StringUtils.rightPad(cell.toString, colWidths(i))
        }
      }.addString(sb, "|", "|", "|\n")
    }

    sb.append(sep)

    // For Data that has more than "numRows" records
    if (hasMoreData) {
      val rowsString = if (numRows == 1) "row" else "rows"
      sb.append(s"only showing top $numRows $rowsString\n")
    }

    sb.toString()
  }

这篇关于有什么办法可以将Spark的Dataset.show()方法的输出作为字符串获取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆