有什么办法可以将Spark的Dataset.show()方法的输出作为字符串获取? [英] Is there any way to get the output of Spark's Dataset.show() method as a string?
问题描述
Spark Dataset.show()
方法对于查看数据集的内容很有用,尤其是在调试时(它会打印出格式正确的表).据我所知,它只能打印到控制台,但是能够以字符串形式获取将很有用.例如,最好使用IntelliJ进行调试时能够将其写入日志或将其视为表达式的结果.
The Spark Dataset.show()
method is useful for seeing the contents of a dataset, particularly for debugging (it prints out a nicely-formatted table). As far as I can tell, it only prints to the console, but it would be useful to be able to get this as a string. For example, it would be nice to be able to write it to a log, or see it as the result of an expression when debugging with, say, IntelliJ.
有什么方法可以将Dataset.show()
的输出作为字符串获取?
Is there any way to get the output of Dataset.show()
as a string?
推荐答案
在sql
包外部看不到show
后面的相应方法.我已经采用了相应的方法并对其进行了更改,以便可以将数据框作为参数传递(代码取自Dataset.scala):
The corresponding method behind show
isn't visible from outside the sql
package. I've taken the corresponding method and changed it such that a dataframe can be passed as parameter (code taken from Dataset.scala) :
def showString(df:DataFrame,_numRows: Int = 20, truncate: Int = 20): String = {
val numRows = _numRows.max(0)
val takeResult = df.take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)
// For array values, replace Seq and Array with square brackets
// For cells that are beyond `truncate` characters, replace it with the
// first `truncate-3` and "..."
val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
row.toSeq.map { cell =>
val str = cell match {
case null => "null"
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
case array: Array[_] => array.mkString("[", ", ", "]")
case seq: Seq[_] => seq.mkString("[", ", ", "]")
case _ => cell.toString
}
if (truncate > 0 && str.length > truncate) {
// do not show ellipses for strings shorter than 4 characters.
if (truncate < 4) str.substring(0, truncate)
else str.substring(0, truncate - 3) + "..."
} else {
str
}
}: Seq[String]
}
val sb = new StringBuilder
val numCols = df.schema.fieldNames.length
// Initialise the width of each column to a minimum value of '3'
val colWidths = Array.fill(numCols)(3)
// Compute the width of each column
for (row <- rows) {
for ((cell, i) <- row.zipWithIndex) {
colWidths(i) = math.max(colWidths(i), cell.length)
}
}
// Create SeparateLine
val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()
// column names
rows.head.zipWithIndex.map { case (cell, i) =>
if (truncate > 0) {
StringUtils.leftPad(cell, colWidths(i))
} else {
StringUtils.rightPad(cell, colWidths(i))
}
}.addString(sb, "|", "|", "|\n")
sb.append(sep)
// data
rows.tail.map {
_.zipWithIndex.map { case (cell, i) =>
if (truncate > 0) {
StringUtils.leftPad(cell.toString, colWidths(i))
} else {
StringUtils.rightPad(cell.toString, colWidths(i))
}
}.addString(sb, "|", "|", "|\n")
}
sb.append(sep)
// For Data that has more than "numRows" records
if (hasMoreData) {
val rowsString = if (numRows == 1) "row" else "rows"
sb.append(s"only showing top $numRows $rowsString\n")
}
sb.toString()
}
这篇关于有什么办法可以将Spark的Dataset.show()方法的输出作为字符串获取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!