带有 tcltk/tcltk2 的 R:使用 TkTable 显示大 data.frame 时提高缓慢的性能? [英] R with tcltk/tcltk2: Improve slow performance when displaying big data.frame with TkTable?

查看:34
本文介绍了带有 tcltk/tcltk2 的 R:使用 TkTable 显示大 data.frame 时提高缓慢的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请看下面的两个编辑(稍后添加)...

我已将一个大的 data.frame 加载到内存中(2.7 mio 行和 7 列 - 74 MB 的 RAM).

I have loaded a big data.frame into memory (2.7 mio rows and 7 columns - 74 MB of RAM).

如果我想使用 Tcl/Tk 的 Tktable widget 查看数据通过tcltk2包函数tk2edit

If I want to view the data using Tcl/Tk's Tktable widget via the tcltk2 package function tk2edit

  • 需要超过 15 分钟,直到窗口显示数据
  • 并且大约 7 GB 的 RAM (!) 被 R(包括 Tcl/Tk)和加上!
  • it takes over 15 minutes till the window is displayed with the data
  • and about 7 GB of RAM (!) is consumed by R (incl. Tcl/Tk) en plus!

示例:

library(tcltk2)

my.data.frame <- data.frame(ID=1:2600000,
                            col1=rep(LETTERS,100000),
                            col2=rep(letters,1E5),
                            col3=26E5:1)       # about 40 MB of data

tk2edit(my.data.frame)

基本问题似乎是 data.frame 的每个单元格必须通过两个嵌套循环加载到 tcl 数组中 (请参阅此 tktable 问题中的代码).

The basic problem seems to be that each cell of the data.frame must loaded into an tcl array via two nested loops (see the code in this tktable question).

tcltk2 包的函数 tk2edit 工作方式相同,过于简化:

The tcltk2 package's function tk2edit works the same way, over-simplified:

# my.data.frame contains a lot of rows...
for (i in 0:(dim(my.data.frame)[1])) {
        for (j in 0:(dim(my.data.frame)[2]-1)) {
                tclarray1[[i,j]] <- my.data.frame[i, j]
        }
}

问题:有没有什么方法可以优化用 tktable 显示大 data.frames,例如.G.通过避免嵌套循环?我只想查看数据(无需编辑)...

Question: Is there any way to optimize displaying big data.frames with tktable, e. g. by avoiding the nested loops? I just want to view data (no editing required)...

tktable 具有 -variable 选项,您可以在其中设置包含表所有数据的 tcl 数组变量.所以我们只需要"找到方法从 R 数据框架创建一个 tcl 数组,并从 R 调用一次 tcl"...

tktable has the -variable option where you can set the tcl array variable that contains ALL the data of the table. So we "only" have to find way to create a tcl array from an R data.frame with "one call to tcl from R"...

PS:这不是 tcltk2 包的问题,​​但似乎是如何将 data.frame 的数据批量加载"到 Tcl 变量中的一般问题...

PS: This is not a problem of the tcltk2 package but seems to be a general problem how to "bulk load" data of a data.frame into Tcl variables...

PS2:好消息是 Tktable 似乎能够有效地显示如此多的数据(我可以滚动甚至编辑单元格而不会注意到任何严重的延迟)​​.

PS2: The good thing is that Tktable seems to be able to display such a lot of data efficiently (I can scroll and even edit cells without noticing any severe delays).

我在 Tcl/Tk 中准备了一个简单的基准测试来测量填充类似 Tktable 的执行时间和内存消耗:

I have prepared a simple benchmark in Tcl/Tk to measure the execution time and memory consumption of filling a similar Tktable:

#!/usr/bin/env wish

package require Tktable

set rows 2700000
set columns 4

for {set row 0} {$row <= $rows} {incr row} {
  for {set column 0} {$column < $columns} {incr column} {
    if {$row == 0} {
      set data($row,$column) Titel$column
    } else {
      set data($row,$column) R${row}C${column}
    }
  }
}

ttk::frame .fr

table .fr.table -rows $rows -cols $columns -titlerows 1 -titlecols 0 -height 5 -width 25 -rowheight 1 -colwidth 9 -maxheight 100 -maxwidth 400 -selectmode extended -variable data -xscrollcommand {.fr.xscroll set} -yscrollcommand {.fr.yscroll set}

scrollbar .fr.xscroll -command {.fr.table xview} -orient horizontal
scrollbar .fr.yscroll -command {.fr.table yview}

pack .fr -fill both -expand 1
pack .fr.xscroll -side bottom -fill x
pack .fr.yscroll -side right -fill y
pack .fr.table -side right -fill both -expand 1

结果:

  • 内存消耗:3.2 GB
  • 表格显示的时间:15 秒.

结论: Tcl/Tk 数组在浪费内存,但是性能非常好(使用 R 和 tcltk 时的 15 分钟运行时间似乎是由 R 到 Tcl/Tk 通信开销造成的.

Conclusion: Tcl/Tk arrays are wasting memory, but the performance is very good (the runtime of 15 minutes when using R with tcltk seem to be caused by R to Tcl/Tk communication overhead.

测试设置:Ubuntu 14.04 64 位,16 GB RAM...

为了比较 Tktablettk::treeview 的内存消耗,我写了这段代码:

To compare the memory consumption of Tktable to ttk::treeview I wrote this code:

#!/usr/bin/env wish
set rows 2700000
set columns 4
set data {}
set colnames {}
for {set i 0} {$i < $columns} {incr i} {
  lappend colnames Title$i
}
for {set row 0} {$row <= $rows} {incr row} {
  set newrow {}
  for {set column 0} {$column < $columns} {incr column} {
      lappend newrow R${row}C${column}
  }
  lappend data $newrow
}

ttk::treeview .tv -columns $colnames -show headings -yscrollcommand {.sbY set} -xscrollcommand {.sbX set}
foreach Element $data {
   .tv insert {} end -values $Element
}
foreach column $colnames {
  .tv heading $column -text $column
}
ttk::scrollbar .sbY -command {.tv yview}
ttk::scrollbar .sbX -command {.tv xview} -orient horizontal
pack .sbY -side right -fill y
pack .sbX -side bottom -fill x
pack .tv -side left -fill both

结果:

  • 内存消耗:2 GB(其中以列表形式存储的数据:1.2 GB)
  • 表格显示的时间:15 秒.
  • 比较:10 个 mio 行消耗 7.2 GB 的 RAM,但选择一行需要几秒钟 (2 - 5) 然后(可能的原因:内部列表遍历?)

结论:

  • treeviewTktable 的内存效率更高,因为它可以使用列表而不是数组.
  • 对于较大的数据大小(> 几百万行),行选择很慢(越多越慢!)
  • The treeview is more memory efficient than Tktable since it can use a list instead of an array.
  • For bigger data sizes (> a few million rows) the row selection is slow (the more at the end the slower!)

推荐答案

我找到了一种可能的解决方案/解决方法在未绑定"(命令)模式下使用 Tktable.

I have found one possible solution/workaround using Tktable in an "unbound" (command) mode.

使用Tktablecommand 选项,您可以指定每次单元格显示在屏幕上时调用的函数.这避免了将所有数据从 R 一次性加载"到 Tcl,提高了启动"时间,并显着减少了 TCL 存储数组和列表的方式导致的内存消耗.

With the command option of Tktable you can specify a function that is called each time a cell shall be displayed on the screen. This avoids "loading" all the data from R to Tcl at once improving the "start-up" time and significantly reduces the memory consumption caused by TCL's way of storing arrays and lists.

这样每次滚动时都会调用一系列函数来询问可见单元格的内容.

This way every time you scroll a series of function calls are done to ask for the content of the visible cells.

即使超过 10 mio,它也适用于我.行!

It works for me even with over 10 mio. rows!

缺点: 调用一个为每个单元格返回一个 Tcl 变量的 R 函数仍然远非有效.如果您是第一次滚动,您可以看到正在更新的单元格.因此,我仍在寻找 R 和 Tcl/Tk 之间的批量数据传输解决方案.

Drawback: Calling an R function that returns a Tcl variable for each cell is still far from being efficient. If you scroll for the first time you can watch the cells being updated. Therefore I am still looking for a bulk data transfer solution between R and Tcl/Tk.

欢迎提出任何改进性能的建议!

我实现了一个小演示(1 个 mio.行和 21 列,消耗 1.2 GB 的 RAM)并添加了一些按钮来测试不同的功能(如缓存).

I have implemented a small demo (with 1 mio. rows and 21 columns consuming 1.2 GB of RAM) and added some buttons to test different features (like caching).

注意:启动时间长是因为创建了底层测试数据,而不是Tktable!

Note: The long start-up time is caused by creating the underlying test data, NOT by Tktable!

library(tcltk)
library(data.table)

# Tktable example with -command ("unbound" mode) ---------------------------
# Doc: http://tktable.sourceforge.net/tktable/doc/tkTable.html

NUM.ROWS <- 1E6
NUM.COLS <- 20

# generate a big data.frame - this will take a while but is required for the demo
dt.data <- data.table(ID = 1:NUM.ROWS)

for (i in 1:NUM.COLS) {
  dt.data[, (paste("Col",i)) := paste0("R", 1:NUM.ROWS, " C", i)]
}

# Fill one cell with a long text containing special control characters to test the Tktable behaviour
dt.data[3,3 := "This is a long text with backslash \\ and \"quotes\"!"]

tclRequire("Tktable")

t <- tktoplevel()

tkwm.protocol(t, "WM_DELETE_WINDOW", function() tkdestroy(t))

# Function to return the current row and column as "calculated" value (without an underlying data "model")
calculated.data <- function(C) {
  # Function arguments  for Tcl "substitutions":
  # See:   http://tktable.sourceforge.net/tktable/doc/tkTable.html
  #   %c the column of the triggered cell.
  #   %C A convenience substitution for %r,%c.
  #   %i 0 for a read (get) and 1 for a write (set). Otherwise it is the current cursor position in the cell.
  #   %r the row of the triggered cell.
  return(tclVar(C))  # this does work!
}

# Function to return the content of a data.table for the current row and colum
data.frame.data <- function(r, c) {
  if( r == "0")
    return(tclVar(names(dt.data)[as.integer(c)+1]))             # First row contains the column names
  else
    return(tclVar(as.character(dt.data[as.integer(r)+1, as.integer(c)+1, with = FALSE])))   # Other rows are data rows
}

frame <- ttklabelframe(t, text = "Data:")
# Add the table to the window environment to ensure killing it when the window is closed (= no more phantom calls to the data command handler)
# Cache = TRUE: This greatly enhances speed performance when used with -command but uses extra memory.
t$env$table <- tkwidget(frame, "table", rows = NUM.ROWS, cols = NUM.COLS, titlerows = 1, selecttype = "cell", selectmode = "extended", command = calculated.data, cache = TRUE, yscrollcommand = function(...) tkset(scroll.y, ...), xscrollcommand = function(...) tkset(scroll.x, ...))

scroll.x <- ttkscrollbar(frame, orient = "horizontal", command=function(...) tkxview(t$env$table,...))  # command that performs the scrolling
scroll.y <- ttkscrollbar(frame, orient = "vertical", command=function(...) tkyview(t$env$table,...))  # command that performs the scrolling

buttons <- ttkframe(t)
btn.read.only <- ttkbutton(buttons, text = "make read only", command = function() tkconfigure(t$env$table, state = "disabled"))
btn.read.write <- ttkbutton(buttons, text = "make writable", command = function() tkconfigure(t$env$table, state = "normal"))
btn.clear.cache <- ttkbutton(buttons, text = "clear cache", command = function() tcl(t$env$table, "clear", "cache"))
btn.bind.data.frame <- ttkbutton(buttons, text = "Fill cells from R data.table",
                                 command = function() {
                                   tkconfigure(t$env$table, command = data.frame.data, rows = nrow(dt.data), cols = ncol(dt.data), titlerows = 1)
                                   tcl(t$env$table, "clear", "cache")
                                   tkwm.title(t,"Cells are filled from an R data.table")
                                 })
btn.bind.calc.value <- ttkbutton(buttons, text = "Fill cells with calculated values",
                                 command = function() {
                                   tkconfigure(t$env$table, command = calculated.data, rows = 1E5, cols = 40)
                                   tcl(t$env$table, "clear", "cache")
                                   tkwm.title(t,"Cells are calculated values (to test the highest performance possible)")
                                 })

tkgrid(btn.read.only, row = 0, column = 1)
tkgrid(btn.read.write, row = 0, column = 2)
tkgrid(btn.clear.cache, row = 0, column = 3)
tkgrid(btn.bind.data.frame, row = 0, column = 5)
tkgrid(btn.bind.calc.value, row = 0, column = 6)

tkpack(frame, fill = "both", expand = TRUE)
tkpack(scroll.x, fill = "x", expand = FALSE, side = "bottom")
tkpack(scroll.y, fill = "y", expand = FALSE, side = "right")
tkpack(t$env$table, fill = "both", expand = TRUE, side = "left")
tkpack(buttons, side = "bottom")

这篇关于带有 tcltk/tcltk2 的 R:使用 TkTable 显示大 data.frame 时提高缓慢的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆