如何创建R输出喜欢混乱矩阵表 [英] How to create R output likes confusion matrix table

查看:190
本文介绍了如何创建R输出喜欢混乱矩阵表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个目录:



第一个目录的名称是model,第二个目录是test,两个目录中的文件列表相同但内容不一样。两个目录中的文件总数也相同,即37个文件。



我从一个文件中显示了内容的示例。



模型目录中的第一个文件



名称文件:Model_A5B45

  data 
1 papaya |榴莲橙色|葡萄
2橙色
3葡萄
4香蕉|榴莲
5番茄
6苹果|番茄
7 apple
8 mangostine
9 strawberry
10 strawberry |芒果

dput输出:

  structure(list(data = structure(c(7L,6L,4L,3L,10L,2L,1L,
5L,8L,9L),.Label = c(apple ,苹果番茄,香蕉|榴莲,
葡萄,甘薯,橙,番木瓜|榴莲|橙|葡萄,
草莓芒果,番茄),class =因子)),.Names =data,class =data.frame,row.names = c(NA,
-10L))

测试目录中的第二个文件



名称文件:Test_A5B45

  data 
1 apple
2 orange |苹果|芒果
3苹果
4香蕉
5葡萄
6木瓜
7榴莲
8番茄|橙色|番木瓜|榴莲

dput输出:

  structure(list(data = structure(c(1L,5L,1L,2L,4L,6L,3L,
7L),.Label = c(apple,banana ,榴莲,葡萄,橙色|苹果|芒果,
木瓜,番茄|橙色|木瓜|榴莲),class =因子)),.Names =data ,class =data.frame,row.names = c(NA,
-8L))

我想计算目录模型中文件中相交的数据和目录测试中的数据的百分比。



这是我的代码仅为两个的文件(Model_A5B45和Test_A5B45)。

  library(dplyr)

data_test< - 读取。 csv(Test_A5B45)
data_model< - read.csv(Model_A5B45)
intersect< - semi_join(data_test,data_model)
除了< - anti_join(data_test,data_model)
except_percentage< - (nrow(except)/ nrow(data_test))* 100
intersect_percentage< - (nrow(intersect)/ nrow(data_test))* 100
sprintf(%s /%s,intersect_percentage,except_percentage)

输出: code>37.5 / 62.5



我的问题是,我想实现我的代码到所有的文件的目录),所以输出将看起来像混乱矩阵。



示例我的期望输出:

  ## y 
## Model_A5B45 Model_A6B46 Model_A7B47
## Test_A5B45 37.5 / 62.5值值
## Test_A6B46值值
## Test_A7B47值值价值

我的答案:



你可以创建可以处理这些东西的代码,但是我仍然不知道如何使输出看起来像混乱矩阵。



这是我的代码:(*我不知道这是有效的,我使用for循环)

  f_performance_testing<  -  function(data_model_pa th,data_test_path){
library(dplyr)
data_model < - read.csv(data_model_path,header = TRUE)
data_test< - read.csv(data_test_path,header = TRUE)
intersect< - semi_join(data_test,data_model)
除了< - anti_join(data_test,data_model)
except_percentage< - (nrow(except)/ nrow(data_test))* 100
intersect_percentage < - (nrow(intersect)/ nrow(data_test))* 100

return(list(intersect= intersect_percentage,except= except_percentage))
}


for(model in model_list){
for(test in test_list){
result< - f_performance_testing(model,test)
intersect_percentage< - round(result $ intersect,3)
except_percentage< - round(result $ except,3)
final_output< - sprintf(intersect:%s |除了%s,intersect_percentage,except_percentage)
cat(print(paste(substring(model,57),substring(test,56),final_output,sep =,)),file =outfile.txt ,append = TRUE,\\\

print(写入文件.......)
}
}

输出是:

  Model_A5B45, Test_A5B45,37.5 / 62.5 
Model_A5B45,Test_A6B46,价值
Model_A5B45,Test_A7B47,价值
Model_A6B46,......
Model_A7B47,.....
...............
......
....

有没有人可以帮我把这个输出转换为混乱矩阵表?

解决方案

这不会直接回答你的问题,但希望能够给你足够的信息来达成自己的解决方案。



我建议你创建一个像以下:

  myFun<  -  function(model,test,datasource){
model< - datasource [模式l]]
test< - datasource [[test]]
paste(rev(map(function(x,y)(x / y)* 100,
lapply ,test%in%model),length),
length(test)))
collapse =/)
}
pre>

此函数与两列 data.frame 一起使用,其中列表示所有test和model的组合(为什么使用 data.frame 结构,当一个字符 vector 将足够吗?)



这是一个这样的一个例子,这样一个 data.frame (其他样本数据在答案结束时找到。

  models<  -  c(model_1,model_2 
$< - c(test_1,test_2,test_3)
A< - expand.grid(models,tests,stringsAsFactors = FALSE)

接下来,创建一个名为列表的模型和测试。如果您使用 lapply 阅读了您的数据,那么可能您可能有名称可以使用。

  dataList<  -  mget(c(models,tests))

现在,计算相关值。在这里,我们可以使用 apply 循环遍历每一行并执行相关的计算。

  A $ value<  -  apply(A,1,function(x)myFun(x [1],x [2],dataList))

最后,您将 c>将数据从长形式更改为宽表单。

  reshape(A,direction =wide,idvar =Var1,timevar =Var2)
#Var1值。 test_1 value.test_2 value.test_3
#1 model_1 75/25 100 75/25
#2 model_2 50/50 50/50 62.5 / 37.5
#3 model_3 62.5 / 37.5 50/50 87.5 / 12.5






以下是一些示例数据。请注意,它们是基本字符向量,而不是 data.frame s。

  set.seed(1)
sets < - c(A,A | B,B,C,A | B | C,A | C D,A | D,B | C,B | D)

test_1< - sample(sets,8,TRUE)
model_1 < (sets,10,TRUE)
test_2 < - sample(sets,8,TRUE)
model_2 < - sample(sets,10,TRUE)
test_3< - sample ,8,TRUE)
model_3< - sample(sets,10,TRUE)

在现实世界的应用程序中,您可能会执行以下操作:

  testList<  -  lapply(list.files(path =路径/到/测试/文件),
函数(x)read.csv(x,stringsAsFactors = FALSE)$ data)
modelList< - lapply(list.files(path = to / model / files),
函数(x)read.csv(x,stringsAsFactors = FALSE)$ data)
dataList< - c(testList,modelList)

但是,这是基于wh的纯粹的猜测在您作为工作代码(例如,没有文件扩展名的csv文件)中分享您的问题。


I have two of directories:

The name of first directory is "model" and the second directory is "test", the list of files in both of directories are same but have different content. The total number of files in both of directories also same, that is 37 files.

I show the example of content from one of file.

First file from model directory

Name file : Model_A5B45

                               data
1  papaya | durian | orange | grapes
2                             orange
3                             grapes
4                    banana | durian
5                             tomato
6                     apple | tomato
7                              apple
8                        mangostine 
9                         strawberry
10                strawberry | mango

dput output :

structure(list(data = structure(c(7L, 6L, 4L, 3L, 10L, 2L, 1L, 
5L, 8L, 9L), .Label = c("apple", "apple | tomato", "banana | durian", 
"grapes", "mangostine ", "orange", "papaya | durian | orange | grapes", 
"strawberry", "strawberry | mango", "tomato"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA, 
-10L))

Second file in test directory

Name file: Test_A5B45

                               data
1                             apple
2            orange | apple | mango
3                             apple
4                            banana
5                            grapes
6                            papaya
7                            durian
8 tomato | orange | papaya | durian

dput output:

structure(list(data = structure(c(1L, 5L, 1L, 2L, 4L, 6L, 3L, 
7L), .Label = c("apple", "banana", "durian", "grapes", "orange | apple | mango", 
"papaya", "tomato | orange | papaya | durian"), class = "factor")), .Names = "data", class = "data.frame", row.names = c(NA, 
-8L))

I want to calculate the percentage of intersect and except data from files in directory test to files in directory model.

This is example of my code only for two of files (Model_A5B45 and Test_A5B45).

library(dplyr)

data_test <- read.csv("Test_A5B45")
data_model <- read.csv("Model_A5B45")
intersect <- semi_join(data_test,data_model)
except <- anti_join(data_test,data_model)
except_percentage <- (nrow(except)/nrow(data_test))*100
intersect_percentage <- (nrow(intersect)/nrow(data_test))*100
sprintf("%s/%s",intersect_percentage,except_percentage) 

Output : "37.5/62.5"

My question is, I want to implement my code to all of files (looping in both of directories) so the output will looks like confusion matrix..

example my expectation output:

##             y
##              Model_A5B45       Model_A6B46    Model_A7B47
##   Test_A5B45     37.5/62.5          value         value
##   Test_A6B46      value             value         value
##   Test_A7B47      value             value         value

My Answer:

I've create code that can process those thing, but I am still do not know how to make output looks like confusion matrix.

This is my code: (* I dont know this is efficient or not, I use for loop)

f_performance_testing <- function(data_model_path, data_test_path){
  library(dplyr)
  data_model <- read.csv(data_model_path, header=TRUE)
  data_test <- read.csv(data_test_path, header=TRUE)
  intersect <- semi_join(data_test,data_model)
  except <- anti_join(data_test,data_model)
  except_percentage <- (nrow(except)/nrow(data_test))*100
  intersect_percentage <- (nrow(intersect)/nrow(data_test))*100

  return(list("intersect"=intersect_percentage,"except"=except_percentage))
}


for (model in model_list){
  for (test in test_list){
    result <- f_performance_testing(model,test)
    intersect_percentage <- round(result$intersect,3)
    except_percentage <- round(result$except,3)
    final_output <- sprintf("intersect : %s | except : %s",intersect_percentage,except_percentage) 
    cat(print(paste(substring(model,57),substring(test,56), final_output,sep=",")),file="outfile.txt",append=TRUE,"\n")
    print("Writing to file.......")
  }
}

The output is:

Model_A5B45,Test_A5B45, 37.5/62.5 
Model_A5B45,Test_A6B46, value
Model_A5B45,Test_A7B47, value
Model_A6B46,...... 
Model_A7B47,.....
...............
......
....

Is there any someone can help me to transform this output as looks like confusion matrix table?

解决方案

This won't answer your question directly, but hopefully gives you enough information to arrive at your own solution.

I would recommend creating a function like the following:

myFun <- function(model, test, datasource) {
  model <- datasource[[model]]
  test <- datasource[[test]]
  paste(rev(mapply(function(x, y) (x/y)*100, 
                   lapply(split(test, test %in% model), length), 
                   length(test))), 
        collapse = "/")
}

This function is to be used with a two-column data.frame, where the columns represent all the combinations of "test" and "model" values (why work with a data.frame structure when a character vector would suffice?)

Here's an example of such a data.frame (other sample data is found at the end of the answer).

models <- c("model_1", "model_2", "model_3")
tests <- c("test_1", "test_2", "test_3")
A <- expand.grid(models, tests, stringsAsFactors = FALSE)

Next, create a named list of your models and tests. If you've read your data in using lapply, it is likely you might have names to work with anyway.

dataList <- mget(c(models, tests))

Now, calculate the relevant values. Here, we can use apply to cycle through each row and perform the relevant calculation.

A$value <- apply(A, 1, function(x) myFun(x[1], x[2], dataList))

Finally, you reshape the data from a "long" form to a "wide" form.

reshape(A, direction = "wide", idvar = "Var1", timevar = "Var2")
#      Var1 value.test_1 value.test_2 value.test_3
# 1 model_1        75/25          100        75/25
# 2 model_2        50/50        50/50    62.5/37.5
# 3 model_3    62.5/37.5        50/50    87.5/12.5


Here's some sample data. Note that they are basic character vectors and not data.frames.

set.seed(1)
sets <- c("A", "A|B", "B", "C", "A|B|C", "A|C", "D", "A|D", "B|C", "B|D")

test_1 <- sample(sets, 8, TRUE)
model_1 <- sample(sets, 10, TRUE)
test_2 <- sample(sets, 8, TRUE)
model_2 <- sample(sets, 10, TRUE)
test_3 <- sample(sets, 8, TRUE)
model_3 <- sample(sets, 10, TRUE)

In a real world application, you would probably do something like:

testList <- lapply(list.files(path = "path/to/test/files"),
                   function(x) read.csv(x, stringsAsFactors = FALSE)$data)
modelList <- lapply(list.files(path = "path/to/model/files"),
                   function(x) read.csv(x, stringsAsFactors = FALSE)$data)
dataList <- c(testList, modelList)

But, this is pure speculation on my part based on what you've shared in your question as working code (for example, csv files with no file extension).

这篇关于如何创建R输出喜欢混乱矩阵表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆