从开放问卷中汇总和绘制观察结果 [英] Aggregating and mapping observations from an open questionnaire

查看:103
本文介绍了从开放问卷中汇总和绘制观察结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要



我想创建一个这样一个显示城市每个社区最常见的问题的boxplot。



不幸的是,boxplot是没用的,因为我使用的数据来自一个开放的问卷调查它有两个主要问题:


  1. 有很多不相关的答案(无关紧要的是指仅由一个或几个人们)

  2. 有一些问题涉及相同的概念,但已被改写为不同,因此被视为不同的东西。

为了使其更有用,我想在一个组中汇总不相关的答案,例如:其他问题并重新命名这些问题相同的,所以他们的措辞是完全一致的,因此可以正确显示在barplot。不幸的是我没有成功这样做。



详细说明



让我们来看一个示例代码(数据框上的名称只是例子:为了清楚起见,我修改了它们,因此它更容易理解两个或更多个问题是相关的,但实际的术语不能总是从正则表达式推导出来):

 库(plyr)
库(dplyr)
库(tidyr)

df = read.csv(http://pastebin.com/raw/bUxANQw6)

problems = df%>%
选择(问题)%>%
收集(变量,值)%>%
group_by(value)%>%
总结(总计= n())%>%
arrange(desc(Total))

哪些导致以下数据框:

 >问题
资料来源:本地数据框架[27 x 2]

值总共
1问题1 282
2问题3 268
3问题2 186
4无问题160
5问题4 76
6问题5 68
7问题6 6
8问题7 5
9不知道4
10问题8 2
.. ... ...
>

可以看出,我们有27个问题,更加关心他们,我们可以建立一些组:


  1. 相关数据:问题1到7 + 无问题不知道

  2. Synomyms:我们有问题9 问题9'问题9问题9'(其中)

  3. 不相关的数据,应分为单个标签,如其他问题:问题12至18

我建议的方法



这是我以为我可以做的,以克服这两个问题: >

为了处理同义词,我想到将同义词值重新命名为单个值,可能使用 revalue 命令,这样:

  df $ Problems = revalue(df $ Problems,c('Problem 9' '='问题9',
'问题9'''='问题9',
'问题9''''='问题9'))

然而,作为一个R新手(以及编程语言的新手),我认为应该有一个更快的方法来实现,因为维护同义词字典的任务将是非常乏味,而且在获得更多回复时会越来越多。



为了处理不相关的答案,我可以采取类似的方法,并将其重新评估为其他问题,但是我想以自动的方式执行,因为问卷调查尚未完成,因此无关术语的列表将会不断增长,并且我无法手动映射所有这些条款(例如:映射所有值已被不到5人投票总计< 5 )。我想我应该创建一个函数并使用控制结构( for ... in ),但我还没有成功。



由于我需要显示一个按照社区分组的答案的框图,我恐怕不能使用问题 dataframe就是这样。因此,尽管计算每个问题的总票数是有用的,但我不知道该怎么办,而不是将其用作信息量。另一方面,我不能仅根据每个社区收到的答复来确定答案是否不相关,因为它会偏离结果,因为预计不同的社区可能会有不同的问题。



对这两个问题的任何帮助都将非常感激。谢谢

解决方案

我看了你的数据和代码。你的数据框,问题得到问题9'问题7'等等。所以你想要的是删除''。那是你的任务。您可以使用以下行来实现此任务。

 问题$ value < -  gsub(pattern ='+ |'+ ,replacement =,x = problems $ value)

您可以通过使用 which()。您要查找总计< 5 。使用 which(),可以找到索引。然后,用其他问题替换行中的中的任何内容。我希望这是你以后的事。

 问题$ value [which(problems $ Total< 5)]< 其他问题

数据

  problem<  -  structure(list(value = c(Problem 1,Problem 3,Problem 2,
No problems ,问题4,问题5,问题6,问题7,
不知道,问题8,问题9问题10,问题10,问题11,问题11,问题12,
问题13,问题14 15,问题16,问题17,
问题18,问题7问题9,问题9 c(282L,268L,186L,160L,76L,68L,6L,5L,4L,2L,
2L,2L,1L,1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,
1L)),class = c(tbl_df,tbl,data.frame),row.names = c(NA,
-27L) ,.Names = c(value,Total))

编辑



看OP的第一个评论,以下行将使数据框架绘制图形。

  count(df,Neighborhoods,Problems) - > temp 

temp $问题< - gsub(pattern ='++,replacement =,x = temp $ Problems)

temp $ Problems [ (temp $ n <5)]< - 其他问题

group_by(temp,Neighborhoods,Problems)%>%
总结(总计= sum(n)) - > temp2


Summary

I want to create a boxplot like this one displaying the most frequent perceived problems in every neighborhood of a city.

Unfortunately, the boxplot is useless as it is, since the data I am using comes from an open questionnaire and it has two main problems:

  1. There are a lot of irrelevant answers (by irrelevant I refer to those which are used by only one or few people)
  2. There are problems that refer to the same concept but have been rephrased differently and thus are counted as something different.

In order to make it more useful I would like to aggregate irrelevant answers in a single group "eg: other problems and rename the problems that mean the same so they are worded exactly and thus can be displayed properly in the barplot. Unfortunately I didn't succeed in doing so.

Detailed explanation

Let's take a look at a sample code (The names on the dataframe are just examples: I have modified them for the sake of clarity so it makes it easier to understand that two or more problems are related, but the real terms can't always be deduced from a regular expression):

library(plyr)
library(dplyr)
library(tidyr)

df= read.csv("http://pastebin.com/raw/bUxANQw6")

problems = df %>%
  select(Problems) %>%
  gather(variable, value) %>%
  group_by(value) %>%
  summarise(Total = n()) %>%
  arrange(desc(Total))

Which result in the following dataframe:

> problems
Source: local data frame [27 x 2]

          value Total
1     Problem 1   282
2     Problem 3   268
3     Problem 2   186
4   No problems   160
5     Problem 4    76
6     Problem 5    68
7     Problem 6     6
8     Problem 7     5
9  Doesn't know     4
10    Problem 8     2
..          ...   ...
> 

As can be seen we have 27 problems, and looking at them more carefuly we could stablish some groups:

  1. Relevant data: Problems 1 to 7 + No Problems and Doesn't know
  2. Synomyms: we have Problem 9, Problem 9', Problem 9'' or Problem 9''' (amongst others)
  3. Irrelevant data, which should be grouped under a single label, like "Other Problems": Problems 12 to 18

My suggested approach

That's what I thought I could do in order to overcome these two problems:

In order to deal with synonyms, I thought of renaming the synonym values into a single one, possibly using revalue command, something like this:

df$Problems = revalue(df$Problems, c('Problem 9’' = 'Problem 9',
                                     'Problem 9’’' = 'Problem 9',
                                     'Problem 9’’’' = 'Problem 9'))

However, as a R newbie (and newbie to programming languages, as well) I think there should be a faster way to achieve that, since the task of maintaining a "synonyms' dictionary" will be very tedious and will be growing when getting more replies.

In order to deal with irrelevant answers, I could take a similar approach, and revalue them as Other problems, but I would like to do it in an automated way, since the list of irrelevant terms will be growing as the questionnaire has not yet finished and I cannot map all of them manually (eg: map all values which have been voted by less than 5 people Total < 5). I guess I should create a function and use a control structure (for ... in) but I have not yet succeeded on that.

Since I need to display a boxplot of the answers grouped by neighborhoods, I'm afraid I can't use the problems dataframe as it is. So although it is useful to calculate total number of votes per problem, I do not know what to do with it other than use it as informative data. On the other hand, I cannot determine if an answer is irrelevant based only on the replies received in each neighborhood, as it would bias the results, since it is expected that different neighborhoods may have different problems.

Any help with these two problems will be really appreciated. Thanks

解决方案

I had a look of your data and code. Your data frame, problems got Problem 9’, Problem 7' and so forth. So what you want is to remove and '. That is your task one. You can achieve this task with the following line.

problems$value <- gsub(pattern = "’+|'+", replacement = "", x = problems$value)

You can achieve the other task by using which(). You want to find rows which are Total < 5. Using which(), you can find indices. Then, you replace whatever in value in the rows with Other problems. I hope this is what you are after.

problems$value[which(problems$Total < 5)] <- "Other problems"

DATA

problems <- structure(list(value = c("Problem 1", "Problem 3", "Problem 2", 
"No problems", "Problem 4", "Problem 5", "Problem 6", "Problem 7", 
"Doesn't know", "Problem 8", "Problem 9", "Problem 9’", "Other problems", 
"Problem 10", "Problem 10’", "Problem 11", "Problem 11'", "Problem 12", 
"Problem 13", "Problem 14", "Problem 15", "Problem 16", "Problem 17", 
"Problem 18", "Problem 7'", "Problem 9’’", "Problem 9’’’"
), Total = c(282L, 268L, 186L, 160L, 76L, 68L, 6L, 5L, 4L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-27L), .Names = c("value", "Total"))

EDIT

Seeing the OP's first comment, the following lines would make a data frame to draw a graphic.

count(df, Neighborhoods, Problems) -> temp

temp$Problems <- gsub(pattern = "’+|'+", replacement = "", x = temp$Problems)

temp$Problems[which(temp$n < 5)] <- "Other problems"

group_by(temp, Neighborhoods, Problems) %>%
summarize(Total = sum(n)) -> temp2

这篇关于从开放问卷中汇总和绘制观察结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆