从具有重叠值的几个列中创建唯一值的向量 [英] Create a vector of unique values out of several columns with overlapping values
问题描述
在我的data.frame中,我在一行的SUBJECT上有三列。我想为每行添加一个独特的主题列。首先,我的数据如何:
In my data.frame I have three columns on the SUBJECT of a row. I want an additional column with a unique subject for each row. First, how my data looks like:
DATE <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7")
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
CLASS <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral")
SUBJECT.1 <- c("LITIGATION","LAYOFF","POLLUTION","CHEMICAL DISASTER","PRESS RELEASE","PEOPLE","EMISSIONS","ENERGY","WASTE MANAGEMENT","EMPLOYEES","MANAGEMENT","PRESS RELEASE","HOTELS","POLLUTION")
SUBJECT.2 <- c("POLLUTION","EMPLOYEES","NUCLEAR","FUELS","STOCK OPTION PLAN","EXECUTIVES","CO2","SOLAR","POLLUTION","EXECUTIVES","PRESS RELEASE","CELEBRITIES","CELEBRITIES","LITIGATION")
SUBJECT.3 <- c("ENVIRONMENT","JOB REDUCTIONS","POWER PLANTS","POLLUTION","EMPLOYEES","FRAUD","CLIMATE CHANGE","SUSTAINABILITY","HAZARDOUS WASTE","BONUS PAY","LITIGATION","EMISSIONS","SCANDALS","SCANDALS")
CONTROLVAR <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")
mydf <- data.frame(DATE, COMP, RET, CLASS, SUBJECT.1, SUBJECT.2, SUBJECT.3, CONTROLVAR, stringsAsFactors=F)
mydf
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS 12
由于我想将主题作为虚拟变量(应该是排他性的)进行后续回归,所以我需要一个单独的列SUBJECT,具有唯一的s每一行的对象。我想专注于LITIGATION,POLLUTION和LAYOFF这个科目。
Since I want to include the subject as a dummy variable (which should be exclusive) for a later regression, I want a single column SUBJECT with a unique subject for each row. I'd like to focus on the subjects LITIGATION, POLLUTION and LAYOFF.
我想从左到右,并查看每个SUBJECT列的LITIGATION,POLLUTION和LAYOFF。
I want to go from left to right and check each SUBJECT column for LITIGATION, POLLUTION and LAYOFF.
如果在第一列中有三个题目LITIGATION,POLLUTION或LAYOFF中的一个被采用。如果在第一列有不同的主题,我检查第二列等等。如果三个主题列中没有一个包含LITIGATION,污染或LAYOFF,则该主题应称为OTHER。
另外,一些科目应该分组。在这个例子中,排放歧视应被视为污染。
If there is one of the three subjects LITIGATION, POLLUTION or LAYOFF in the first column, this subject is taken. If there is a different subject in the first column, I check the second column and so on. If none of the three subject-columns contain LITIGATION, POLLUTION or LAYOFF, then the subject should be called OTHER. Also, some of the subjects should be grouped. In this example EMISSIONS should be treated as if it was POLLUTION.
输出应如下所示:
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 SUBJECT CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT LITIGATION 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS LAYOFF 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS POLLUTION 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES OTHER 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD OTHER 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE POLLUTION 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY OTHER 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE POLLUTION 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY OTHER 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS POLLUTION 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS OTHER 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS POLLUTION 12
谢谢! p>
Thank you!
推荐答案
mydf$SUBJECT <- "OTHER"
sapply(c("SUBJECT.3", "SUBJECT.2", "SUBJECT.1"), function(x) mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), "SUBJECT"] <<- mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), x])
mydf$SUBJECT[mydf$SUBJECT == "EMISSIONS"] <- "POLLUTION"
这篇关于从具有重叠值的几个列中创建唯一值的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!