从具有重叠值的几个列中创建唯一值的向量 [英] Create a vector of unique values out of several columns with overlapping values

查看:209
本文介绍了从具有重叠值的几个列中创建唯一值的向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的data.frame中,我在一行的SUBJECT上有三列。我想为每行添加一个独特的主题列。首先,我的数据如何:

In my data.frame I have three columns on the SUBJECT of a row. I want an additional column with a unique subject for each row. First, how my data looks like:

DATE <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7")
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
CLASS <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral")
SUBJECT.1 <- c("LITIGATION","LAYOFF","POLLUTION","CHEMICAL DISASTER","PRESS RELEASE","PEOPLE","EMISSIONS","ENERGY","WASTE MANAGEMENT","EMPLOYEES","MANAGEMENT","PRESS RELEASE","HOTELS","POLLUTION")
SUBJECT.2 <- c("POLLUTION","EMPLOYEES","NUCLEAR","FUELS","STOCK OPTION PLAN","EXECUTIVES","CO2","SOLAR","POLLUTION","EXECUTIVES","PRESS RELEASE","CELEBRITIES","CELEBRITIES","LITIGATION")
SUBJECT.3 <- c("ENVIRONMENT","JOB REDUCTIONS","POWER PLANTS","POLLUTION","EMPLOYEES","FRAUD","CLIMATE CHANGE","SUSTAINABILITY","HAZARDOUS WASTE","BONUS PAY","LITIGATION","EMISSIONS","SCANDALS","SCANDALS")
CONTROLVAR <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")

mydf <- data.frame(DATE, COMP, RET, CLASS, SUBJECT.1, SUBJECT.2, SUBJECT.3, CONTROLVAR, stringsAsFactors=F)

mydf

#    DATE COMP   RET    CLASS         SUBJECT.1         SUBJECT.2       SUBJECT.3 CONTROLVAR
# 1     1    A -2.00 positive        LITIGATION         POLLUTION     ENVIRONMENT         11
# 2     2    A  1.10 negative            LAYOFF         EMPLOYEES  JOB REDUCTIONS         13
# 3     3    A  3.00 aneutral         POLLUTION           NUCLEAR    POWER PLANTS         13
# 4     4    A  1.40 positive CHEMICAL DISASTER             FUELS       POLLUTION         14
# 5     5    A -0.20 positive     PRESS RELEASE STOCK OPTION PLAN       EMPLOYEES         13
# 6     6    A  0.60 negative            PEOPLE        EXECUTIVES           FRAUD         14
# 7     7    A  0.10 aneutral         EMISSIONS               CO2  CLIMATE CHANGE         12
# 8     1    B -0.21 positive            ENERGY             SOLAR  SUSTAINABILITY         11
# 9     2    B -1.20 negative  WASTE MANAGEMENT         POLLUTION HAZARDOUS WASTE         13
# 10    3    B  0.90 negative         EMPLOYEES        EXECUTIVES       BONUS PAY         13
# 11    4    B  0.30 positive        MANAGEMENT     PRESS RELEASE      LITIGATION         14
# 12    5    B -0.10 aneutral     PRESS RELEASE       CELEBRITIES       EMISSIONS         13
# 13    6    B  0.30 aneutral            HOTELS       CELEBRITIES        SCANDALS         14
# 14    7    B -0.12 aneutral         POLLUTION        LITIGATION        SCANDALS         12

由于我想将主题作为虚拟变量(应该是排他性的)进行后续回归,所以我需要一个单独的列SUBJECT,具有唯一的s每一行的对象。我想专注于LITIGATION,POLLUTION和LAYOFF这个科目。

Since I want to include the subject as a dummy variable (which should be exclusive) for a later regression, I want a single column SUBJECT with a unique subject for each row. I'd like to focus on the subjects LITIGATION, POLLUTION and LAYOFF.

我想从左到右,并查看每个SUBJECT列的LITIGATION,POLLUTION和LAYOFF。

I want to go from left to right and check each SUBJECT column for LITIGATION, POLLUTION and LAYOFF.

如果在第一列中有三个题目LITIGATION,POLLUTION或LAYOFF中的一个被采用。如果在第一列有不同的主题,我检查第二列等等。如果三个主题列中没有一个包含LITIGATION,污染或LAYOFF,则该主题应称为OTHER。
另外,一些科目应该分组。在这个例子中,排放歧视应被视为污染。

If there is one of the three subjects LITIGATION, POLLUTION or LAYOFF in the first column, this subject is taken. If there is a different subject in the first column, I check the second column and so on. If none of the three subject-columns contain LITIGATION, POLLUTION or LAYOFF, then the subject should be called OTHER. Also, some of the subjects should be grouped. In this example EMISSIONS should be treated as if it was POLLUTION.

输出应如下所示:

#    DATE COMP   RET    CLASS         SUBJECT.1         SUBJECT.2       SUBJECT.3    SUBJECT CONTROLVAR
# 1     1    A -2.00 positive        LITIGATION         POLLUTION     ENVIRONMENT LITIGATION         11
# 2     2    A  1.10 negative            LAYOFF         EMPLOYEES  JOB REDUCTIONS     LAYOFF         13
# 3     3    A  3.00 aneutral         POLLUTION           NUCLEAR    POWER PLANTS  POLLUTION         13
# 4     4    A  1.40 positive CHEMICAL DISASTER             FUELS       POLLUTION  POLLUTION         14
# 5     5    A -0.20 positive     PRESS RELEASE STOCK OPTION PLAN       EMPLOYEES      OTHER         13
# 6     6    A  0.60 negative            PEOPLE        EXECUTIVES           FRAUD      OTHER         14
# 7     7    A  0.10 aneutral         EMISSIONS               CO2  CLIMATE CHANGE  POLLUTION         12
# 8     1    B -0.21 positive            ENERGY             SOLAR  SUSTAINABILITY      OTHER         11
# 9     2    B -1.20 negative  WASTE MANAGEMENT         POLLUTION HAZARDOUS WASTE  POLLUTION         13
# 10    3    B  0.90 negative         EMPLOYEES        EXECUTIVES       BONUS PAY      OTHER         13
# 11    4    B  0.30 positive        MANAGEMENT     PRESS RELEASE      LITIGATION LITIGATION         14
# 12    5    B -0.10 aneutral     PRESS RELEASE       CELEBRITIES       EMISSIONS  POLLUTION         13
# 13    6    B  0.30 aneutral            HOTELS       CELEBRITIES        SCANDALS      OTHER         14
# 14    7    B -0.12 aneutral         POLLUTION        LITIGATION        SCANDALS  POLLUTION         12

谢谢! p>

Thank you!

推荐答案

mydf$SUBJECT <- "OTHER"
sapply(c("SUBJECT.3", "SUBJECT.2", "SUBJECT.1"), function(x) mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), "SUBJECT"] <<- mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), x])
mydf$SUBJECT[mydf$SUBJECT == "EMISSIONS"] <- "POLLUTION"

这篇关于从具有重叠值的几个列中创建唯一值的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆