如何在data.table中基于多个条件设置新列? [英] How to set new column based on multiple conditions in data.table?

查看:394
本文介绍了如何在data.table中基于多个条件设置新列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据文字搜寻收集目录资讯。在文本中搜索某个字符串,并在一个新列 C_Organization 中添加一些说明。



这里是示例数据:

 #加载包:
pacman: :p_load(data.table,
stringr)

#make sample data:
DE< - data.table(c(John,Sussan ,Bill),
c(Text contains MIT,some text with Stanford University,He graduated from Yale))

colnames(DE)< - c (Name,Text)

> DE
名称文本
1:John Text包含MIT
2:Sussan与斯坦福大学的一些文本
3:Bill他毕业于耶鲁大学

搜索某个字符串,并使用新列创建一个新的data.table:

  mit < -  DE [str_detect(DE $ Text,MIT),。(Name,C_Organization =MIT)] 
yale< - DE [str_detect $ Text,Yale),。(Name,C_Organization =Yale)]
stanford < - DE [str_detect(DE $ Text,Stanford),。(Name,C_Organization =Stanford) ]

#将它们绑定在一起:
combine_table< - rbind(mit,yale,stanford)

combine_table

名称C_Organization
1:John MIT
2:Bill Yale
3:Sussan Stanford

这种挑选和组合方法工作正常,但似乎有点乏味。是否有可能在 data.table



编辑



由于我的数据分析能力差和数据不清楚,我需要清楚地说明问题:


  1. 真实数据有点复杂:



    (1)有些情况下,来自两个以上组织code> Jack,UC Berkeley,Bell lab 。和

    (2)同一个组织的同一个人出现不同的年份,如 Steven,MIT,2011 Steven,MIT,2014


  2. 我想知道:



    (1)每个组织有多少人。如果一个人属于不止一个组织,则使最显得最多的组织为其组织。例如, John,MIT,AMS,Bell lab ,如果 MIT 出现30次, AMS 12次,贝尔实验室 26次。然后让 MIT 作为他的组织。



    (2)计算每年有多少人。这不是直接落实到我原来的问题,但为了以后的计算,我不想丢弃这些记录。



解决方案

另一种解决方案, ,操作rowwise并将匹配结合在一起:

  uni < -  c(MIT,Yale,Stanford )
DE [,idx:=。I] [,c_org:= paste(uni [str_detect(Text,uni)],collapse =,),idx]



这给出:

  DE 
名称文本idx c_org
1:John Text包含MIT 1 MIT
2:Sussan斯坦福大学的一些文本2 Stanford
3:Bill他毕业于耶鲁大学,麻省理工学院,斯坦福大学。 3 MIT,Yale,Stanford
4:Bill一些文本4

当在 Name 中具有相同的名称时,显而易见。执行时:

  DE [,uni [str_detect(Text,uni)],Name] 

您得到不正确的结果:

 名称V1 
1:John MIT
2:Sussan Stanford
3:Bill MIT
4:Bill Stanford
pre>

=>您不知道第四行中有哪个Bill。此外, Yale 不包括在第一个Bill(即原始数据集的第3行)中。






使用的数据:

  (John,Sussan,Bill,Bill),Text = c(Text contains MIT,some text with Stanford University,He graduated from Yale,MIT,Stanford。文本)).Names = c(Name,Text),row.names = c(NA,-4L),class = c(data.table,data.frame))


I'm trying to collect catalogue information based on text search. Search for a certain string in column Text, and put some description into a new column C_Organization.

Here is the sample data:

# load packages:
pacman::p_load("data.table",
               "stringr")

# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
                 c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))

colnames(DE) <- c("Name", "Text")

> DE
     Name                               Text
1:   John                  Text contains MIT
2: Sussan some text with Stanford University
3:   Bill             He graduated from Yale

search for a certain string and make a new data.table with new column:

mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]

# bind them together:
combine_table <- rbind(mit, yale, stanford)

combine_table

     Name C_Organization
1:   John            MIT
2:   Bill           Yale
3: Sussan       Stanford

This pick-and-combine approach works fine but it seems a little bit tedious. Is it possible to do it in one step in data.table?

Edit

Due to my poor data analysis skill and the unclean data, I need to make the question clear:

  1. The real data is a little complicated:

    (1) There are cases where a person from more than two organizations, like Jack, UC Berkeley, Bell lab. and

    (2) The same person of the same organization appears for different year, like Steven, MIT, 2011, Steven, MIT, 2014.

  2. I want to figure out:

    (1) How many people from each organization. If one person belongs to more than one organization, make the organization which appears most as his organization. (i.e. by popularity.) For example, John, MIT, AMS, Bell lab, if MIT appears 30 times, AMS 12 times, Bell lab 26 times. Then make MIT as his organization.

    (2) count how many people for each year. This is not directly realted to my original question, but for later calculation, I don't want to throw away these records.

解决方案

An alternative solution which takes into account for several matches in one text, operates rowwise and binds the matches together:

uni <- c("MIT","Yale","Stanford")
DE[,idx:=.I][, c_org := paste(uni[str_detect(Text, uni)], collapse=","), idx]

this gives:

> DE
     Name                                   Text idx             c_org
1:   John                      Text contains MIT   1               MIT
2: Sussan     some text with Stanford University   2          Stanford
3:   Bill He graduated from Yale, MIT, Stanford.   3 MIT,Yale,Stanford
4:   Bill                              some text   4                  

The advantage of operating rowwise is evident when you have identical names in Name. When you do:

DE[, uni[str_detect(Text, uni)], Name]

you get not the correct result:

     Name       V1
1:   John      MIT
2: Sussan Stanford
3:   Bill      MIT
4:   Bill Stanford

=> you don't know which Bill you have in the fourth row. Moreover, Yale isn't included for the 'first' Bill (i.e. row 3 of the original dataset).


Used data:

DE <- structure(list(Name = c("John", "Sussan", "Bill", "Bill"), Text = c("Text contains MIT", "some text with Stanford University", "He graduated from Yale, MIT, Stanford.", "some text")), .Names = c("Name", "Text"), row.names = c(NA, -4L), class = c("data.table", "data.frame"))

这篇关于如何在data.table中基于多个条件设置新列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆