将不同的列转换为不同的格式 [英] Converting different columns to different formats

查看:33
本文介绍了将不同的列转换为不同的格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个 df,我已经使用:

I have a df in R that I have loaded using:

data <- fread("Data/LuminateDataExport_UTDP2_011818.csv", colClasses = 'character', stringsAsFactors = FALSE)

我这样做是因为我必须执行某些操作,例如剥离$"等

I did this because I had to perform certain operations like stripping "$", etc.

现在,我正在尝试将列转换为适当的格式,而不必单独 as._ 每一列...

Now, I am trying to convert the columns into the appropriate formats without having to as._ each column individually...

当前df的结构为:

> str(data)
Classes ‘data.table’ and 'data.frame':  196879 obs. of  32 variables:
 $ city             : chr  "" "" "" "" ...
 $ company_goal     : chr  "" "" "" "" ...
 $ company_name     : chr  "" "" "" "" ...
 $ event_date       : chr  "5/14/2016" "9/26/2015" "9/12/2015" "6/3/2017" ...
 $ event_year       : chr  "FY 2016" "FY 2016" "FY 2016" "FY 2017" ...
 $ fundraising_goal : chr  "250" "200" "350" "0" ...
 $ name             : chr  "Heart Walk 2015-2016 St. Louis MO" "Heart Walk 2015-2016 Canton, OH" "Heart Walk 2015-2016 Dallas, TX" "FDA HW 2016-2017 Albany, NY WO-65355" ...
 $ participant_id   : chr  "2323216" "2273391" "2419569" "4088558" ...
 $ state            : chr  "" "OH" "TX" "" ...
 $ street           : chr  "" "" "" "" ...
 $ team_average     : chr  "176" "123" "306" "47" ...
 $ team_captain     : chr  "No" "No" "Yes" "No" ...
 $ team_count       : chr  "7" "6" "4" "46" ...
 $ team_id          : chr  "152788" "127127" "45273" "179207" ...
 $ team_member_goal : chr  "0" "0" "0" "0" ...
 $ team_name        : chr  "Team Clayton" "Cardiac Crusaders" "BIS - Team Myers" "Independent Walkers" ...
 $ team_total_gifts : chr  "1,230 " "738" "1,225 " "2,145 " ...
 $ zip              : chr  "" "" "" "" ...
 $ gifts_count      : chr  "2" "1" "2" "1" ...
 $ registration_gift: chr  "No" "No" "No" "No" ...
 $ participant_gifts: chr  "236" "218" "225" "0" ...
 $ personal_gift    : chr  "0" "0" "0" "250" ...
 $ total_gifts      : chr  "236" "218" "225" "250" ...
 $ match_code       : chr  "UX000" "UX000" "UX000" "UX000" ...
 $ tap_level        : chr  "X" "X" "X" "X" ...
 $ tap_desc         : chr  "" "" "" "" ...
 $ tap_lifed        : chr  "" "" "" "" ...
 $ medage_cy        : chr  "0" "0" "0" "0" ...
 $ divindx_cy       : chr  "0" "0" "0" "0" ...
 $ medhinc_cy       : chr  "0" "0" "0" "0" ...
 $ meddi_cy         : chr  "0" "0" "0" "0" ...
 $ mednw_cy         : chr  "0" "0" "0" "0" ...
 - attr(*, ".internal.selfref")=<externalptr> 

现在,作为第一步 - 我正在尝试将所有数字转换为 to_numeric.

Now, as a first step- I am trying to convert all of the numbers to_numeric.

我已经尝试了所有找到的解决方案 此处 但它们都没有工作.

I have tried every one of the solutions found here but none of them have worked.

我不断收到的错误是:

[.data.table(data, , cols) 中的错误:j(里面的第二个参数[...]) 是单个符号,但未找到列名 'cols'.也许您打算使用 DT[,..cols] 或 DT[,cols,with=FALSE].这种差异到data.frame 在 FAQ 1.1 中经过深思熟虑和解释.

Error in [.data.table(data, , cols) : j (the 2nd argument inside [...]) is a single symbol but column name 'cols' is not found. Perhaps you intended DT[,..cols] or DT[,cols,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

[.data.table(data, cols) 中的错误:当 i 是 data.table(或字符向量),必须指定要加入的列使用 'on=' 参数(参见 ?data.table)或通过键入 x(即排序,并标记为已排序,请参阅 ?setkey).键控连接可能有更多由于 x 在 RAM 中排序,因此在非常大的数据上具有速度优势.

Error in [.data.table(data, cols) : When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

这里有一些关于数据的更多信息:

Here some more info on the data:

> dput(data[1:6, 1:11])
structure(list(city = c("", "", "", "", "", ""), company_goal = c("", 
"", "", "", "", ""), company_name = c("", "", "", "", "", ""), 
    event_date = c("5/14/2016", "9/26/2015", "9/12/2015", "6/3/2017", 
    "5/6/2017", "10/17/2015"), event_year = c("FY 2016", "FY 2016", 
    "FY 2016", "FY 2017", "FY 2017", "FY 2016"), fundraising_goal = c("250", 
    "200", "350", "0", "0", "100"), name = c("Heart Walk 2015-2016 St. Louis MO", 
    "Heart Walk 2015-2016 Canton, OH", "Heart Walk 2015-2016 Dallas, TX", 
    "FDA HW 2016-2017 Albany, NY WO-65355", "FDA HW 2016-2017 New Haven, CT WO-66497", 
    "Heart Walk 2015-2016 Puget Sound, WA"), participant_id = c("2323216", 
    "2273391", "2419569", "4088558", "4527010", "2424207"), state = c("", 
    "OH", "TX", "", "", "WA"), street = c("", "", "", "", "", 
    ""), team_average = c("176", "123", "306", "47", "0", "97"
    )), .Names = c("city", "company_goal", "company_name", "event_date", 
"event_year", "fundraising_goal", "name", "participant_id", "state", 
"street", "team_average"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x10200c378>)

请给点建议?

(一旦我这样做,我还必须将不同的列转换为因子等)

(Once I do this, I will also have to convert different columns to factors, etc)

推荐答案

我意识到这是一个老问题,您可能不再研究,但因为这是人们同时搜索时出现的第一个问题在 R 中将多列格式化为数字,我想我会添加一个想法.

I realize this is an older question that you're probably not working on anymore, but since it's one of the first questions that comes up when people search for simultaneously formatting multiple columns as numeric in R, I thought I'd add a thought.

关于您问题的第一部分——如何确定哪些列是数字、哪些列是日期、哪些列是因子等——我没有很好的答案,特别是因为因子最初可以是字符或否则,但随后被指定为因子.决定转换哪些在很大程度上取决于设计师.如果没有可接受的 NA 条目,您可以使用逻辑 此处 确定哪些列应设置为数字格式.一旦您决定了要转换的列...

Regarding the first part of your question--how to identify which columns are numeric, which columns are dates, which columns are factors, etc.--I do not have a good answer, particularly because factors can originally be character or otherwise, but then be designated as factor instead. Deciding which ones to convert is largely up to the designer. If there are no entries that are acceptably NA, you could use the logic here to determine which columns should be formatted numeric. Once you have decided which columns to convert . . .

我猜您的第二个错误出现是因为您使用的数据表与当前的语法要求略有不同.您可以在本文后面的一个答案中找到使用数据表语法更改所选列的说明:

I'm guessing that your second error appears because you're using data tables slightly differently than the current syntax requires. You can find instructions for changing a selection of columns using data table syntax in one of the later answers on this post:

一次将多列强制转换为因子

在那篇文章中,他们强制一组列进行分解;相同的过程适用于强制转换为数字.

In that post, they coerce a set of columns to factor; the same process works for coercing to numeric.

为简单起见,您可以指定所需的列(使用数值或列名或其他方式——在您的情况下,将使用您应用于将数据分组的任何逻辑和规则以编程方式分配此值).例如,

To keep it simple, you specify the columns you want (using numeric values or column names or otherwise--In your case, this value will be programmatically assigned using whatever logic and rules you apply to divide your data into groups). E.g.,

colsToConvert <- c(6,11,13)

colsToConvert <- c("fundraising_goal","team_average","team_count")

然后您使用 lapply 命令并使用 SDcols 子集规范:

Then you use an lapply command and use the SDcols subset specification:

data[, (colsToConvert) := lapply(.SD, as.numeric), .SDcols = colsToConvert]

那应该可以完成您的转换.对任意数量的数据类型重复此过程,将格式从 as.numeric 更改为任何合适的类型.

That should do your conversion. Repeat this process for as many data types as you prefer, changing the formatting from as.numeric to whichever type is appropriate.

这篇关于将不同的列转换为不同的格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆