有没有办法在R中加入虚线的csv文件? [英] Is there a way in R to join broken lines of csv file?

查看:236
本文介绍了有没有办法在R中加入虚线的csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个导出csv文件但不引用新行或使用 / n 而不是 / n / r 。它在记录的中间使用与末尾相同的行尾。然而,程序在变量之间使用逗号分隔符。如何在 R 之前删除所有eol标记,直到达到数据中的变量数量为止?



数据将如下所示:

 姓名,职级,序列号,年龄,身高,体重

mike,noob,123456,22,6,34.4

bob,officer,345

323,24,6,2

3.5

ted,officer,34234,2

5,6,35.2

我如何在第2行中的第5行之后,第3行中的第2行和第6行中的第2行之后删除CR?每行应该有5个逗号和6个变量。我的数据在每行之间没有额外的行。我只是不能让它停止把它所有在一行,没有这样做。我的数据是43个变量,并不断产生新行。大多数时间里它读取有几千行。大约有20%的人有CR问题。



还想添加一个新行总是从一个新行开始,它不会跟在同一行



数据框应如下所示:

 <$> c $ c>姓名,职级,序列号,年龄,身高,体重

mike,noob,123456,22,6,34.4

bob,officer,345323,24 ,6,23.5

ted,officer,34234,25,6,35.2

这是我的数据看起来如果这有帮助。第一行是一个标题,后面是应该是6条记录,但 read.csv fread 给我10条记录。第6条记录有额外的CR,但仍有42个变量。只需分成5行。

  EFPCName,EFUseAPPE,log pdl,pdl错误,设备漂亮名称, ,打印的总页数,打印的总页数,打印的总彩色页,打印的总bw页,打印的总标签页,打印的总样本页,num副本,打印状态,说明,notes1,notes2,username,noneutf8lastuser,non utf8提交者,标题,大小,逻辑打印机,火热,时间,日期,总撕裂持续时间,时间戳假脱机,时间戳完成打印,时间戳等待打印,时间戳等待打印,时间戳等待打印,时间戳完成打印, ,媒体重量,输入槽,媒体大小,媒体类型,解释器,

LZX激光24 - 11 x 17小报,后记,佳能,2,1,1,2,1,1, 1,0,0,1,OK ,,,, TeamMember ,, TeamMember,78053.01.pdf,4004491,Canon hold,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,3,2013 06 07 19 37 23,2013 06 07 19 37 24,2013 06 07 19 38 02 118342,2013 06 07 19 38 02 118342,2013 06 07 19 38 09,2013 06 07 19 38 09,2013 06 07 19 38 38,2013 06 07 19 39 19 124419,,Tray5,Tabloid,Plain,PS,

LZX Laser 24 - 11 x 17 Tabloid ,, postscript ,, Canon,2,1,1,2,1,1, 1,0,0,1,OK ,,,, TeamMember ,, TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07,2013 06 07 19 44 11,2013 06 07 19 44 53 141084,,Tray5,Tabloid,Plain,PS,

LZX Laser 24 - 11 x 17 Tabloid ,, postscript ,, Canon,2,1,1,2,1,1 ,1,0,0,1,OK ,,,, TeamMember ,, TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01,2013 06 07 19 46 05, 2013 06 07 19 46 46 911557,,Tray5,Tabloid,Plain,PS,

LZX80彩色复制封面 - 11 x 17小报,后记,佳能,1,2,2,2,2 ,2,0,0,0,2,OK ,,,, TeamMember,TeamMember,78011.01.pdf,874486,Canon hold,SERVER-Shredder,2013 06 07 19 47 07,2013 06 07 19 47 00,3, 2013 06 07 19 47 17,2013 06 07 19 47 17 507576,2013 06 07 19 47 47 960542,2013 06 07 19 47 47 960542,2013 06 07 19 47 51,2013 06 07 19 47 51,2013 06 07 19 47 54,2013 06 07 19 48 25 77595,,Tray3,Tabloid,Heavy5,PS,

LZX Laser 24 - 11 x 17 Tabloid ,, postscript ,, Canon,2,1,1,2, 1,1,1,0,0,1,OK ,,,, TeamMember ,, TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04,2013 06 07 19 48 07,2013 06 07 19 48 48 188474,,Tray5,Tabloid,Plain,PS,

EX32 Laser 32 - 11 x 17 Tabloid ,, pdf ,,佳能,63,64,1,63 ,64,4,59,0,0,1,OK,尺寸:11 x 17
整理:线圈装订切割冲压
页数:
1-63 4/0 EX32激光32 - 11 x 17 11 x 17
,颜色77992:01员工手册REVISED_2up(NFC).pdf,McAllen TX,EFI Pace ,,,颜色77992:01员工手册REVISED_2up(NFC).pdf,518880,无,SERVER- ,2013 06 07 20 01 52,2013 06 07 20 01 00,3,2013 06 07 20 02 41 495216,2013 06 07 20 02 44 780196,2013 06 07 20 02 41 871208,2013 06 07 20 02 41 871208,2013 06 07 20 02 45,2013 06 07 20 02 45,2013 06 07 20 03 25,2013 06 07 20 05 45 741386,,Tray4,Tabloid,Heavy1,PS,
pre>

解决方案

这是我现在所拥有的。

  dat<  -  readLines(temp.txt)#读取其中的内容,一次一行
varnames< - unlist(strsplit(dat [1],,))#提取变量名
nvar < - length(varnames)

k< - 1#设置计数器
dat1
while(k <长度(dat)){
k< - k + 1
if(dat [k] ==){k< - k + 1
print ,k是一个空字符串))
if(k> length(dat)){break}
}
temp< - dat [k]
#检查是否有足够的逗号或者行被破坏了
while(length(gregexpr(,,temp)[[1]])< nvar-1){
k& + 1
temp< - paste0(temp,dat [k])
}
temp< - unlist(strsplit(temp,,))
message )
dat1 < - rbind(dat1,temp)
}

dat1 = dat1 [-1,]#删除空的初始行

一般的想法是保持折叠文本,直到字符串中有足够的逗号。一旦实现,数据以逗号分割并作为单行添加到矩阵中。代码是可怕的笨重,并将慢大型数据文件。这是我能做的最好的。



对于原始数据示例,该代码工作,创建一个42列和6行的字符矩阵。对于较小的示例,代码无法处理最后一列中的中断。


I have a program that exports a csv file but doesn't quote new lines or use /n instead of /n/r. It uses the same end of line in the middle of records as it does at the end. The program does however use a comma delimiter between variables. How can I tell R to delete all the eol markers until the number of variables in data is reached?

My data would look like this:

name, rank, serial number, age, height, weight

mike, noob, 123456, 22, 6, 34.4

bob, officer, 345

323, 24, 6, 2

3.5

ted, officer, 34234, 2

5, 6, 35.2

How would I basically delete the CR after the 5 in row 2, after the 2 in row 3 and after the 2 in row 6? There should be 5 commas in each row and 6 variables. My data doesn't have the extra line between each row. I just couldn't get it to stop putting it all on one line without doing that. My data is 43 variables and is constantly generating new lines. Most of the time it is read in there are a few thousand lines. About 20% of them have the CR problem.

Also want to add that a new row will always start on a new row, it will not follow on the same line as the previous if that makes sense.

the data frame should look like this:

name, rank, serial number, age, height, weight

mike, noob, 123456, 22, 6, 34.4

bob, officer, 345323, 24, 6, 23.5

ted, officer, 34234, 25, 6, 35.2

This is what my data looks like if that helps. The first line is a header followed by what should be 6 records but read.csv and fread and everything else I tried gives me 10 records. The 6th record has the extra CR, but still has 42 variables. Just broken up into 5 lines.

EFPCName,EFUseAPPE,log pdl,pdl error,device pretty name,num pages,num sheets,copies printed,total pages printed,total sheets printed,total color pages printed,total bw pages printed,total tab pages printed,total sample pages printed,num copies,print status,instructions,notes1,notes2,username,noneutf8lastuser,non utf8 submitted by,title,size,logical printer,fiery,time,date,total rip duration,timestamp spooling,timestamp done spooling,timestamp waiting to rip,timestamp ripping,timestamp done ripping,timestamp waiting to print,timestamp printing,timestamp done printing,media weight,input slot,media size,media type,interpreter,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004491,Canon hold,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,3,2013 06 07 19 37 23,2013 06 07 19 37 24,2013 06 07 19 38 02 118342,2013 06 07 19 38 02 118342,2013 06 07 19 38 09,2013 06 07 19 38 09,2013 06 07 19 38 38,2013 06 07 19 39 19 124419,,Tray5,Tabloid,Plain,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07,2013 06 07 19 44 11,2013 06 07 19 44 53 141084,,Tray5,Tabloid,Plain,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01,2013 06 07 19 46 05,2013 06 07 19 46 46 911557,,Tray5,Tabloid,Plain,PS,

LZX80  Color Copy Cover - 11 x 17 Tabloid,,postscript,,Canon,1,2,2,2,2,2,0,0,0,2,OK,,,,TeamMember,,TeamMember,78011.01.pdf,874486,Canon hold,SERVER-Shredder,2013 06 07 19 47 07,2013 06 07 19 47 00,3,2013 06 07 19 47 17,2013 06 07 19 47 17 507576,2013 06 07 19 47 47 960542,2013 06 07 19 47 47 960542,2013 06 07 19 47 51,2013 06 07 19 47 51,2013 06 07 19 47 54,2013 06 07 19 48 25 77595,,Tray3,Tabloid,Heavy5,PS,

LZX  Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04,2013 06 07 19 48 07,2013 06 07 19 48 48 188474,,Tray5,Tabloid,Plain,PS,

EX32  Laser 32 - 11 x 17 Tabloid,,pdf,,Canon,63,64,1,63,64,4,59,0,0,1,OK,Size: 11 x 17
Finishing: Coil Binding  Cutting  Punching
Pages: 
1-63  4/0  EX32  Laser 32 - 11 x 17  11 x 17
 ,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf, McAllen TX,EFI Pace,,,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf,518880,none,SERVER-Shredder,2013 06 07 20 01 52,2013 06 07 20 01 00,3,2013 06 07 20 02 41 495216,2013 06 07 20 02 44 780196,2013 06 07 20 02 41 871208,2013 06 07 20 02 41 871208,2013 06 07 20 02 45,2013 06 07 20 02 45,2013 06 07 20 03 25,2013 06 07 20 05 45 741386,,Tray4,Tabloid,Heavy1,PS,

解决方案

This is what I have for now. See how this works on your data.

dat <- readLines("temp.txt") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], ",")) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
    k <- k + 1
    if(dat[k] == "") {k <- k + 1
        print(paste("data line", k, "is an empty string"))
        if(k > length(dat)) {break}
    }
    temp <- dat[k]
    # checks if there are enough commas or if the line was broken
    while(length(gregexpr(",", temp)[[1]]) < nvar-1){
        k <- k + 1
        temp <- paste0(temp, dat[k])
    }
    temp <- unlist(strsplit(temp, ","))
    message(k)
    dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

The general idea is to keep collapsing text until there are enough commas in the string. Once that is achieved, the data is split at commas and added as a single row into a matrix. The code is horribly clunky and will be slow for large data files. It is the best I can do though.

For the original data example, the code works and creates a character matrix with 42 columns and 6 rows. For the smaller example, the code cannot handle the break in the last column.

这篇关于有没有办法在R中加入虚线的csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆