R:从字符串中删除最后三个点 [英] R: removing the last three dots from a string

查看:57
本文介绍了R:从字符串中删除最后三个点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本数据文件,我可能会用 readLines 读取它.每个字符串的初始部分包含很多乱码,后面是我需要的数据.乱码和数据通常用三个点分隔.我想在最后三个点之后拆分字符串,或者用某种标记替换最后三个点,告诉 R 将这三个点左侧的所有内容都视为一列.

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.

这是关于 Stackoverflow 的类似帖子,它将定位最后一个点:

Here is a similar post on Stackoverflow that will locate the last dot:

R:查找字符串中的最后一个点

但是,在我的情况下,一些数据有小数,所以定位最后一个点是不够的.另外,我认为 ... 在 R 中有特殊含义,这可能会使问题复杂化.另一个潜在的并发症是一些点比其他点大.此外,在某些行中,三个点之一被逗号替换.

However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.

除了上面帖子中的 gregexpr 之外,我还尝试使用 gsub,但找不到解决方案.

In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.

这是一个示例数据集和我希望达到的结果:

Here is an example data set and the outcome I hope to achieve:

aa = matrix(c(
'first string of junk... 0.2 0 1', 
'next string ........2 0 2', 
'%%%... ! 1959 ...  0 3 3',
'year .. 2 .,.  7 6 5',
'this_string   is . not fine .•. 4 2 3'), 
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))

aa <- as.data.frame(aa, stringsAsFactors=F)
aa

# desired result
#                             C1  C2 C3 C4
# 1        first string of junk  0.2  0  1
# 2            next string .....   2  0  2
# 3             %%%... ! 1959      0  3  3
# 4                 year .. 2      7  6  5
# 5 this_string   is . not fine    4  2  3

我希望这个问题不要太具体.文本数据文件是使用我昨天关于在 R 中读取 MSWord 文件的帖子中概述的步骤创建的.

I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.

有些行不包含乱码或三个点,而只包含数据.但是,这可能会使后续帖子变得复杂.

Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.

感谢您的建议.

推荐答案

虽然不是特别优雅...

This does the trick, though not especially elegant...

options(stringsAsFactors = FALSE)


# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))

# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))


# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))

# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")

这篇关于R:从字符串中删除最后三个点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆