R - 使用正则表达式查找/替换换行符 [英] R - find/replace line breaks using regex

查看:91
本文介绍了R - 使用正则表达式查找/替换换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用正则表达式清理文件夹中的一堆 .txt 文件.我似乎无法让 R 找到换行符.

I'm trying to clean a bunch of .txt files in a folder using regex. I can't seem to get R to find line breaks.

这是我正在使用的代码.它适用于字符替换,但不适用于换行.

This is the code I'm using. It works for character substitution, but not for line breaks.

gsub_dir(dir = "folder_name", pattern = "\\n", replacement = "#")

我也尝试过 \r 和其他各种排列.使用纯文本编辑器,我找到所有带有 \n 的换行符.

I've also tried \r and various other permutations. Using a plain text editor I find all the line breaks with \n.

推荐答案

你不能用 xfun::gsub_dir 做到这一点.

You can't do that with xfun::gsub_dir.

查看源代码:

  • 使用 read_utf8 读取文件,基本上执行 x = readLines(con, encoding = 'UTF-8', warn = FALSE),
  • 然后,gsub 输入这些行,当所有替换完成后,
  • write_utf8 函数 将行...与 LF、换行符、符号连接起来.
  • The files are read in using read_utf8 that basically executes x = readLines(con, encoding = 'UTF-8', warn = FALSE),
  • Then, gsub is fed with these lines, and when all replacements are done,
  • The write_utf8 function concatenates the lines... with the LF, newline, symbol.

您需要为此使用一些自定义函数,这里是快速而肮脏"的函数,它将用 # 替换所有 LF 符号:

You need to use some custom function for that, here is "quick and dirty" one that will replace all LF symbols with #:

lbr_change_gsub_dir = function(newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
 files = list.files(dir, full.names = TRUE, recursive = recursive)
 for (f in files) {
   x = readLines(f, encoding = encoding, warn = FALSE)
   cat(x, sep = newline, file = f)
 }
}

folder <- "C:\\MyFolder\\Here"
lbr_change_gsub_dir(newline="#", dir=folder)

如果您希望能够匹配多行模式,pastecollape 使用 newline 并使用您喜欢的任何模式:

If you want to be able to match multiline patterns, paste the lines collapeing them with newline and use any pattern you like:

lbr_gsub_dir = function(pattern, replacement, perl = TRUE, newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
 files = list.files(dir, full.names = TRUE, recursive = recursive)
 for (f in files) {
   x <- readLines(f, encoding = encoding, warn = FALSE)
   x <- paste(x, collapse = newline)
   x <- gsub(pattern, replacement, x, perl = perl)
   cat(x, file = f)
 }
}

folder <- "C:\\1"
lbr_gsub_dir("(?m)\\d+\\R(.+)", "\\1", dir = folder)

这将删除跟在纯数字行之后的行.

This will remove lines that follow digit only lines.

这篇关于R - 使用正则表达式查找/替换换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆