无法使用R删除txt文件中的空白行 [英] can't remove blank lines in txt file with R
问题描述
我正在用R进行文本分析,需要将句子的第一个字母转换为小写,同时将其他大写字母保持原样.所以我使用了命令
I am doing a text analysis with R and needed to convert the first letters of the sentences into lowercase while keeping the other capitalized words the way they are. So I used the command
x <- gsub("(\\..*?[A-Z])", '\\L\\1', x, perl=TRUE)
有效,但部分有效.问题是,对于文本分析,我不得不将pdf文件转换为txt格式,现在txt文件包含很多空行(分页符,可能返回),因此我使用的命令不会将大写字母转换为出现在新行上.我试图在 gsub 中使用具有多个\ s,\ r,\ n的不同组合来消除空行,但没有任何效果.当我执行tm-package的inspect(x)时,输出以以下方式显示:
which worked, but partially. The problem is that for the text analysis I had to convert the pdf files into txt format and now the txt files contain a lot of empty lines (page breaks, returns possibly), and therefore the command I used does not convert the capital letters that appear on the new lines. I was trying to eliminate the empty lines using different combinations in gsub with multiple \s, with \r, \n but nothing works. When I do the inspect(x) of the tm-package, the output looks in the following way:
[346]
[347] Thank you.
[348]
[349] Vice President of Investor Relations
[350]
如果有人能帮助我,我将不胜感激!
I would be grateful if anyone could help me!
推荐答案
鉴于您的输出,空行似乎是字符向量中的单独字符串.您需要使用grep
过滤掉这些内容:
Given your output, the empty lines appear to be separate character strings in a character vector. You need to filter those out using grep
:
empty_lines = grepl('^\\s*$', x)
x = x[! empty_lines]
然后您可以执行后续分析,但是您可能仍然需要先将行连接起来才能得到单个字符串:
Then you can perform your subsequent analysis, but you probably still need to concatenate the lines first to get a single character string:
x = paste(x, collapse = '\n')
这篇关于无法使用R删除txt文件中的空白行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!