无法使用R删除txt文件中的空白行 [英] can't remove blank lines in txt file with R

查看:236
本文介绍了无法使用R删除txt文件中的空白行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用R进行文本分析,需要将句子的第一个字母转换为小写,同时将其他大写字母保持原样.所以我使用了命令

I am doing a text analysis with R and needed to convert the first letters of the sentences into lowercase while keeping the other capitalized words the way they are. So I used the command

     x <- gsub("(\\..*?[A-Z])", '\\L\\1', x, perl=TRUE)

有效,但部分有效.问题是,对于文本分析,我不得不将pdf文件转换为txt格式,现在txt文件包含很多空行(分页符,可能返回),因此我使用的命令不会将大写字母转换为出现在新行上.我试图在 gsub 中使用具有多个\ s,\ r,\ n的不同组合来消除空行,但没有任何效果.当我执行tm-package的inspect(x)时,输出以以下方式显示:

which worked, but partially. The problem is that for the text analysis I had to convert the pdf files into txt format and now the txt files contain a lot of empty lines (page breaks, returns possibly), and therefore the command I used does not convert the capital letters that appear on the new lines. I was trying to eliminate the empty lines using different combinations in gsub with multiple \s, with \r, \n but nothing works. When I do the inspect(x) of the tm-package, the output looks in the following way:

[346]                                                                                                                                                                                                                                                  
[347]    Thank you.                                                                                                                                                                                                                                    
[348]                                                                                                                                                                                                                                                  
[349]    Vice President of Investor Relations                                                                                                                                                                                               
[350]   

如果有人能帮助我,我将不胜感激!

I would be grateful if anyone could help me!

推荐答案

鉴于您的输出,空行似乎是字符向量中的单独字符串.您需要使用grep过滤掉这些内容:

Given your output, the empty lines appear to be separate character strings in a character vector. You need to filter those out using grep:

empty_lines = grepl('^\\s*$', x)
x = x[! empty_lines]

然后您可以执行后续分析,但是您可能仍然需要先将行连接起来才能得到单个字符串:

Then you can perform your subsequent analysis, but you probably still need to concatenate the lines first to get a single character string:

x = paste(x, collapse = '\n')

这篇关于无法使用R删除txt文件中的空白行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆