读取带隐藏或不可见字符^ M的csv文件 [英] Read csv file with hidden or invisible character ^M

查看:3470
本文介绍了读取带隐藏或不可见字符^ M的csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试读取包含隐藏或隐藏字元的* .csv档案失败。文件内容如下所示:

  my.data2<  -  read.table(text ='
Common。 name,Scientific.name,Stuff1,Stuff2
Greylag.Goose,Anser.anser,AAC,rr
Snow.Goose,Anser.caerulescens,AAC,rr
Greater.Canada.Goose,Branta .canadensis,AAC,rr
Barnacle.Goose,Branta.leucopsis,AAC,rr
Brent.Goose,Branta.bernicla,AAC,rr
',header = TRUE,sep =', ',stringsAsFactors = FALSE)

注意上面的 read.table 命令正确读取数据。但是,read.csv无法正确读取文件,因为在许多行中,在第二个空格后面有一个隐藏的字符。在某些行中,在第一个空格后面还有一个隐藏的字符。在某些行中没有隐藏的字符。例如:

  setwd('c:/ users / mmiller21 / simple R programs')

my.data< - read.csv('invisible.delimiter2.csv',header = TRUE)
my.data

返回:

  Common.name Scientific.name Stuff1 Stuff2 
1 Greylag.Goose Anser.anser
2 AAC rr
3 Snow.Goose
4 Anser.caerulescens
5 AAC rr
6 Greater.Canada.Goose Branta.canadensis AAC rr
7 Barnacle.Goose Branta.leucopsis
8 AAC rr
9 Brent.Goose Branta.bernicla
10 AAC rr

更具体地说,如果我在记事本中打开* .csv文件,并使用向右箭头键沿着第一行数据移动光标,箭头键两次以移动经过 AAC 中的第一个 A



以下行未解决问题:

  my.data<  -  read.csv('invisible.delimiter2 .csv',sep =',',header = TRUE)

常见的隐藏字符或分隔符。



我也试过将* .csv文件转换为* .txt文件,但是,返回以下内容:

 > my.data3<  -  read.table('invisible.delimiter2.txt',sep =',',header = TRUE)
在scan中出错(file,what,nmax,sep,dec,quote, nlines,na.strings,:
第1行没有4个元素
> my.data3
错误:未找到object'my.data3'

我不熟悉其他可能的解决方案。该文件太大,无法手动搜索每个空间的隐藏字符并删除它。



感谢您就如何读取此类文件或在将文件读入R之前如何查找和删除隐藏字符提供任何建议。



如果有帮助,我最初通过复制维基百科中的表格获得数据,也许这有助于识别隐藏的字符。



编辑



感谢下面的评论我使用gVim 7.3打开示例数据文件,该软件显示隐藏的字符,并显示它 ^ M 。不幸的是,我无法通过gVim 7.3中的简单查找和替换从数据文件中删除该字符。如果当我找出如何移除 ^ M ,我会在这里发布方法。



发布如何使用Perl删除 ^ M



在Perl中,如何从文件中删除^ M?希望我可以弄清楚如何使用R或文本编辑器删除它



这里是一个链接,其中的例子*。 csv文件已存储。



https://github.com/markwmiller/Rcode/blob/93d07bd2e389e516b6da92017e025a1e97173db0/invisible.delimiter2.csv



和替代链接档案:



https:// github。 com / markwmiller / Rcode

解决方案

在gVim中,您应该能够通过键入以下内容来删除^ M个字符:

 :%s /< ctrl> V< ctrl> M // g< return& 

如果输入正确,它将显示为::%s / ^ M // g'。当你按return时,gVim搜索('s')第一和第二斜线之间的内容,并用第二和第三斜杠之间的全局(g)替换它。



注意:如果您在Windows框中,< ctrl> V似乎是粘贴文本,则gVim可能会配置为Windows行为。在这种情况下,使用< ctrl> Q< ctrl> M而不是< ctrl> V< ctrl> M。



当我加载你的示例文件到gVim 7.3 ,它看起来像这样:





输入字符后

 :%s / ctrl> V< ctrl> M // g 

但是在点击回车之前,我看到:



>



点击回车后,我看到了:





然后,您可以执行File-> Save或File-> Save As,

I am attempting unsuccessfully to read a *.csv file containing hidden or invisible characters. The file contents are shown here:

my.data2 <- read.table(text = '
Common.name, Scientific.name, Stuff1, Stuff2
Greylag.Goose, Anser.anser, AAC, rr
Snow.Goose, Anser.caerulescens, AAC, rr
Greater.Canada.Goose, Branta.canadensis, AAC, rr
Barnacle.Goose, Branta.leucopsis, AAC, rr
Brent.Goose, Branta.bernicla, AAC, rr
', header = TRUE, sep=',', stringsAsFactors = FALSE)

Note that the above read.table command reads the data correctly. However, read.csv cannot read the file correctly because in many lines there is a hidden character following the second blank space. In some lines there is also a hidden character after the first blank space. In some lines there are no hidden characters. For example:

setwd('c:/users/mmiller21/simple R programs')

my.data <- read.csv('invisible.delimiter2.csv', header = TRUE)
my.data

returns:

            Common.name    Scientific.name Stuff1 Stuff2
1         Greylag.Goose        Anser.anser              
2                   AAC                 rr              
3            Snow.Goose                                 
4    Anser.caerulescens                                 
5                   AAC                 rr              
6  Greater.Canada.Goose  Branta.canadensis    AAC     rr
7        Barnacle.Goose   Branta.leucopsis              
8                   AAC                 rr              
9           Brent.Goose    Branta.bernicla              
10                  AAC                 rr              

More specifically, if I open the *.csv file in Notepad and use the right-arrow key to move the cursor along the first line of data I have to press the right-arrow key twice to move past the first A in AAC.

The following line does not solve the problem:

my.data <- read.csv('invisible.delimiter2.csv', sep=',', header = TRUE)

In my experience tabs are a fairly common hidden character or delimiter. However, I have tried searching for and replacing tabs and that does not help.

I have also tried converting the *.csv file to a *.txt file, but that returns the following:

> my.data3 <- read.table('invisible.delimiter2.txt', sep=',', header = TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
> my.data3
Error: object 'my.data3' not found

I am not familiar with other possible solutions. The file is too large to manually search every space for a hidden character and remove it.

Thank you for any advice on how to read a file like this or on how to find and remove hidden characters prior to reading the file into R.

If it helps, I originally obtained the data by copying a table from Wikipedia. Perhaps that might help identify the hidden character.

EDIT

Thanks to comments below I opened the example data file using gVim 7.3. That software displays the hidden character and reveals it to be ^M. Unfortunately, I have not been able to remove that character from the data file with a simple find and replace within gVim 7.3. If and when I figure out how to remove the ^M I will post the approach here.

Here is a post on how to remove ^M with Perl.

In Perl, how to do you remove ^M from a file?

Hopefully I can figure out how to remove it with R or a text editor

Here is a link where the example *.csv file is stored.

https://github.com/markwmiller/Rcode/blob/93d07bd2e389e516b6da92017e025a1e97173db0/invisible.delimiter2.csv

and an alternative link to the same file on the same site:

https://github.com/markwmiller/Rcode

解决方案

In gVim you should be able to remove the ^M characters by typing the following:

:%s/<ctrl>V<ctrl>M//g<return>

If you've typed it in correctly it will look like ':%s/^M//g' in gVim. When you press return, gVim searches (the 's') for what's between the first and second slash and replaces it with what's between the second and third slash, globally (the 'g').

NOTE: If you are on a Windows box and <ctrl>V seems to be pasting text, then gVim may be configured with 'windows behavior'. In that case, use <ctrl>Q<ctrl>M instead of <ctrl>V<ctrl>M.

When I load your sample file into gVim 7.3, it looks like this:

After typing the characters

:%s/<ctrl>V<ctrl>M//g

but BEFORE hitting return I see this:

After hitting return I see this:

You can then do File->Save or File->Save As, which do what you would expect.

这篇关于读取带隐藏或不可见字符^ M的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆