在R中处理字节顺序标记(BOM) [英] Dealing with Byte Order Mark (BOM) in R

查看:77
本文介绍了在R中处理字节顺序标记(BOM)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.CSV文件的开头有时会出现字节顺序标记(BOM).使用记事本或Excel打开文件时,该符号不可见,但是,当您使用各种方法在R中读取文件时,第一列名称中将使用不同的符号.这是一个例子

Sometimes a Byte Order Mark (BOM) is present at the beginning of a .CSV file. The symbol is not visible when you open the file using Notepad or Excel, however, When you read the file in R using various methods, you will different symbols in the name of first column. here is an example

开头为BOM的示例csv文件.

A sample csv file with BOM in the beginning.

ID,title,clean_title,clean_title_id
1,0 - 0,,0
2,"""0 - 1,000,000""",,0
27448,"20yr. rope walker
igger",Rope Walker Igger,1832700817

通过基本R包中的 read.csv 进行读取

Reading through read.csv in base R package

(x1 = read.csv("file1.csv",stringsAsFactors = FALSE))
#   ï..ID                raw_title        semi_clean semi_clean_id
# 1     1                    0 - 0                               0
# 2     2          "0 - 1,000,000"                               0
# 3 27448 20yr. rope walker\nigger Rope Walker Igger    1832700817

通过data.table包中的 fread 进行读取

Reading through fread in data.table package

(x2 = data.table::fread("file1.csv"))
#    ID                raw_title        semi_clean semi_clean_id
# 1:     1                    0 - 0                               0
# 2:     2        ""0 - 1,000,000""                               0
# 3: 27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

通过阅读器包中的 read_csv 进行阅读

Reading through read_csv in readr package

(x3 = readr::read_csv("file1.csv"))
#   <U+FEFF>ID                raw_title        semi_clean semi_clean_id
# 1          1                    0 - 0              <NA>             0
# 2          2          "0 - 1,000,000"              <NA>             0
# 3      27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

您会在变量名ID前面注意到不同的字符.

You can notice different characters in front of variable name ID.

当您在所有这些名称上运行名称时,以下是结果

Here are the results when you run names on all of these

names(x1)
# [1] "ï..ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x2)
# [1] "ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x3)
# [1] "ID"             "raw_title"     "semi_clean"    "semi_clean_id"

x3 中, ID 前面没有任何可见",但是当您检查

In x3, there is nothing 'visible' in front of ID, but when you check

names(x3)[[1]]=="ID"
# [1] FALSE

在每种情况下如何摆脱这些不需要的字符.PS:请添加更多读取csv文件的方法,面临的问题和解决方案.

How to get rid of these unwanted character in each case. PS: Please add more methods to read csv files, the problem faced and the solutions.

推荐答案

要在base R中使用read.csv,请使用:

For read.csv in base R use:

x1 = read.csv("file1.csv",stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")

有关fread,请使用:

For fread, use:

x2 = fread("file1.csv")
setnames(x2, "ID", "ID")

对于read_csv,请使用:

For read_csv, use:

x3 = readr::read_csv("file1.csv")
setDT(X3) #convert into data tables, so that setnames can be used
setnames(x3, "\uFEFFID", "ID")

一个基于非R的解决方案是在Notepad ++中打开文件,将编码更改为在没有BOM的UTF-8中编码"后保存文件

One non-R based solution is open the file in Notepad++, save the file after change encoding to "Encoding in UTF-8 without BOM"

这篇关于在R中处理字节顺序标记(BOM)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆