如何确定与read.fwf一起使用的正确的文件编码(或使用解决方法删除不合规的字符) [英] How to determine the correct file encoding for use with read.fwf (or use a workaround to remove non-conforming characters)

查看:247
本文介绍了如何确定与read.fwf一起使用的正确的文件编码(或使用解决方法删除不合规的字符)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了以下问题中的方法,但仍然卡住。

I tried the approach in the following question and am still stuck.

How to detect the right encoding for read.csv?

这下面的代码应该是可重现的...任何想法?我宁愿不使用scan()或readLines,因为我已经成功使用这个代码过去的什么状态级别的ACS数据....

This following code should be reproduceable... Any ideas? I'd rather not use scan() or readLines because I've been using this code successfully for assorted state level ACS data in the past....

我的其他想到的是在导入之前编辑文本文件。但是,我存储文件压缩,并使用脚本解压缩,然后访问数据。不得不在R环境之外编辑文件真的会推动这个过程。提前感谢!

My other thought is to edit the text file prior to importing it. However I store the files zipped and use a script to unzip and then access the data. Having to edit the file outside of the R environment would really gum up that process. Thanks in advance!

Filename <- "g20095us.txt"
Url <- "http://www2.census.gov/acs2005_2009_5yr/summaryfile/2005-2009_ACSSF_By_State_By_Sequence_Table_Subset/UnitedStates/All_Geographies_Not_Tracts_Block_Groups/"

Widths <- c(6,2,3,2,7,1,1,1,2,2,3,5,5,6,1,5,4,5,1,3,5,5,5,3,5,1,1,5,3,5,5,5,2,3,
        3,6,3,5,5,5,5,5,1,1,6,5,5,40,200,6,1,50)
Classes <- c(rep('character',4),'integer',rep('character',47))
Names <- c('fileid','stusab','sumlev','geocomp','logrecno','us','region','division',
       'statece','state','county','cousub','place','tract','blkgrp','concit',
       rep('blank',14),'ua',rep('blank',11),'ur',rep('blank',4),'geoid','name',rep('blank',3))
GeoHeader <- read.fwf(paste0(Url,Filename),widths=Widths,
                  colClasses=Classes,col.names=Names,fill=TRUE,strip.white=TRUE)

下面的文件g2009us.txt中有四行。第二个Canoncito导致的问题。下载的其他文件是csv,但这是一个固定宽度和必要的识别感兴趣的地理位置(数据的组织不是很直观)。

Four lines from the file "g2009us.txt" below. The second one "Canoncito" is causing the problems. The other files in the download are csv but this one is fixed-width and necessary to identify geographies of interest (the organization of the data is not very intuitive).

ACSSF US251000000964 2430 090 25100US2430090 Cameron Chapter,Navajo Nation Reservation and Off-Reservation Trust Land,AZ-NM-UT
ACSSF US251000000965 2430 092 25100US2430092CañoncitoChapter,Navajo Nation Reservation and Off-Reservation Trust Land,AZ-NM --UT
ACSSF US251000000966 2430 095 25100US2430095 Casamero Lake Chapter,纳瓦霍国家保留和非保留信托土地,AZ - NM - UT
ACSSF US251000000967 2430 105 25100US2430105 Chi Chil Tah Chapter,Navajo Nation Reservation和Off-Reservation Trust Land,AZ - NM - UT

ACSSF US251000000964 2430 090 25100US2430090 Cameron Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000965 2430 092 25100US2430092 Cañoncito Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000966 2430 095 25100US2430095 Casamero Lake Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000967 2430 105 25100US2430105 Chi Chil Tah Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT

推荐答案

首先,我们先识别所有非ASCII字符。我通过将
转换为原始向量,然后查找127以上的值(最后
以ASCII格式明确编码的值)来实现。

First, we start by identifying all non-ASCII characters. I do this by converting converting to a raw vector, and then looking for values over 127 (the last unambiguously encoded value in ASCII).

lines <- readLines("g20095us.txt")

non_ascii <- function(x) {
  any(charToRaw(x) > 127)
}

bad <- vapply(lines, non_ascii, logical(1), USE.NAMES = FALSE)
lines[bad]

然后我们需要弄清楚正确的编码是什么。这是有挑战性的
当我们只有两种情况,经常涉及一些试验和错误。在这个
的情况下,我搜索encoding \xf1,并发现
为什么这个转换为utf8工作?,这表明latin1可能
是corect编码。

We then need to figure out what the correct encoding is. This is challenging when we only have two cases, and often involves some trial and error. In this case I googled for "encoding \xf1", and discovered Why doesn't this conversion to utf8 work?, which suggested that latin1 might be the corect encoding.

我测试了使用 iconv ,它将一种编码转换为另一种编码使用utf-8):

I tested that using iconv which converts from one encoding to another (and you always want to use utf-8):

iconv(lines[bad], "latin1", "utf-8")

最后,我们重新加载正确的编码。令人困惑的是,任何 read。* 函数的编码
参数不会这样做 - 您需要
手动指定连接上的编码:

Finally, we reload with the correct encoding. Confusingly, the encoding argument to any of the read.* functions doesn't do this - you need to manually specify an encoding on the connection:

fixed <- readLines(file("g20095us.txt", encoding = "latin1"))
fixed[bad]

这篇关于如何确定与read.fwf一起使用的正确的文件编码(或使用解决方法删除不合规的字符)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆