< U + 00A0>读取csv文件时的特殊字符 [英] <U+00A0> special characters when reading a csv file

查看:103
本文介绍了< U + 00A0>读取csv文件时的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将多个csv文件读入R作为数据帧列表.在Windows计算机上工作.

I am reading multiple csv files into R as list of dataframe. Working on Windows machine.

     create_lstdf_csv <- function(path, pattern = "*.csv") {
      files <- dir(path = path, pattern)
      lstdf <- files %>%
        purrr::map(function(x) vroom::vroom(file = file.path(path, x),
        .name_repair = ~ janitor::make_clean_names(.)),
trimws = T) %>%
        stats::setNames(tools::file_path_sans_ext(files)) %>%
        purrr::map(~.x,janitor::remove_empty(which = c("rows", "cols")))
      return(lstdf)
    }

数据框中的某些列具有一些空格 \ xa0 .即使vroom函数将 trimws 设置为True,它也不会删除前导空白.

certain columns in the data frame has some spaces\xa0. Even though vroom function has trimws as True, it did not remove the leading an trailing white space.

   <chr>          
 1 "CTLA4"        
 2 "PDCD1"        
 3  NA            
 4  NA            
 5 "CXCR3"        
 6  NA            
 7 "\xa0KLRK1"    
 8 "\xa0NCR3\xa0" 
 9 "\xa0NCR2"     
10 "IL-12A/IL-12B" 

即使在对UTF-8进行编码后,当我使用 gsub("\\ xA0&",,",df $ gene,perl = TRUE)时,也会遇到相同的错误./p>

When I use gsub("\\xA0", " ", df$gene, perl = TRUE) even after encoding to UTF-8, I get the same error.

Error in gsub("\\xA0", " ", df$gene, perl = TRUE) : 
  input string 7 is invalid UTF-8

在将文件读入列表df时,是否有一种方法可以避免此错误?

is there a way to avoid this error while reading files into list df ?

数据

structure(list(gene = c("CTLA4", "PDCD1", NA, NA, "CXCR3", NA, 
"<U+00A0>KLRK1", "<U+00A0>NCR3<U+00A0>", "<U+00A0>NCR2", "IL-12A/IL-12B", 
"IL18R1 and IL18RAP", "<U+00A0>KLRK1", "IFNG", NA, "<U+00A0>KLRK1", 
"<U+00A0>KLRK1", "CXCR (gene group)", "CTLA4", "CTLA4", "PDCD1<U+00A0>", 
"HAVCR2", "CD28", "CD28", "CTLA4", "CTLA4", "CTLA4", "CTLA4", 
"PDCD1<U+00A0>", "PDCD1<U+00A0>", "PDCD1<U+00A0>", "PDCD1<U+00A0>", 
"CD80", "CD80", "LAG3", "LAG3", "<U+00A0>HAVCR2", "<U+00A0>HAVCR2", 
"<U+00A0>HAVCR2", "TNFRSF9", "TNFRSF9", "TNFRSF18", "TNFRSF18", 
"CD40", "CD40", "TNFRSF4", NA, NA, NA, NA, "TLR2", NA, NA, "<U+00A0>KLRK1", 
"<U+00A0>KLRK1", "CCR6", NA, "PDCD1<U+00A0>", "CCR4", "CCR4", 
"ITGAE", "TNFRSF9", "CSF1R", "CCR4", "CCR4", "CCR2", "CD40", 
"TNFRSF17", "TNFRSF13B", "FLT3", "CSF2RA", "CD40", "TNFRSF14", 
"IL12RB1 and IL12RB2", "IL12RB1 and IL12RB2", "IL18R1 and IL18RAP", 
"IL18R1 and IL18RAP", "IL18R1 and IL18RAP", NA, "TIGIT", "TMIGD2", 
"ICOS", "CD27", "TNFRSF14", "TNFRSF14", "TNFRSF14", "TNFRSF14", 
"<U+00A0>HAVCR2", "<U+00A0>HAVCR2", "LAG3", "LAG3", "TIGIT", 
"TIGIT", "TIGIT", "TIGIT", "TIGIT", "TIGIT", "TMIGD2", "TMIGD2", 
"ICOS", "ICOS", "CD27", "CD27", "TNFRSF9", "TNFRSF9", "TNFRSF18", 
"TNFRSF18", "TNFRSF4", "TNFRSF4", "CD40", "CD40", "TNFRSF14", 
"TNFRSF14", "FAS", "CD28", "CTLA4", "PDCD1<U+00A0>", "CD28", 
"CD28", "CD28", "CD28", "CTLA4", "CTLA4", "CTLA4", "CTLA4", "PDCD1<U+00A0>", 
"PDCD1<U+00A0>", NA, "CD40", "PDCD1<U+00A0>", "CTLA4", "CD28", 
"IL6R", "EPHA4", "THY1", "PDCD1<U+00A0>", "CD28", "CD28", "CTLA4", 
"CTLA4", "PDCD1<U+00A0>", "<U+00A0>HAVCR2", "LAG3", "TIGIT", 
"TIGIT", NA)), row.names = c(NA, -145L), class = c("tbl_df", 
"tbl", "data.frame"))

推荐答案

这应该对您有用:

df %>% 
  mutate(clean_gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene))

注意 clean_gene

gene               clean_gene        
   <chr>              <chr>             
 1 IL-12A/IL-12B      IL-12A/IL-12B     
 2 IL18R1 and IL18RAP IL18R1 and IL18RAP
 3 <U+00A0>KLRK1      KLRK1             
 4 IFNG               IFNG              
 5 NA                 NA                
 6 <U+00A0>KLRK1      KLRK1             
 7 <U+00A0>KLRK1      KLRK1            

要应用于 data.frame s的列表:

library(purrr)
library(dplyr)

list_of_dfs <- list_of_dfs %>% 
  map(~mutate(., gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene)))

这篇关于&lt; U + 00A0&gt;读取csv文件时的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆