使用dplyr和SQLite进行UTF-8编码 [英] UTF-8 encoding with dplyr and SQLite

查看:119
本文介绍了使用dplyr和SQLite进行UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SQLite中有一个表,我想用dplyr打开它。在使用Win 7的PC上使用SQLite Expert版本35.58.2478,R Studio Version 0.98.1062。



使用src_sqlite()连接到数据库并使用tbl读取()我得到了表。但是包含的字符是错误的。从csv文件读取相同的表只是通过在函数read.csv中添加encoding =utf-8,但是在这种情况下会出现第一列名称中的另一个错误(请考虑下面的最小例子)。 >

请注意,在SQLite表中,编码为UTF-8,SQLite正确显示数据。



我尝试更改在R Studio选项中的编码没有成功。也可以更改窗口或r中的区域没有任何影响。



有没有解决方案使用dplyr将表中的字符正确插入r?

最小例子



  library(dplyr)
db< - src_sqlite(C:/Users/Jens/Documents/SQLite/my_db.sqlite)
tbl(db,prozesse)
##来源:sqlite 3.7.17 [C:/ Users / Jens / Documents / SQLite / my_db.sqlite]
##来自:prozesse [4 x 4]
##
## KH_IDEinschätzungProzess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP

您在第二列的名称中看到错误的编码。这个问题在ä,ö,ü等的列中也出现。



第二列的名称正确显示,但第一列错误: p>

  read.csv(C:/Users/Jens/Documents/SQLite/prozess.csv,encoding =UTF-8 )
## XUFEFF.KH_IDEinschätzungGruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite


sessionInfo()
## R版本3.1.1(2014-07-10)
##平台:x86_64 -w64-mingw32 / x64(64位)
##
## locale:
## [1] LC_COLLATE = German_Germany.1252 LC_CTYPE = German_Germany.1252
## [3] LC_MONETARY = German_Germany.1252 LC_NUMERIC = C
## [5] LC_TIME = German_Germany.1252
##
##附加的基本包:
## [1]统计图形grDevices utils数据集方法base
##
##其他附件包:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
##通过命名空间加载(而不附加):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1 .6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13

$ b $我也有同样的问题。我下面解决了但是,我不保证解决方案是坚实的。尝试一下:

  library(dplyr)
库(sqldf)

#修改内置的mtcars数据集

mtcars $ test < -
c(č,ž,š,č,ž,š )%>%
enc2utf8(。)

mtcars $češćžä< -
c(č,ž,š,č, ,š,字母)%>%
enc2utf8(。)

名称(mtcars)< -
iconv(names(mtcars),cp1250 utf-8)

#连接到sqlite数据库

my_db< - src_sqlite(my_db.sqlite3,create = T)

#将mtcars数据集导出到数据库
copy_to(my_db,mtcars,temporary = FALSE)

#dbSendQuery(my_db $ con,drop table mtcars)

#从sqlite数据库获取数据
my_mtcars_from_db< -
collect(tbl(my_db,mtcars))

#断开数据库
dbDisconnect(my_db $ con )



convert_to_encoding()函数



 #一个编码
的函数#列名称和字符列中的值
#与指定的编码
convert_to_encoding < -
函数(x,from_encoding =UTF-8,to_encoding =cp1250){

#列的名称以指定的编码编码
my_names< -
iconv(names(x),from_encoding,to_encoding)

#如果有任何列名称是NA,留下名字
#否则用新名称替换
if(any(is.na(my_names))){
names(x)
} else {
名称(x)< - my_names
}

#获取列类
x_char_columns< - sapply(x,class)
#识别字符列
x_cols< - 名称(x_char_columns [x_char_columns ==character])

#将字符列中的所有字符串值转换为
#指定编码
x< -
x%>%
mutate_each_(funs(iconv(。,from_encoding,to_encoding)),
x_cols)
# return x
return(x)
}

#使用
convert_to_encoding(my_mtcars_from_db,UTF-8,cp1250)



结果



  
my_mtcars_from_db

来源:本地数据框[32 x 13]

mpg cyl disp hp drat wt qsec vs am gear carbÄŤeĹćžä‡test testžÞà b 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4ÄŤÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4žž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1Ĺ Ĺ
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1ÄŤÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2žž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1ĹĹ
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 aa
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 bb
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cc
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 dd
.. ... ... ... ... ... ... ... .. ... ... ... ...

转换后
convert_to_encoding(my_mtcars_from_db,UTF- 8,cp1250)

来源:本地数据框[32 x 13]

mpg cyl disp hp drat wt qsec vs am gear carb testčešćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4čč
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1šš
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1čč
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1šš
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 aa
8 24.4 4 14 6.7 62 3.69 3.190 20.00 1 0 4 2 bb
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cc
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 dd
.. ... ... ... ... ... ... ... .. ... ... ... ...



会话信息



  devtools :: session_info()

会话信息------------------------------------------ -------------------------
设置值
版本R版本3.2.0(2015-04-16)
系统x86_64,mingw32
ui RStudio(0.99.441)
语言(EN)
collat​​e Slovenian_Slovenia.1250
tz欧洲/布拉格

包-------------------------------------------------- ---------------------
包*版本日期来源
assertthat * 0.1 2013-12-06 CRAN(R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN(R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN(R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN(R 3.2 .0)
dplyr 0.4.1 2015-01-14 CRAN(R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN(R 3.2.0)
lazyeval * 0.1 .10 2015-01-02 CRAN(R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN(R 3.2.0)
proto 0.3-10 2012-12-22 CRAN(R 3.2 .0)
R6 * 2.0.1 2014-10-29 CRAN(R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN(R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN(R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN(R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)


I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.

After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = "utf-8" to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).

Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.

I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.

Is there any solution of getting the characters in the table correctly into r using dplyr?

Minimal Example

library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
## 
##   KH_ID Einschätzung Prozess Gruppe
## 1     1             3 Buchung     IT
## 2     2             4 Buchung     IT
## 3     3             3 Buchung    OLP
## 4     4             5 Buchung    OLP

You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.

The name of the second column is displayed correctly, but the first column is wrong:

read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
##   X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1              1            3     PO  visite
## 2              2            3     IT  visite
## 3              3            3     IT  visite
## 4              2            3     PO  visite


sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4        DBI_0.3.0            
## [4] dplyr_0.2            
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1  digest_0.6.4    evaluate_0.5.5  formatR_1.0    
##  [5] htmltools_0.2.6 knitr_1.6       parallel_3.1.1  Rcpp_0.11.2    
##  [9] rmarkdown_0.3.3 stringr_0.6.2   tools_3.1.1     yaml_2.1.13

解决方案

I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:

library(dplyr)
library(sqldf)

# Modifying built-in mtcars dataset

mtcars$test <- 
  c("č", "ž", "š", "č", "ž", "š", letters) %>% 
  enc2utf8(.)

mtcars$češćžä <- 
  c("č", "ž", "š", "č", "ž", "š", letters) %>% 
  enc2utf8(.)

names(mtcars) <- 
  iconv(names(mtcars), "cp1250", "utf-8")

# Connecting to sqlite database

my_db <- src_sqlite("my_db.sqlite3", create = T)

# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)

# dbSendQuery(my_db$con, "drop table mtcars")

# getting data from sqlite database
my_mtcars_from_db <-
  collect(tbl(my_db, "mtcars"))

# disconnecting from database
dbDisconnect(my_db$con)

convert_to_encoding() function

# a function that encodes 
# column names and values in character columns
# with specified encodings
convert_to_encoding <- 
  function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){

    # names of columns are encoded in specified encoding
    my_names <- 
      iconv(names(x), from_encoding, to_encoding) 

    # if any column name is NA, leave the names
    # otherwise replace them with new names
    if(any(is.na(my_names))){
      names(x)
    } else {
      names(x) <- my_names
    }

    # get column classes
    x_char_columns <- sapply(x, class)
    # identify character columns
    x_cols <- names(x_char_columns[x_char_columns == "character"])

    # convert all string values in character columns to 
    # specified encoding
    x <- 
      x %>%
      mutate_each_(funs(iconv(., from_encoding, to_encoding)), 
                   x_cols)
    # return x
    return(x)
  }

# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")

Results

# before conversion
my_mtcars_from_db

Source: local data frame [32 x 13]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb češćžä test
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4          ÄŤ   ÄŤ
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4          Ĺľ   Ĺľ
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1          š   š
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1          ÄŤ   ÄŤ
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2          Ĺľ   Ĺľ
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1          š   š
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4           a    a
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2           b    b
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2           c    c
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4           d    d
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...         ...  ...

# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")

Source: local data frame [32 x 13]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb test češćžä
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4    č      č
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4    ž      ž
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    š      š
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    č      č
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    ž      ž
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    š      š
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    a      a
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    b      b
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    c      c
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4    d      d
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...  ...    ...

Session information

devtools::session_info()

Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, mingw32             
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  Slovenian_Slovenia.1250     
 tz       Europe/Prague               

Packages -----------------------------------------------------------------------
 package    * version date       source        
 assertthat * 0.1     2013-12-06 CRAN (R 3.2.0)
 chron      * 2.3-45  2014-02-11 CRAN (R 3.2.0)
 DBI          0.3.1   2014-09-24 CRAN (R 3.2.0)
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
 dplyr        0.4.1   2015-01-14 CRAN (R 3.2.0)
 gsubfn       0.6-6   2014-08-27 CRAN (R 3.2.0)
 lazyeval   * 0.1.10  2015-01-02 CRAN (R 3.2.0)
 magrittr   * 1.5     2014-11-22 CRAN (R 3.2.0)
 proto        0.3-10  2012-12-22 CRAN (R 3.2.0)
 R6         * 2.0.1   2014-10-29 CRAN (R 3.2.0)
 Rcpp       * 0.11.6  2015-05-01 CRAN (R 3.2.0)
 RSQLite      1.0.0   2014-10-25 CRAN (R 3.2.0)
 rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
 sqldf        0.4-10  2014-11-07 CRAN (R 3.2.0)

这篇关于使用dplyr和SQLite进行UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆