使用dplyr和SQLite进行UTF-8编码 [英] UTF-8 encoding with dplyr and SQLite
问题描述
使用src_sqlite()连接到数据库并使用tbl读取()我得到了表。但是包含的字符是错误的。从csv文件读取相同的表只是通过在函数read.csv中添加encoding =utf-8,但是在这种情况下会出现第一列名称中的另一个错误(请考虑下面的最小例子)。 >
请注意,在SQLite表中,编码为UTF-8,SQLite正确显示数据。
我尝试更改在R Studio选项中的编码没有成功。也可以更改窗口或r中的区域没有任何影响。
有没有解决方案使用dplyr将表中的字符正确插入r?
最小例子
library(dplyr)
db< - src_sqlite(C:/Users/Jens/Documents/SQLite/my_db.sqlite)
tbl(db,prozesse)
##来源:sqlite 3.7.17 [C:/ Users / Jens / Documents / SQLite / my_db.sqlite]
##来自:prozesse [4 x 4]
##
## KH_IDEinschätzungProzess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
您在第二列的名称中看到错误的编码。这个问题在ä,ö,ü等的列中也出现。
第二列的名称正确显示,但第一列错误: p>
read.csv(C:/Users/Jens/Documents/SQLite/prozess.csv,encoding =UTF-8 )
## XUFEFF.KH_IDEinschätzungGruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R版本3.1.1(2014-07-10)
##平台:x86_64 -w64-mingw32 / x64(64位)
##
## locale:
## [1] LC_COLLATE = German_Germany.1252 LC_CTYPE = German_Germany.1252
## [3] LC_MONETARY = German_Germany.1252 LC_NUMERIC = C
## [5] LC_TIME = German_Germany.1252
##
##附加的基本包:
## [1]统计图形grDevices utils数据集方法base
##
##其他附件包:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
##通过命名空间加载(而不附加):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1 .6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13
$ b $我也有同样的问题。我下面解决了但是,我不保证解决方案是坚实的。尝试一下:
library(dplyr)
库(sqldf)
#修改内置的mtcars数据集
mtcars $ test < -
c(č,ž,š,č,ž,š )%>%
enc2utf8(。)
mtcars $češćžä< -
c(č,ž,š,č, ,š,字母)%>%
enc2utf8(。)
名称(mtcars)< -
iconv(names(mtcars),cp1250 utf-8)
#连接到sqlite数据库
my_db< - src_sqlite(my_db.sqlite3,create = T)
#将mtcars数据集导出到数据库
copy_to(my_db,mtcars,temporary = FALSE)
#dbSendQuery(my_db $ con,drop table mtcars)
#从sqlite数据库获取数据
my_mtcars_from_db< -
collect(tbl(my_db,mtcars))
#断开数据库
dbDisconnect(my_db $ con )
convert_to_encoding()函数
#一个编码
的函数#列名称和字符列中的值
#与指定的编码
convert_to_encoding < -
函数(x,from_encoding =UTF-8,to_encoding =cp1250){
#列的名称以指定的编码编码
my_names< -
iconv(names(x),from_encoding,to_encoding)
#如果有任何列名称是NA,留下名字
#否则用新名称替换
if(any(is.na(my_names))){
names(x)
} else {
名称(x)< - my_names
}
#获取列类
x_char_columns< - sapply(x,class)
#识别字符列
x_cols< - 名称(x_char_columns [x_char_columns ==character])
#将字符列中的所有字符串值转换为
#指定编码
x< -
x%>%
mutate_each_(funs(iconv(。,from_encoding,to_encoding)),
x_cols)
# return x
return(x)
}
#使用
convert_to_encoding(my_mtcars_from_db,UTF-8,cp1250)
结果
my_mtcars_from_db
来源:本地数据框[32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carbÄŤeĹćžä‡test testžÞà b 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4ÄŤÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4žž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1Ĺ Ĺ
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1ÄŤÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2žž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1ĹĹ
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 aa
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 bb
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cc
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 dd
.. ... ... ... ... ... ... ... .. ... ... ... ...
转换后
convert_to_encoding(my_mtcars_from_db,UTF- 8,cp1250)
来源:本地数据框[32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb testčešćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4čč
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1šš
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1čč
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1šš
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 aa
8 24.4 4 14 6.7 62 3.69 3.190 20.00 1 0 4 2 bb
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cc
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 dd
.. ... ... ... ... ... ... ... .. ... ... ... ...
会话信息
devtools :: session_info()
会话信息------------------------------------------ -------------------------
设置值
版本R版本3.2.0(2015-04-16)
系统x86_64,mingw32
ui RStudio(0.99.441)
语言(EN)
collate Slovenian_Slovenia.1250
tz欧洲/布拉格
包-------------------------------------------------- ---------------------
包*版本日期来源
assertthat * 0.1 2013-12-06 CRAN(R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN(R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN(R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN(R 3.2 .0)
dplyr 0.4.1 2015-01-14 CRAN(R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN(R 3.2.0)
lazyeval * 0.1 .10 2015-01-02 CRAN(R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN(R 3.2.0)
proto 0.3-10 2012-12-22 CRAN(R 3.2 .0)
R6 * 2.0.1 2014-10-29 CRAN(R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN(R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN(R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN(R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)
I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.
After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = "utf-8" to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).
Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.
I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.
Is there any solution of getting the characters in the table correctly into r using dplyr?
Minimal Example
library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
##
## KH_ID Einschätzung Prozess Gruppe
## 1 1 3 Buchung IT
## 2 2 4 Buchung IT
## 3 3 3 Buchung OLP
## 4 4 5 Buchung OLP
You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.
The name of the second column is displayed correctly, but the first column is wrong:
read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
## X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1 1 3 PO visite
## 2 2 3 IT visite
## 3 3 3 IT visite
## 4 2 3 PO visite
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.3.0
## [4] dplyr_0.2
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0
## [5] htmltools_0.2.6 knitr_1.6 parallel_3.1.1 Rcpp_0.11.2
## [9] rmarkdown_0.3.3 stringr_0.6.2 tools_3.1.1 yaml_2.1.13
I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:
library(dplyr)
library(sqldf)
# Modifying built-in mtcars dataset
mtcars$test <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
mtcars$češćžä <-
c("č", "ž", "š", "č", "ž", "š", letters) %>%
enc2utf8(.)
names(mtcars) <-
iconv(names(mtcars), "cp1250", "utf-8")
# Connecting to sqlite database
my_db <- src_sqlite("my_db.sqlite3", create = T)
# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)
# dbSendQuery(my_db$con, "drop table mtcars")
# getting data from sqlite database
my_mtcars_from_db <-
collect(tbl(my_db, "mtcars"))
# disconnecting from database
dbDisconnect(my_db$con)
convert_to_encoding() function
# a function that encodes
# column names and values in character columns
# with specified encodings
convert_to_encoding <-
function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){
# names of columns are encoded in specified encoding
my_names <-
iconv(names(x), from_encoding, to_encoding)
# if any column name is NA, leave the names
# otherwise replace them with new names
if(any(is.na(my_names))){
names(x)
} else {
names(x) <- my_names
}
# get column classes
x_char_columns <- sapply(x, class)
# identify character columns
x_cols <- names(x_char_columns[x_char_columns == "character"])
# convert all string values in character columns to
# specified encoding
x <-
x %>%
mutate_each_(funs(iconv(., from_encoding, to_encoding)),
x_cols)
# return x
return(x)
}
# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Results
# before conversion
my_mtcars_from_db
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb češćžä test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ÄŤ ÄŤ
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Ĺľ Ĺľ
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ÄŤ ÄŤ
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Ĺľ Ĺľ
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb test češćžä
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 č č
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ž ž
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 š š
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 č č
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ž ž
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 š š
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 a a
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 b b
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 c c
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 d d
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
Session information
devtools::session_info()
Session info -------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, mingw32
ui RStudio (0.99.441)
language (EN)
collate Slovenian_Slovenia.1250
tz Europe/Prague
Packages -----------------------------------------------------------------------
package * version date source
assertthat * 0.1 2013-12-06 CRAN (R 3.2.0)
chron * 2.3-45 2014-02-11 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
dplyr 0.4.1 2015-01-14 CRAN (R 3.2.0)
gsubfn 0.6-6 2014-08-27 CRAN (R 3.2.0)
lazyeval * 0.1.10 2015-01-02 CRAN (R 3.2.0)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
R6 * 2.0.1 2014-10-29 CRAN (R 3.2.0)
Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
sqldf 0.4-10 2014-11-07 CRAN (R 3.2.0)
这篇关于使用dplyr和SQLite进行UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!