R:在使用dplyr sparklyr/在Sparklyr连接中使用管道处理数据时,NaN在as.numeric()之后出现 [英] R: NaN came up after as.numeric() while using dplyr sparklyr / maniplating data with pipes in sparklyr connection
问题描述
我开始使用sparklyr处理大尺寸数据,因此我只需要使用管道.
I began to use sparklyr to handle big size data, so I need to use only pipe lines.
但是在处理数据帧时遇到了麻烦,似乎是
But while manupulating data frame I got in trouble and it seems by
csj %>% head()
下面是我的数据的样子. 在此处输入图片描述
below is how my data looks like. enter image description here
我首先要做的是,我要创建一个新列lenght_of_review,其中要计算reviewText的字符数以及另一个可以显示类别的新列.
What I want to do is first, I want to make a new column, lenght_of_review, with counting number of characters of reviewText and another new column which can show categories.
所以我输入了这样的代码:
So I put code like this:
csj<- csj %>% mutate(length_of_review = nchar(csj$reviewText,keppNA=TRUE),
category ="Clothes_shoes_jewelry") %>%
select(c('_c0',reviewerID,asin,helpful,length_of_review,overall,unixReviewTime,category))
csj %>% head()
Error: Invalid number of args to SQL LENGTH. Expecting 1
类别部分有效,但length_of_review部分无效.所以我又厌倦了as.numeric
Category part worked but length_of_review part didn't work. So I tired with as.numeric again
csj<- csj %>% mutate(length_of_review = as.numeric(nchar(csj$reviewText)),
category ="Clothes_shoes_jewelry") %>%
select(c('_c0',reviewerID,asin,helpful,length_of_review,overall,unixReviewTime,category))
csj %>% head()
#Source: lazy query [?? x 8]
# Database: spark_connection
`_c0` reviewerID asin helpful length_of_review overall unixReviewTime category
<int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 0 A1KLRMWW2FWPL4 31887 [0, 0] NaN 5 1297468800 Clothes_shoes_jewelry
2 1 A2G5TCU2WDFZ65 31887 [0, 0] NaN 5 1358553600 Clothes_shoes_jewelry
3 2 A1RLQXYNCMWRWN 31887 [0, 0] NaN 5 1357257600 Clothes_shoes_jewelry
4 3 A8U3FAMSJVHS5 31887 [0, 0] NaN 5 1398556800 Clothes_shoes_jewelry
5 4 A3GEOILWLK86XM 31887 [0, 0] NaN 5 1394841600 Clothes_shoes_jewelry
6 5 A27UF1MSF3DB2 31887 [0, 0] NaN 4 1396224000 Clothes_shoes_jewelry
然后变成NAN:(.....
And it turns to NAN :(.....
同样,但是另一个问题是关于有用的同伴. 我想创建一个新列,称为help =帮助中的第一#/帮助中的第二#.我以前在这里的网站上问过,得到了以下代码:
Also, simiar but another problem is about helpful coulmn. I want to make a new column called as help = first # in helpful / 2nd # in helpful. I asked in here site before and I got this code:
csj %>%
+ mutate(col1 = as.numeric(stringi::stri_extract_first_regex(csj$helpful, pattern = "[0-9]")),#extract first number
+ col2 = as.numeric(stringi::stri_extract_last_regex(csj$helpful, pattern = "[0-9]")),#extract second
+ col3 = ifelse(col2 == 0, 1, col2 ),#change 0s to 1
+ help = col1/col3) #divide row1 and 3
# Source: lazy query [?? x 12]
# Database: spark_connection
`_c0` reviewerID asin helpful length_of_review overall unixReviewTime category col1 col2 col3 help
<int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 0 A1KLRMWW2FWPL4 31887 [0, 0] NaN 5 1297468800 Clothes_sh~ NaN NaN NaN NaN
2 1 A2G5TCU2WDFZ65 31887 [0, 0] NaN 5 1358553600 Clothes_sh~ NaN NaN NaN NaN
3 2 A1RLQXYNCMWRWN 31887 [0, 0] NaN 5 1357257600 Clothes_sh~ NaN NaN NaN NaN
4 3 A8U3FAMSJVHS5 31887 [0, 0] NaN 5 1398556800 Clothes_sh~ NaN NaN NaN NaN
5 4 A3GEOILWLK86XM 31887 [0, 0] NaN 5 1394841600 Clothes_sh~ NaN NaN NaN NaN
6 5 A27UF1MSF3DB2 31887 [0, 0] NaN 4 1396224000 Clothes_sh~ NaN NaN NaN NaN
7 6 A16GFPNVF4Y816 31887 [0, 0] NaN 5 1399075200 Clothes_sh~ NaN NaN NaN NaN
8 7 A2M2APVYIB2U6K 31887 [0, 0] NaN 5 1356220800 Clothes_sh~ NaN NaN NaN NaN
9 8 A1NJ71X3YPQNQ9 31887 [0, 0] NaN 4 1384041600 Clothes_sh~ NaN NaN NaN NaN
10 9 A3EERSWHAI6SO 31887 [7, 8] NaN 5 1349568000 Clothes_sh~ NaN NaN NaN NaN
# ... with more rows
两个问题似乎都应该起作用,但是没有起作用.我什至无法开始分析,因为我在这里待了很长时间:(
Both problems seems should work, but it didn't work. I couldn't even start analysis because I'm stuck here such a long time :(
有人知道为什么吗?并对此有解决方案?如果我能解决这些问题,我将非常高兴.任何帮助将不胜感激!
Is there anyone knows why? and has a solution for this? I'll be very much happy if I can solve these problems. Any help would be much appreciated!
这是str(csj)>>
Here is str(csj) >>
> str(csj)
List of 2
$ src:List of 1
..$ con:List of 10
.. ..$ master : chr "local[4]"
.. ..$ method : chr "shell"
.. ..$ app_name : chr "sparklyr"
.. ..$ config :List of 4
.. .. ..$ spark.env.SPARK_LOCAL_IP.local : chr "127.0.0.1"
.. .. ..$ sparklyr.csv.embedded : chr "^1.*"
.. .. ..$ sparklyr.cores.local : int 4
.. .. ..$ spark.sql.shuffle.partitions.local: int 4
.. .. ..- attr(*, "config")= chr "default"
.. .. ..- attr(*, "file")= chr "C:\\Users\\ms\\Documents\\R\\win-library\\3.5\\sparklyr\\conf\\config-template.yml"
.. ..$ spark_home : chr "C:\\spark"
.. ..$ backend : 'sockconn' int 4
.. .. ..- attr(*, "conn_id")=<externalptr>
.. ..$ monitor : 'sockconn' int 3
.. .. ..- attr(*, "conn_id")=<externalptr>
.. ..$ output_file : chr "C:\\Users\\ms\\AppData\\Local\\Temp\\RtmpygTIca\\file371068ce6a02_spark.log"
.. ..$ spark_context:Classes 'spark_jobj', 'shell_jobj' <environment: 0x00000000daa77a50>
.. ..$ java_context :Classes 'spark_jobj', 'shell_jobj' <environment: 0x00000000daa365b8>
.. ..- attr(*, "class")= chr [1:3] "spark_connection" "spark_shell_connection" "DBIConnection"
..- attr(*, "class")= chr [1:3] "src_spark" "src_sql" "src"
$ ops:List of 4
..$ name: chr "select"
..$ x :List of 4
.. ..$ name: chr "mutate"
.. ..$ x :List of 2
.. .. ..$ x : 'ident' chr "review_csj"
.. .. ..$ vars: chr [1:7] "_c0" "reviewerID" "asin" "helpful" ...
.. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
.. ..$ dots:List of 2
.. .. ..$ length_of_review: language ~as.numeric(nchar(NULL))
.. .. .. ..- attr(*, ".Environment")=<environment: 0x00000000dd5f2548>
.. .. ..$ category : language ~"Clothes_shoes_jewelry"
.. .. .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
.. ..$ args: list()
.. ..- attr(*, "class")= chr [1:3] "op_mutate" "op_single" "op"
..$ dots:List of 1
.. ..$ : language ~c("_c0", reviewerID, asin, helpful, length_of_review, overall, unixReviewTime, category)
.. .. ..- attr(*, ".Environment")=<environment: 0x00000000dd6190f0>
.. ..- attr(*, "class")= chr "quosures"
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_select" "op_single" "op"
- attr(*, "class")= chr [1:4] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
>
这是我的session_info()
Here is my session_info()
Session info ----------------------------------------------------------------------------------------------------
setting value
version R version 3.5.0 (2018-04-23)
system x86_64, mingw32
ui RStudio (1.1.453)
language (EN)
collate English_United States.1252
tz Europe/Berlin
date 2018-05-21
Packages --------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
base * 3.5.0 2018-04-23 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.5.0)
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
broom 0.4.4 2018-03-29 CRAN (R 3.5.0)
cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
compiler 3.5.0 2018-04-23 local
config 0.3 2018-03-27 CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
datasets * 3.5.0 2018-04-23 local
DBI 1.0.0 2018-05-02 CRAN (R 3.5.0)
dbplyr 1.2.1 2018-02-19 CRAN (R 3.5.0)
devtools * 1.13.5 2018-02-18 CRAN (R 3.5.0)
digest * 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr * 0.7.5 2018-05-19 CRAN (R 3.5.0)
foreign 0.8-70 2017-11-28 CRAN (R 3.5.0)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.5.0)
glue 1.2.0 2017-10-29 CRAN (R 3.5.0)
graphics * 3.5.0 2018-04-23 local
grDevices * 3.5.0 2018-04-23 local
grid * 3.5.0 2018-04-23 local
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
httpuv 1.4.3 2018-05-10 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
later 0.7.2 2018-05-01 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.0 2018-04-23 local
mime 0.5 2016-07-07 CRAN (R 3.5.0)
mnormt 1.5-5 2016-10-15 CRAN (R 3.5.0)
munsell 0.4.3 2016-02-13 CRAN (R 3.5.0)
nlme 3.1-137 2018-04-07 CRAN (R 3.5.0)
openssl 1.0.1 2018-03-03 CRAN (R 3.5.0)
parallel 3.5.0 2018-04-23 local
pillar 1.2.2 2018-04-26 CRAN (R 3.5.0)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
promises 1.0.1 2018-04-13 CRAN (R 3.5.0)
psych 1.8.4 2018-05-06 CRAN (R 3.5.0)
purrr 0.2.4 2017-10-18 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
RColorBrewer * 1.1-2 2014-12-07 CRAN (R 3.5.0)
Rcpp 0.12.17 2018-05-18 CRAN (R 3.5.0)
readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
reshape2 1.4.3 2017-12-11 CRAN (R 3.5.0)
rlang 0.2.0 2018-02-20 CRAN (R 3.5.0)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
scales * 0.5.0 2017-08-24 CRAN (R 3.5.0)
shiny 1.1.0 2018-05-17 CRAN (R 3.5.0)
sparklyr * 0.8.3 2018-05-12 CRAN (R 3.5.0)
stats * 3.5.0 2018-04-23 local
stringi * 1.1.7 2018-03-12 CRAN (R 3.5.0)
stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
tools 3.5.0 2018-04-23 local
utf8 1.1.3 2018-01-03 CRAN (R 3.5.0)
utils * 3.5.0 2018-04-23 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
xtable 1.8-2 2016-02-05 CRAN (R 3.5.0)
yaml 2.1.19 2018-05-01 CRAN (R 3.5.0)
>
推荐答案
您犯了两个基本错误,前一个错误实际上掩盖了后者.让我们逐步进行跟踪.
You've made two fundamental mistakes, where the former one actually masks the latter one. Let's trace it step-by-step.
df <- copy_to(
sc, tibble(id=1, helpful="[0, 0]", reviewText="Here goes some text")
)
-
第一个错误是尝试使用
$
访问列.tbl_spark
(据我所知,其他基于数据库的对象,例如tbl_sql
)对象不提供$
访问权限:The first mistake is attempt to use
$
to access columns.tbl_spark
(and as far as I am aware other database-based objects, liketbl_sql
) objects don't provide$
access:> df$a NULL
,通常,您不应使用
dplyr
引用此类列.and in general you shouldn't reference columns like this with
dplyr
.sparklyr
不支持直接执行普通R代码(spark_apply
之类的功能除外.但是这些效率非常低.)sparklyr
doesn't support direct execution of plain R code (with exception of functions likespark_apply
. These are however very inefficient).因此,如果省略
$
,查询实际上将失败:So if you omit
$
query would actually fail:> df %>% mutate(length_of_review = nchar(reviewText, keppNA=TRUE)) Error: Invalid number of args to SQL LENGTH. Expecting 1
您可以省略R特定的参数:
You could omit R specific arguments:
> df %>% mutate(length_of_review = nchar(reviewText)) # Source: lazy query [?? x 4] # Database: spark_connection id helpful reviewText length_of_review <dbl> <chr> <chr> <int> 1 1 [0, 0] Here goes some text 19
,它将转换为
length
函数的本机Spark调用(请注意,它不是base::length
):which would be translated to a native Spark call of
length
function (note that it is notbase::length
):> df %>% mutate(length_of_review = length(reviewText)) # Source: lazy query [?? x 4] # Database: spark_connection id helpful reviewText length_of_review <dbl> <chr> <chr> <int> 1 1 [0, 0] Here goes some text 19
- 要检查长度,请使用
length
功能(请参见上文). -
要使用正则表达式处理字符串,请使用
regexp_extract
函数和 Java正则表达式: - To check length use
length
function (see above). To process strings with regular expressions use
regexp_extract
function with Java regular expressions:
通常,对连接应用的操作将进行翻译SQL查询,因此您应该使用在特定平台上可以访问的函数(或使用
dplyr
别名).有关可用函数的列表及其参数,您可以检查In general operations applied on connection will be translated to SQL queries so you should use functions, which are accessible on a specific platform (or are aliased by
dplyr
). For the list of available functions, with their arguments you can check the official Scala documentation.关于您的问题:
> df %>% mutate(col1 = as.numeric(regexp_extract(helpful, "^\\\\[([0-9]+), ([0-9]+)\\\\]$", 1))) # Source: lazy query [?? x 4] # Database: spark_connection id helpful reviewText col1 <dbl> <chr> <chr> <dbl> 1 1 [0, 0] Here goes some text 0
您还可以使用
spark_apply
,但是如上所述,这不是一种有效的方法.You could also use
spark_apply
, but as already mentioned, it is not an efficient approach.这篇关于R:在使用dplyr sparklyr/在Sparklyr连接中使用管道处理数据时,NaN在as.numeric()之后出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!