R:如何从管道中的火花连接(sparklyr)中的数据列中提取元素 [英] R: How can I extract an element from a column of data in spark connection (sparklyr) in pipe
问题描述
我有一个如下的数据集.
I have a dataset as below.
由于数据量很大,我通过sparklyr
包上传了数据,因此只能使用管道语句.
Because of its large amount of data, I uploaded it through the sparklyr
package, so I can use only pipe statements.
pos <- str_sub(csj$helpful,2)
neg1 <- str_sub(csj$helpful,4)
csj <- csj %>% mutate(neg=replace(helpful,stringr::str_sub(csj$helpful,4)==1,0))
csj <- csj %>% mutate(help=pos/neg)
csj
is.null(csj$helpful)
我要创建一个名为帮助"的列,该列是帮助列的第一个编号/帮助列的第二个编号".
I want to make a column named 'help' which is 'the first number of helpful column/2nd number of helpful column'.
如果第二个数字为0,则需要将第二个数字更改为1,然后将其除.
If the 2nd number is 0, I need to change the 2nd number to 1 and then divide it.
数据帧名称为csj
.
但这是行不通的.
如果有人能帮助我解决这个问题,我会感到很高兴.
I'll be glad if someone can help me solve this problem.
在我遵循@Sebastian Hoyos的建议之后,但还是得到了此col1,col2,col3为NAN,如下图所示. (但是他给我的例子起作用了).我应该如何解决这个问题?
After I followed @Sebastian Hoyos's advice but still I got this col1,col2,col3 as NAN as below picture. (But the example he gave me worked). How should I solve this problem?
+)在尝试不使用as.numeric
零件之后,我得到了这个结果.
+) After I tried without as.numeric
the part then I got this result.
> csj %>%
+ mutate(col1 = stringi::stri_extract_first_regex(csj$helpful, pattern = "[0-9]"),#extract first number
+ col2 = stringi::stri_extract_last_regex(csj$helpful, pattern = "[0-9]"),#extract second
+ col3 = ifelse(col2 == 0, 1, col2 ),#change 0s to 1
+ help = col1/col3) #divide row1 and 3
# Source: lazy query [?? x 12]
# Database: spark_connection
`_c0` reviewerID asin helpful length_of_review overall unixReviewTime category col1 col2 col3 help
<int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 0 A1KLRMWW2FWPL4 31887 [0, 0] 172 5 1297468800 Clothes_s~ "" "" NA NaN
2 1 A2G5TCU2WDFZ65 31887 [0, 0] 306 5 1358553600 Clothes_s~ "" "" NA NaN
3 2 A1RLQXYNCMWRWN 31887 [0, 0] 312 5 1357257600 Clothes_s~ "" "" NA NaN
4 3 A8U3FAMSJVHS5 31887 [0, 0] 405 5 1398556800 Clothes_s~ "" "" NA NaN
5 4 A3GEOILWLK86XM 31887 [0, 0] 453 5 1394841600 Clothes_s~ "" "" NA NaN
6 5 A27UF1MSF3DB2 31887 [0, 0] 375 4 1396224000 Clothes_s~ "" "" NA NaN
7 6 A16GFPNVF4Y816 31887 [0, 0] 334 5 1399075200 Clothes_s~ "" "" NA NaN
8 7 A2M2APVYIB2U6K 31887 [0, 0] 158 5 1356220800 Clothes_s~ "" "" NA NaN
9 8 A1NJ71X3YPQNQ9 31887 [0, 0] 96 4 1384041600 Clothes_s~ "" "" NA NaN
10 9 A3EERSWHAI6SO 31887 [7, 8] 532 5 1349568000 Clothes_s~ "" "" NA NaN
# ... with more rows
>
推荐答案
尽管这不是最优雅的代码字符串,但它应该可以完成工作.由于除了屏幕截图外没有提供任何示例数据集,因此我只创建了一个包含您感兴趣的重要元素的示例.
Although this isn't the most elegant string of code, it should get the job done. Since no sample dataset is provided other than a screenshot, I just created a sample with the important elements you were interested in.
csj <- tibble(helpful = rep(c("[0,0]","[0,1]","[0,2]","[1,3]"),100),
overall = rep(c(5,4,3,2),100))
#this change the columns and creates the help column
csj %>%
mutate(col1 = as.numeric(stringi::stri_extract_first_regex(helpful, pattern = "[0-9]")),#extract first number
col2 = as.numeric(stringi::stri_extract_last_regex(helpful, pattern = "[0-9]")),#extract second
col3 = ifelse(col2 == 0, 1, row2 ),#change 0s to 1
help = col1/col3) %>% #divide row1 and 3
select(helpful, help)#select the rows you wish to keep
只要您根据需要对数据集修改函数,此方法就应该起作用.还要注意,有用的是数据集中的字符类型,这就是为什么需要将其更改为数字
This should work as long as you modify the functions to your dataset as needed. Also note that helpful is a character type in your dataset which is why you need to change it to numeric
所以我查找了一些代码,意识到为什么代码无法正常工作,所以我为自己创建了一个示例进行测试.尽管我没有完全复制您的数据,但我想出了足够的方法来希望提供一个工作解决方案.
So I looked up some sparklyr and realized why the code isn't working so I created an example for myself to test out.Although I didn't replicate your data completely I came up with enough things to hopefully provide a working solution.
library(sparklyr)
library(dplyr)
library(ggplot2)
library(magrittr)
sc <- spark_connect(master="local")
#create dataframe
cjs <- tibble(helpful = rep(c("[0, 0]","[0, 1]","[0, 2]","[1, 3]","[,1]",NA,"a"),100),
overall = rep(c(6,5,4,3,2,1,0),100))
#transfer to sparkly
csj <- copy_to(sc, csj,"cjs")
#this should do the trick
csj %>%
mutate(newcol2 = regexp_replace(helpful, "[^0-9,]", " "),
newcol3 = as.numeric(substring_index(newcol2, ",", 1)),
newcol4 = as.numeric(substring_index(newcol2,",",-1)),
newcol5 = ifelse(newcol4 == 0, 1, newcol4),
help = newcol3/newcol5) %>%
select(starts_with("new"),help) #select the columns you need with help calculated appropriately
这篇关于R:如何从管道中的火花连接(sparklyr)中的数据列中提取元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!