从URL读取随机样本 [英] Read a random sample from URL
问题描述
我想从URL中读取csv格式文件的随机样本。
I want to read a random sample of a csv-formatted file from URL.
到目前为止:
library(tidyverse)
library(data.table)
# load dataset from url, skip the first 16 rows
# then *after* reading it completely, use dplyr function
# for sampling. quite dumb, I want to do it while
# reading the file
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
sample_frac(.01) %>%
rename(password = V1)
然后我尝试了,如几篇文章中所建议的:
Then I tried, as suggested in several posts:
df <- fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16, header = F)
但这对我不起作用。错误:
But it doesn't work for me. Error:
shuf: 'http://datashaping.com/passwords.txt': No such file or directory
Error in fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16, :
File is empty: /dev/shm/file1ab1608b13cf
此外,恐惧似乎还很慢。
Moreover, fread seems to be rather slow.
有什么想法吗? $ b基准?
Any idea? Benchmarks?
我尝试对 read.csv()进行基准测试
与 fread()
:
benchmark("read.csv" = {
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16)
df <- df %>%
sample_n(10) %>%
rename(password = V1)
}, {
df <- fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10")
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
"Stopped reading at empty line 9 but text exists afterwards (discarded): 08090728"Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
"Stopped reading at empty line 6 but text exists afterwards (discarded): 0307737205"
推荐答案
看起来该文件不是CSV,并且数据从第15行开始。我现在在Windows 10上,这对我非常迅速(整个样本,而不是随机样本):
Looks like that file is not a CSV, and the data starts on line 15. I am on Windows 10 right now & this worked for me very quickly (whole sample, not random sample):
> test <- fread("http://datashaping.com/passwords.txt",skip=15)
trying URL 'http://datashaping.com/passwords.txt'
Content type 'text/plain' length 20163417 bytes (19.2 MB)
downloaded 19.2 MB
Read 2069847 rows and 1 (of 1) columns from 0.019 GB file in 00:00:03
它按预期提供 data.table
结构:
> str(test)
Classes ‘data.table’ and 'data.frame': 2069847 obs. of 1 variable:
$ #: chr "07606374520" "piontekendre" "rambo144" "primoz123" ...
- attr(*, ".internal.selfref")=<externalptr>
您可以像这样访问所有数据(使用 with = FALSE
以按列号引用):
You can access all the data like this (use with=FALSE
to reference by column number):
> test[,1,with=FALSE]
#
1: 07606374520
2: piontekendre
3: rambo144
4: primoz123
5: sal1387
---
2069843: 26778982
2069844: brazer1
2069845: usethisone
2069846: scare222273
2069847: anto1962
您可以访问以下单个密码:
And you can access individual passwords like this:
> test[1,1,with=FALSE]
#
1: 07606374520
> test[5,1,with=FALSE]
#
1: sal1387
这篇关于从URL读取随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!