更好的方式在Haskell中从列表中抽取样本而无需更换 [英] Better way of taking random sample without replacement from list in Haskell

查看:86
本文介绍了更好的方式在Haskell中从列表中抽取样本而无需更换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从一个较长的列表中取一个没有替换的随机样本(每个元素只在样本中出现一次)。我使用下面的代码,但现在我想知道:


  1. 是否有一个库函数可以实现此功能?

    li>
  2. 如何改进此代码? (我是一个Haskell初学者,所以即使有库函数,这也是有用的)。

抽样的目的是能够将样本分析结果推广到人口。

  import System.Random 

- |从列表中取出一个没有更换尺寸大小的随机样本。
takeRandomSample :: Int - > Int - > [a] - > [a]
takeRandomSample种子大小xs
|尺寸< hi =子集xs rs
|否则=错误样本量必须小于总体。
其中
rs = randomSample种子大小lo hi
lo = 0
hi = length xs - 1

getOneRandomV g lo hi = randomR(lo,嗨)g

rs帮助大小lo hi gx acc
| x`notElem` acc&&长度acc< size = rsHelper size lo hi new_g new_x(x:acc)
| x`elem` acc&&长度acc< size = rsHelper size lo hi new_g new_x acc
|否则= acc
其中(new_x,new_g)= getOneRandomV g lo hi

- |获得一个随机样本,无需在lo和hi之间替换尺寸大小。
randomSample种子大小lo hi = rsHelper size lo hi gx []其中
(x,g)= getOneRandomV(mkStdGen种子)lo hi

子集l = map(l! !)


解决方案

这里有一个快速的back-of-the 信封实现Daniel Fischer在他的评论中提出的建议,使用我的首选PRNG(mwc-random):

  { - # LANGUAGE BangPatterns# - } 

模块示例(示例)其中

导入Control.Monad.Primitive
导入Data.Foldable(toList)
导入限定Data.Sequence as Seq
import System.Random.MWC

sample :: PrimMonad m => [a] - > Int - > Gen(PrimState m) - > m [a]
sample ys size = go 0(l-1)(Seq.fromList ys)其中
l =长度ys
go!n!i xs g | n> =大小=返回$! (toList。Seq.drop(l-size))xs
|否则= do
j < - uniformR(0,i)g
letI = xs`Seq.index` j
toJ = xs`Seq.index` i
next = (Seq.update i toI.Seq.update j toJ)xs
go(n + 1)(i-1)next g
{ - #INLINE sample# - }
$ b

这是对R的内部C版本 sample(),因为它被称为没有替换。
$ b 样本仅仅是递归工作函数的一个包装,它递增地混合总体直到达到所需的样本大小,只返回许多混洗的元素。写这样的函数可以确保GHC可以内联它。



它很容易使用:

  *主>创建>> =样本[1..100] 10 
[51,94,58,3,91,70,19,65,24,53]

生产版本可能希望按顺序使用类似可变向量的元素,而不是 Data.Sequence 减少花在GC上的时间。

I need to take a random sample without replacement (each element only occurring once in the sample) from a longer list. I'm using the code below, but now I'd like to know:

  1. Is there a library function that does this?
  2. How can I improve this code? (I'm a Haskell beginner, so this would be useful even if there is a library function).

The purpose of the sampling is to be able to generalize findings from analyzing the sample to the population.

import System.Random

-- | Take a random sample without replacement of size size from a list.
takeRandomSample :: Int -> Int -> [a] -> [a]
takeRandomSample seed size xs
    | size < hi  = subset xs rs
    | otherwise = error "Sample size must be smaller than population."
    where
        rs = randomSample seed size lo hi
        lo = 0
        hi = length xs - 1

getOneRandomV g lo hi = randomR (lo, hi) g

rsHelper size lo hi g x acc
    | x `notElem` acc && length acc < size = rsHelper size lo hi new_g new_x (x:acc)
    | x `elem` acc && length acc < size = rsHelper size lo hi new_g new_x acc
    | otherwise = acc
    where (new_x, new_g) = getOneRandomV g lo hi

-- | Get a random sample without replacement of size size between lo and hi.
randomSample seed size lo hi = rsHelper size lo hi g x [] where
(x, g)  = getOneRandomV (mkStdGen seed) lo hi

subset l = map (l !!) 

解决方案

Here's a quick 'back-of-the-envelope' implementation of what Daniel Fischer suggested in his comment, using my preferred PRNG (mwc-random):

{-# LANGUAGE BangPatterns #-}

module Sample (sample) where

import Control.Monad.Primitive
import Data.Foldable (toList)
import qualified Data.Sequence as Seq
import System.Random.MWC

sample :: PrimMonad m => [a] -> Int -> Gen (PrimState m) -> m [a]
sample ys size = go 0 (l - 1) (Seq.fromList ys) where
    l = length ys
    go !n !i xs g | n >= size = return $! (toList . Seq.drop (l - size)) xs
                  | otherwise = do
                      j <- uniformR (0, i) g
                      let toI  = xs `Seq.index` j
                          toJ  = xs `Seq.index` i
                          next = (Seq.update i toI . Seq.update j toJ) xs
                      go (n + 1) (i - 1) next g
{-# INLINE sample #-}

This is pretty much a (terse) functional rewrite of R's internal C version of sample() as it's called without replacement.

sample is just a wrapper over a recursive worker function that incrementally shuffles the population until the desired sample size is reached, returning only that many shuffled elements. Writing the function like this ensures that GHC can inline it.

It's easy to use:

*Main> create >>= sample [1..100] 10
[51,94,58,3,91,70,19,65,24,53]

A production version might want to use something like a mutable vector instead of Data.Sequence in order to cut down on time spent doing GC.

这篇关于更好的方式在Haskell中从列表中抽取样本而无需更换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆