在Go中以恒定的时间从文件中读取随机行 [英] Reading a random line from a file in constant time in Go

查看:44
本文介绍了在Go中以恒定的时间从文件中读取随机行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码从包含 ip:port 形式的行的文件中选择2条随机行:

  import("os""fmt"数学/rand"日志"时间""unicode/utf8"//"bufio")func main(){fmt.Println(行中的数字字节为:\ n",utf8.RuneCountInString("10.244.1.8:8080"))file_pods_array,err_file_pods_array:= os.Open("pods_array.txt")如果err_file_pods_array!= nil {log.Fatalf(打开文件失败:%s",err_file_pods_array)}//16 = ip:port对中的字节数randsource:= rand.NewSource(time.Now().UnixNano())randgenerator:= rand.New(randsource)firstLoc:= randgenerator.Intn(10)secondLoc:= randgenerator.Intn(10)候选人1:="候选人2:="num_bytes_from_start_first:= 16 *(firstLoc + 1)num_bytes_from_start_second:= 16 *(secondLoc + 1)buf_ipport_first:= make([] byte,int64(15))buf_ipport_second:= make([] byte,int64(15))start_first:= int64(num_bytes_from_start_first)start_second:= int64(num_bytes_from_start_second)_,err_first:= file_pods_array.ReadAt(buf_ipport_first,start_first)first_ipport_ep:= buf_ipport_first如果err_first == nil {候选人1 =字符串(first_ipport_ep)}_,err_second:= file_pods_array.ReadAt(buf_ipport_second,start_second)second_ipport_ep:= buf_ipport_second如果err_second == nil {候选人2 =字符串(second_ipport_ep)}fmt.Println("first is:",候选人1)fmt.Println("sec是:",候选人2)} 

这有时会打印空白行或部分行.为什么会发生这种情况,我该如何解决?

输出示例:

行中的

  num个字节是:15首先是:10.244.1.17:808秒是:10.244.1.11:80 

谢谢.

解决方案

如果行的长度是固定的,则可以在固定时间内执行此操作.

  1. 每行的长度是L.
  2. 检查文件大小S.
  3. 将S/L除以得到行数N.
  4. 选择0到N-1之间的随机数R.
  5. 在文件中寻找R * L.
  6. 读取L个字节.


但是您不会没有固定长度的线.我们不能执行恒定时间,但是可以使用Donald E. Knuth的计算机编程艺术,第2卷,第3.4.2节,中的技术在恒定的内存和O(n)时间中进行.em>.

  1. 阅读一行.记住它的行号M.
  2. 选择一个从1到M的随机数.
  3. 如果为1,请记住这一行.

也就是说,当您阅读每一行时,您有1/M的机会选择它.累积起来,每行总计为1/N.

如果我们有3条线,则第一条线有1/1的几率被选中.然后剩下1/2的几率.然后剩下2/3的机会.总机率:1 * 1/2 * 2/3 = 1/3.

第二行被选中的机率是1/2,剩下的几率为2/3.总机率:1/2 * 2/3 = 1/3.

第三行被选中的几率是1/3.

 程序包主要进口("bufio""fmt""os"日志"数学/rand"时间");func main(){文件,错误:= os.Open("pods_array.txt")如果err!= nil {log.Fatal(错误)}延迟file.Close()扫描仪:= bufio.NewScanner(文件)randsource:= rand.NewSource(time.Now().UnixNano())randgenerator:= rand.New(randsource)lineNum:= 1var选择字符串用于扫描仪.Scan(){行:= Scanner.Text()fmt.Printf(将%v考虑为1/%v.\ n",scanner.Text(),lineNum)//而不是1到N,而是0到N-1roll:= randgenerator.Intn(lineNum)fmt.Printf(我们滚动%v.\ n",滚动)如果roll == 0 {fmt.Printf(提货线.\ n")选择=线}lineNum + = 1}fmt.Printf("Picked:%v \ n",选择)} 

因为 rand.Intn(n)返回 [0,n),即从0到n-1,所以我们检查0,而不是1.


也许您在想:如果我寻找文件中的一个随机点然后读取下一个完整的行该怎么办?"那不是完全恒定的时间,而是O(最长行),但并不是真正的随机.更长的台词会更频繁地被选择.


请注意,由于这些(我假设)是所有可以的IP地址和端口,因此记录长度是恒定的.将IPv4地址存储为32位,将端口存储为16位.每行48位.

但是,这将在IPv6上中断.为了实现前向兼容性,请将所有内容存储为IPv6:IP的128位和端口的16位.每行144位.将IPv4地址转换为IPv6 进行存储.

这将允许您在恒定时间内选择随机地址,并节省磁盘空间.

或者,将将其存储在SQLite中.

I have the following code to choose 2 random lines from a file containing lines of the form ip:port:

import (
  "os"
   "fmt"
"math/rand"
"log"
"time"
"unicode/utf8"

//"bufio"
)
func main() {
fmt.Println("num bytes in line is: \n", utf8.RuneCountInString("10.244.1.8:8080"))
file_pods_array, err_file_pods_array := os.Open("pods_array.txt")
if err_file_pods_array != nil {
        log.Fatalf("failed opening file: %s", err_file_pods_array)
}
//16 = num of bytes in ip:port pair
randsource := rand.NewSource(time.Now().UnixNano())
                randgenerator := rand.New(randsource)
                firstLoc := randgenerator.Intn(10)
                secondLoc := randgenerator.Intn(10)
                candidate1 := ""
                candidate2 := ""
num_bytes_from_start_first := 16 * (firstLoc + 1)
num_bytes_from_start_second := 16 * (secondLoc + 1)
    buf_ipport_first := make([]byte, int64(15))
    buf_ipport_second := make([]byte, int64(15))
    start_first := int64(num_bytes_from_start_first)
    start_second := int64(num_bytes_from_start_second)
    _, err_first := file_pods_array.ReadAt(buf_ipport_first, start_first)
    first_ipport_ep := buf_ipport_first
    if err_first == nil {
            candidate1 = string(first_ipport_ep)
    }
    _, err_second := file_pods_array.ReadAt(buf_ipport_second, start_second)
    second_ipport_ep := buf_ipport_second

    if err_second == nil {
            candidate2 = string(second_ipport_ep)
    }
fmt.Println("first is: ", candidate1)
fmt.Println("sec is: ", candidate2)
}

This sometimes prints empty or partial lines. Why does this happen and how can I fix it?

Output example:

num bytes in line is:
 15
first is: 10.244.1.17:808
sec is:
10.244.1.11:80

Thank you.

解决方案

If your lines were of a fixed length you could do this in constant time.

  1. Length of each line is L.
  2. Check the size of the file, S.
  3. Divide S/L to get the number of lines N.
  4. Pick a random number R from 0 to N-1.
  5. Seek to R*L in the file.
  6. Read L bytes.


But you don't have fixed length lines. We can't do constant time, but we can do it in constant memory and O(n) time using the technique from The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

  1. Read a line. Remember its line number M.
  2. Pick a random number from 1 to M.
  3. If it's 1, remember this line.

That is, as you read each line you have a 1/M chance of picking it. Cumulatively this adds up to 1/N for every line.

If we have three lines, the first line has a 1/1 chance of being picked. Then a 1/2 chance of remaining. Then a 2/3 chance of remaining. Total chance: 1 * 1/2 * 2/3 = 1/3.

The second line has a 1/2 chance of being picked and a 2/3 chance of remaining. Total chance: 1/2 * 2/3 = 1/3.

The third line has a 1/3 chance of being picked.

package main

import(
    "bufio"
    "fmt"
    "os"
    "log"
    "math/rand"
    "time"
);

func main() {
    file, err := os.Open("pods_array.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    randsource := rand.NewSource(time.Now().UnixNano())
    randgenerator := rand.New(randsource)

    lineNum := 1
    var pick string
    for scanner.Scan() {
        line := scanner.Text()
        fmt.Printf("Considering %v at 1/%v.\n", scanner.Text(), lineNum)

        // Instead of 1 to N it's 0 to N-1
        roll := randgenerator.Intn(lineNum)        
        fmt.Printf("We rolled a %v.\n", roll)
        if roll == 0 {
            fmt.Printf("Picking line.\n")
            pick = line
        }

        lineNum += 1
    }

    fmt.Printf("Picked: %v\n", pick)
}

Because rand.Intn(n) returns [0,n), that is from 0 to n-1, we check for 0, not 1.


Maybe you're thinking "what if I seek to a random point in the file and then read the next full line?" That wouldn't quite be constant time, it would beO(longest-line), but it wouldn't be truly random. Longer lines would get picked more frequently.


Note that since these are (I assume) all IP addresses and ports you could have constant record lengths. Store the IPv4 address as a 32 bits and the port as a 16 bits. 48 bits per line.

However, this will break on IPv6. For forward compatibility store everything as IPv6: 128 bits for the IP and 16 bits for the port. 144 bits per line. Convert IPv4 addresses to IPv6 for storage.

This will allow you to pick random addresses in constant time, and it will save disk space.

Alternatively, store them in SQLite.

这篇关于在Go中以恒定的时间从文件中读取随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆