生成带有较低拉丁字母的大随机字符串的最快方法 [英] Fastest method to generate big random string with lower Latin letters

查看:67
本文介绍了生成带有较低拉丁字母的大随机字符串的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决Timus的问题在线法官.要解决此问题,您需要生成1 000 000个小写拉丁字母的序列,并在1秒内将其写入stdin.

I'm trying to solve this problem from Timus Online Judge. To solve this problem you need generate a sequence of 1 000 000 lowercase Latin letters and write it to stdin in 1 second.

使用C ++或Java很容易解决此问题.我在这里有python解决方案:

It is easy to solve this problem with C++ or Java. I have python solution here:

import os
from random import randint

s = ''.join(chr(97 + randint(0, 25)) for i in range(1000000))
os.write(1, bytes(s, 'utf8'))

需要1.7秒:

$ time python3.3 1219.py > /dev/null

real    0m1.756s
user    0m1.744s
sys     0m0.008s

结果是超出时间限制".因此,问题是如何更快地做到这一点?"

And I got "Time limit exceeded" in result. So the question is "How to do it faster?"

UPD1 : 使用randint(97, 122)可以减少16ms的时间.现在是1.740s

UPD1: Using randint(97, 122) reduces time at 16ms. Now it is 1.740s

UPD2: @Martijn Pieters的解决方案需要0.979s,但也没有通过测试.

UPD2: Solution by @Martijn Pieters takes 0.979s, but it doesn't pass test either.

UPD3 Martijn Pieters 提出了一个很好的解决方案,但是仍然很慢:

UPD3 Martijn Pieters suggested a very good solutions, but it's still slow:

from sys import stdin
from random import choice
from string import ascii_lowercase

s = ''.join([choice(ascii_lowercase) for _ in range(1000000)])
stdout.write(s) 

需要 0.924s

from sys import stdout
from random import choice
from string import ascii_lowercase

for _ in range(1000000):
    stdout.write(choice(ascii_lowercase))

需要 1.173s

from sys import stdout
from random import choice
from string import ascii_lowercase
bal = [c.encode('ascii') for c in ascii_lowercase]
out = stdout.buffer

for _ in range(1000000):
    out.write(choice(bal))

需要 1.155s

from sys import stdout
from random import choice
from string import ascii_lowercase

bal = [c.encode('ascii') for c in ascii_lowercase]
stdout.buffer.write(b''.join([choice(bal) for _ in range(1000000)]))

需要 0.901s

UPD4

有人在Timus上刚刚解决问题.我希望他能分享他的解决方案:)

Some guy just solved problem on Timus. I hope he will share his solution :)

UPD5 感谢 Ashwini Chaudhary 与我们分享他的Python 2.x解决方案:

UPD5 Thanks to Ashwini Chaudhary for sharing his Python 2.x solution with us:

from random import choice
from string import ascii_lowercase
lis=list(ascii_lowercase)
print ''.join(choice(lis) for _ in xrange(1000000)) 

我的计算机上需要 0.527s ,并且通过了Timus的测试.但是Python3.x的问题仍然存在.

It takes 0.527s on my computer and it passes tests on Timus. But problem with Python3.x still remains.

UPD6 感谢 MarkkuK.此代码:

import os
from random import random
from string import ascii_lowercase

bal = [c.encode('ascii') for c in ascii_lowercase]
os.write(1, b''.join([bal[int(random() * 26)] for _ in range(1000000)]))

需要 0.445s ,但仍未通过测试

推荐答案

以下是Python 3代码,可在0.28秒内生成1000000个随机"小写字母(另请参阅0.11 -seconds解决方案; @Ashwini Chaudhary的问题中的代码在我的计算机上需要0.55秒,@ Markku K.的代码-0.53):

Here's Python 3 code that generates 1000000 "random" lowercase letters in 0.28 seconds (see also 0.11-seconds solution at the end; @Ashwini Chaudhary's code from the question takes 0.55 seconds on my machine, @Markku K.'s code -- 0.53):

#!/usr/bin/env python3
import os
import sys

def write_random_lowercase(n):
    min_lc = ord(b'a')
    len_lc = 26
    ba = bytearray(os.urandom(n))
    for i, b in enumerate(ba):
        ba[i] = min_lc + b % len_lc # convert 0..255 to 97..122
    sys.stdout.buffer.write(ba)

write_random_lowercase(1000000)

% len_lc仍然满足条件(ascii,小写字母,1、2、3个字母序列的频率),但使分布倾斜(请参阅最后的解决方法):

% len_lc skews the distribution (see at the end on how to fix it) though It still satisfies the conditions (ascii, lowercase, frequencies of 1, 2, 3 letter sequences):

$ python3 generate-random.py | python3 check-seq.py

其中check-seq.py:

#!/usr/bin/env python3
import sys
from collections import Counter
from string import ascii_lowercase

def main():
    limits = [40000, 2000, 100]

    s = sys.stdin.buffer.readline() # a single line
    assert 1000000 <= len(s) <= 1000002 # check length +/- newline
    s.decode('ascii','strict') # check ascii
    assert set(s) == set(ascii_lowercase.encode('ascii')) # check lowercase

    for n, lim in enumerate(limits, start=1):
        freq = Counter(tuple(s[i:i+n]) for i in range(len(s)))
        assert max(freq.values()) <= lim, freq

main()

注意:在acm.timus.ru generate-random.py上显示超出了输出限制".

Note: on acm.timus.ru generate-random.py gives "Output limit exceeded".

要提高性能,可以使用 bytes.translate()方法(0.11秒):

To improve performance, you could use bytes.translate() method (0.11 seconds):

#!/usr/bin/env python3
import os
import sys

# make translation table from 0..255 to 97..122
tbl = bytes.maketrans(bytearray(range(256)),
                      bytearray([ord(b'a') + b % 26 for b in range(256)]))
# generate random bytes and translate them to lowercase ascii
sys.stdout.buffer.write(os.urandom(1000000).translate(tbl))

如何解决% len_lc偏斜

256(字节数)不能被26(低拉丁字母数)均分,因此公式min_lc + b % len_lc使得某些值出现的频率比其他值低,例如:

How to fix % len_lc skew

256 (number of bytes) is not evenly divisible by 26 (number of lower Latin letters) therefore the formula min_lc + b % len_lc makes some values appear less often than others e.g.:

#!/usr/bin/env python3
"""Find out skew: x = 97 + y % 26 where y is uniform from [0, 256) range."""
from collections import Counter, defaultdict

def find_skew(random_bytes):
    char2freq = Counter(chr(ord(b'a') + b % 26) for b in random_bytes)
    freq2char = defaultdict(set)
    for char, freq in char2freq.items():
        freq2char[freq].add(char)
    return {f: ''.join(sorted(c)) for f, c in freq2char.items()}

print(find_skew(range(256)))
# -> {9: 'wxyz', 10: 'abcdefghijklmnopqrstuv'}

在这里,输入range(256)是均匀分布的(每个字节恰好出现一次),但是输出中的'wxyz'字母比其余910出现的频率要低.要解决此问题,可以丢弃未对齐的字节:

Here, the input range(256) is uniformly distributed (each byte occurs exactly once) but 'wxyz' letters in the output are less often then the rest 9 vs. 10 occurrences. To fix it, unaligned bytes could be dropped:

print(find_skew(range(256 - (256 % 26))))
# -> {9: 'abcdefghijklmnopqrstuvwxyz'}

这里,输入是均匀分布的字节,范围为[0, 234),输出是均匀分布的ascii小写字母.

Here, the input is uniformly distributed bytes in the range [0, 234) the output is uniformly distributed ascii lowercase letters.

bytes.translate()接受第二个参数来指定要删除的字节:

bytes.translate() accepts the second argument to specify bytes to delete:

#!/usr/bin/env python3
import os
import sys

nbytes = 256
nletters = 26
naligned = nbytes - (nbytes % nletters)
tbl = bytes.maketrans(bytearray(range(naligned)),
                      bytearray([ord(b'a') + b % nletters
                                 for b in range(naligned)]))
bytes2delete = bytearray(range(naligned, nbytes))
R = lambda n: os.urandom(n).translate(tbl, bytes2delete)

def write_random_ascii_lowercase_letters(write, n):
    """*write* *n* random ascii lowercase letters."""    
    while n > 0:
        # R(n) expected to drop `(nbytes - nletters) / nbytes` bytes
        # to compensate, increase the initial size        
        n -= write(memoryview(R(n * nbytes // naligned + 1))[:n])

write = sys.stdout.buffer.write
write_random_ascii_lowercase_letters(write, 1000000)

如果随机生成器(此处为os.urandom)产生的长字节序列超出了对齐范围(>=234),则while循环可能执行多次.

If the random generator (os.urandom here) produces long sequences of the bytes that are outside of the aligned range (>=234) then the while loop may execute many times.

如果 random.getrandbits(8*n).to_bytes(n, 'big') <使用/a>代替 os.urandom(n) .前者使用Mersenne Twister作为核心生成器,它可能比使用操作系统提供的源的os.urandom()更快.如果您将随机字符串用于机密,则后者更为安全.

The time performance can be improved by another order of magnitude if random.getrandbits(8*n).to_bytes(n, 'big') is used instead of os.urandom(n). The former uses Mersenne Twister as the core generator that may be faster than os.urandom() that uses sources provided by the operating system. The latter is more secure if you use the random string for secrets.

这篇关于生成带有较低拉丁字母的大随机字符串的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆