Redis vs Disk在缓存应用程序中的性能 [英] Performance of Redis vs Disk in caching application

查看:72
本文介绍了Redis vs Disk在缓存应用程序中的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中创建一个redis缓存,并且作为一个自尊的科学家,我做了一个基准来测试性能。

I wanted to create a redis cache in python, and as any self respecting scientist I made a bench mark to test the performance.

有趣的是,redis并没有表现很好Python做一些魔术(存储文件),或者我的Redis版本非常慢。

Interestingly, redis did not fare so well. Either Python is doing something magic (storing the file) or my version of redis is stupendously slow.

我不知道这是否是因为我的代码的方式结构化的或类似的东西,但是我希望redis比它做得更好。

I don't know if this is because of the way my code is structured, or what, but I was expecting redis to do better than it did.

要创建redis缓存,我设置了二进制数据(在这种情况下,是HTML页)到从文件名派生的密钥(有效期为5分钟)。

To make a redis cache, I set my binary data (in this case, an HTML page) to a key derived from the filename with an expiration of 5 minutes.

在所有情况下,文件处理都是通过f.read()完成的(这比f.readlines()快3倍,而且我需要二进制blob)。

In all cases, file handling is done with f.read() (this is ~3x faster than f.readlines(), and I need the binary blob).

比较中是否缺少某些内容,还是Redis确实与磁盘不匹配? Python是否将文件缓存在某个位置,然后每次都重新访问它?为什么这比访问redis快得多?

Is there something I'm missing in my comparison, or is Redis really no match for a disk? Is Python caching the file somewhere, and reaccessing it every time? Why is this so much faster than access to redis?

我在64位Ubuntu系统上使用redis 2.8,python 2.7和redis-py。

I'm using redis 2.8, python 2.7, and redis-py, all on a 64 bit Ubuntu system.

我认为Python并没有做任何特别神奇的事情,因为我做了一个函数,将文件数据存储在python对象中并永久产生。

I do not think Python is doing anything particularly magical, as I made a function that stored the file data in a python object and yielded it forever.

我对四个函数调用进行了分组:

I have four function calls that I grouped:

读取文件X次

调用该函数以查看redis对象是否仍在内存中,加载它或缓存新文件(单个和多个redis实例)。

A function that is called to see if redis object is still in memory, load it, or cache new file (single and multiple redis instances).

一个函数,该函数创建一个生成器,该生成器从redis数据库(带有单个和多个redis实例)产生结果。

A function that creates a generator that yields the result from the redis database (with single and multi instances of redis).

最后,将文件存储在

import redis
import time

def load_file(fp, fpKey, r, expiry):
    with open(fp, "rb") as f:
        data = f.read()
    p = r.pipeline()
    p.set(fpKey, data)
    p.expire(fpKey, expiry)
    p.execute()
    return data

def cache_or_get_gen(fp, expiry=300, r=redis.Redis(db=5)):
    fpKey = "cached:"+fp

    while True:
        yield load_file(fp, fpKey, r, expiry)
        t = time.time()
        while time.time() - t - expiry < 0:
            yield r.get(fpKey)


def cache_or_get(fp, expiry=300, r=redis.Redis(db=5)):

    fpKey = "cached:"+fp

    if r.exists(fpKey):
        return r.get(fpKey)

    else:
        with open(fp, "rb") as f:
            data = f.read()
        p = r.pipeline()
        p.set(fpKey, data)
        p.expire(fpKey, expiry)
        p.execute()
        return data

def mem_cache(fp):
    with open(fp, "rb") as f:
        data = f.readlines()
    while True:
        yield data

def stressTest(fp, trials = 10000):

    # Read the file x number of times
    a = time.time()
    for x in range(trials):
        with open(fp, "rb") as f:
            data = f.read()
    b = time.time()
    readAvg = trials/(b-a)


    # Generator version

    # Read the file, cache it, read it with a new instance each time
    a = time.time()
    gen = cache_or_get_gen(fp)
    for x in range(trials):
        data = next(gen)
    b = time.time()
    cachedAvgGen = trials/(b-a)

    # Read file, cache it, pass in redis instance each time
    a = time.time()
    r = redis.Redis(db=6)
    gen = cache_or_get_gen(fp, r=r)
    for x in range(trials):
        data = next(gen)
    b = time.time()
    inCachedAvgGen = trials/(b-a)


    # Non generator version    

    # Read the file, cache it, read it with a new instance each time
    a = time.time()
    for x in range(trials):
        data = cache_or_get(fp)
    b = time.time()
    cachedAvg = trials/(b-a)

    # Read file, cache it, pass in redis instance each time
    a = time.time()
    r = redis.Redis(db=6)
    for x in range(trials):
        data = cache_or_get(fp, r=r)
    b = time.time()
    inCachedAvg = trials/(b-a)

    # Read file, cache it in python object
    a = time.time()
    for x in range(trials):
        data = mem_cache(fp)
    b = time.time()
    memCachedAvg = trials/(b-a)


    print "\n%s file reads: %.2f reads/second\n" %(trials, readAvg)
    print "Yielding from generators for data:"
    print "multi redis instance: %.2f reads/second (%.2f percent)" %(cachedAvgGen, (100*(cachedAvgGen-readAvg)/(readAvg)))
    print "single redis instance: %.2f reads/second (%.2f percent)" %(inCachedAvgGen, (100*(inCachedAvgGen-readAvg)/(readAvg)))
    print "Function calls to get data:"
    print "multi redis instance: %.2f reads/second (%.2f percent)" %(cachedAvg, (100*(cachedAvg-readAvg)/(readAvg)))
    print "single redis instance: %.2f reads/second (%.2f percent)" %(inCachedAvg, (100*(inCachedAvg-readAvg)/(readAvg)))
    print "python cached object: %.2f reads/second (%.2f percent)" %(memCachedAvg, (100*(memCachedAvg-readAvg)/(readAvg)))

if __name__ == "__main__":
    fileToRead = "templates/index.html"

    stressTest(fileToRead)

现在的结果是:

10000 file reads: 30971.94 reads/second

Yielding from generators for data:
multi redis instance: 8489.28 reads/second (-72.59 percent)
single redis instance: 8801.73 reads/second (-71.58 percent)
Function calls to get data:
multi redis instance: 5396.81 reads/second (-82.58 percent)
single redis instance: 5419.19 reads/second (-82.50 percent)
python cached object: 1522765.03 reads/second (4816.60 percent)

结果有趣,因为a)生成器比每次调用函数都要快,b)redis比从磁盘读取要慢,c)从python对象读取

The results are interesting in that a) generators are faster than calling functions each time, b) redis is slower than reading from the disk, and c) reading from python objects is ridiculously fast.

为什么从磁盘读取数据会很比从Redis读取内存文件快得多吗?

Why would reading from a disk be so much faster than reading from an in-memory file from redis?

编辑:
一些更多信息和测试。

Some more information and tests.

我将函数替换为

data = r.get(fpKey)
if data:
    return r.get(fpKey)

结果与

The results do not differ much from

if r.exists(fpKey):
    data = r.get(fpKey)


Function calls to get data using r.exists as test
multi redis instance: 5320.51 reads/second (-82.34 percent)
single redis instance: 5308.33 reads/second (-82.38 percent)
python cached object: 1494123.68 reads/second (5348.17 percent)


Function calls to get data using if data as test
multi redis instance: 8540.91 reads/second (-71.25 percent)
single redis instance: 7888.24 reads/second (-73.45 percent)
python cached object: 1520226.17 reads/second (5132.01 percent)

创建新的redis insta由于每次函数调用实际上都不会对读取速度产生明显影响,因此每次测试之间的可变性都大于增益。

Creating a new redis instance on each function call actually does not have a noticable affect on read speed, the variability from test to test is larger than the gain.

Sripathi Krishnan建议实现随机文件读取。正如我们从这些结果中看到的那样,这就是缓存真正开始起作用的地方。

Sripathi Krishnan suggested implementing random file reads. This is where caching starts to really help, as we can see from these results.

Total number of files: 700

10000 file reads: 274.28 reads/second

Yielding from generators for data:
multi redis instance: 15393.30 reads/second (5512.32 percent)
single redis instance: 13228.62 reads/second (4723.09 percent)
Function calls to get data:
multi redis instance: 11213.54 reads/second (3988.40 percent)
single redis instance: 14420.15 reads/second (5157.52 percent)
python cached object: 607649.98 reads/second (221446.26 percent)

有巨大的文件读取的可变性数量,因此百分比差异不是加速的良好指标。

There is a HUGE amount of variability in file reads so the percent difference is not a good indicator of speedup.

Total number of files: 700

40000 file reads: 1168.23 reads/second

Yielding from generators for data:
multi redis instance: 14900.80 reads/second (1175.50 percent)
single redis instance: 14318.28 reads/second (1125.64 percent)
Function calls to get data:
multi redis instance: 13563.36 reads/second (1061.02 percent)
single redis instance: 13486.05 reads/second (1054.40 percent)
python cached object: 587785.35 reads/second (50214.25 percent)

我使用了random.choice (fileList)在每次通过函数时随机选择一个新文件。

I used random.choice(fileList) to randomly select a new file on each pass through the functions.

完整的要旨在这里,如果有人想尝试一下- https://gist.github.com/3885957

The full gist is here if anyone would like to try it out - https://gist.github.com/3885957

编辑编辑:
没有意识到我正在为一个文件调用一个文件生成器(尽管函数调用和生成器的性能非常相似)。这也是来自生成器的不同文件的结果。

Edit edit: Did not realize that I was calling one single file for the generators (although the performance of the function call and generator was very similar). Here is the result of different files from the generator as well.

Total number of files: 700
10000 file reads: 284.48 reads/second

Yielding from generators for data:
single redis instance: 11627.56 reads/second (3987.36 percent)

Function calls to get data:
single redis instance: 14615.83 reads/second (5037.81 percent)

python cached object: 580285.56 reads/second (203884.21 percent)


推荐答案

这是一个苹果与桔子的比较。
请参见 http://redis.io/topics/benchmarks

This is an apples to oranges comparison. See http://redis.io/topics/benchmarks

Redis是高效的远程数据存储。每次在Redis上执行命令时,都会向Redis服务器发送一条消息,并且如果客户端是同步的,它将阻止等待答复。因此,除了命令本身的成本之外,您还需要支付网络往返或IPC的费用。

Redis is an efficient remote data store. Each time a command is executed on Redis, a message is sent to the Redis server, and if the client is synchronous, it blocks waiting for the reply. So beyond the cost of the command itself, you will pay for a network roundtrip or an IPC.

在现代硬件上,与其他操作相比,网络往返或IPC的费用高得惊人。这是由于几个因素造成的:

On modern hardware, network roundtrips or IPCs are suprisingly expensive compared to other operations. This is due to several factors:


  • 介质的原始延迟(主要用于网络)

  • 操作系统调度程序的延迟(在Linux / Unix上不能保证)

  • 内存高速缓存未命中的代价很高,并且在客户端和服务器进程在/中进行调度时,高速缓存未命中的可能性会增加。

  • 在高端包装盒上,NUMA副作用

  • the raw latency of the medium (mainly for network)
  • the latency of the operating system scheduler (not guaranteed on Linux/Unix)
  • memory cache misses are expensive, and the probability of cache misses increases while the client and server processes are scheduled in/out.
  • on high-end boxes, NUMA side effects

现在,让我们回顾一下结果。

Now, let's review the results.

比较使用生成器的实现和使用函数调用的实现,它们生成Redis的往返次数不同。使用生成器,您只需:

Comparing the implementation using generators and the one using function calls, they do not generate the same number of roundtrips to Redis. With the generator you simply have:

    while time.time() - t - expiry < 0:
        yield r.get(fpKey)

所以每次迭代1次。使用该函数,您可以:

So 1 roundtrip per iteration. With the function, you have:

if r.exists(fpKey):
    return r.get(fpKey)

因此,每次迭代2次往返。难怪生成器会更快。

So 2 roundtrips per iteration. No wonder the generator is faster.

当然,您应该重用相同的Redis连接以获得最佳性能。没有必要运行一个系统地连接/断开连接的基准测试。

Of course you are supposed to reuse the same Redis connection for optimal performance. There is no point to run a benchmark which systematically connects/disconnects.

最后,关于Redis调用和文件读取之间的性能差异,您只是在比较本地调用到一个偏远的地方。文件读取由OS文件系统缓存,因此它们是内核和Python之间的快速内存传输操作。此处不涉及磁盘I / O。使用Redis,您必须支付往返的费用,因此速度要慢得多。

Finally, regarding the performance difference between Redis calls and the file reads, you are simply comparing a local call to a remote one. File reads are cached by the OS filesystem, so they are fast memory transfer operations between the kernel and Python. There is no disk I/O involved here. With Redis, you have to pay for the cost of the roundtrips, so it is much slower.

这篇关于Redis vs Disk在缓存应用程序中的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆