为什么使用Python异步从文件读取和调用API的速度比同步慢? [英] Why is reading and calling an API from a file slower using Python async than synchronously?

查看:132
本文介绍了为什么使用Python异步从文件读取和调用API的速度比同步慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件,每行都有JSON记录.我正在编写一个脚本,以通过API将这些记录的子集上载到CouchDB,并尝试各种方法来查看最快的方法.这是我发现工作最快或最慢的方法(在本地主机上的CouchDB实例上):

I have a large file, with a JSON record on each line. I'm writing a script to upload a subset of these records to CouchDB via the API, and experimenting with different approaches to see what works the fastest. Here's what I've found to work fastest to slowest (on a CouchDB instance on my localhost):

  1. 将每个所需的记录读取到内存中.在所有记录都存储在内存中后,为每个记录生成一个上传协程,并立即收集/运行所有协程

  1. Read each needed record into memory. After all records are in memory, generate an upload coroutine for each record, and gather/run all the coroutines at once

同步读取文件,并且在遇到所需记录时,同步上传

Synchronously read file and when a needed record is encountered, synchronously upload

使用 aiofiles 读取文件,并在遇到所需记录时异步更新

Use aiofiles to read the file, and when a needed record is encountered, asynchronously update

方法1比其他两个方法快得多(大约快一倍).我感到困惑,为什么方法2比方法3快,尤其是与

Approach #1 is much faster than the other two (about twice as fast). I am confused why approach #2 is faster than #3, especially in contrast to this example here, which takes half as much time to run asynchronously than synchronously (sync code not provided, had to rewrite it myself). Is it the context switching from file i/o to HTTP i/o, especially with file reads ocurring much more often than API uploads?

为进一步说明,这是代表每种方法的一些Python伪代码:

For additional illustration, here's some Python pseudo-code that represents each approach:

import json
import asyncio
import aiohttp

records = []
with open('records.txt', 'r') as record_file:
    for line in record_file:
        record = json.loads(line)
        if valid(record):
            records.append(record)

async def batch_upload(records):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for record in records:
            task = async_upload(record, session)
            tasks.append(task)  
        await asyncio.gather(*tasks)

asyncio.run(batch_upload(properties))

方法2-同步文件IO,同步HTTP IO

import json

with open('records.txt', 'r') as record_file:
    for line in record_file:
        record = json.loads(line)
        if valid(record):
            sync_upload(record)

方法3-异步文件IO,异步HTTP IO

import json
import asyncio
import aiohttp
import aiofiles

async def batch_upload()
    async with aiohttp.ClientSession() as session:
        async with open('records.txt', 'r') as record_file:
            line = await record_file.readline()
            while line:
                record = json.loads(line)
                if valid(record):
                    await async_upload(record, session)
                line = await record_file.readline()

asyncio.run(batch_upload())

我正在开发的文件约为1.3 GB,共有100000条记录,我上传了691条.每次上传均以GET请求开头,以查看该记录是否已存在于CouchDB中.如果是这样,则执行PUT以使用任何新信息更新CouchDB记录;否则,执行PUT操作.如果不是,则将记录过帐到数据库.因此,每个上载都包含两个API请求.出于开发目的,我仅创建记录,因此运行GET和POST请求,总共1382个API调用.

The file I'm developing this with is about 1.3 GB, with 100000 records total, 691 of which I upload. Each upload begins with a GET request to see if the record already exists in CouchDB. If it does, then a PUT is performed to update the CouchDB record with any new information; if it doesn't, then a the record is POSTed to the db. So, each upload consists of two API requests. For dev purposes, I'm only creating records, so I run the GET and POST requests, 1382 API calls total.

方法1大约需要17秒,方法2大约需要33秒,方法3大约需要42秒.

Approach #1 takes about 17 seconds, approach #2 takes about 33 seconds, and approach #3 takes about 42 seconds.

推荐答案

您的代码使用async,但它是同步完成的,因此在这种情况下,它的速度将比sync方法慢.如果构建/使用效率不高,Asyc不会加快执行速度.

your code uses async but it does the work synchronously and in this case it will be slower than the sync approach. Asyc won't speed up the execution if not constructed/used effectively.

您可以创建2个协程并使它们并行运行..也许可以加快操作速度.

You can create 2 coroutines and make them run in parallel.. perhaps that speeds up the operation.

示例:

#!/usr/bin/env python3

import asyncio


async def upload(event, queue):
    # This logic is not so correct when it comes to shutdown,
    # but gives the idea
    while not event.is_set():
        record = await queue.get()
        print(f'uploading record : {record}')
    return


async def read(event, queue):
    # dummy logic : instead read here and populate the queue.
    for i in range(1, 10):
        await queue.put(i)
    # Initiate shutdown..
    event.set()


async def main():
    event = asyncio.Event()
    queue = asyncio.Queue()

    uploader = asyncio.create_task(upload(event, queue))
    reader = asyncio.create_task(read(event, queue))
    tasks = [uploader, reader]

    await asyncio.gather(*tasks)


if __name__ == '__main__':
    asyncio.run(main())

这篇关于为什么使用Python异步从文件读取和调用API的速度比同步慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆