是否可以在内存中保留空间以减少加载时间? [英] Is possible to keep spacy in memory to reduce the load time?

查看:21
本文介绍了是否可以在内存中保留空间以减少加载时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 spacy 作为在线服务的 NLP.每次用户提出请求时,我都会调用脚本my_script.py"

I want to use spacy as for NLP for an online service. Each time a user makes a request I call the script "my_script.py"

以:

from spacy.en import English
nlp = English()

我遇到的问题是这两行需要超过 10 秒,是否可以将 English() 保留在内存中或其他一些选项以将加载时间减少到不到一秒?

The problem I'm having is that those two lines take over 10 seconds, is it possible to keep English() in the ram or some other option to reduce this load time to less than a second?

推荐答案

您说过要在收到请求时启动独立脚本 (my_script.py).这将使用来自spacy.en 没有加载 spacy.en 的开销.使用这种方法,操作系统将始终在您启动脚本时创建一个新进程.所以只有一种方法可以避免每次都加载 spacy.en:有一个已经在运行的单独进程,加载 spacy.en,并让你的脚本与那个过程.下面的代码显示了一种方法.但是,正如其他人所说,您可能会通过更改服务器架构而受益,以便将 spacy.en 加载到您的网络服务器中(例如,使用基于 Python 的网络服务器).

You said that you want to launch a freestanding script (my_script.py) whenever a request comes in. This will use capabilites from spacy.en without the overhead of loading spacy.en. With this approach, the operating system will always create a new process when you launch your script. So there is only one way to avoid loading spacy.en each time: have a separate process that is already running, with spacy.en loaded, and have your script communicate with that process. The code below shows a way to do that. However, as others have said, you will probably benefit by changing your server architecture so spacy.en is loaded within your web server (e.g., using a Python-based web server).

最常见的进程间通信形式是通过 TCP/IP 套接字.下面的代码实现了一个小型服务器,它保持 spacy.en 加载并处理来自客户端的请求.它还有一个客户端,可以将请求发送到该服务器并接收返回结果.由您决定在这些传输中放入什么内容.

The most common form of inter-process communication is via TCP/IP sockets. The code below implements a small server which keeps spacy.en loaded and processes requests from the client. It also has a client which transmits requests to that server and receives results back. It's up to you to decide what to put into those transmissions.

还有第三个脚本.由于客户端和服务器都需要发送和接收函数,这些函数位于一个名为 comm.py 的共享脚本中.(请注意,客户端和服务器都加载了 comm.py 的单独副本;它们不通过加载到共享内存中的单个模块进行通信.)

There is also a third script. Since both client and server need send and receive functions, those functions are in a shared script called comm.py. (Note that the client and server each load a separate copy of comm.py; they do not communicate through a single module loaded into shared memory.)

我假设两个脚本都在同一台机器上运行.如果没有,您需要将 comm.py 的副本放在两台机器上,并将 comm.server_host 更改为服务器的机器名称或 IP 地址.

I assume both scripts are run on the same machine. If not, you will need to put a copy of comm.py on both machines and change comm.server_host to the machine name or IP address for the server.

运行 nlp_server.py 作为后台进程(或只是在不同的终端窗口中进行测试).这会等待请求,处理它们并将结果发回:

Run nlp_server.py as a background process (or just in a different terminal window for testing). This waits for requests, processes them and sends the results back:

import comm
import socket
from spacy.en import English
nlp = English()

def process_connection(sock):
    print "processing transmission from client..."
    # receive data from the client
    data = comm.receive_data(sock)
    # do something with the data
    result = {"data received": data}
    # send the result back to the client
    comm.send_data(result, sock)
    # close the socket with this particular client
    sock.close()
    print "finished processing transmission from client..."

server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# open socket even if it was used recently (e.g., server restart)
server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_sock.bind((comm.server_host, comm.server_port))
# queue up to 5 connections
server_sock.listen(5)
print "listening on port {}...".format(comm.server_port)
try:
    while True:
        # accept connections from clients
        (client_sock, address) = server_sock.accept()
        # process this connection 
        # (this could be launched in a separate thread or process)
        process_connection(client_sock)
except KeyboardInterrupt:
    print "Server process terminated."
finally:
    server_sock.close()

加载 my_script.py 作为快速运行的脚本以从 nlp 服务器请求结果(例如,python my_script.py 这里是一些参数):

import socket, sys
import comm

# data can be whatever you want (even just sys.argv)
data = sys.argv

print "sending to server:"
print data

# send data to the server and receive a result
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# disable Nagle algorithm (probably only needed over a network) 
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)
sock.connect((comm.server_host, comm.server_port))
comm.send_data(data, sock)
result = comm.receive_data(sock)
sock.close()

# do something with the result...
print "result from server:"
print result

comm.py 包含客户端和服务器都使用的代码:

comm.py contains code that is used by both the client and server:

import sys, struct
import cPickle as pickle

# pick a port that is not used by any other process
server_port = 17001
server_host = '127.0.0.1' # localhost
message_size = 8192
# code to use with struct.pack to convert transmission size (int) 
# to a byte string
header_pack_code = '>I'
# number of bytes used to represent size of each transmission
# (corresponds to header_pack_code)
header_size = 4  

def send_data(data_object, sock):
    # serialize the data so it can be sent through a socket
    data_string = pickle.dumps(data_object, -1)
    data_len = len(data_string)
    # send a header showing the length, packed into 4 bytes
    sock.sendall(struct.pack(header_pack_code, data_len))
    # send the data
    sock.sendall(data_string)

def receive_data(sock):
    """ Receive a transmission via a socket, and convert it back into a binary object. """
    # This runs as a loop because the message may be broken into arbitrary-size chunks.
    # This assumes each transmission starts with a 4-byte binary header showing the size of the transmission.
    # See https://docs.python.org/3/howto/sockets.html
    # and http://code.activestate.com/recipes/408859-socketrecv-three-ways-to-turn-it-into-recvall/

    header_data = ''
    header_done = False
    # set dummy values to start the loop
    received_len = 0
    transmission_size = sys.maxint

    while received_len < transmission_size:
        sock_data = sock.recv(message_size)
        if not header_done:
            # still receiving header info
            header_data += sock_data
            if len(header_data) >= header_size:
                header_done = True
                # split the already-received data between header and body
                messages = [header_data[header_size:]]
                received_len = len(messages[0])
                header_data = header_data[:header_size]
                # find actual size of transmission
                transmission_size = struct.unpack(header_pack_code, header_data)[0]
        else:
            # already receiving data
            received_len += len(sock_data)
            messages.append(sock_data)

    # combine messages into a single string
    data_string = ''.join(messages)
    # convert to an object
    data_object = pickle.loads(data_string)
    return data_object

注意:您应该确保从服务器发送的结果仅使用本机数据结构(字典、列表、字符串等).如果结果包含spacy.en中定义的对象,那么客户端在解包结果时会自动导入spacy.en,以提供对象的方法.

Note: you should make sure the result sent from the server only uses native data structures (dicts, lists, strings, etc.). If the result includes an object defined in spacy.en, then the client will automatically import spacy.en when it unpacks the result, in order to provide the object's methods.

这个设置与HTTP协议非常相似(服务器等待连接,客户端连接,客户端发送请求,服务器发送响应,双方断开连接).因此,最好使用标准 HTTP 服务器和客户端而不是此自定义代码.那将是一个RESTful API",这是当今一个流行的术语(有充分的理由).使用标准的 HTTP 包可以省去管理自己的客户端/服务器代码的麻烦,您甚至可以直接从现有的 Web 服务器调用数据处理服务器,而不是启动 my_script.py.但是,您必须将请求转换为与 HTTP 兼容的内容,例如 GET 或 POST 请求,或者可能只是一个特殊格式的 URL.

This setup is very similar to the HTTP protocol (server waits for connections, client connects, client sends a request, server sends a response, both sides disconnect). So you might do better to use a standard HTTP server and client instead of this custom code. That would be a "RESTful API", which is a popular term these days (with good reason). Using standard HTTP packages would save you the trouble of managing your own client/server code, and you might even be able to call your data-processing server directly from your existing web server instead of launching my_script.py. However, you will have to translate your request into something compatible with HTTP, e.g., a GET or POST request, or maybe just a specially formatted URL.

另一种选择是使用标准的进程间通信包,例如 PyZMQ、redis、mpi4py 或 zmq_object_exchanger.有关一些想法,请参阅此问题:Efficient Python IPC

Another option would be to use a standard interprocess communication package such as PyZMQ, redis, mpi4py or maybe zmq_object_exchanger. See this question for some ideas: Efficient Python IPC

或者您可以使用 dill 包 (https://pypi.python.org/pypi/dill) 然后在 my_script.py 的开头恢复.这可能比每次导入/重建它更快,并且比使用进程间通信更简单.

Or you may be able to save a copy of the spacy.en object on disk using the dill package (https://pypi.python.org/pypi/dill) and then restore it at the start of my_script.py. That may be faster than importing/reconstructing it each time and simpler than using interprocess communication.

这篇关于是否可以在内存中保留空间以减少加载时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆