是否可以在内存中保留乱码以减少加载时间? [英] Is possible to keep spacy in memory to reduce the load time?

查看:84
本文介绍了是否可以在内存中保留乱码以减少加载时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将spacy用作在线服务的NLP. 每次用户提出请求时,我都将脚本称为"my_script.py"

I want to use spacy as for NLP for an online service. Each time a user makes a request I call the script "my_script.py"

其开头为:

from spacy.en import English
nlp = English()

我遇到的问题是这两行要花10秒钟以上的时间,是否有可能将English()保留在ram或其他选项中以将加载时间减少到不到一秒钟?

The problem I'm having is that those two lines take over 10 seconds, is it possible to keep English() in the ram or some other option to reduce this load time to less than a second?

推荐答案

您说过要在请求进入时启动独立脚本(my_script.py).这将使用spacy.en中的功能,而不会产生开销.正在加载spacy.en.使用这种方法,操作系统将在您启动脚本时始终创建一个新进程.因此,只有一种避免每次加载spacy.en的方法:让一个已经运行并已加载spacy.en的单独进程运行,并让您的脚本与该进程通信.下面的代码显示了一种实现方法.但是,正如其他人所说,更改服务器架构可能会从中受益,因此spacy.en会加载到Web服务器中(例如,使用基于Python的Web服务器).

You said that you want to launch a freestanding script (my_script.py) whenever a request comes in. This will use capabilites from spacy.en without the overhead of loading spacy.en. With this approach, the operating system will always create a new process when you launch your script. So there is only one way to avoid loading spacy.en each time: have a separate process that is already running, with spacy.en loaded, and have your script communicate with that process. The code below shows a way to do that. However, as others have said, you will probably benefit by changing your server architecture so spacy.en is loaded within your web server (e.g., using a Python-based web server).

进程间通信的最常见形式是通过TCP/IP套接字.下面的代码实现了一个小型服务器,该服务器保持spacy.en加载并处理来自客户端的请求.它还有一个客户端,该客户端将请求发送到该服务器并接收结果.您可以自行决定将哪些内容放入这些传输中.

The most common form of inter-process communication is via TCP/IP sockets. The code below implements a small server which keeps spacy.en loaded and processes requests from the client. It also has a client which transmits requests to that server and receives results back. It's up to you to decide what to put into those transmissions.

还有第三个脚本.由于客户端和服务器都需要发送和接收功能,因此这些功能位于名为comm.py的共享脚本中. (请注意,客户端和服务器各自加载comm.py的单独副本;它们不通过加载到共享内存中的单个模块进行通信.)

There is also a third script. Since both client and server need send and receive functions, those functions are in a shared script called comm.py. (Note that the client and server each load a separate copy of comm.py; they do not communicate through a single module loaded into shared memory.)

我假设两个脚本都在同一台计算机上运行.如果不是,则需要在两台计算机上都放置一个comm.py的副本,并将comm.server_host更改为服务器的计算机名称或IP地址.

I assume both scripts are run on the same machine. If not, you will need to put a copy of comm.py on both machines and change comm.server_host to the machine name or IP address for the server.

nlp_server.py 作为后台进程运行(或仅在其他终端窗口中进行测试).这将等待请求,对其进行处理,然后将结果发送回去:

Run nlp_server.py as a background process (or just in a different terminal window for testing). This waits for requests, processes them and sends the results back:

import comm
import socket
from spacy.en import English
nlp = English()

def process_connection(sock):
    print "processing transmission from client..."
    # receive data from the client
    data = comm.receive_data(sock)
    # do something with the data
    result = {"data received": data}
    # send the result back to the client
    comm.send_data(result, sock)
    # close the socket with this particular client
    sock.close()
    print "finished processing transmission from client..."

server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# open socket even if it was used recently (e.g., server restart)
server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_sock.bind((comm.server_host, comm.server_port))
# queue up to 5 connections
server_sock.listen(5)
print "listening on port {}...".format(comm.server_port)
try:
    while True:
        # accept connections from clients
        (client_sock, address) = server_sock.accept()
        # process this connection 
        # (this could be launched in a separate thread or process)
        process_connection(client_sock)
except KeyboardInterrupt:
    print "Server process terminated."
finally:
    server_sock.close()

my_script.py 加载为快速运行脚本以从nlp服务器(例如python my_script.py here are some arguments)请求结果:

import socket, sys
import comm

# data can be whatever you want (even just sys.argv)
data = sys.argv

print "sending to server:"
print data

# send data to the server and receive a result
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# disable Nagle algorithm (probably only needed over a network) 
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)
sock.connect((comm.server_host, comm.server_port))
comm.send_data(data, sock)
result = comm.receive_data(sock)
sock.close()

# do something with the result...
print "result from server:"
print result

comm.py 包含客户端和服务器都使用的代码:

comm.py contains code that is used by both the client and server:

import sys, struct
import cPickle as pickle

# pick a port that is not used by any other process
server_port = 17001
server_host = '127.0.0.1' # localhost
message_size = 8192
# code to use with struct.pack to convert transmission size (int) 
# to a byte string
header_pack_code = '>I'
# number of bytes used to represent size of each transmission
# (corresponds to header_pack_code)
header_size = 4  

def send_data(data_object, sock):
    # serialize the data so it can be sent through a socket
    data_string = pickle.dumps(data_object, -1)
    data_len = len(data_string)
    # send a header showing the length, packed into 4 bytes
    sock.sendall(struct.pack(header_pack_code, data_len))
    # send the data
    sock.sendall(data_string)

def receive_data(sock):
    """ Receive a transmission via a socket, and convert it back into a binary object. """
    # This runs as a loop because the message may be broken into arbitrary-size chunks.
    # This assumes each transmission starts with a 4-byte binary header showing the size of the transmission.
    # See https://docs.python.org/3/howto/sockets.html
    # and http://code.activestate.com/recipes/408859-socketrecv-three-ways-to-turn-it-into-recvall/

    header_data = ''
    header_done = False
    # set dummy values to start the loop
    received_len = 0
    transmission_size = sys.maxint

    while received_len < transmission_size:
        sock_data = sock.recv(message_size)
        if not header_done:
            # still receiving header info
            header_data += sock_data
            if len(header_data) >= header_size:
                header_done = True
                # split the already-received data between header and body
                messages = [header_data[header_size:]]
                received_len = len(messages[0])
                header_data = header_data[:header_size]
                # find actual size of transmission
                transmission_size = struct.unpack(header_pack_code, header_data)[0]
        else:
            # already receiving data
            received_len += len(sock_data)
            messages.append(sock_data)

    # combine messages into a single string
    data_string = ''.join(messages)
    # convert to an object
    data_object = pickle.loads(data_string)
    return data_object

注意:您应确保从服务器发送的结果仅使用本机数据结构(字典,列表,字符串等).如果结果包含spacy.en中定义的对象,则客户端在解压缩结果时将自动导入spacy.en,以提供该对象的方法.

Note: you should make sure the result sent from the server only uses native data structures (dicts, lists, strings, etc.). If the result includes an object defined in spacy.en, then the client will automatically import spacy.en when it unpacks the result, in order to provide the object's methods.

此设置与HTTP协议非常相似(服务器等待连接,客户端连接,客户端发送请求,服务器发送响应,双方断开连接).因此,最好使用标准的HTTP服务器和客户端而不是此自定义代码.那将是一个"RESTful API",这是最近流行的术语(有充分的理由).使用标准的HTTP包可以节省管理自己的客户端/服务器代码的麻烦,甚至可以直接从现有的Web服务器调用数据处理服务器,而不必启动my_script.py.但是,您将必须将请求转换为与HTTP兼容的内容,例如GET或POST请求,或者可能只是特殊格式的URL.

This setup is very similar to the HTTP protocol (server waits for connections, client connects, client sends a request, server sends a response, both sides disconnect). So you might do better to use a standard HTTP server and client instead of this custom code. That would be a "RESTful API", which is a popular term these days (with good reason). Using standard HTTP packages would save you the trouble of managing your own client/server code, and you might even be able to call your data-processing server directly from your existing web server instead of launching my_script.py. However, you will have to translate your request into something compatible with HTTP, e.g., a GET or POST request, or maybe just a specially formatted URL.

另一种选择是使用标准的进程间通信程序包,例如PyZMQ,redis,mpi4py或zmq_object_exchanger.请参阅以下问题以获取一些想法:高效的Python IPC

Another option would be to use a standard interprocess communication package such as PyZMQ, redis, mpi4py or maybe zmq_object_exchanger. See this question for some ideas: Efficient Python IPC

或者您可以使用dill包将spacy.en对象的副本保存在磁盘上(

Or you may be able to save a copy of the spacy.en object on disk using the dill package (https://pypi.python.org/pypi/dill) and then restore it at the start of my_script.py. That may be faster than importing/reconstructing it each time and simpler than using interprocess communication.

这篇关于是否可以在内存中保留乱码以减少加载时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆