NLTK可以在Postgres Python存储过程中使用吗 [英] Can NLTK be used in a Postgres Python Stored Procedure

查看:119
本文介绍了NLTK可以在Postgres Python存储过程中使用吗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可能在Postgres Python存储过程或触发器中使用NLTK,有没有人做过,甚至没有做过

Has anyone done or even no if its possible to use NLTK within a Postgres Python Stored Procedure or trigger

推荐答案

您可以在PL/Python存储过程或触发器中使用几乎任何Python库.

You can use pretty much any Python library in a PL/Python stored procedure or trigger.

请参见 PL/Python文档.

要理解的关键点是 PL/Python CPython (在PostgreSQL 9.3及更高版本中,无论如何);它使用与普通独立Python完全相同的解释器,只是将其作为库加载到支持PostgreSQL的库中.有一些限制(如下所述),如果它可以与CPython一起使用,则可以与PL/Python一起使用.

The crucial point to understand is that PL/Python is CPython (in PostgreSQL up to and including 9.3, anyway); it uses exactly the same interpreter that the normal standalone Python does, it just loads it as a library into the PostgreSQL backed. With a few limitations (outlined below), if it works with CPython it works with PL/Python.

如果您的系统上安装了多个Python解释器-版本,发行版,32位与64位等-您可能需要确保在运行distutils脚本等时将扩展程序和库安装到正确的扩展程序中. ,仅此而已.

If you have multiple Python interpreters installed on your system - versions, distributions, 32-bit vs 64-bit etc - you might need to make sure you're installing extensions and libraries into the right one when running distutils scripts, etc, but that's about it.

由于您可以加载系统Python可用的任何库,因此没有理由认为NLTK会成为问题,除非您知道NLTK需要真正在PostgreSQL后端中不推荐的诸如线程之类的东西. (果然,我尝试了一下,它起作用了",请参见下文).

Since you can load any library available to the system Python there's no reason to think NLTK would be a problem unless you know it requires things like threading that aren't really recommended in a PostgreSQL backend. (Sure enough, I tried it and it "just worked", see below).

一个可能的担心是,诸如NLTK之类的东西的启动开销可能会很大,您可能想将PL/Python预加载到postmaster中,然后将模块导入您的设置代码中,以便在后端启动时就可以使用了.请理解,postmaster是所有其他后端fork()所来自的父进程,因此,如果postmaster预加载了某些内容,则后端可以使用它,从而大大降低了开销.两种方式都可以测试性能.

One possible concern is that the startup overhead of something like NLTK might be quite big, you probably want to preload PL/Python it in the postmaster and import the module in your setup code so it's ready when backends start. Understand that the postmaster is the parent process that all the other backends fork() from, so if the postmaster preloads something it's available to the backends with greatly reduced overheads. Test performance either way.

因为您可以通过PL/Python加载任意C库,并且因为Python解释器没有真正的安全模型,所以plpythonu是一种不受信任"的语言.脚本具有postgres用户的完全和不受限制的系统访问权限,并且可以相当简单地绕过PostgreSQL中的访问控制.出于明显的安全原因,这意味着PL/Python函数和触发器只能由超级用户创建,尽管GRANT普通用户具有运行精心编写的功能的能力是相当合理的.超级用户.

Because you can load arbitrary C libraries via PL/Python and because the Python interpreter has no real security model, plpythonu is an "untrusted" language. Scripts have full and unrestricted access to the system as the postgres user and can fairly simply bypass access controls in PostgreSQL. For obvious security reasons this means that PL/Python functions and triggers may only be created by the superuser, though it's quite reasonable to GRANT normal users the ability to run carefully written functions that were installed by the superuser.

好处是,您可以执行普通Python中几乎可以做的所有事情,请记住Python解释器的生命周期是数据库连接(会话)的生命周期.不建议使用线程,但是其他大多数事情都可以.

The upside is that you can do pretty much anything you can do in normal Python, keeping in mind that the Python interpreter's lifetime is that of the database connection (session). Threading isn't recommended, but most other things are fine.

在编写PL/Python函数时,必须谨慎输入,在调用SPI来运行查询时必须设置search_path,等等.在手册中对此进行了更多讨论.

PL/Python functions must be written with careful input sanitation, must set search_path when invoking the SPI to run queries, etc. This is discussed more in the manual.

长时间运行或可能有问题的事情,例如DNS查找,到远程系统的HTTP连接,SMTP邮件传递等,通常应使用LISTENNOTIFY从帮助程序脚本完成,而不是按顺序在后端进行保留PostgreSQL的性能,并避免在很多长时间的事务中妨碍VACUUM.您可以在后端执行这些操作,但这不是一个好主意.

Long-running or potentially problematic things like DNS lookups, HTTP connections to remote systems, SMTP mail delivery, etc should generally be done from a helper script using LISTEN and NOTIFY rather than an in-backend job in order to preserve PostgreSQL's performance and avoid hampering VACUUM with lots of long transactions. You can do these things in the backend, it just isn't a great idea.

您应该避免在PostgreSQL后端中创建线程.

You should avoid creating threads within the PostgreSQL backend.

不要尝试加载将加载libpq C库的任何Python库.这可能会导致后端出现各种令人兴奋的问题.从PL/Python与PostgreSQL通讯时,请使用SPI例程,而不是常规的客户端库.

Don't attempt to load any Python library that'll load the libpq C library. This could cause all sorts of exciting problems with the backend. When talking to PostgreSQL from PL/Python use the SPI routines not a regular client library.

不要在后端做很长时间的事情,否则会导致真空问题.

Don't do very long-running things in the backend, you'll cause vacuum problems.

不要加载任何可能加载已经加载的本机C库的不同版本的内容-例如不同的libcrypto,libssl等.

Don't load anything that might load a different version of an already loaded native C library - say a different libcrypto, libssl, etc.

请勿直接写入PostgreSQL数据目录 ever 中的文件.

Don't write directly to files in the PostgreSQL data directory, ever.

PL/Python函数以操作系统上的postgres系统用户身份运行,因此它们无权访问用户的主目录或连接客户端上的文件.

PL/Python functions run as the postgres system user on the OS, so they don't have access to things like the user's home directory or files on the client side of the connection.

$ yum install python-nltk python-nltk
$ psql -U postgres regress

regress=# CREATE LANGUAGE plpythonu;

regress=# CREATE OR REPLACE FUNCTION nltk_word_tokenize(word text) RETURNS text[] AS $$
          import nltk
          return nltk.word_tokenize(word)
          $$ LANGUAGE plpythonu;

regress=# SELECT nltk_word_tokenize('This is a test, it''s going to work fine');
              nltk_word_tokenize               
-----------------------------------------------
 {This,is,a,test,",",it,'s,going,to,work,fine}
(1 row)

所以,正如我所说:尝试一下.只要PostgreSQL用于plpython的Python解释器安装了nltk的依赖项,它就可以正常工作.

So, as I said: Try it. So long as the Python interpreter PostgreSQL is using for plpython has nltk's dependencies installed it will work fine.

PL/Python是CPython,但我很想看到基于PyPy的替代方案,该替代方案可以使用PyPy的沙箱功能运行不受信任的代码.

PL/Python is CPython, but I'd love to see a PyPy based alternative that can run untrusted code using PyPy's sandbox features.

这篇关于NLTK可以在Postgres Python存储过程中使用吗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆