使用Python NLTK的AWS Lambda中的路径 [英] Paths in AWS lambda with Python NLTK

查看:88
本文介绍了使用Python NLTK的AWS Lambda中的路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在AWS Lambda中遇到了NLTK软件包的问题.但是,我认为问题更多与Lambda中的路径配置不正确有关. NLTK在查找存储在本地而不是模块安装的一部分的数据库时遇到了麻烦. SO上列出的许多解决方案都是简单的路径配置,可以在此处找到,但我认为此问题与Lambda中的路径有关:

I'm encountering problems with the NLTK package in AWS Lambda. However I believe the issue is related more to path configurations in Lambda being incorrect. NLTK is having trouble finding data libraries that are stored locally and not part of the module install. Many of the solutions listed on SO are simple path configs as can be found here but I think this issue related to pathing in Lambda:

如何从代码配置nltk数据目录?

按顺序下载的内容使nltk.tokenize.word_tokenize工作?

还要提一下,这也与我在此处发布的上一个问题有关 在Python中将NLTK语料库与AWS Lambda函数一起使用

Should also mention this also relates to a previous question I posted here Using NLTK corpora with AWS Lambda functions in Python

但是问题似乎更普遍,因此我选择重新定义问题,因为它涉及到如何在Lambda中正确配置路径环境以与需要外部库(如NLTK)的模块一起使用. NLTK将大量数据存储在本地的nltk_data文件夹中,但是在lambda zip中包含此文件夹以进行上传,但似乎找不到它.

but the issue seems more general and so I have elected to redefine the question as it relates how to correctly configure path environments in Lambda to work with modules that require external libraries like NLTK. NLTK stores a lot of it's data in a nltk_data folder locally, however including this folder within the lambda zip for upload, it doesn't seem to find it.

Lambda func zip文件中还包括以下文件和目录:

Also included in the Lambda func zip file are the following files and dirs:

\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle
\nltk_data\tokenizers\punkt\english.pickle
\nltk_data\tokenizers\punkt\PY3\english.pickle

在以下站点上,似乎var/task/是lambda函数在其中执行的文件夹,并且我尝试包括以下无效路径. https://alestic.com/2014/11/aws-lambda-environment/

From the following site, it seems that var/task/ is the folder in which the lambda function executes and I have tried including this path to no avail. https://alestic.com/2014/11/aws-lambda-environment/

从文档看来,似乎也可以使用许多环境变量,但是我不确定如何将它们包括在python脚本中(来自Windows,不是Linux)

From the docs it also seems that there are a number of environment variables that can be used however I'm not sure how to include them in a python script (coming from windows, not linux) http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

希望在所有人都有配置Lambda路径的经验的情况下提出这一点.尽管进行了搜索,但我还没有看到很多与此特定问题相关的问题,因此希望它对解决这一问题可能有用

Hoping to throw this up here incase anyone has experience in configuring Lambda paths. I haven't seen a lot of questions relating to this specific issue despite searching, so hoping it could be useful to resolve this

代码在这里

import nltk
import pymysql.cursors
import re
import rds_config
import logging
from boto_conn import botoConn
from warnings import filterwarnings
from nltk import word_tokenize

nltk.data.path.append("/nltk_data/tokenizers/punkt")
nltk.data.path.append("/nltk_data/taggers/averaged_perceptron_tagger")

logger = logging.getLogger()

logger.setLevel(logging.INFO)

rds_host = "nodexrd2.cw7jbiq3uokf.ap-southeast-2.rds.amazonaws.com"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name

filterwarnings("ignore", category=pymysql.Warning)


def parse():

    tknzr = word_tokenize

    stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself','yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
                 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that','these','those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',
                 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of','at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then','once', 'here','there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
                 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will','just', 'don', 'should','now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn',
                 'haven', 'isn', 'ma','mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

    s3file = botoConn(None, 1).getvalue()
    db = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
    lines = s3file.split('\n')

    for line in lines:

        tkn = tknzr(line)
        tagged = nltk.pos_tag(tkn)

        excl = ['the', 'and', 'of', 'at', 'what', 'to', 'it', 'a', 'of', 'i', 's', 't', 'is', 'I\'m', 'Im', 'U', 'RT', 'RTs', 'its']  # Arg

        x = [i for i in tagged if i[0] not in stopwords]
        x = [i for i in x if i[0] not in excl]
        x = [i for i in x if len(i[0]) > 1]
        x = [i for i in x if 'https' not in i[0]]
        x = [i for i in x if i[1] == 'NNP' or i[1] == 'VB' or i[1] == 'NN']
        x = [(re.sub(r'[^A-Za-z0-9]+' + '()', r'', i[0])) for i in x]
        sql_dat_a, sql_dat = [], []

输出日志在这里:

   **********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************: LookupError
Traceback (most recent call last):
  File "/var/task/Tweetscrape_Timer.py", line 27, in schedule
    server()
  File "/var/task/Tweetscrape_Timer.py", line 14, in server
    parse()
  File "/var/task/parse_to_SQL.py", line 91, in parse
    tkn = tknzr(line)
  File "/var/task/nltk/tokenize/__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/var/task/nltk/tokenize/__init__.py", line 93, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/var/task/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/var/task/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/var/task/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************

推荐答案

所以我找到了这个问题的答案.经过几天的混乱,我终于弄明白了. nltk文件夹中的data.py文件需要进行如下修改.基本上删除/usr/...路径,并从/var/task/中添加Lambda执行的文件夹,并确保您的nltk_data文件夹位于执行zip的根目录中.

So I've found the answer to this question. After a couple of days messing around I've finally figured it out. The data.py file in the nltk folder needs to be modified as follows. Basically remove the /usr/... paths and add in the folder that Lambda executes from /var/task/ and ensure that your nltk_data folder is in the root of your execution zip.

不知道为什么,但是使用内联nltk.data.path.append()方法不适用于AWS Lambda,并且需要直接修改data.py文件.

Not sure why, but using the inline nltk.data.path.append() method does not work with AWS Lambda and the data.py file needs to be modified directly.

else:
    # Common locations on UNIX & OS X:
    path += [
        str('/var/task/nltk_data')
        #str('/usr/share/nltk_data'),
        #str('/usr/local/share/nltk_data'),
        #str('/usr/lib/nltk_data'),
        #str('/usr/local/lib/nltk_data')
    ]

这篇关于使用Python NLTK的AWS Lambda中的路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆