使用Python NLTK的AWS Lambda中的路径 [英] Paths in AWS lambda with Python NLTK

查看：88 发布时间：2020/5/18 1:14:37 python nltk aws-lambda

本文介绍了使用Python NLTK的AWS Lambda中的路径的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在AWS Lambda中遇到了NLTK软件包的问题.但是，我认为问题更多与Lambda中的路径配置不正确有关. NLTK在查找存储在本地而不是模块安装的一部分的数据库时遇到了麻烦. SO上列出的许多解决方案都是简单的路径配置，可以在此处找到，但我认为此问题与Lambda中的路径有关:

I'm encountering problems with the NLTK package in AWS Lambda. However I believe the issue is related more to path configurations in Lambda being incorrect. NLTK is having trouble finding data libraries that are stored locally and not part of the module install. Many of the solutions listed on SO are simple path configs as can be found here but I think this issue related to pathing in Lambda:

如何从代码配置nltk数据目录?

按顺序下载的内容使nltk.tokenize.word_tokenize工作?

还要提一下，这也与我在此处发布的上一个问题有关在Python中将NLTK语料库与AWS Lambda函数一起使用

Should also mention this also relates to a previous question I posted here Using NLTK corpora with AWS Lambda functions in Python

但是问题似乎更普遍，因此我选择重新定义问题，因为它涉及到如何在Lambda中正确配置路径环境以与需要外部库(如NLTK)的模块一起使用. NLTK将大量数据存储在本地的nltk_data文件夹中，但是在lambda zip中包含此文件夹以进行上传，但似乎找不到它.

but the issue seems more general and so I have elected to redefine the question as it relates how to correctly configure path environments in Lambda to work with modules that require external libraries like NLTK. NLTK stores a lot of it's data in a nltk_data folder locally, however including this folder within the lambda zip for upload, it doesn't seem to find it.

Lambda func zip文件中还包括以下文件和目录:

Also included in the Lambda func zip file are the following files and dirs:

\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle
\nltk_data\tokenizers\punkt\english.pickle
\nltk_data\tokenizers\punkt\PY3\english.pickle

在以下站点上，似乎var/task/是lambda函数在其中执行的文件夹，并且我尝试包括以下无效路径. https://alestic.com/2014/11/aws-lambda-environment/

From the following site, it seems that var/task/ is the folder in which the lambda function executes and I have tried including this path to no avail. https://alestic.com/2014/11/aws-lambda-environment/

从文档看来，似乎也可以使用许多环境变量，但是我不确定如何将它们包括在python脚本中(来自Windows，不是Linux)

From the docs it also seems that there are a number of environment variables that can be used however I'm not sure how to include them in a python script (coming from windows, not linux) http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

希望在所有人都有配置Lambda路径的经验的情况下提出这一点.尽管进行了搜索，但我还没有看到很多与此特定问题相关的问题，因此希望它对解决这一问题可能有用

Hoping to throw this up here incase anyone has experience in configuring Lambda paths. I haven't seen a lot of questions relating to this specific issue despite searching, so hoping it could be useful to resolve this

代码在这里

import nltk
import pymysql.cursors
import re
import rds_config
import logging
from boto_conn import botoConn
from warnings import filterwarnings
from nltk import word_tokenize

nltk.data.path.append("/nltk_data/tokenizers/punkt")
nltk.data.path.append("/nltk_data/taggers/averaged_perceptron_tagger")

logger = logging.getLogger()

logger.setLevel(logging.INFO)

rds_host = "nodexrd2.cw7jbiq3uokf.ap-southeast-2.rds.amazonaws.com"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name

filterwarnings("ignore", category=pymysql.Warning)


def parse():

    tknzr = word_tokenize

    stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself','yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
                 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that','these','those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',
                 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of','at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then','once', 'here','there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
                 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will','just', 'don', 'should','now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn',
                 'haven', 'isn', 'ma','mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

    s3file = botoConn(None, 1).getvalue()
    db = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
    lines = s3file.split('\n')

    for line in lines:

        tkn = tknzr(line)
        tagged = nltk.pos_tag(tkn)

        excl = ['the', 'and', 'of', 'at', 'what', 'to', 'it', 'a', 'of', 'i', 's', 't', 'is', 'I\'m', 'Im', 'U', 'RT', 'RTs', 'its']  # Arg

        x = [i for i in tagged if i[0] not in stopwords]
        x = [i for i in x if i[0] not in excl]
        x = [i for i in x if len(i[0]) > 1]
        x = [i for i in x if 'https' not in i[0]]
        x = [i for i in x if i[1] == 'NNP' or i[1] == 'VB' or i[1] == 'NN']
        x = [(re.sub(r'[^A-Za-z0-9]+' + '()', r'', i[0])) for i in x]
        sql_dat_a, sql_dat = [], []

输出日志在这里:

   **********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************: LookupError
Traceback (most recent call last):
  File "/var/task/Tweetscrape_Timer.py", line 27, in schedule
    server()
  File "/var/task/Tweetscrape_Timer.py", line 14, in server
    parse()
  File "/var/task/parse_to_SQL.py", line 91, in parse
    tkn = tknzr(line)
  File "/var/task/nltk/tokenize/__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/var/task/nltk/tokenize/__init__.py", line 93, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/var/task/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/var/task/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/var/task/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************

使用Python NLTK的AWS Lambda中的路径 [英] Paths in AWS lambda with Python NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python NLTK的AWS Lambda中的路径 [英] Paths in AWS lambda with Python NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭