如何保存Python NLTK对齐模型供以后使用? [英] How to save Python NLTK alignment models for later use?

查看:655
本文介绍了如何保存Python NLTK对齐模型供以后使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中,我正在使用 NLTK的对齐模块 在并行文本之间创建单词对齐。对齐bitexts可能是一个耗时的过程,尤其是在相当多的语料库上完成时。最好在一天内进行批量对齐并稍后使用这些对齐。

In Python, I'm using NLTK's alignment module to create word alignments between parallel texts. Aligning bitexts can be a time-consuming process, especially when done over considerable corpora. It would be nice to do alignments in batch one day and use those alignments later on.

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

一旦我创建了一个模型,我怎么能(1)将它保存到磁盘上,以及(2)以后重复使用它?

Once I create a model, how can I (1) save it to disk and (2) reuse it later?

推荐答案

最直接的答案是腌制它,参见 https://wiki.python.org/moin/UsingPickle

The immediate answer is to pickle it, see https://wiki.python.org/moin/UsingPickle

但是因为IBMModel1返回一个lambda函数,所以不可能用默认的 pickle / <来腌制它code> cPickle (参见 https://github.com/nltk/nltk/blob/develop/nltk/align/i bm1.py#L74 https: //github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104

But because IBMModel1 returns a lambda function, it's not possible to pickle it with the default pickle / cPickle (see https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 and https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

所以我们将使用莳萝。首先,安装 dill ,参见 Python pickle lambda函数?

So we'll use dill. Firstly, install dill, see Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

然后:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()

使用腌制模型:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']






如果你试图挑选 IBMModel1 对象,这是一个lambda函数,你最终会得到这个:


If you try to pickle the IBMModel1 object, which is a lambda function, you'll end up with this:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

(注意:上面的代码片段来自NLTK 3.0.0版) )

(Note: the above code snippet comes from NLTK version 3.0.0)

在带有NLTK 3.0.0的python3中,你也会遇到同样的问题,因为IBMModel1返回一个lambda函数:

In python3 with NLTK 3.0.0, you will also face the same problem because IBMModel1 returns a lambda function:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

(注意:在python3中, pickle cPickle ,请参阅 http://docs.pythonsprints.com/python3_porting/py-porting.html

(Note: In python3, pickle is cPickle, see http://docs.pythonsprints.com/python3_porting/py-porting.html)

这篇关于如何保存Python NLTK对齐模型供以后使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆