用nltk提取元组? [英] Extracting tuples with nltk?

查看:75
本文介绍了用nltk提取元组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读nltk的文档后,我发现可以用str2tuple()提取元组.举例来说,假设我有以下句子(显然是一个更大的文件):

Reading the documentation of nltk i found that is possible to extract tuples with str2tuple(). As an instance assume i have the following sentence(clearly is a much larger file):

sent = "pero pero CC " \
        "tan tan RG " \
        "antigua antiguo AQ0FS0 " \
        "que que CS " \
        "según según SPS00 " \
        "mi mi  DP1CSS " \
        "madre madre NCFS000"

我想提取一个元组列表,例如:

I would like to extract a list of tuples, e.g.:

> ([antigua, AQ0FS0],[madre, NCFS000])

女性形容词标签(AQ0FS0)和女性名词标签(NCFS000). str2tuple()是否可行,或者更好的方法是使用正则表达式?

The female adjective tag (AQ0FS0) and the female noun tag (NCFS000). Is this possible with str2tuple() or a better aproach could be using a regular expression?

这是我尝试过的:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk as nl

sent = "pero pero CC " \
              "tan tan RG " \
              "antigua antiguo AQ0FS0 " \
              "que que CS " \
              "según según SPS00 " \
              "mi mi  DP1CSS " \
              "madre madre NCFS000"

nl.tag.str2tuple(t) for t in sent.split()

推荐答案

由于您可能对将语料库与NLTK结合使用很感兴趣:假设文件以这种格式存储,则应将其读取,解析(使用str2tuple或其他更简单的方法)并使用TaggedCorpusReader加载它.然后,您可以将其与所有标准NLTK语料库功能一起使用.基本上,您有两种类型的标签,即词性和(大概)词引理.如果您要这样做,我可以在此答案中添加更多具体信息.

Since you're presumably interested in using your corpus with the NLTK: Assuming your file is stored in this format, you should read it in, parse it (using str2tuple or other simpler methods) and load it with TaggedCorpusReader. Then you can use all the standard NLTK corpus functions with it. You basically have two types of tags, part of speech and (presumably) word lemma. If this is what you're after, I can add more specific information to this answer.

假设您的字符串实际上在每个三元组之后都包含一个换行符,那么将其解析为元组列表的简单方法如下:

Assuming your string actually includes a newline after each triple, the easy way to parse it into a list of tuples is like this:

sent = """pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi  mi DP1CSS
madre madre NCFS000"""

tuples = [ line.split() for line in sent.splitlines() ]

详细信息:split()实际上返回一个列表,而不是一个元组.如果需要将它们用作字典键,请将line.split()替换为tuple(line.split()).

A detail: split() actually returns a list, not a tuple. If you need to use them as dictionary keys, replace line.split() with tuple(line.split()).

这篇关于用nltk提取元组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆