使用Python将文本文件中的复数转换为单数 [英] Converting plural to singular in a text file with Python

查看:42
本文介绍了使用Python将文本文件中的复数转换为单数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的 txt 文件:

I have txt files that look like this:

word, 23
Words, 2
test, 1
tests, 4

我希望它们看起来像这样:

And I want them to look like this:

word, 23
word, 2
test, 1
test, 4

我希望能够在 Python 中获取一个 txt 文件并将复数词转换为单数词.这是我的代码:

I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:

import nltk

f = raw_input("Please enter a filename: ")

def openfile(f):
    with open(f,'r') as a:
       a = a.read()
       a = a.lower()
       return a

def stem(a):
    p = nltk.PorterStemmer()
    [p.stem(word) for word in a]
    return a

def returnfile(f, a):
    with open(f,'w') as d:
        d = d.write(a)
    #d.close()

print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

我也尝试过这两个定义而不是 stem 定义:

I have also tried these 2 definitions instead of the stem definition:

def singular(a):
    for line in a:
        line = line[0]
        line = str(line)
        stemmer = nltk.PorterStemmer()
        line = stemmer.stem(line)
        return line

def stem(a):
    for word in a:
        for suffix in ['s']:
            if word.endswith(suffix):
                return word[:-len(suffix)]
            return word

之后我想取重复的单词(例如 testtest)并通过将它们旁边的数字相加来合并它们.例如:

Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:

word, 25
test, 5

我不知道该怎么做.一个解决方案会很好但不是必需的.

I'm not sure how to do that. A solution would be nice but not necessary.

推荐答案

您似乎对 Python 非常熟悉,但我仍会尝试解释一些步骤.让我们从第一个问题开始,即去复数词.当您使用 a.read() 读取多行文件(在您的情况下为单词,数字 csv)时,您将把文件的整个主体读入一个大字符串.

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.

def openfile(f):
    with open(f,'r') as a:
        a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        a = a.lower()
        return a

这很好,但是当您想将结果传递给 stem() 时,它将作为一个大字符串,而不是作为单词列表.这意味着当您使用 for word in a 遍历输入时,您将遍历输入字符串的每个单独字符并将词干分析器应用于这些单独字符.

This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.

def stem(a):
    p = nltk.PorterStemmer()
    a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
    return a

这绝对不适合您的目的,我们可以做一些不同的事情.

This definitely doesn't work for your purposes, and there are a few different things we can do.

  1. 我们可以更改它,以便我们将输入文件作为一个行列表读取
  2. 我们可以使用大字符串并将其分解为一个列表.
  3. 我们可以一次一行地遍历并截取行列表中的每一行.

为了方便起见,让我们继续#1.这将需要将 openfile(f) 更改为以下内容:

Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:

def openfile(f):
    with open(f,'r') as a:
        a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
        b = [x.lower() for x in a]
        return b

这应该给我们 b 作为行列表,即 ['soc, 32', 'soc, 1', ...].所以下一个问题变成了当我们将字符串列表传递给 stem() 时我们如何处理它.一种方法如下:

This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:

def stem(a):
    p = nltk.PorterStemmer()
    b = []
    for line in a:
        split_line = line.split(',') #break it up so we can get access to the word
        new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together 
        b.append(new_line) #add it to the new list of lines
    return b

这绝对是一个非常粗略的解决方案,但应该充分迭代输入中的所有行,并将它们去复数化.这很粗糙,因为当你放大时,拆分和重新组装它们并不是特别快.但是,如果您对此感到满意,那么剩下的就是遍历新行列表,并将它们写入您的文件.根据我的经验,写入新文件通常更安全,但这应该可以正常工作.

This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.

def returnfile(f, a):
    with open(f,'w') as d:
        for line in a:
            d.write(line)


print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

当我有以下input.txt

soc, 32
socs, 1
dogs, 8

我得到以下标准输出:

Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None

input.txt 看起来像这样:

soc, 32
soc, 1
dog, 8

<小时>

关于合并具有相同单词的数字的第二个问题改变了我们上面的解决方案.根据评论中的建议,您应该看看使用字典来解决这个问题.与其将所有这些都作为一个大列表来做,更好的(可能也是更 Python 化的)方法是遍历输入的每一行,并在处理它们时将它们提取出来.如果您仍在努力弄清楚,我将在稍后编写有关此的代码.


The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

这篇关于使用Python将文本文件中的复数转换为单数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆