创建一个包含单词词汇的语料库 [英] Create a Corpus Containing the Vocabulary of Words

查看:99
本文介绍了创建一个包含单词词汇的语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的文档字典中的所有单词计算inverse_document_frequency,我必须显示根据查询分数排名最高的5个文档.但是我在创建包含文档中单词词汇的语料库时陷入了循环.请帮助我改善代码. 该代码块用于读取我的文件以及删除文件中的标点符号和停用词

I am calculating inverse_document_frequency for all the words in my documents dictionary and I have to show the top 5 documents ranked according to the score on queries. But I am stuck in loops while creating corpus containing the vocabulary of words in the documents. Please help me to improve my code. This Block of code used to read my files and removing punctuation and stop words from a file

def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return List of Words
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in english_stopwords]

此代码块用于将所有文件名存储在我的文件夹中

This block of code is used to store all files name in my folder

file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)

此代码段用于创建我正在使用的文档字典

This block of code used to create my dictionary of documents on which i am working

documents = {}
for i in file_names:
documents[i]=wordList(i)

上面的代码运行良好且快速,但是这段代码需要花费大量时间来创建语料库,我该如何改善它

Above codes working good and fast but this block of code taking lot of time creating corpus how can i improve this

#create a corpus containing the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
     for doc in documents.values(): #iterate through documents 
        for word in doc: #go through each word in the current doc
            if not word in corpus: 
                corpus.append(word) #add word in corpus if not already added

此代码创建一个字典,该字典将存储语料库中每个单词的文档频率

This code creates a dictionary that will store document frequency for each word in the corpus

df_corpus = {} #document frequency for every word in corpus
for word in corpus:
    k = 0 #initial document frequency set to 0
    for doc in documents.values(): #iterate through documents
        if word in doc.split(): #check if word in doc
            k+=1 
    df_corpus[word] = k

从2个小时开始,它会创建语料库,并且仍在创建中.请帮助我改善代码. 这是我正在使用的数据集 https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCFJ >

From 2 hours it creating corpus and still creating Please help me to improve my code. This is the data set I am working with https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ

推荐答案

将列表的语料库设置为

How about instead of list, setting corpus as a set type? you won't need additional if too.

corpus = set() # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    corpus.update(doc) #add word in corpus if not already added

这篇关于创建一个包含单词词汇的语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆