优化python脚本提取和处理大型数据文件 [英] Optimizing python script extracting and processing large data files
问题描述
我是python的新手,并且天真地为以下任务编写了python脚本:
I am new to python and naively wrote a python script for the following task:
我想创建一个表示多个对象的单词包.每个对象基本上都是成对的,由词袋表示.因此,该对象将转换为最终文档.
I want to create a bag of words representation of multiple objects. Each object is basically a pair and bag of words representation of synopsis is to be made. So the object is converted to in the final documents.
这是脚本:
import re
import math
import itertools
from nltk.corpus import stopwords
from nltk import PorterStemmer
from collections import defaultdict
from collections import Counter
from itertools import dropwhile
import sys, getopt
inp = "inp_6000.txt" #input file name
out = "bowfilter10" #output file name
with open(inp,'r') as plot_data:
main_dict = Counter()
file1, file2 = itertools.tee(plot_data, 2)
line_one = itertools.islice(file1, 0, None, 4)
line_two = itertools.islice(file2, 2, None, 4)
dictionary = defaultdict(Counter)
doc_count = defaultdict(Counter)
for movie_name, movie_plot in itertools.izip(line_one, line_two):
movie_plot = movie_plot.lower()
words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE) #split words
elemStopW = filter(lambda x: x not in stopwords.words('english'), words) #remove stop words, python nltk
for word in elemStopW:
word = PorterStemmer().stem_word(word) #use python stemmer class to do stemming
#increment the word count of the movie in the particular movie synopsis
dictionary[movie_name][word] += 1
#increment the count of a partiular word in main dictionary which stores frequency of all documents.
main_dict[word] += 1
#This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other.
if doc_count[word]['this_mov']==0:
doc_count[word].update(count=1, this_mov=1);
for word in doc_count:
doc_count[word].update(this_mov=-1)
#print "---------main_dict---------"
#print main_dict
#Remove all the words with frequency less than 5 in whole set of movies
for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()):
del main_dict[key]
#print main_dict
.#Write to file
bow_vec = open(out, 'w');
#calculate the the bog vector and write it
m = len(dictionary)
for movie_name in dictionary.keys():
#print movie_name
vector = []
for word in list(main_dict):
#print word, dictionary[movie_name][word]
x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2)
vector.append(x)
#write to file
bow_vec.write("%s" % movie_name)
for item in vector:
bow_vec.write("%s," % item)
bow_vec.write("\n")
数据文件的格式以及有关数据的其他信息: 数据文件具有以下格式:
Format of the data file and additional information about data: The data file has the following format:
电影名称. 空行. 电影简介(可以假定大小约为150个字) 空行.
Movie Name. Empty line. Movie synopsis(on can assume the size to be around 150 words) Empty line.
注意:<*>
用于表示.
输入文件的大小:
文件大小约为200 MB.
Size of input the file:
The file size is around 200 MB.
到目前为止,此脚本在3 GHz Intel处理器上大约需要 10-12小时.
As of now this script is taking around 10-12hrs on a 3 GHz Intel processor.
注意:我正在寻找串行代码方面的改进.我知道并行化会改善它,但是我想稍后再研究.我想借此机会使此串行代码更高效.
Note: I am looking for improvement in serial code. I know parallelization would improve it but I want look into it later. I want to take this opportunity to make this serial code more efficient.
任何帮助表示赞赏.
推荐答案
首先-尝试删除正则表达式,它们很重.我最初的建议是app脚的-不会奏效的.也许这样会更有效
First of all - try to drop regular expressions, they are heavy. My original advice was crappy - it would not have worked. Maybe, this will be more efficient
trans_table = string.maketrans(string.string.punctuation,
' '*len(string.punctuation)).lower()
words = movie_plot.translate(trans_table).split()
(事后思考) 我无法对其进行测试,但我认为,如果您将此调用的结果存储在变量中,则可以使用
(An afterthought) I cannot test it, but I think that if you store the result of this call in a variable
stops = stopwords.words('english')
或者可能更好-将其首先转换为set(如果函数不返回一个)
or probably better - convert it into set first (if the function does not return one)
stops = set(stopwords.words('english'))
您也会得到一些改善
(在评论中回答您的问题) 每个函数调用都消耗时间.如果您获得的数据块多于您无法永久使用的时间,那可能会浪费大量时间 至于列表与列表-比较结果:
(To answer your question in comment) Every function call consumes time; if you get large block of data than you don't utilize permanently - the waste of time may be huge As for set vs list - compare results:
In [49]: my_list = range(100)
In [50]: %timeit 10 in my_list
1000000 loops, best of 3: 193 ns per loop
In [51]: %timeit 101 in my_list
1000000 loops, best of 3: 1.49 us per loop
In [52]: my_set = set(my_list)
In [53]: %timeit 101 in my_set
10000000 loops, best of 3: 45.2 ns per loop
In [54]: %timeit 10 in my_set
10000000 loops, best of 3: 47.2 ns per loop
虽然我们有油腻的细节-这是拆分与RE的测量
While we are at greasy details - here are measurements for split vs. RE
In [30]: %timeit words = 'This is a long; and meaningless - sentence'.split(split_let)
1000000 loops, best of 3: 271 ns per loop
In [31]: %timeit words = re.findall(r'\w+', 'This is a long; and meaningless - sentence', flags = re.UNICODE | re.LOCALE)
100000 loops, best of 3: 3.08 us per loop
这篇关于优化python脚本提取和处理大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!