如何从数据文件中提取特定的行 [英] how to extract specific lines from a data file
问题描述
我试图写一个脚本来提取10%的训练语料库,并将其写入一个新的文件,但到目前为止,我不明白它的工作。我所做的是计算文件中的总行数,然后将这个数字除以10以知道我想提取的十个不同测试集中的每一个的大小。
trainFile = open(danish.train)
numberOfLines = 0
for trainFile:
numberOfLines + = 1
lengthTest = numberOfLines / 10
我发现,自己的培训文件,它由3638行组成,所以每个测试应该大致由363行组成。
如何将1-363行,364-726行等写入不同的测试文件?
一旦你有了线的数量,回到文件的开始,并开始复制出行到 danish.train.part-01
。当行号是10%测试集大小的倍数时,为下一部分打开一个新文件。
#!/ usr / bin / env python2.7
pre
trainFile = open(danish.train)
numberOfLines = 0
for trainFile:
numberOfLines + = 1
lengthTest = numberOfLines / 10
#将文件倒退到开头
trainFile.seek(0)
numberOfLines = 0
file_number = 0
for trainFile:
if numberOfLines%lengthTest == 0:
file_number + = 1
output = open('danish .train.part-%02d'%file_number,'w')
numberOfLines + = 1
output.write(line)
在这个输入文件(对不起,我不会说丹麦!):
one
two
three
four
five
six
seven
eight
nine
十元
十一元
十二元$ b $十三元$ b $十四元
十五元$ b $十六元
十七元$ b $十八元
十九
二十
二十一
二十二
二十三
二十四
二十五
二十六
二十七
二十八
二十九
三十
这将创建文件
danish.train.part-01
例如,pre>
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train .part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10
和第五部分包含:
十三
十四
十五
I have a problem but I feel the solution should be quite simple. I'm building a model and want to test its accuracy by 10-fold cross-validation. To do this I have to split my training corpus 90%/10% into training and test sections, then train my model on the 90% and test on the 10%. This I want to do ten times, by taking a different 90%/10% split every time, so that eventually each bit of the corpus has been used as testing data. Then I'll average the results for each 10% test.
I have tried to write a script to extract 10% of the training corpus and write it to a new file, but so far I don't get it working. What I have done is counting the total number of lines in the file, and then dividing this number by ten to know the size of each of the ten different test sets that I want to extract.
trainFile = open("danish.train") numberOfLines = 0 for line in trainFile: numberOfLines += 1 lengthTest = numberOfLines / 10
I have found, for my own training file, that it consists of 3638 lines, so each test should consist roughly of 363 lines.
How do I write line 1-363, line 364-726, etc. to different test files?
解决方案Once you have the count of lines, go back to the beginning of the file, and start copying out lines to
danish.train.part-01
. When the line number is a multiple of the size of the 10% test set, open a new file for the next part.#!/usr/bin/env python2.7 trainFile = open("danish.train") numberOfLines = 0 for line in trainFile: numberOfLines += 1 lengthTest = numberOfLines / 10 # rewind file to beginning trainFile.seek(0) numberOfLines = 0 file_number = 0 for line in trainFile: if numberOfLines % lengthTest == 0: file_number += 1 output = open('danish.train.part-%02d' % file_number, 'w') numberOfLines += 1 output.write(line)
On this input file (sorry I don’t speak Danish!):
one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine thirty
This creates files
danish.train.part-01 danish.train.part-02 danish.train.part-03 danish.train.part-04 danish.train.part-05 danish.train.part-06 danish.train.part-07 danish.train.part-08 danish.train.part-09 danish.train.part-10
and part 5, for example, contains:
thirteen fourteen fifteen
这篇关于如何从数据文件中提取特定的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!