如何从数据文件中提取特定的行 [英] how to extract specific lines from a data file

查看:110
本文介绍了如何从数据文件中提取特定的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,但我觉得解决方案应该很简单。我正在构建一个模型,并希望通过10倍交叉验证来测试其准确性。要做到这一点,我必须将训练语料库90%/ 10%分成训练和测试部分,然后训练我的模型90%和测试10%。这个我想做十次,每次做不同的90%/ 10%的分割,这样最终每个位都被用作测试数据。然后我会平均每个10%的测试结果。

我试图写一个脚本来提取10%的训练语料库,并将其写入一个新的文件,但到目前为止,我不明白它的工作。我所做的是计算文件中的总行数,然后将这个数字除以10以知道我想提取的十个不同测试集中的每一个的大小。

  trainFile = open(danish.train)
numberOfLines = 0

for trainFile:
numberOfLines + = 1

lengthTest = numberOfLines / 10

我发现,自己的培训文件,它由3638行组成,所以每个测试应该大致由363行组成。



如何将1-363行,364-726行等写入不同的测试文件?

解决方案

一旦你有了线的数量,回到文件的开始,并开始复制出行到 danish.train.part-01 。当行号是10%测试集大小的倍数时,为下一部分打开一个新文件。

  #!/ usr / bin / env python2.7 

trainFile = open(danish.train)
numberOfLines = 0

for trainFile:
numberOfLines + = 1

lengthTest = numberOfLines / 10

#将文件倒退到开头
trainFile.seek(0)

numberOfLines = 0
file_number = 0
for trainFile:
if numberOfLines%lengthTest == 0:
file_number + = 1
output = open('danish .train.part-%02d'%file_number,'w')

numberOfLines + = 1
output.write(line)
pre

在这个输入文件(对不起,我不会说丹麦!):

  one 
two
three
four
five
six
seven
eight
nine
十元
十一元
十二元$ b $十三元$ b $十四​​元
十五元$ b $十六元
十七元$ b $十八元
十九
二十
二十一
二十二
二十三
二十四
二十五
二十六
二十七
二十八
二十九
三十

这将创建文件

  danish.train.part-01 
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train .part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10


和第五部分包含:

 十三
十四
十五


I have a problem but I feel the solution should be quite simple. I'm building a model and want to test its accuracy by 10-fold cross-validation. To do this I have to split my training corpus 90%/10% into training and test sections, then train my model on the 90% and test on the 10%. This I want to do ten times, by taking a different 90%/10% split every time, so that eventually each bit of the corpus has been used as testing data. Then I'll average the results for each 10% test.

I have tried to write a script to extract 10% of the training corpus and write it to a new file, but so far I don't get it working. What I have done is counting the total number of lines in the file, and then dividing this number by ten to know the size of each of the ten different test sets that I want to extract.

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

I have found, for my own training file, that it consists of 3638 lines, so each test should consist roughly of 363 lines.

How do I write line 1-363, line 364-726, etc. to different test files?

解决方案

Once you have the count of lines, go back to the beginning of the file, and start copying out lines to danish.train.part-01. When the line number is a multiple of the size of the 10% test set, open a new file for the next part.

#!/usr/bin/env python2.7

trainFile = open("danish.train")
numberOfLines = 0

for line in trainFile:
    numberOfLines += 1

lengthTest = numberOfLines / 10

# rewind file to beginning
trainFile.seek(0)

numberOfLines = 0
file_number = 0
for line in trainFile:
    if numberOfLines % lengthTest == 0:
        file_number += 1
        output = open('danish.train.part-%02d' % file_number, 'w')

    numberOfLines += 1
    output.write(line)

On this input file (sorry I don’t speak Danish!):

one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty

This creates files

danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10

and part 5, for example, contains:

thirteen
fourteen
fifteen

这篇关于如何从数据文件中提取特定的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆