使用Python将日记文件分割成多个文件 [英] Split diary file into multiple files using Python

查看:231
本文介绍了使用Python将日记文件分割成多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我保留一份技术说明的日记文件。每个条目的时间戳如下所示:

 #星期一02012-05-07 at 01:45:20 PM 

这是一个示例注释

Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua。你可以用简单的语言来表达自己的想法。 Duis aute irure dolor in rennederit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur。 Excepteur sint occaecat cupidatat non
proident,sunt in culpa qui officia deserunt mollit anim id est laborum。

#星期三02012-06-06 at 03:44:11 PM

这是另外一个。

Excepteur sint occaecat cupidatat non proident,sunt in culpa qui officia
deserunt mollit anim id est laborum。

想把这些笔记分解成基于时间戳头的单个文件。例如这是一个示例note.txt 这是另一个很长的title.txt 。我确定我将不得不截断文件名在某些时候,但想法是基于日记条目的第一行种子文件名。



它doesn'看起来像我可以通过python修改文件的创建日期,所以我想保留条目时间戳作为注释的一部分。



我有一个正则表达式模式来捕捉适合我的时间戳:

 #(\s)(星期一|星期二|星期三|星期四|星期五(星期六|星期天)(\s)(。*)

可以使用该正则表达式通过文件循环,并打破每个条目,但我不太清楚如何循环通过日记文件,并将其分解成单个文件。有很多抓住实际的正则表达式模式或特定行的例子,但我想在这里做一些更多的事情,有一些困难在一起peeling它。



这是一个所需文件内容的例子(datestamp +直到下一次datestamp匹配为止的所有文本):

$ $ $ $ $ $ $ c $ bash $ cat \ is \ a \ sample \ note.txt
星期一02012-05-07 at 01:45:20 PM

这是一个示例注释

Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua。你可以用简单的语言来表达自己的想法。 Duis aute irure dolor in rennederit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur。 Excepteur sint occaecat cupidatat non
proident,sunt in culpa qui officia deserunt mollit anim id est laborum。

bash $


解决方案

一般;-)的方法:

$ $ p $ f = open(diaryfile,r)
body = [ ]
在f:
中如果你的代码为:b_req():
如果body:
write_one(body)
body = []
body .append(line)
if body:
write_one(body)
f.close()

简而言之,您只需将所有行附加到列表( body )。当你找到一条神奇的线时,你可以调用 write_one()来转储你到目前为止的内容,并清除列表。该文件的最后一个块是一个特殊的情况,因为你不会再找到你的神奇的正则表达式。你可以在你的 write_one()功能。例如,您想要从输入的时间戳行删除前导#。没关系,例如:
$ b $ pre $ body $ [0] = body [0] [2:]

write_one 中。所有的行都可以通过例如

 打开(file_name_extracted_from_body_goes_here,w)作为f: 
f.writelines(body)

您可能要检查文件是否先存在!如果它和我的日记一样,许多参赛作品的第一行就是腐烂的日子。 ; - )

I keep a diary file of tech notes. Each entry is timestamped like so:

# Monday 02012-05-07 at 01:45:20 PM

This is a sample note

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

# Wednesday 02012-06-06 at 03:44:11 PM

Here is another one.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia 
deserunt mollit anim id est laborum.

Would like to break these notes down into individual files based on timestamp headers. e.g. This is a sample note.txt, Here is another really long title.txt. Im sure I would have to truncate the filename at some point, but the idea would be to seed the filename based on the first line of the diary entry.

It doesn't look like I can modify the file's creation date via python, so I would like to preserve the entries timestamp as part of the note's body.

I've got a RegEx pattern to capture the timestamps that suits me well:

#(\s)(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\s)(.*)

and can likely use that regex to loop through the file and break each entry down, but im not quite sure how to loop through the diary file and break it out into individual files. There are a lot of examples of grabbing the actual regex pattern, or particular line, but I want to do a few more things here and am having some difficulty peicing it together.

Here is an example of the desired file contents (datestamp + all text up until next datestamp match):

bash$ cat This\ is\ a\ sample\ note.txt
Monday 02012-05-07 at 01:45:20 PM

This is a sample note

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

bash$

解决方案

Here's the general ;-) approach:

f = open("diaryfile", "r")
body = []
for line in f:
    if your_regexp.match(line):
        if body:
            write_one(body)
        body = []
    body.append(line)
if body:
    write_one(body)
f.close()

In short, you just keep appending all lines to a list (body). When you find a magical line, you call write_one() to dump what you have so far, and clear the list. The last chunk of the file is a special case, because you're not going to find your magical regexp again. So you again dump what you have after the loop.

You can make any transformations you like in your write_one() function. For example, sounds like you want to remove the leading "# " from the input timestamp lines. That's fine - just do, e.g.,

body[0] = body[0][2:]

in write_one. All the lines can be written out in one gulp via, e.g.,

with open(file_name_extracted_from_body_goes_here, "w") as f:
    f.writelines(body)

You probably want to check that the file doesn't exist first! If it's anything like my diary, the first line of many entries will be "Rotten day." ;-)

这篇关于使用Python将日记文件分割成多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆