Python根据时间戳对文件内容进行排序并将其写入新文件吗? [英] Python Sorting the contents of the file according to the timestamp and write it to new file?
问题描述
我有一个文件,该文件以以下格式存储数据
I have the file which stores the data in the below format
TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[03.26_12:28:30.753664]ID[ROLL:2341987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.26_12:29:30.853664]ID[ROLL:201978623]MARKS[PHY:0|MATH:0|CHEM:40]
TIME[04.27_12:29:30.553664]ID[ROLL:2034287623]MARKS[PHY:100|MATH:200|CHEM:400]
这种类型的数据存储在文本文件中,我使用此文本文件创建的内容是,我制作了多个名称为ROLL的文件,并将该特定卷号的数据存储在文本文件中,为此,我在python中使用正则表达式这实际上是代码文件太大,我可以使用readlines函数将它们存储在列表中,这会导致内存错误,因此我必须逐行阅读,这是我为此编写的代码
This type of data is stored in the text file, what I am creating with this text file is that I am making several files with names as ROLL and storing the data of that particular roll number in the text file, For which I am using regex in python this is the code actually file is so large that I can store them in the list using readlines function it'll give memory error so I have to read it line by line here is the code that i have written for it
import re
import os
import fileinput
from datetime import datatime
from collections import defaultdict
time_for_roll_numbers=defaultdict()# a dictionary I am using the timestamp roll number wise
with open('Marksinfo.txt','r') as f:
for line in f:
ind=re.match(r'(.*)TIME\[' + r'(.*?)](.*)\[ROLL:(.*?)\]',line,re.M|re.I)
timer_for_roll_numbers.setdefault(int(ind.group(4)),defaultdict(list))['TIME'].append(ind.group(2))
p=open('ROLL_{}.txt'.format(ind.group(4)),"a")
p.write(%s % line)
p.close()
上面的函数也根据我的意愿创建文件,但是我希望数据根据我不知道该怎么办的数据中给出的时间戳值以排序格式显示,因为这是从上面的文件并写入新创建的文件,而无需考虑数据是否根据时间戳进行排序
The above function is creating the files according to my wish also , but I want the data to be in sorted format according to timestamp values given in the data that I have no idea how to do because this is fetching the lines sequentially from the above file and writing in the newly made file without considering that the data is sorted or not according to timestamp what I am getting now is this
我现在得到的实际输出格式如下
Actual Output format currently I am getting is as below
In file name ROLL_201987623.txt
TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
所需的输出格式应如下
TIME[03.27_12:29:30.553669]ID[ROLL:201987623]MARKS[PHY:100|MATH:1200|CHEM:900]
TIME[04.26_12:30:30:853664]ID[ROLL:201987623]MARKS[PHY:100|MATH:200|CHEM:400]
明智的做法是,在每个文件中,每个卷号应采用排序格式,请提出一些建议,
Like wise for every roll number it should be in sorted format in respective files ,please suggest any ideas how to do it
在我的代码中,我还获取了此时间戳,并使用python中的日期时间库将其转换为以下格式,假设对于特定的纸卷编号,我想获取我正在使用的时间戳的每个细节(例如样本纸卷编号)是201987623
In my code I have fetched this time stamp also and converted it into the following format using the date time library in python suppose for particular roll number I want to fetch every detail of the timestamp this I am using (say sample roll number is 201987623
time_for_particular_roll=timer_for_roll_numbers[201987623]['TIME']
dt = [datetime.strptime(s, '%m.%d_%H:%M:%S.%f') for s in time_for_particular_roll]
dt包含以下格式,我可以轻松访问
dt is containing in the below format which I can access easily
(4,26,12,30,30,853664)
现在,我不知道如何在新创建的文件中为该卷号以特定的格式插入特定卷号的信息
Now I am not getting how to insert in sorted format the information of particular roll number in the newly made file for that roll number
推荐答案
我将使用排序和 itertools.groupby .
用于按ROLL对行进行一次分组(按ROLL和时间戳排序).这是我将首先使用的脚本:
For grouping lines by ROLL once sorted by ROLL and timestamp. Here is the script I would use as a first approach:
import re
from itertools import groupby
regex = re.compile(r"^.*TIME\[([^]]+)\]ID\[ROLL:([^]]+)\].+$")
我将定义三个可调用项以对行进行过滤,排序和分组:
I would define three callables for filtering, sorting and grouping lines:
def func1(arg) -> bool:
return regex.match(arg)
def func2(arg) -> str:
match = regex.match(arg)
if match:
return match.group(1)
return ""
def func3(arg) -> int:
match = regex.match(arg)
if match:
return int(match.group(2))
return 0
然后循环遍历您的输入文件.
Then loop over your input file.
首先拒绝不合规的数据. 按ROLL然后按时间戳对剩余数据进行排序. 然后按ROLL对数据进行分组.
Reject at first non-compliant data. Sort remaining data by ROLL then by timestamp. Then group data by ROLL.
with open(your_input_file) as fr:
collection = filter(func1, fr)
collection = sorted(collection, key=func2)
collection = sorted(collection, key=func3)
for key, group in groupby(collection, key=func3):
with open(f"ROLL_{key}", mode="w") as fw:
fw.writelines(group)
根据您的示例,该代码段将生成四个文件,这些文件的数据按时间戳的升序排序.
According to your example that snippet will produce four files with data sorted by ascending timestamp.
请勿通过将例如天数设置在第一位置来更改课程的时间戳格式.
Don't change the timestamp format of course by setting, for example, days in the first position.
这篇关于Python根据时间戳对文件内容进行排序并将其写入新文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!