在R中使用日志文件 [英] Working with log files in R
问题描述
我有一个.log文件,该文件的数据格式不一致。
I have a .log file that has an inconsistent data format.
数据看起来像这样,并存储为 Little-endian UTF-16 Unicode文本:
The data looks something like this and is stored as "Little-endian UTF-16 Unicode" text:
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
[XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
323133.....238813 76378 989899 000000000000
现在,我有几个遵循这种模式的日志文件。我已经尝试过scan()和read.table(),但它们都不以我期望的格式返回数据。
Now, I have several log files that follow this kind of pattern. I have tried scan() and read.table(), they both don't return data back in the format I expect it to do.
我期望的数据格式如下:
The data format I am expecting looks like this:
Date String
2017-06-21 00:00:30.483 START THIS THING
但是,我在日志文件中多次包含以下行:
But, I have these line multiple times in the log files:
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
然后,
323133.....238813 76378 989899 000000000000
解决此问题的最佳方法是什么?谢谢!
What would be the best way to approach this solution? Thanks!
推荐答案
只是使用原始R的原始草图(忽略时间戳和列名的时间部分),而没有任何性能优化(例如使用 data.table :: fread
和软件包 lubridate
):
Just a raw sketch (ignoring the time part of your timestamp and column names) using base R without any performance optimisation (like using data.table::fread
and the package lubridate
):
log.data <- "2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
[XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
[CASE] IN: [ID: 1010]33
[CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
323133.....238813 76378 989899 000000000000"
log <- read.csv(text = log.data, sep = "\n", header = F)
log$timestamp <- as.Date(log[,1])
结果是:
> log
V1 timestamp
1 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
2 [XYZ 1000 T1]:1 <NA>
3 2017-06-22 01:15:17.945 NOTHING 'D': 989 2017-06-22
4 [CASE] IN: [ID: 1010]33 <NA>
5 [CASE] IN: [ID: 2010]8 <NA>
6 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
7 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
8 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
9 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
10 2017-06-21 00:00:30.483 START THIS THING 2017-06-21
11 2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
12 323133.....238813 76378 989899 000000000000 <NA>
更新1:
由于您发现您的日志文件使用UTF-16 little-endian文件编码(在终端中通过Linux / OSX的 file
命令检查),因此必须在 read.csv
中添加文件编码,以使R在读取过程中正确转换文件内容:
Since you found out that your log file uses the UTF-16 little-endian file encoding (checked with the file
command of Linux/OSX in a terminal) you have to add the file encoding to read.csv
to let R convert the file content correctly during reading:
log <- read.csv(file = "my.log", sep = "\n", header = F, fileEncoding = "UTF-16LE", encoding = "UTF-8")
这篇关于在R中使用日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!