在R中使用日志文件 [英] Working with log files in R

查看：255 发布时间：2020/10/29 6:42:10 r encoding

本文介绍了在R中使用日志文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个.log文件，该文件的数据格式不一致。

I have a .log file that has an inconsistent data format.

数据看起来像这样，并存储为 Little-endian UTF-16 Unicode文本：

The data looks something like this and is stored as "Little-endian UTF-16 Unicode" text:

2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
     [XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
     [CASE] IN: [ID: 1010]33
     [CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS

323133.....238813   76378    989899 000000000000

现在，我有几个遵循这种模式的日志文件。我已经尝试过scan（）和read.table（），但它们都不以我期望的格式返回数据。

Now, I have several log files that follow this kind of pattern. I have tried scan() and read.table(), they both don't return data back in the format I expect it to do.

我期望的数据格式如下：

The data format I am expecting looks like this:

Date                          String
2017-06-21 00:00:30.483       START THIS THING

但是，我在日志文件中多次包含以下行：

But, I have these line multiple times in the log files:

 [CASE] IN: [ID: 1010]33
 [CASE] IN: [ID: 2010]8

然后，

323133.....238813   76378    989899 000000000000

解决此问题的最佳方法是什么？谢谢！

What would be the best way to approach this solution? Thanks!

推荐答案

只是使用原始R的原始草图（忽略时间戳和列名的时间部分），而没有任何性能优化（例如使用 data.table :: fread 和软件包 lubridate ）：

Just a raw sketch (ignoring the time part of your timestamp and column names) using base R without any performance optimisation (like using data.table::fread and the package lubridate):

log.data <- "2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
     [XYZ 1000 T1]:1
2017-06-22 01:15:17.945 NOTHING 'D': 989
     [CASE] IN: [ID: 1010]33
     [CASE] IN: [ID: 2010]8
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS
2017-06-21 00:00:30.483 START THIS THING
2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS

323133.....238813   76378    989899 000000000000"

log <- read.csv(text = log.data, sep = "\n", header = F)
log$timestamp <- as.Date(log[,1])

结果是：

> log
                                                 V1  timestamp
1    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
2                                   [XYZ 1000 T1]:1       <NA>
3          2017-06-22 01:15:17.945 NOTHING 'D': 989 2017-06-22
4                           [CASE] IN: [ID: 1010]33       <NA>
5                            [CASE] IN: [ID: 2010]8       <NA>
6          2017-06-21 00:00:30.483 START THIS THING 2017-06-21
7    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
8          2017-06-21 00:00:30.483 START THIS THING 2017-06-21
9    2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
10         2017-06-21 00:00:30.483 START THIS THING 2017-06-21
11   2017-06-21 00:00:56.400 SOMETHING ELSE HAPPENS 2017-06-21
12 323133.....238813   76378    989899 000000000000       <NA>

更新1：

由于您发现您的日志文件使用UTF-16 little-endian文件编码（在终端中通过Linux / OSX的 file 命令检查），因此必须在 read.csv 中添加文件编码，以使R在读取过程中正确转换文件内容：

Since you found out that your log file uses the UTF-16 little-endian file encoding (checked with the file command of Linux/OSX in a terminal) you have to add the file encoding to read.csv to let R convert the file content correctly during reading:

log <- read.csv(file = "my.log", sep = "\n", header = F, fileEncoding = "UTF-16LE", encoding = "UTF-8")

这篇关于在R中使用日志文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在R中使用日志文件 [英] Working with log files in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中使用日志文件 [英] Working with log files in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭