如何在Linux中加快awk命令? [英] How to speed up awk command in linux?

查看:73
本文介绍了如何在Linux中加快awk命令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在linux中使用awk命令将utc转换为本地时间,但是文件大小很大(> 30 gb),并且将花费一个多小时.

I am using awk command in linux to convert utc to localtime, but the file size is huge (>30 gb) and it would cost more than an hour.

这是我的代码:

awk -F"," '{cmd="date -d \"$(date -d \""$1"\")-4hours\" \"+%Y%m%d_%H\"";cmd | getline datum; close(cmd); print $0 ","datum""}' data.txt

如何加快此命令的速度,或者有没有简单的方法可以在Linux中进行此转换?

How can I speed up this command or is there any simple way to do this conversion in linux?

Here is the sample of input:

utc,id
2018-03-31 16:00:49.425,4485
2018-04-1 17:01:19.425,30019
2018-05-31 18:01:49.425,15427
2018-08-20 19:01:55.425,17579
2018-09-2 20:02:31.425,23716
2018-10-15 21:03:34.425,24772

expected output:

utc,id,localtime
2018-03-31 16:00:49.425,4485,20180331_12
2018-04-1 17:01:19.425,30019,20180401_13
2018-05-31 18:01:49.425,15427,20180531_14
2018-08-20 19:01:55.425,17579,20180820_15
2018-09-2 20:02:31.425,23716,20180902_16
2018-10-15 21:03:34.425,24772,20181015_17

推荐答案

原始解决方案运行缓慢的原因是由于对 date 的系统调用.用awk处理的每个记录/行都调用一个外部命令来执行日期转换.此类外部调用需要加载到内存中,然后执行,其输出需要由awk处理.如果我们可以在awk本身中进行实际的日期转换,则可以加快速度.

The reason why your original solution is slow is due to the system call to date. Each record/line you process with awk calls an external command to perform the date-conversion. Such an external call needs to be loaded into memory, executed and its output needs to be processed by awk. We can speed this up if we can do the actual date-conversion in awk itself.

一般性评论::将日期和时间从UTC转换为本地时区时,必须考虑到1月1日与8月1日所在的时区不同.这是由于白天-节约时间.以下算法无法提供解决方案,因为OP刚刚要求将班次更改为4h或更改为当前时区.(备注:适用于gawk 4.1.2或更高版本的解决方案将考虑DST)

general comment: when you convert dates and times from UTC to your local timezone, you have to take into account that January 1 is in a different timezone than August 1. This is due to daylight-saving time. The algorithm below does not provide a solution to that as the OP just requested a shift of 4h, or a shift to his current time-zone. (remark: the solution for gawk 4.1.2 or newer will take DST into account)

下面,我根据您使用的awk提出了几种解决方案,您可以使用它们:

Below I present several solutions which you can use depending on which awk you use:

Gnu awk : gawk 的各种扩展之一是时间功能.此问题的两个有用的时间函数是 mktime strftime :

Gnu awk: One of the various extensions of gawk are Time-functions. The two useful time-functions for this problem are mktime and strftime:

  • mktime(datespec,[utc-flag]) :这将转换为以下格式的日期指定字符串 datespec YYYY MM DD hh mm ss 转换为Unix纪元时间,即自1970 01 01 UTC以来的总秒数.从gawk-4.2.1开始,您可以使用 utc-flag 指示 datespec 是否采用UTC.在gawk-4.2.1之前,它采用本地时区.

  • mktime(datespec,[utc-flag]): This converts a date specification string, datespec, of the form YYYY MM DD hh mm ss into a Unix epoch time, i.e. the total seconds since 1970 01 01 UTC. Since gawk-4.2.1 you can use the utc-flag to indicate datespec is in UTC or not. Prior to gawk-4.2.1 it assumed the local time-zone.

strftime(format,timestamp,[utc-flag]) :这会将时代时间 timestamp 转换为格式化的字符串(格式与 date 命令相同).您可以使用 utc-flag 指示返回的时间应采用UTC或本地时区.

strftime(format,timestamp,[utc-flag]): This converts an epoch-time timestamp into a formatted string (same formatting as the date command). You can use the utc-flag the indicate that the returned time should be in UTC or in the local time-zone.

GNU awk手册中的更多信息.

我们要将字段1从UTC转换为本地时区.由于我们不知道字段1的格式,因此我们假设存在一个函数 convert_time(str),该函数将 str 格式化为 YYYY MM DD hh mm可以被 mktime :

We want to convert field 1 from UTC to local time-zone. Since we do not know the format of field 1, we assume the existence of a function convert_time(str) which formats str into the format YYYY MM DD hh mm ss that can be accepted by mktime:

gawk 4.1.2或更高版本:

$ awk 'BEGIN{FS=OFS=","}
      { # convert $1 into YYYY MM DD hh mm ss
        datestring=convert_time($1)
        # convert datestring (as UTC) into epoch 
        datum=mktime(datestring,1)
        # convert epoch into string (local TZ)
        datum=strftime("\042%Y%m%d_%H\042",datum)
        # print and append
        print $0,datum
      }' data.txt

gawk 4.1.2之前的版本:在这里,我们无法使用 utc-flag ,因此我们迫使awk在UTC中工作:

prior to gawk 4.1.2: Here we cannot make use of the utc-flag, so we force awk to work in UTC:

$ TZ=UTC awk 'BEGIN{FS=OFS=","}
             { # convert $1 into YYYY MM DD hh mm ss
               datestring=convert_time($1)
               # convert datestring (as UTC) into epoch 
               datum=mktime(datestring)
               # perform TZ correction
               datum-=4*3600;
               # convert epoch into string (local TZ)
               datum=strftime("\042%Y%m%d_%H\042",datum)
               # print and append
               print $0,datum
              }' data.txt

POSIX awk::如果您没有GNU awk,但没有任何其他awk,则不能使用这些时间功能,因为它们是GNU awk特有的.但是可以实现它们:

POSIX awk: if you do not have GNU awk, but any other awk, you cannot use those time-functions as they are GNU awk specific. It is however possible to implement them:

awk '

# Algorithm from "Astronomical Algorithms" By J.Meeus
function mktime_posix(datestring,    a,t) {
    split(datestring,a," ")
    if (a[1] < 1970) return -1
    if (a[2] <= 2) { a[1]--; a[2]+=12 }
    t=int(a[1]/100); t=2-t+int(t/4)
    t=int(365.25*a[1]) + int(30.6001*(a[2]+1)) + a[3] + t - 719593
    return t*86400 + a[4]*3600 + a[5]*60 + a[6]
}

function strftime_posix(epoch, JD,yyyy,mm,dd,HH,MM,SS,A,B,C,D,E ) {
    if (epoch < 0 ) return "0000 00 00 00 00 00.000000"
    JD=epoch; SS=JD%60; JD-=SS; JD/=60; MM=JD%60;
    JD-=MM; JD/=60; HH=JD%24; JD-=HH; JD/=24;
    JD+=2440588
    A=int((JD - 1867216.25)/(36524.25))
    A=JD+1+A-int(A/4)
    B=A+1524; C=int((B-122.1)/365.25); D=int(365.25*C); E=int((B-D)/30.6001)
    dd=B-D-int(30.6001*E)
    mm = E < 14 ? E-1 : E - 13
    yyyy=mm>2?C-4716:C-4715
    return sprintf("\042%0.4d-%0.2d-%0.2d %0.2d:%0.2d:%06.3f",yyyy,mm,dd,HH,MM,SS)
}

{ # convert $1 into YYYY MM DD hh mm ss
  datestring=convert_time($1)
  # convert datestring (as UTC) into epoch 
  datum=mktime_posix(datestring)
  # perform TZ correction
  datum-=4*3600;
  # convert epoch into string (local TZ)
  datum=strftime_posix(datum)
  # print and append
  print $0,datum
}' data.txt

这篇关于如何在Linux中加快awk命令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆