如何解析多行记录(使用awk?) [英] How to parse multi line records (with awk?)

查看:232
本文介绍了如何解析多行记录(使用awk?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出如何提取从 \\ n \\分隔多行记录特定领域ñ

在这种情况下,它正好是从易于缓存类似于DEBIAN控制文件输出。见输出的apt-cache显示$包

 套餐:咖啡因
优先级:可选
科:MISC
安装尺寸:641
维护者:鲁本托马斯< rrt@sc3d.org>
架构:全部
版本:2.8.3
取决于:python3:任何(大于= 3.3.2-2〜),python3,gir1.2-GTK-3.0,gir1.2-appindicator3-0.1,python3-的xlib,python3-PKG-资源的libnet-dbus的-perl的
文件名:池/主/ C /咖啡因/ caffeine_2.8.3_all.deb
大小:58774
MD5SUM:4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1:e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256:ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
说明恩:prevent桌面成为闲置在全屏模式
 咖啡因停止桌面变得空闲时的应用程序
 正在运行全屏幕。桌面指标咖啡因指标
 提供一个手动切换,并命令caffeinate'可以用来
 以prevent闲散任何命令的持续时间。
说明-MD5:7c14f8adc007b10f6ecafed36260bedb包装:咖啡因
优先级:可选
科:MISC
安装尺寸:655
维护者:鲁本托马斯< rrt@sc3d.org>
架构:全部
版本:2.6 + 555〜ubuntu14.04.1
取决于:蟒蛇:任何(LT;< 2.8),蟒蛇:任何(大于= 2.7.5-5〜),蟒蛇,gir1.2-GTK-2.0,gir1.2-appindicator3-0.1,X11-utils的,蟒蛇-DBUS
文件名:池/主/ C /咖啡因/ caffeine_2.6 + 555〜ubuntu14.04.1_all.deb
大小:58604
MD5SUM:1051c3f7d40d344f986bb632d7436849
SHA1:5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256:11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
说明:prevent桌面成为闲置
 咖啡因允许用户prevent桌面变得空闲,
 手动或当某些应用中运行。这个
 prevents屏幕消隐,锁定,悬浮剂,等等。
说明-MD5:738866350e5086e77408d7a9c7ffa59b包装:咖啡因
状态:安装确定安装
优先级:可选
科:MISC
安装尺寸:794
维护者:以赛亚·海尔< freshapplepy@gmail.com>
架构:全部
版本:2.4.1 + 478〜raring1
取决于:dconf-gsettings后端| gsettings后端,蟒蛇(大于2.6 =),蟒蛇和中部(大于= 0.6.11),蟒蛇,Xlib中,蟒蛇,appindicator,蟒蛇,XDG,蟒蛇,通知,蟒蛇-KAA元数据
说明:咖啡因
 能够在状态栏的应用程序暂时prevent激活
 的屏幕保护程序和休眠省电模式两种。
说明-MD5:1c29acf1ab0f2e6636db29fbde1d14a3
主页:https://launchpad.net/caffeine
Python的版本:> = 2.6

我所需的输出是每个记录一行,格式为易于得到下载$ PKG = $版本-a = $ ARCH 。基本上安装列表可用的软件包的命令......

到目前为止,我得到了什么是的apt-cache显示$包| awk的'/ ^包装:/ {打印$ 2} / ^版本:/ {打印$ 2} / ^架构:/ {打印$ 2}'| xargs的-n3 | AWK'{printf的apt-get的下载%s =%s的-a =%s的\\ n,$ 30,$ 90,$ 2}

这是实际的输出:

  apt-get的下载咖啡因= 2.8.3 -a =所有
apt-get的下载咖啡因= 2.6 + 555〜ubuntu14.04.1 -a =所有
apt-get的下载咖啡因= 2.4.1 + 478〜raring1 -a =所有

的是根据需要,但它似乎是一个侥幸只是因为字段的顺序是在本实施例是一致的。这将打破,如果字段的顺序是不同的。

我可以做解析像在Python这一点使用面向对象的,但我有困难得到一个awk命令完成这件事。我可以看到正确的这样做的唯一方法是分裂每个记录分到各个TMP文件(使用拆分或类似的规定),然后逐个分析每个文件(这很简单)。很显然,我很想以避免不必要的I / O,因为这似乎喜欢的事,AWK是装备精良的。任何AWK Pro的知道如何解决此问题?我甚至会开到一个Perl一行程序或使用bash的,但我在学习如何更好地利用AWK很感兴趣。


解决方案

  $ =包的sed
$的apt-cache显示$包| awk的'/ ^包装:/ {P = $ 2} / ^版本:/ {V = $ 2} / ^架构:/ {a = $ 2} / ^ $ / {打印apt-get的下载P=V -a =一个}'
apt-get的下载SED = 4.2.1-10 -a = AMD64

工作原理


  • / ^包装:/ {P = $ 2}

    保存变量 P

  • 包信息
  • / ^版本:/ {V = $ 2}

    保存变量诉版本信息


  • / ^架构:/ {a = $ 2}

    保存在体系结构的信息变量 A


  • / ^ $ / {打印apt-get的下载P=V-a =一个}

    当我们到达一个空行,打印出所需形式的信息。

    我的的apt-缓存的版本总是会输出每个包后,一个空行。您的示例输出丢掉了最后的空行。如果你的的apt-缓存真的不会产生最后一个空白行,那么我们将需要增加多一点点code来弥补。

    作为一个风格问题,有些人可能preFER 的printf 打印。在这种情况下,替换上面:

      / ^ $ / {printf的apt-get的下载%s =%s的-a =%s的\\ n,V,P,A}'


I'm trying to figure out how to extract particular fields from multi line records separated by \n\n.

In this instance, it happens to be output from apt-cache akin to DEBIAN control files. See output of apt-cache show "$package"

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 641
Maintainer: Reuben Thomas <rrt@sc3d.org>
Architecture: all
Version: 2.8.3
Depends: python3:any (>= 3.3.2-2~), python3, gir1.2-gtk-3.0, gir1.2-appindicator3-0.1, python3-xlib, python3-pkg-resources, libnet-dbus-perl
Filename: pool/main/c/caffeine/caffeine_2.8.3_all.deb
Size: 58774
MD5sum: 4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1: e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256: ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
Description-en: Prevent the desktop becoming idle in full-screen mode
 Caffeine stops the desktop becoming idle when an application
 is running full-screen. A desktop indicator ‘caffeine-indicator’
 supplies a manual toggle, and the command ‘caffeinate’ can be used
 to prevent idleness for the duration of any command.
Description-md5: 7c14f8adc007b10f6ecafed36260bedb

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 655
Maintainer: Reuben Thomas <rrt@sc3d.org>
Architecture: all
Version: 2.6+555~ubuntu14.04.1
Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~), python, gir1.2-gtk-2.0, gir1.2-appindicator3-0.1, x11-utils, python-dbus
Filename: pool/main/c/caffeine/caffeine_2.6+555~ubuntu14.04.1_all.deb
Size: 58604
MD5sum: 1051c3f7d40d344f986bb632d7436849
SHA1: 5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256: 11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
Description: Prevent the desktop becoming idle
 Caffeine allows the user to prevent the desktop becoming idle,
 either manually or when certain applications are run. This
 prevents screen-blanking, locking, suspending, and so on.
Description-md5: 738866350e5086e77408d7a9c7ffa59b

Package: caffeine
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 794
Maintainer: Isaiah Heyer <freshapplepy@gmail.com>
Architecture: all
Version: 2.4.1+478~raring1
Depends: dconf-gsettings-backend | gsettings-backend, python (>= 2.6), python-central (>= 0.6.11), python-xlib, python-appindicator, python-xdg, python-notify, python-kaa-metadata
Description: Caffeine
 A status bar application able to temporarily prevent the activation
 of both the screensaver and the "sleep" powersaving mode.
Description-md5: 1c29acf1ab0f2e6636db29fbde1d14a3
Homepage: https://launchpad.net/caffeine
Python-Version: >= 2.6

My desired output is one line per record in the format apt-get download $pkg=$ver -a=$arch. Basically a list of the installation commands for available packages...

So far what I've got is apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s\n", $1, $3, $2}'

This is the actual output:

apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The is as desired but it appears to be a fluke only because the order of the fields is consistent in this example. It would break if the order of fields was different.

I can do parsing like this using object orientation in Python but I'm having difficulty getting this done in one awk command. The only way I can see doing this correctly would be to split each record into individual tmp files (using split or something along those lines) and then parse each file individually (which is straightforward). Obviously I'd really like to avoid unnecessary I/O as this seems like something that awk is well equipped for. Any awk pro's know how to solve this? I'd even be open to a Perl one-liner or utilizing bash but I'm really interested in learning how to better leverage awk.

解决方案

$ package=sed
$ apt-cache show "$package" | awk '/^Package: /{p=$2} /^Version: /{v=$2} /^Architecture: /{a=$2} /^$/{print "apt-get download "p"="v" -a="a}' 
apt-get download sed=4.2.1-10 -a=amd64

How it works

  • /^Package: /{p=$2}

    Save the package information in variable p.

  • /^Version: /{v=$2}

    Save the version information in variable v.

  • /^Architecture: /{a=$2}

    Save the architecture information in variable a.

  • /^$/{print "apt-get download "p"="v" -a="a}

    When we reach a blank line, print out the information in the desired form.

    My version of apt-cache always outputs a blank line after each package. Your sample output is missing the last blank line. If your apt-cache genuinely does not produce that last blank line, then we will need to add a little bit more code to compensate.

    As a matter of style, some may prefer printf to print. In which case, replace the above with:

    /^$/{printf "apt-get download %s=%s -a=%s\n",v,p,a}' 
    

这篇关于如何解析多行记录(使用awk?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆