如何解析多行记录(使用 awk?) [英] How to parse multi line records (with awk?)

查看:12
本文介绍了如何解析多行记录(使用 awk?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚如何从由 分隔的多行记录中提取特定字段.

I'm trying to figure out how to extract particular fields from multi line records separated by .

在这种情况下,它恰好是从类似于 DEBIAN 控制文件的 apt-cache 输出.查看 apt-cache show "$package"

In this instance, it happens to be output from apt-cache akin to DEBIAN control files. See output of apt-cache show "$package"

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 641
Maintainer: Reuben Thomas <rrt@sc3d.org>
Architecture: all
Version: 2.8.3
Depends: python3:any (>= 3.3.2-2~), python3, gir1.2-gtk-3.0, gir1.2-appindicator3-0.1, python3-xlib, python3-pkg-resources, libnet-dbus-perl
Filename: pool/main/c/caffeine/caffeine_2.8.3_all.deb
Size: 58774
MD5sum: 4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1: e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256: ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
Description-en: Prevent the desktop becoming idle in full-screen mode
 Caffeine stops the desktop becoming idle when an application
 is running full-screen. A desktop indicator ‘caffeine-indicator’
 supplies a manual toggle, and the command ‘caffeinate’ can be used
 to prevent idleness for the duration of any command.
Description-md5: 7c14f8adc007b10f6ecafed36260bedb

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 655
Maintainer: Reuben Thomas <rrt@sc3d.org>
Architecture: all
Version: 2.6+555~ubuntu14.04.1
Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~), python, gir1.2-gtk-2.0, gir1.2-appindicator3-0.1, x11-utils, python-dbus
Filename: pool/main/c/caffeine/caffeine_2.6+555~ubuntu14.04.1_all.deb
Size: 58604
MD5sum: 1051c3f7d40d344f986bb632d7436849
SHA1: 5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256: 11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
Description: Prevent the desktop becoming idle
 Caffeine allows the user to prevent the desktop becoming idle,
 either manually or when certain applications are run. This
 prevents screen-blanking, locking, suspending, and so on.
Description-md5: 738866350e5086e77408d7a9c7ffa59b

Package: caffeine
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 794
Maintainer: Isaiah Heyer <freshapplepy@gmail.com>
Architecture: all
Version: 2.4.1+478~raring1
Depends: dconf-gsettings-backend | gsettings-backend, python (>= 2.6), python-central (>= 0.6.11), python-xlib, python-appindicator, python-xdg, python-notify, python-kaa-metadata
Description: Caffeine
 A status bar application able to temporarily prevent the activation
 of both the screensaver and the "sleep" powersaving mode.
Description-md5: 1c29acf1ab0f2e6636db29fbde1d14a3
Homepage: https://launchpad.net/caffeine
Python-Version: >= 2.6

我想要的输出是每条记录一行,格式为 apt-get download $pkg=$ver -a=$arch.基本上是可用包的安装命令列表...

My desired output is one line per record in the format apt-get download $pkg=$ver -a=$arch. Basically a list of the installation commands for available packages...

到目前为止我得到的是 apt-cache show "$package" |awk '/^Package:/{ print $2 }/^Version:/{ print $2 }/^Architecture:/{ print $2 }' |xargs -n3 |awk '{printf "apt-get 下载 %s=%s -a=%s ", $1, $3, $2}'

So far what I've got is apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s ", $1, $3, $2}'

这是实际输出:

apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

正如所愿,但它似乎只是侥幸,因为在这个例子中字段的顺序是一致的.如果字段的顺序不同,它会中断.

The is as desired but it appears to be a fluke only because the order of the fields is consistent in this example. It would break if the order of fields was different.

我可以在 Python 中使用面向对象进行这样的解析,但我很难在一个 awk 命令中完成这项工作.我能看到正确执行此操作的唯一方法是将每个记录拆分为单独的 tmp 文件(使用 split 或类似方法),然后单独解析每个文件(这很简单).显然,我真的很想避免不必要的 I/O,因为这似乎是 awk 所擅长的.任何 awk 专业人士都知道如何解决这个问题?我什至对 Perl one-liner 或使用 bash 持开放态度,但我真的很想学习如何更好地利用 awk.

I can do parsing like this using object orientation in Python but I'm having difficulty getting this done in one awk command. The only way I can see doing this correctly would be to split each record into individual tmp files (using split or something along those lines) and then parse each file individually (which is straightforward). Obviously I'd really like to avoid unnecessary I/O as this seems like something that awk is well equipped for. Any awk pro's know how to solve this? I'd even be open to a Perl one-liner or utilizing bash but I'm really interested in learning how to better leverage awk.

推荐答案

$ package=sed
$ apt-cache show "$package" | awk '/^Package: /{p=$2} /^Version: /{v=$2} /^Architecture: /{a=$2} /^$/{print "apt-get download "p"="v" -a="a}' 
apt-get download sed=4.2.1-10 -a=amd64

工作原理

  • /^Package:/{p=$2}

    将包信息保存在变量p中.

    Save the package information in variable p.

    /^Version:/{v=$2}

    将版本信息保存在变量v中.

    Save the version information in variable v.

    /^架构:/{a=$2}

    将架构信息保存在变量a中.

    Save the architecture information in variable a.

    /^$/{print "apt-get download "p"="v" -a="a}

    当我们到达一个空白行时,以所需的形式打印出信息.

    When we reach a blank line, print out the information in the desired form.

    我的 apt-cache 版本总是在每个包后输出一个空行.您的示例输出缺少最后一个空行.如果你的 apt-cache 真的没有产生最后一个空行,那么我们需要添加更多的代码来补偿.

    My version of apt-cache always outputs a blank line after each package. Your sample output is missing the last blank line. If your apt-cache genuinely does not produce that last blank line, then we will need to add a little bit more code to compensate.

    就风格而言,有些人可能更喜欢 printf 而不是 print.在这种情况下,将以上内容替换为:

    As a matter of style, some may prefer printf to print. In which case, replace the above with:

    /^$/{printf "apt-get download %s=%s -a=%s
    ",v,p,a}' 
    

  • 这篇关于如何解析多行记录(使用 awk?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆