如何在 Bash 中解析 XML? [英] How to parse XML in Bash?

查看:33
本文介绍了如何在 Bash 中解析 XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

理想情况下,我希望能够做的是:

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

推荐答案

这其实只是对 Yuzem 的解释 回答,但我觉得不应该对其他人进行这么多编辑,而且评论不允许格式化,所以...

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

让我们称其为read_dom"而不是rdom",将其隔开一点并使用更长的变量:

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {
    local IFS=>
    read -d < ENTITY CONTENT
}

好的,它定义了一个名为 read_dom 的函数.第一行使 IFS(输入字段分隔符)成为此函数的本地并将其更改为 >.这意味着当您读取数据而不是自动在空格、制表符或换行符上拆分时,它会在 '>' 上拆分.下一行说从 stdin 读取输入,而不是在换行符处停止,当您看到 '<' 时停止字符(-d 表示分隔符标志).然后使用 IFS 拆分读取的内容并分配给变量 ENTITY 和 CONTENT.所以采取以下措施:

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

第一次调用 read_dom 得到一个空字符串(因为 '<' 是第一个字符).由于没有 '>' 字符,IFS 将其拆分为 ''.Read 然后为这两个变量分配一个空字符串.第二个调用获取字符串 'tag>value'.然后由 IFS 将其拆分为两个字段标签"和值".Read 然后分配如下变量:ENTITY=tagCONTENT=value.第三次调用获取字符串'/tag>'.这被 IFS 分成两个字段/tag"和".Read 然后分配如下变量:ENTITY=/tagCONTENT=.第四个调用将返回一个非零状态,因为我们已经到达文件末尾.

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

现在他的while循环清理了一下以匹配上面的内容:

Now his while loop cleaned up a bit to match the above:

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

第一行只是说,当 read_dom 函数返回零状态时,请执行以下操作."第二行检查我们刚刚看到的实体是否是title".下一行回显标签的内容.四行退出.如果它不是标题实体,则循环在第六行重复.我们将xhtmlfile.xhtml"重定向到标准输入(对于 read_dom 函数),并将标准输出重定向到titleOfXHTMLPage.txt"(循环早期的回显).

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

现在给出 input.xml 的以下内容(类似于您在 S3 上列出存储桶的内容):

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>item-apple-iso@2x.png</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

以及以下循环:

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

你应该得到:

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => item-apple-iso@2x.png
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents => 

所以如果我们像 Yuzem 那样写一个 while 循环:

So if we wrote a while loop like Yuzem's:

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

我们会得到 S3 存储桶中所有文件的列表.

We'd get a listing of all the files in the S3 bucket.

编辑如果由于某种原因 local IFS=> 对您不起作用并且您全局设置它,您应该在函数结束时重置它,例如:

EDIT If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=>
    read -d < ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

否则,您稍后在脚本中进行的任何行拆分都会搞砸.

Otherwise, any line splitting you do later in the script will be messed up.

编辑 2要拆分属性名称/值对,您可以像这样增加 read_dom() :

EDIT 2 To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {
    local IFS=>
    read -d < ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

然后编写您的函数来解析并获取您想要的数据,如下所示:

Then write your function to parse and get the data you want like this:

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

然后当你 read_dom 调用 parse_dom 时:

Then while you read_dom call parse_dom:

while read_dom; do
    parse_dom
done

然后给出以下示例标记:

Then given the following example markup:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

你应该得到这个输出:

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

EDIT 3 另一个用户说他们遇到了问题在 FreeBSD 中,并建议保存 read 的退出状态并在 read_dom 结束时返回它,例如:

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {
    local IFS=>
    read -d < ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

我看不出有什么理由不这样做

I don't see any reason why that shouldn't work

这篇关于如何在 Bash 中解析 XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆