XSLT中的批处理制表符分隔的文件 [英] Batch processing tab-delimited files in XSLT

查看:113
本文介绍了XSLT中的批处理制表符分隔的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个XML文件,其中包含92个制表符分隔的文本文件:

I have an XML file with a list of 92 tab-delimited text files:

<?xml version="1.0" encoding="UTF-8"?>
<dumpSet>
  <dump filename="file_one.txt"/>
  <dump filename="file_two.txt"/>
  <dump filename="file_three.txt"/>
  ...
</dumpSet>

每个文件的第一行包含后续行的字段名称.这只是一个例子.元素的名称和数量将因记录而异.大多数将有大约50个字段名称.

The first row in each file contains the field names for the subsequent rows. This is just an example. The names and number of elements will vary by record. Most will have around 50 field names.

Title   Translated Title    Watch Video Interviewee Interviewer 
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John
...

Oxygen XML编辑器具有导入功能,可以将文本文件转换为XML,但是据我所知,这不能在具有多个文件的批处理过程中完成.到目前为止,批处理部分还没有问题.我正在使用XSLT 2.0的 unparsed-text()函数从列表中的文件中提取内容.但是,我正在努力正确地对XML输出进行分组.所需输出的示例:

Oxygen XML Editor has an Import function that can convert text files to XML, but--as far as I know--this cannot be done in a batch process with multiple files. So far, the batch processing part has not been a problem. I am using XSLT 2.0's unparsed-text() function to pull in the content from the files in the list. However, I am struggling to group the XML output correctly. Example of desired output:

<collection>
  <record>
    <title>Interview with Barack Obama</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Obama, Barack</interviewee>
    <interviewer>Walters, Barbara</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  <record>
    <title>Interview with Sarah Palin</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Palin, Sarah</interviewee>
    <interviewer>Couric, Katie</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  ...
</collection>

现在,这是我得到的输出:

Right now, here is the kind of output I am getting:

<collection>
  <record>
    <title>title</title>
    <value>Interview with Barack Obama</value>
    <value>Interview with Sarah Palin</value>
    <translatedtitle>translatedtitle</translatedtitle>
    <value/>
    <value/>
    <watchvideo>watchvideo</watchvideo>
    <value/>
    <value/>
    <interviewee>interviewee</interviewee>
    <value>Obama, Barack</value>
    <value>Palin, Sarah</value>
    <interviewer>interviewer</interviewer>
    <value>Walters, Barbara</value>
    <value>Couric, Katie</value>
    <videographer>videographer</videographer>
    <value>Smith, John</value>
    <value>Smith, John </value>
    <value/>
    <value/>
  </record>
</collection>

也就是说,我无法按记录对输出进行分组.这是我正在使用的当前代码,基于Doug Tidwell的XSLT书中的示例:

That is, I'm not able to group the output by record. Here's the current code I'm working with, based on an example in Doug Tidwell's XSLT book:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0">

    <xsl:param name="i" select="1"/>
    <xsl:param name="increment" select="1"/>
    <xsl:param name="operator" select="'&lt;='"/>
    <xsl:param name="testVal" select="100"/>    

    <xsl:template match="/">
        <collections>
            <collection>
                <xsl:for-each select="dumpSet/dump">

                    <!-- Pull in external tab-delimited files -->  
                    <xsl:for-each select="unparsed-text(concat('../2013-04-26/',@filename),'UTF-8')">
                        <record>

                            <!-- Call recursive template to loop through elements. -->
                            <xsl:call-template name="for-loop">
                                <xsl:with-param name="i" select="$i"/>
                                <xsl:with-param name="increment" select="$increment"/>
                                <xsl:with-param name="operator" select="$operator"/>
                                <xsl:with-param name="testVal" select="$testVal"/>
                            </xsl:call-template>
                        </record>
                    </xsl:for-each>
                </xsl:for-each>
            </collection>
        </collections>
    </xsl:template>

    <xsl:template name="for-loop">
        <xsl:param name="i"/>
        <xsl:param name="increment"/>
        <xsl:param name="operator"/>
        <xsl:param name="testVal"/>
        <xsl:variable name="testPassed">
            <xsl:choose>
                <xsl:when test="$operator = '&lt;='">
                    <xsl:if test="$i &lt;= $testVal">
                        <xsl:text>true</xsl:text>
                    </xsl:if>
                </xsl:when>
            </xsl:choose>
        </xsl:variable>
        <xsl:if test="$testPassed = 'true'">

            <!-- Separate the header from the tab-delimited file. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[1]">

                <!-- Spit out the field names. -->
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <xsl:element name="{replace(lower-case(translate(.,'-.','')),' ','')}">
                        <xsl:value-of select="replace(lower-case(translate(.,'-.','')),' ','')"/>
                    </xsl:element>
                </xsl:for-each>
            </xsl:for-each>

            <!-- For the following rows, loop through the field values. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[position()&gt;1]">
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <value>
                        <xsl:value-of select="."/>
                    </value>
                </xsl:for-each>
            </xsl:for-each>

            <!-- Call the template to increment. -->  
            <xsl:call-template name="for-loop">
                <xsl:with-param name="i" select="$i + $increment"/>
                <xsl:with-param name="increment" select="$increment"/>
                <xsl:with-param name="operator" select="$operator"/>
                <xsl:with-param name="testVal" select="$testVal"/>
            </xsl:call-template>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

我应该如何更改它以按记录对输出进行分组?

How should I change this to to group the output by record?

推荐答案

如果使用xsl:analyze-string解析每条记录,则可能会更容易.从标题中获取元素名称的方法可能比在示例中更好,但我没有时间考虑太久.

It might be easier if you use xsl:analyze-string to parse each record. There might be a better way to get the element names from the header than what is in my example, but I didn't have time to think about this too long.

注意:

您可能必须更改unparsed-text()的编码.我通常将编码作为参数传递,因此不必修改样式表.也许可以将编码添加到<dump/>?

You may have to change the encoding for unparsed-text(). I usually pass the encoding in as a parameter so I don't have to modify the stylesheet. Maybe the encoding could be added to <dump/>?

使用unparsed-text-available()查看文件是否存在并可以使用指定的编码读取是个好主意.

It would be a good idea to use unparsed-text-available() to see if the file exists and can be read with the specified encoding.

此外,您可能需要进行检查以确保标头中的值是有效的QName.例如,如果标头中有撇号,则会出现错误.也许最好将标头中的字段名称用作属性值而不是元素名称. (例如:<field name="Interviewee">Obama, Barack</field>)

Also, you may want to do a check to make sure the value from the header is a valid QName. For example if you have an apostrophe in the header, you'll get an error. Maybe it would be better to use the field names from the header as an attribute value instead of an element name. (Like: <field name="Interviewee">Obama, Barack</field>)

这是我的例子:

XML输入

<dumpSet>
  <dump filename="file_one.txt"/>
</dumpSet>

file_one.txt

Title   Translated Title    Watch Video Interviewee Interviewer Videographer
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="dumpSet">
        <collection>
            <xsl:apply-templates select="dump[@filename]"/>
        </collection>
    </xsl:template>

    <xsl:template match="dump">
        <xsl:variable name="text" select="unparsed-text(@filename, 'iso-8859-1')"/>
        <xsl:variable name="header">
            <xsl:analyze-string select="$text" regex="(..*)">
                <xsl:matching-substring>
                    <xsl:if test="position()=1">
                        <xsl:value-of select="regex-group(1)"/>
                    </xsl:if>                   
                </xsl:matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
        <xsl:variable name="headerTokens" select="tokenize($header,'\t')"/>
        <xsl:analyze-string select="$text" regex="(..*)">
            <xsl:matching-substring>
                <xsl:if test="not(position()=1)">
                    <record>
                        <xsl:analyze-string select="." regex="([^\t][^\t]*)\t?|\t">
                            <xsl:matching-substring>
                                <xsl:variable name="pos" select="position()"/>
                                <xsl:element name="{replace(normalize-space(lower-case($headerTokens[$pos])),' ','')}">
                                    <xsl:value-of select="normalize-space(regex-group(1))"/>                            
                                </xsl:element>                              
                            </xsl:matching-substring>
                        </xsl:analyze-string>
                    </record>
                </xsl:if>
            </xsl:matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:stylesheet>

输出

<collection>
   <record>
      <title>Interview with Barack Obama</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Obama, Barack</interviewee>
      <interviewer>Walters, Barbara</interviewer>
   </record>
   <record>
      <title>Interview with Sarah Palin</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Palin, Sarah</interviewee>
      <interviewer>Couric, Katie</interviewer>
      <videographer>Smith, John</videographer>
   </record>
</collection>

这篇关于XSLT中的批处理制表符分隔的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆