使用 XLS 将 XML 转换为文本文件只会返回所有连接而不是格式化的文本 [英] Using XLS to convert XML to text file just returns all the text concatenated instead of formatted
问题描述
我的公司使用 hadoop 处理大量产品提要.我们有一个过程可以精确地提取一个产品节点并将其作为文件中的一行.然后我们使用 xsl 将产品 xml 转换为单行三管分隔文件.这到目前为止运作良好.但是,我遇到了一个客户的问题.他们在新的 xml 文件中进行了一些更改,正在使用一些名称空间,这导致事情中断.我不得不修改 xml 中的链接,以便我可以发布它.我把http改成httc原始 xml 文件是这样设置的:
My company processes alot of product feeds using hadoop. We have a process to extract exactly one product node and make that a line in a file. we then use xsl to convert the product xml to a single line triple pipe delimited file. This has worked well so far. However I ran into an issue with one client. They made some changes in the new xml file are using some namespaces this caused things to break. I had to modify the links in the xml so i could post it. I changed the http to httc The Original xml file was setup like this:
<?xml version="1.0" encoding="utf-8"?>
<CATALOG APIKEY="88ac00e4f3e16e44" xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES />
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
新的xml文件设置如下:
The new xml file is setup like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CATALOG APIKEY="88ac00e4f3e16e44" xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES>
<CATEGORY ID="103510">
<CATEGORYNAME>Kaspersky Promotion</CATEGORYNAME>
</CATEGORY>
</CATEGORIES>
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
当将产品转换为单行时,我们只获取产品开始和结束标签之间的所有内容.
When convert the product to single lines we only take everything between and including the product beginning and end tags.
当我们对新文件执行此操作时,它失败了,因为它脱离了命名空间.所以我修改了这个过程,以在产品周围包含一个带有命名空间标签的包装器.所以通过xsl发送的要转换的文本是:
When we did this with the new file it failed because it was dropping off the namespace. so i modified the process to include a wrapper around the product with the namespace tags. So the text being sent to be converted via xsl is:
<wrapper xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES>
<CATEGORY ID="103510">
<CATEGORYNAME>Kaspersky Promotion</CATEGORYNAME>
</CATEGORY>
</CATEGORIES>
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
</wrapper>
我尝试使用的 xsl 是:
The xsl I am trying to use is:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="httc://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="no" />
<xsl:strip-space elements="*" />
<xsl:template match="PRODUCT">
<!-- skuId --><xsl:value-of select="PRODUCTSKU"/>
<xsl:text>|||</xsl:text>
<!-- parentSkuId --><xsl:value-of select="PARENTPRODUCTID"/>
<xsl:text>|||</xsl:text>
<!-- globalSkuID --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyKey Path --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyText --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- upc --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- mpn --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- model_Number --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- Name --><xsl:value-of select="PRODUCTNAME"/>
<xsl:text>|||</xsl:text>
<!-- shortDescription --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- longDescription --><xsl:value-of select="PRODUCTDESCRIPTION"/>
<xsl:text>|||</xsl:text>
<!-- price --><xsl:value-of select="SALEPRICE"/>
<xsl:text>|||</xsl:text>
<!-- comparePrice --><xsl:value-of select="LISTPRICE"/>
<xsl:text>|||</xsl:text>
<!-- productPage --><xsl:value-of select="PRODUCTURL"/>
<xsl:text>|||</xsl:text>
<!-- thumbnail --><xsl:value-of select="IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- fullImage --><xsl:value-of select="IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- rating --><xsl:value-of select="RATING"/>
<xsl:text>|||</xsl:text>
<!-- brand --><xsl:value-of select="BRAND"/>
<xsl:text>|||</xsl:text>
<!-- isActive --><xsl:value-of select="INSTOCK"/>
<xsl:text>|||</xsl:text>
<!-- ReviewCouunt --><xsl:value-of select="REVIEWS"/>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyKeys -->
<xsl:for-each select="CATEGORIES/CATEGORY">
<xsl:value-of select="@ID" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyNames -->
<xsl:for-each select="CATEGORIES/CATEGORY/CATEGORYNAME">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AttributeNames -->
<xsl:for-each select="ATTRIBUTES/ATTRIBUTE">
<xsl:value-of select="@NAME" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- Attribute Values -->
<xsl:for-each select="ATTRIBUTES/ATTRIBUTE">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
这将导致仅从产品级别节点连接的字符串的输出,例如:HP Pavilion g6t 笔记本电脑第 3 代英特尔® 酷睿™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW100005487$499.99$499.99/.product.100005487.htmlhttc://images.test-static.com/image/media/1050.0惠普10100005487
This results in the output of just the string concatenated from the product level node like: HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW100005487$499.99$499.99/.product.100005487.htmlhttc://images.test-static.com/image/media/150-__10.0HEWLETT PACKARD10100005487
我猜这与它们所包含的命名空间有关,但我对使用 xsl 来弄清楚是什么知之甚少.请帮忙
I'm guessing it has something to do with the namespaces they are including but I don't really know enough about using xsl to figure out what. Please Help
推荐答案
您必须通过定义具有相同 namespace-uri()
的命名空间来将 XML 文档的命名空间添加到 XSLT,例如xmlns:u="urn:rrXML"
.然后您可以使用此前缀访问 XML 中的元素,这意味着:您使用 <xsl:value-of select="u:PRODUCTSKU"/>
而不是 <xsl:value-of select="PRODUCTSKU"/>
.在输入 XML 中添加缺少的结束 PRODUCTS 标记时,遵循 XSLT
You have to add the namespace of the XML document to the XSLT by defining a namespace with the same namespace-uri()
, e.g. xmlns:u="urn:rrXML"
. Then you can access the elements in the XML with this prefix, meaning: you get the value using <xsl:value-of select="u:PRODUCTSKU"/>
instead of <xsl:value-of select="PRODUCTSKU"/>
. When the missing closing PRODUCTS tag is added in your input XML, following XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:u="urn:rrXML"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="no" />
<xsl:strip-space elements="*" />
<xsl:template match="u:PRODUCT" >
<!-- skuId --><xsl:value-of select="u:PRODUCTSKU"/>
<xsl:text>|||</xsl:text>
<!-- parentSkuId --><xsl:value-of select="u:PARENTPRODUCTID"/>
<xsl:text>|||</xsl:text>
<!-- globalSkuID --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyKey Path --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyText --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- upc --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- mpn --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- model_Number --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- Name --><xsl:value-of select="u:PRODUCTNAME"/>
<xsl:text>|||</xsl:text>
<!-- shortDescription --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- longDescription --><xsl:value-of select="u:PRODUCTDESCRIPTION"/>
<xsl:text>|||</xsl:text>
<!-- price --><xsl:value-of select="u:SALEPRICE" />
<xsl:text>|||</xsl:text>
<!-- comparePrice --><xsl:value-of select="u:LISTPRICE"/>
<xsl:text>|||</xsl:text>
<!-- productPage --><xsl:value-of select="u:PRODUCTURL"/>
<xsl:text>|||</xsl:text>
<!-- thumbnail --><xsl:value-of select="u:IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- fullImage --><xsl:value-of select="u:IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- rating --><xsl:value-of select="u:RATING"/>
<xsl:text>|||</xsl:text>
<!-- brand --><xsl:value-of select="u:BRAND"/>
<xsl:text>|||</xsl:text>
<!-- isActive --><xsl:value-of select="u:INSTOCK"/>
<xsl:text>|||</xsl:text>
<!-- ReviewCouunt --><xsl:value-of select="u:REVIEWS"/>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyKeys -->
<xsl:for-each select="u:CATEGORIES/u:CATEGORY">
<xsl:value-of select="@ID" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyNames -->
<xsl:for-each select="u:CATEGORIES/u:CATEGORY/u:CATEGORYNAME">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AttributeNames -->
<xsl:for-each select="u:ATTRIBUTES/u:ATTRIBUTE">
<xsl:value-of select="@NAME" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- Attribute Values -->
<xsl:for-each select="u:ATTRIBUTES/u:ATTRIBUTE">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
产生输出
100005487|||100005487||||||||||||||||||HP Pavilion g6t 笔记本电脑第三代英特尔® 酷睿™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW|||||||||$499.99|||$499.99|||/.product.100005487.html|||httc://images.test-static.com/image/media/150-__1|||httc://images.test-static.com/image/media/150-__1|||0.0|||惠普|||1|||0|||103510^|||卡巴斯基促销^|||类别^FSA^HIDEPRICEFROMBROWSE^ADDTOCARTFROMSEARCH^ITEMMINQTY^ITEMMAXQTY^MERCHANDISINGDESC^DISCOUNTDESC^ALTTEXT^MAPITEM^MEMBERONLYITEM^品牌^显卡^硬盘大小^操作系统^包含RAM^屏幕大小^|||卡巴斯基促销^假1.0^1.0^^^HP Pavilion g6t 笔记本电脑第三代 Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW^False^False^HP^Intel HD Graphics^500 GB^Windows ®^4 GB^15.6在.^
produces the output
100005487|||100005487|||||||||||||||||||||HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW|||||||||$499.99|||$499.99|||/.product.100005487.html|||httc://images.test-static.com/image/media/150-__1|||httc://images.test-static.com/image/media/150-__1|||0.0|||HEWLETT PACKARD|||1|||0|||103510^|||Kaspersky Promotion^|||Categories^FSA^HIDEPRICEFROMBROWSE^ADDTOCARTFROMSEARCH^ITEMMINQTY^ITEMMAXQTY^MERCHANDISINGDESC^DISCOUNTDESC^ALTTEXT^MAPITEM^MEMBERONLYITEM^Brand^Graphic Card^Hard Drive Size^Operating System^RAM Included^Screen Size^|||Kaspersky Promotion^False^False^0^1.0^1.0^^^HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW^False^False^HP^Intel HD Graphics^500 GB^Windows ®^4 GB^15.6 in.^
在一行中,如果这确实是预期的输出.
in one line, if that's really the intended ouput.
这篇关于使用 XLS 将 XML 转换为文本文件只会返回所有连接而不是格式化的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!