Excel - 使用 FILTERXML 从字符串中提取子字符串 [英] Excel - Extract substring(s) from string using FILTERXML

查看:55
本文介绍了Excel - 使用 FILTERXML 从字符串中提取子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

最近,我一直试图更加熟悉将分隔字符串更改为 XML 以使用 Excel 的 FILTERXML 并检索那些感兴趣的子字符串.请注意,此功能在 Excel 2013 中可用,不适用于 Excel for Mac 或 Excel Online.

Lately I've been trying to get more familiar with the concept of changing a delimited string into an XML to parse with Excel's FILTERXML and retrieve those substrings that are of interest. Please note that this function came available from Excel 2013 and is not available on Excel for Mac nor Excel Online.

对于分隔字符串,我的意思是使用空格作为分隔符的普通句子或可用于定义字符串中的子字符串的任何其他字符组合.例如,让我们想象以下内容:

With a delimited string, I meant anything from a normal sentence using spaces as delimiters or any other combination of characters that could be used to define substrings within a string. For example let's imagine the following:

ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123

<小时>

问题

所以,很多人都知道如何获取 nth 元素(例如:=TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),3*LEN(A1)+1,LEN(A1))) 检索456).或者与 LEN()MID()FIND() 和所有这些结构的其他组合,我们如何使用 FILTERXML 使用更具体的标准来提取关注的子字符串并清理完整的字符串?例如如何检索:

So, where a lot of people know how to get the nth element (e.g.: =TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),3*LEN(A1)+1,LEN(A1))) to retrieve 456). Or other combinationes with LEN(), MID(), FIND() and all those constructs, how do we use FILTERXML to use more specific criteria to extract substrings of concern and clean up the full string? For example, how to retrieve:

  • 元素按位置
  • 数字或非数字元素
  • 自身包含子字符串的元素
  • 以子字符串开头或结尾的元素
  • 大写或小写的元素
  • 包含数字的元素
  • 独特的价值
  • ...

推荐答案

Excel 的 FILTERXML 使用 XPATH 1.0 不幸的是,这意味着它不像我们想要的那样多样化成为.此外,Excel 似乎允许返回返工节点值,只允许您按外观顺序选择节点.然而,我们仍然可以使用相当多的功能.可以在此处找到更多相关信息.

Excel's FILTERXML uses XPATH 1.0 which unfortunately means it is not as diverse as we would maybe want it to be. Also, Excel seems to not allow returning reworked node values and exclusively allows you to select nodes in order of appearance. However there is a fair share of functions we can still utilize. More information about that can be found here.

该函数有两个参数:=FILTERXML(,)

假设单元格 A1 包含字符串:ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123.为了创建有效的 XML 字符串,我们使用 SUBSTITUTE 将分隔符更改为有效的结束和开始标记结构.因此,要获得给定示例的有效 XML 构造,我们可以这样做:

Let's say cell A1 holds the string: ABC|123|DEF|456|XY-1A|ZY-2F|XY-3F|XY-4f|xyz|123. To create a valid XML string we use SUBSTITUTE to change the delimiter to valid end- and start-tag constructs. So to get a valid XML construct for the given example we could do:

"<t><s>"&SUBSTITUTE(A1,"|","</s><s>")&"</s>"

出于可读性原因,我将使用 作为占位符 来引用上述结构.下面你会发现不同的有用的 XPATH 函数在过滤节点的有效构造中:

For readability reasons I'll refer to the above construct with the word <XML> as a placeholder. Below you'll find different usefull XPATH functions in a valid construct to filter nodes:

1) 所有元素:

=FILTERXML(<XML>,"//s")

返回:ABC123DEF456XY-1AZY-2FXY-3FXY-4fxyz123(所有节点)

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes)

2) 元素按位置:

=FILTERXML(<XML>,"//s[position()=4]")

或者:

=FILTERXML(<XML>,"//s[4]")

返回:456(索引 4 上的节点)

Returns: 456 (node on index 4)

=FILTERXML(<XML>,"//s[position()<4]")

返回:ABC123DEF(索引 <4 上的节点)

Returns: ABC, 123 and DEF (nodes on index < 4)

=FILTERXML(<XML>,"//s[position()=2 or position()>5]")

返回:123ZY-2FXY-3FXY-4fxyz123(索引 2 或 > 5 上的节点)

Returns: 123, ZY-2F, XY-3F, XY-4f, xyz and 123 (nodes on index 2 or > 5)

=FILTERXML(<XML>,"//s[last()]")

返回:123(最后一个索引上的节点)

Returns: 123 (node on last index)

=FILTERXML(<XML>,"//s[position() mod 2 = 1]")

返回:ABCDEFXY-1AXY-3Fxyz(奇数节点)

Returns: ABC, DEF, XY-1A, XY-3F and xyz (odd nodes)

=FILTERXML(<XML>,"//s[position() mod 2 = 0]")

返回:123456ZF-2FXY-4f123(偶数节点)

Returns: 123, 456, ZF-2F, XY-4f and 123 (even nodes)

3)(非)数字元素:

=FILTERXML(<XML>,"//s[number()=.]")

或者:

=FILTERXML(<XML>,"//s[.*0=0]")

返回:123456123(数字节点)

Returns: 123, 456, and 123 (numeric nodes)

=FILTERXML(<XML>,"//s[not(number()=.)]")

或者:

=FILTERXML(<XML>,"//s[.*0!=0)]")

返回:ABCDEFXY-1AZY-2FXY-3FXY-4fxyz(非数字节点)

Returns: ABC, DEF, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (non-numeric nodes)

4) 元素(不)包含:

=FILTERXML(<XML>,"//s[contains(., 'Y')]")

返回:XY-1AZY-2FXY-3FXY-4f> (包含'Y',注意XPATH区分大小写,不包括xyz)

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (containing 'Y', notice XPATH is case sensitive, exclusing xyz)

=FILTERXML(<XML>,"//s[not(contains(., 'Y'))]")

返回:ABC123DEF456xyz123(不包含'Y',注意XPATH区分大小写,包括xyz)

Returns: ABC, 123, DEF, 456, xyz and 123 (not containing 'Y', notice XPATH is case sensitive, including xyz)

5) 以(不)开始或/和结束的元素:

=FILTERXML(<XML>,"//s[starts-with(., 'XY')]")

返回:XY-1AXY-3FXY-4f(以XY"开头)

Returns: XY-1A, XY-3F and XY-4f (starting with 'XY')

=FILTERXML(<XML>,"//s[not(starts-with(., 'XY'))]")

返回:ABC123DEF456ZY-2Fxyz123(不要以 'XY' 开头)

Returns: ABC, 123, DEF, 456, ZY-2F, xyz and 123 (don't start with 'XY')

=FILTERXML(<XML>,"//s[substring(., string-length(.) - string-length('F') +1) = 'F']")

返回:DEFZY-2FXY-3F(以 'F' 结尾,注意 XPATH1.0 不支持 ends-with)

Returns: DEF, ZY-2F and XY-3F (end with 'F', notice XPATH 1.0 does not support ends-with)

=FILTERXML(<XML>,"//s[not(substring(., string-length(.) - string-length('F') +1) = 'F')]")

返回:ABC123456XY-1AXY-4fxyz123(不要以'F'结尾)

Returns: ABC, 123, 456, XY-1A, XY-4f, xyz and 123 (don't end with 'F')

=FILTERXML(<XML>,"//s[starts-with(., 'X') and substring(., string-length(.) - string-length('A') +1) = 'A']")

返回:XY-1A(以X"开头,以A"结尾)

Returns: XY-1A (start with 'X' and end with 'A')

6) 大写或小写的元素:

=FILTERXML(<XML>,"//s[translate(.,'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')=.]")

返回:ABC123DEF456XY-1AZY-2FXY-3F123(大写节点)

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F and 123 (uppercase nodes)

=FILTERXML(<XML>,"//s[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')=.]")

返回:123456xyz123(小写节点)

Returns: 123, 456, xyz and 123 (lowercase nodes)

注意:不幸的是 XPATH 1.0 不支持 upper-case() 也不支持 lower-case() 所以以上是一种解决方法.如果需要,添加特殊字符.

NOTE: Unfortunately XPATH 1.0 does not support upper-case() nor lower-case() so the above is a workaround. Add special characters if need be.

7) 元素(不)包含任何数字:

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=.]")

返回:123456XY-1AZY-2FXY-3FXY-4f123(包含任意数字)

Returns: 123, 456, XY-1A, ZY-2F, XY-3F, XY-4f and 123 (contain any digit)

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')=.]")

返回:ABCDEFxyz(不包含任何数字)

Returns: ABC, DEF and xyz (don't contain any digit)

=FILTERXML(<XML>,"//s[translate(.,'1234567890','')!=. and .*0!=0]")

返回:XY-1AZY-2FXY-3FXY-4f>(持有数字但不是自己的数字)

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (holding digits but not a a number on it's own)

8) 唯一元素或重复元素:

=FILTERXML(<XML>,"//s[preceding::*=.]")

返回:123(重复节点)

Returns: 123 (duplicate nodes)

=FILTERXML(<XML>,"//s[not(preceding::*=.)]")

返回:ABC123DEF456XY-1AZY-2FXY-3FXY-4fxyz(唯一节点)

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f and xyz (unique nodes)

=FILTERXML(<XML>,"//s[not(following::*=. or preceding::*=.)]")

返回:ABCDEF456XY-1AZY-2FXY-3FXY-4f(没有相似兄弟节点的节点)

Returns: ABC, DEF, 456, XY-1A, ZY-2F, XY-3F and XY-4f (nodes that have no similar sibling)

9) 一定长度的元素:

=FILTERXML(<XML>,"//s[string-length()=5]")

返回:XY-1AZY-2FXY-3FXY-4f>(5 个字符长)

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (5 characters long)

=FILTERXML(<XML>,"//s[string-length()<4]")

返回:ABC123DEF456xyz123(少于 4 个字符)

Returns: ABC, 123, DEF, 456, xyz and 123 (shorter than 4 characters)

10) 基于前面/后面的元素:

=FILTERXML(<XML>,"//s[preceding::*[1]='456']")

返回:XY-1A(前一个节点等于'456')

Returns: XY-1A (previous node equals '456')

=FILTERXML(<XML>,"//s[starts-with(preceding::*[1],'XY')]")

返回:ZY-2FXY-4fxyz(前一个节点以 'XY' 开头)

Returns: ZY-2F, XY-4f, and xyz (previous node starts with 'XY')

=FILTERXML(<XML>,"//s[following::*[1]='123']")

返回:ABCxyz(后面的节点等于 '123')

Returns: ABC, and xyz (following node equals '123')

=FILTERXML(<XML>,"//s[contains(following::*[1],'1')]")

返回:ABC456xyz(后续节点包含 '1')

Returns: ABC, 456, and xyz (following node contains '1')

=FILTERXML(<XML>,"//s[preceding::*='ABC' and following::*='XY-3F']")

返回:123DEF456XY-1AZY-2F('ABC' 和 'XY-3f' 之间的所有内容)

Returns: 123, DEF, 456, XY-1A and ZY-2F (everything between 'ABC' and 'XY-3f')

11) 基于子字符串的元素:

=FILTERXML(<XML>,"//s[substring-after(., '-') = '3F']")

返回:XY-3F(连字符后以3F"结尾的节点)

Returns: XY-3F (nodes ending with '3F' after hyphen)

=FILTERXML(<XML>,"//s[contains(substring-after(., '-') , 'F')]")

返回:ZY-2FXY-3F(连字符后包含F"的节点)

Returns: ZY-2F and XY-3F (nodes containing 'F' after hyphen)

=FILTERXML(<XML>,"//s[substring-before(., '-') = 'ZY']")

返回:ZY-2F(连字符前以ZY"开头的节点)

Returns: ZY-2F (nodes starting with 'ZY' before hyphen)

=FILTERXML(<XML>,"//s[contains(substring-before(., '-'), 'Y')]")

返回:XY-1AZY-2FXY-3FXY-4f>(连字符前包含Y"的节点)

Returns: XY-1A, ZY-2F, XY-3F and XY-4f (nodes containing 'Y' before hyphen)

12) 基于串联的元素:

=FILTERXML(<XML>,"//s[concat(., '|', following::*[1])='ZY-2F|XY-3F']")

返回:ZY-2F(当节点与|"连接并且其后的兄弟节点等于ZY-2F|XY-3F")

Returns: ZY-2F (nodes when concatenated with '|' and following sibling equals 'ZY-2F|XY-3F')

=FILTERXML(<XML>,"//s[contains(concat(., preceding::*[2]), 'FA')]")

返回:DEF(当节点与左边的两个兄弟索引连接时包含'FA')

Returns: DEF (nodes when concatenated with sibling two indices to the left contains 'FA')

13) 空与非空:

=FILTERXML(<XML>,"//s[count(node())>0]")

或者:

=FILTERXML(<XML>,"//s[node()]")

返回:ABC123DEF456XY-1AZY-2FXY-3FXY-4fxyz123(所有不为空的节点)

Returns: ABC, 123, DEF, 456, XY-1A, ZY-2F, XY-3F, XY-4f, xyz and 123 (all nodes that are not empty)

=FILTERXML(<XML>,"//s[count(node())=0]")

或者:

=FILTERXML(<XML>,"//s[not(node())]")

返回:无(所有节点都是空的)

现在很明显,上面是 XPATH 1.0 函数的可能性演示,您可以获得上述的各种组合以及更多!我试图涵盖最常用的字符串函数.如有遗漏,欢迎留言.

Now obviously the above is a demonstration of possibilities with XPATH 1.0 functions and you can get a whole range of combinations of the above and more! I tried to cover most commonly used string functions. If you are missing any please feel free to comment.

尽管问题本身相当广泛,但我希望就如何将 FILTERXML 用于手头的查询提供一些总体指导.该公式返回要以任何其他方式使用的节点数组.很多时候我会在 TEXTJOIN()INDEX() 中使用它.但我想其他选项将是新的 DA 函数来溢出结果.

Whereas the question is quite broad on itself, I was hoping to give some general direction on how to use FILTERXML for the queries at hand. The formula returns an array of nodes to be used in any other way. A lot of the times I would use it in TEXTJOIN() or INDEX(). But I guess other options would be new DA-functions to spill results.

请注意,在通过 FILTERXML() 解析字符串时,与字符 (&) 和左尖括号 (<) 必须 not 以其字面形式出现.它们将分别需要替换为 &amp;&lt;.另一种选择是使用他们的数字 ISO/IEC 10646 字符 code分别为 &#38;&#60; .解析后,该函数将以文字形式将这些字符返回给您.不用说,用分号分割字符串变得很棘手.

Be alert that while parsing a string through FILTERXML(), the ampersand character (&) and the left angle bracket (<) must not appear in their literal form. They will respectively need to be substituted with either &amp; or &lt;. Another option would be to use their numeric ISO/IEC 10646 character code being &#38; or &#60; respectively. After parsing, the function will return these characters back to you in their literal form. Needless to say that splitting a string by the semi-colon therefor just became tricky.

这篇关于Excel - 使用 FILTERXML 从字符串中提取子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆