PHP解析xml文件错误 [英] PHP parsing xml file error
问题描述
我正在尝试使用simpleXML从 http://rates.fxcm.com/RatesXML
使用simplexml_load_file()
有时会出错,因为该网站在xml文件前后总是有奇怪的字符串/数字.
示例:
I am trying to use simpleXML to get data from http://rates.fxcm.com/RatesXML
Using simplexml_load_file()
i had errors at times as this website always has weird strings/numbers before and after the xml file.
Example:
2000<?xml version="1.0" encoding="UTF-8"?>
<Rates>
<Rate Symbol="EURUSD">
<Bid>1.27595</Bid>
<Ask>1.2762</Ask>
<High>1.27748</High>
<Low>1.27385</Low>
<Direction>-1</Direction>
<Last>23:29:11</Last>
</Rate>
</Rates>
0
然后,我决定使用file_get_contents并使用simplexml_load_string()
将其解析为字符串,此后,我使用substr()
前后删除字符串.但是,有时随机字符串会出现在节点之间,如下所示:
I then decided to use file_get_contents and parse it as a string with simplexml_load_string()
, afterwards which I use substr()
to remove the strings before and after. However, sometimes the random strings will appear in between the nodes like this:
<Rate Symbol="EURTRY">
<Bid>2.29443</Bid>
<Ask>2.29562</Ask>
<High>2.29841</High>
<Low>2.28999</Low>
137b
<Direction>1</Direction>
<Last>23:29:11</Last>
</Rate>
我的问题是,无论是否将它们放置在任何正则表达式函数中,我是否都能同时处理所有这些随机字符串? (认为与联系该站点以使它们广播适当的xml文件相比,这是一个更好的主意)
My question is, is there anyway i can deal with all these random strings at a go with any regex functions regardless of where they are placed? (think that will be a better idea rather than to contact the site to get them to broadcast proper xml files)
推荐答案
我相信但这是一个preg替换,它从字符串的开头,字符串的结尾以及在关闭/自关闭标签之后删除所有非空白字符:
But here is a preg replace that removes all non-whitespace characters, from the beginning of the string, from the end of the string, and after closing/self-closing tags:
$string = preg_replace( '~
(?| # start of alternation where capturing group count starts from
# 1 for each alternative
^[^<]* # match non-< characters at the beginning of the string
| # OR
[^>]*$ # match non-> characters at the end of the string
| # OR
( # start of capturing group $1: closing tag
</[^>]++> # match a closing tag; note the possessive quantifier (++); it
# suppresses backtracking, which is a convenient optimization,
# the following bit is mutually exclusive anyway (this will be
# used throughout the regex)
\s++ # and the following whitespace
) # end of $1
[^<\s]*+ # match non-<, non-whitespace characters (the "bad" ones)
(?: # start subgroup to repeat for more whitespace/non-whitespace
# sequences
\s++ # match whitespace
[^<\s]++ # match at least one "bad" character
)* # repeat
# note that this will kind of pattern keeps all whitespace
# before the first and the last "bad" character
| # OR
( # start of capturing group $1: self-closing tag
<[^>/]+/> # match a self-closing tag
\s++ # and the following whitespace
)
[^<]*+(?:\s++[^<\s]++)*
# same as before
) # end of alternation
~x',
'$1',
$input);
然后,我们简单地写出关闭标签或自关闭标签(如果有的话).
And then we simply write back the closing or self-closing tag if there was one.
此方法不安全的原因之一是在注释或属性字符串内可能会出现关闭或自动关闭标签.但是我几乎不建议您使用XML解析器,因为您的XML解析器也无法解析XML.
One of the reasons this approach is not safe is that closing or self-closing tags might occur inside comments or attribute strings. But I can hardly suggest you use an XML parser instead, since your XML parser can't parse the XML either.
这篇关于PHP解析xml文件错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!