在C中解析xml文件 [英] xml file parsing in C

查看:68
本文介绍了在C中解析xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



是否可以在C中解析XML文件,以便我可以满足这些

的要求:

1)替换所有<和>标记体内的标记由空格标记,例如:

示例1:

< fooblabla< bla< / foo>


变为


< fooblabla bla< / foo>


示例2:


< foo> blablabla< / foo>


变为

< foob​​lablabla< / foo>

2)删除XML文件每一行末尾的所有额外空格

3)替换所有特殊字符(Unicode或十六进制字符)一个

空间

我的意思是如果有<的话,XML文件格式不正确。和>在任何地方都签了一个



在这种情况下它不是一个有效的文件,所以我不认为使用解析器

在这种情况下是适当的。 (如果
遇到<与标签的开头不对应的话,解析器会如何反应?)


你有吗?关于如何编写程序来处理这些

要求的想法?

技术环境是:Unix,KSH和C(gcc)

>
我正在考虑使用sed命令相反,我可以摆脱额外的

空格并替换特殊字符,但我仍然不知道如何处理额外的>和
处理额外的>和<标志。


感谢您的帮助。

解决方案

Marc Dubois写道:


hi,

是否可以在C中解析XML文件,以便我可以满足这些

的要求:

1)替换所有<和>标记体内的标记由空格标记,例如:



[...]


2)删除每行末尾的所有额外空格XML文件的结果

3)用

空格替换所有特殊字符(Unicode或十六进制字符)


我的意思是如果存在<,则文件格式不正确。和>在任何地方都签了一个



在这种情况下它不是一个有效的文件,所以我不认为使用解析器

在这种情况下是适当的。 (如果
遇到<与标签的开头不对应的话,解析器会如何反应?)


你有吗?关于如何编写程序来处理这些

要求的想法?

技术环境是:Unix,KSH和C(gcc)

>
我正在考虑使用sed命令相反,我可以摆脱额外的

空格并替换特殊字符,但我仍然不知道如何处理额外的>和
处理额外的>和<标志。



相当于这个组的OT。试试一个处理POSIX

工具的新闻组,或尝试man sed。


< state:off-topic>

另外,请查看XML整理/验证工具。 HTML整理有限的XML支持。

< />


Marc Dubois写道:


hi,

是否可以用C
解析XML文件



当然它是可能的。 ;这很容易吗?

取决于你编写解析器的经验。

XML语法并不是特别复杂

- 这就是那种它的重点。


如果你愿意拿一个罐装解决方案,那就有C的外来人员 http://www.jclark.com/xml/expat.html


但是,你的问题似乎是格式化和纠错,而不是
XML解析。例如,


< fooblabla< bla< / foo>



不是XML。


< foo> blablabla< / foo>



不是XML


2)删除XML每行末尾的所有额外空格file



除了char数组的地址和''\ 0''之外你不需要任何东西

:-)


3)用

空格替换所有特殊字符(Unicode或十六进制字符)



这部分可能是一个有趣的问题。


我的意思是如果有<的话,XML文件格式不正确。和>签到一个

到处都有点,



对,所以你意识到了这一点,你意识到XML解析器会

只是呛到它并且(也许)告诉你错误在哪里:-)


在这种情况下它不是有效文件,所以我不知道认为在这种情况下使用解析器

是合适的。 (如果
遇到<与标签的开头不对应的话,解析器会如何反应?)



希望它会发出一条诊断信息......


你对我如何编写程序来处理这些问题有所了解吗? >
要求?

技术环境是:Unix,KSH和C(gcc)


我正在考虑使用sed命令相反,我可以摆脱额外的

空格并替换特殊字符,但我仍然不知道如何处理额外的>和
处理额外的>和<迹象。



你可以使用先行技术,因为你总是知道你想要的东西

来匹配。我开始的天真的方法,将是从外部极端到内部工作的

代币,也许首先通过

验证角度括号全部匹配。


>

感谢您的帮助。



我回复了你的帖子,因为我在Java环境中工作,而且我知道我被宠坏了。在java中执行XML太简单了,不值得讨论。另一方面,在C中做一个XML解析器,从头开始,

将是一个非常有趣的问题。


考虑它大约半秒钟后,我将研究在C应用程序中使用Xerces-C ++库的难度级别。或者

XML :: Parser perl模块。


我意识到你想要提供无效的XML并纠正错误;我知道

从经验中你可以在某种程度上使用Xerces定位

错误,所以采取这种方法可能并不十分困难 - 制作

通过xerces验证器发现错误,修复它们,并最终获得
,能够免费在文档上执行SAX或DOM。


我从来没有,甚至考虑过触及Xerces-C ++,所以如果它与Xerces-Java有任何共同之处,我就不知道了。
。关于xerces

网站的文档让它看起来很容易使用。


有人在那里做过这个,对吧?


好奇,你为什么要用C来做这个?我不是在抨击C,

(我喜欢它),但这似乎是Perl创建的任务

for。

-

-Rob Hoelz


星期二,2006年12月12日22:07:14 +0100" Marc Dubois" < no@spam.com>

写道:


hi,

是否可以解析一个C语言中的XML文件,以便我可以满足这些

的要求:

1)替换所有"<"和>用空格标记体内的标志,

,例如:示例1:

< fooblabla< bla< / foo>


变为


< fooblabla bla< / foo>


示例2:


< foo> blablabla< / foo>


变为


< fooblablabla< / foo>


2)删除XML文件每行末尾的所有额外空格

3)全部替换一个空格的特殊字符(Unicode或十六进制

字符)
$ b我的意思是如果有<的话,XML文件格式不正确。和>标志

a到处都有点,

在这种情况下它不是一个有效的文件,所以我不认为使用

解析器会在这种情况下是适当的。 (当解析器遇到<与

标签的开头不对应时,解析器如何反应




你对我如何编写程序来处理这些

的要求有所了解吗?

技术环境是:Unix,KSH和C(gcc )


我正在考虑使用sed命令相反,我可以摆脱

额外的空格并替换特殊字符,但我仍然这样做

不知道如何处理额外的>和<标志。


感谢您的帮助。


hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabla </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.

解决方案

Marc Dubois wrote:

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :

[...]

2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Pretty much OT for this group. Try a newsgroup that deals with POSIX
tools, or try "man sed".

<state:off-topic>
Also, look at XML tidy/validation tools. HTML tidy has limited XML support.
</>


Marc Dubois wrote:

hi,
is it possible to parse an XML file in C

Of course it is "possible." Is it easy?
Depends on your experience writing parsers.
The XML grammar is not especially complicated
-- that''s sort of the point of it.

If you are willing to take a canned solution, there is expat for C
http://www.jclark.com/xml/expat.html

However, your problem seems to be formatting and error correction, not
XML parsing. For example,

<fooblabla < bla </foo>

Is not XML.

<foo>blablabla </foo>

Is not XML

2) Remove all extra spaces at the end of every line of the XML file

You don''t need anything but an address of a char array and ''\0'' to do
that :-)

3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

This part might be an interesting problem.

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,

Right, so you realize this, and you realize that an XML parser will
simply choke on it and (maybe) tell you where the errors are :-)

it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Hopefully, it will emit a diagnostic message ...

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

You could use a lookahead technique since you always know what you want
to match. The naive approach I''d start with, would be to work the
tokens from the outer extremes to inner, maybe making a pass first just
to validate that the angle brackets all match up.

>
Thanks for your help.

I replied to your post because I work in a Java environment, and I
realize I am spoiled. Doing XML in java is too simple to warrant much
discussion. Doing an XML parser in C, on the other hand, from scratch,
would be a very interesting problem.

After considering it for about half a second, I''d look into the
difficulty level of using the Xerces-C++ library in a C app. Or the
XML::Parser perl module.

I realize you want to feed it invalid XML and correct errors; I know
from experience that you can use Xerces to a certain extent to locate
errors, so it might not be terribly hard to take that approach - make
passes through the xerces validator to find errors, fix them, and end up
with the ability to do SAX or DOM on the document for free.

I have never, ever, even considered touching Xerces-C++, so I don''t know
if it has anything in common with Xerces-Java. The docs on the xerces
site make it look easy enough to use.

Somebody out there has done this, right?


Just curious, why do you want to use C for this? I''m not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabla </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">" signs
a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser react
when it encounters a < that does not correspond to the beginning of a
tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.


这篇关于在C中解析xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆