在iPhone上使用NSXMLParser解析html实体 [英] Resolving html entities with NSXMLParser on iPhone

查看:122
本文介绍了在iPhone上使用NSXMLParser解析html实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想我读过与此问题相关的每一个网页,但我仍然无法找到解决方案,所以我在这里。

i think i read every single web page relating to this problem but i still cannot find a solution to it, so here i am.

我有一个HTML网页页面哪个不受我的控制,我需要从我的iPhone应用程序解析它。这是我正在谈论的网页示例:

I have an HTML web page wich is not under my control and i need to parse it from my iPhone application. Here it is a sample of the web page i'm talking about:

<HTML>
  <HEAD>
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  </HEAD>
  <BODY>
    <LI class="bye bye" rel="hello 1">
      <H5 class="onlytext">
        <A name="morning_part">morning</A>
      </H5>
      <DIV class="mydiv">
        <SPAN class="myclass">something about you</SPAN> 
        <SPAN class="anotherclass">
          <A href="http://www.google.it">Bye Bye &egrave; un saluto</A>
        </SPAN>
      </DIV>
    </LI>
  </BODY>
</HTML>

我正在使用NSXMLParser,直到找到è html实体。它调用foundCharacters:forBye Bye,然后调用 resolveExternalEntityName:systemID :: ,实体名称为egrave。
在这个方法中,我只是返回在NSData中转换的字符è,再次调用foundCharacters将字符串è添加到前一个Bye Bye,然后解析器引发 NSXMLParserUndeclaredEntityError 错误。

I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave". In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.

我没有DTD,我无法更改我正在解析的html文件。你对这个问题有什么想法吗?在此先感谢各位,
Rob。

I have no DTD and i cannot change the html file i'm parsing. Do you have any ideas on this problem? Thanks in advance to all of you, Rob.

更新(12/03/2010)。在Griffo的建议之后我得到了类似的结果:

Update (12/03/2010). After the suggestion of Griffo i ended up with something like this:

data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];

其中replaceHtmlEntities:(NSData *)是这样的:

where replaceHtmlEntities:(NSData *) is something like this:

- (NSData *)replaceHtmlEntities:(NSData *)data {

    NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
    NSMutableString *temp = [NSMutableString stringWithString:htmlCode];

    [temp replaceOccurrencesOfString:@"&amp;" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    [temp replaceOccurrencesOfString:@"&nbsp;" withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
    ...
    [temp replaceOccurrencesOfString:@"&Agrave;" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];

    NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
    return finalData;

}

但我仍然在寻找解决此问题的最佳方法。我将在接下来的几天尝试使用TouchXml,但我仍然认为应该有一种方法可以使用NSXMLParser API来做到这一点,所以如果你知道如何,请随意在这里写:)

But i am still looking the best way to solve this problem. I will try TouchXml in the next days but i still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here :)

推荐答案

在探索了几种替代方案之后,似乎NSXMLParser将不支持标准实体以外的实体& lt;,& gt;,& ',& quot;和& amp;

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities &lt;, &gt;, &apos;, &quot; and &amp;

以下代码失败,导致 NSXMLParserUndeclaredEntityError

The code below fails resulting in an NSXMLParserUndeclaredEntityError.


// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent


NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys: 
                     [NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
                     [NSString stringWithFormat:@"%C", 0x00E0], @"agrave", 
                     ...
                     ,nil];

NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];

// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
    return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}

尝试通过在ENTITY声明前面添加HTML文档来声明实体将通过,但是,展开的实体不会传回解析器:foundCharacters ,并且会删除è和à字符。

Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
  <!ENTITY agrave "à">
  <!ENTITY egrave "è">
]>

在另一个实验中,我使用内部DTD创建了一个完全有效的xml文档

In another experiment, I created a completely valid xml document with an internal DTD

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
    <!ELEMENT author (#PCDATA)>
    <!ENTITY js "Jo Smith">
]>
<author>&lt; &js; &gt;</author>

我实现了解析器:foundInternalEntityDeclarationWithName:value:; 委托方法,很明显解析器正在获取实体数据,但解析器:foundCharacters 仅针对预定义的实体进行调用。

I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.

2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model: 
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before: 
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: < 
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: < 
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <  
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: <  >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document

我找到了指向使用SAX界面的教程的链接LibXML NSXMLParser 使用的 xmlSAXHandler 允许 getEntity 回调定义。在调用 getEntity 之后,实体的扩展将传递给字符回调。

I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.

NSXMLParser 此处缺少功能。应该发生的是 NSXMLParser 或其委托存储实体定义并将它们提供给 xmlSAXHandler getEntity 回调。这显然没有发生。我将提交错误报告。

NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandler getEntity callback. This is clearly not happening. I will file a bug report.

与此同时,如果您的文档很小,那么执行字符串替换的早期答案是完全可以接受的。查看上面提到的SAX教程以及Apple的XMLPerformance示例应用程序,看看是否值得实现 libxml 解析器。

In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.

这很有趣。

这篇关于在iPhone上使用NSXMLParser解析html实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆