NSXMLDocumentTidyHTML不修复一些XHTML验证错误 [英] NSXMLDocumentTidyHTML doesn't tidy some XHTML validation errors

查看:111
本文介绍了NSXMLDocumentTidyHTML不修复一些XHTML验证错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网页列表中抓取文字。我做了一些实验,发现,我的需要的最好的方式是通过WebKit。

I want to grab text from a list of web pages. I've done a bit of experimenting and found that the best way for my needs is via WebKit.

一旦页面的源被抓住,我想剥离通过使用此评论中的技术,来显示所有HTML标记。

Once the source of the page has been grabbed, I want to strip out all the HTML tags, by using the technique in this comment.

这是我的代码:

- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame {
    if(frame == [sender mainFrame]) {
        NSString *content = [[[[sender mainFrame] dataSource] representation] documentSource];
        NSXMLDocument *theDocument = [[NSXMLDocument alloc] initWithXMLString:content options:NSXMLDocumentTidyHTML error:&theError];
        NSString *theXSLTString = @"<?xml version='1.0' encoding='utf-8'?>\n<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml='http://www.w3.org/1999/xhtml'>\n<xsl:output method='text'/>\n<xsl:template match='xhtml:head'></xsl:template>\n<xsl:template match='xhtml:script'></xsl:template>\n</xsl:stylesheet>";
        NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:nil error:&theError];
        NSString *theString = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];
    }
}

但是,如果页面没有正确验证为XHTML,我有时会从我的 initWithXMLString:方法中得到一个错误。

This works fine on most pages. However, if a page doesn't validate correctly as XHTML, I sometimes get an error from my initWithXMLString: method.

这是公平的 - 我要求它整理XHTML,所以我希望它报告遇到的问题。但是如果验证有问题,它返回nil和一个错误,而不是实际上整理XHTML。

That's fair enough - I'm asking it to tidy up the XHTML, so I'd expect it to report what problems it's encountered. But if there's a problem with the validation, it returns nil and an error rather than actually tidying up the XHTML.

导致问题的一个特定页面是 Ruby类文档

One specific page that's causing the problem is the Ruby class documentation.

我发现,优秀的第三方 HTML整理应用程式可以清除这个XHTML罚款,但我会期望NSXMLDocumentTidyHTML能够在cellpadding值附近添加一些引号。这是一个相当基本的清理操作。我不想在我的代码库中添加另一个依赖。

I've found that the excellent third party HTML tidy application can clean up this XHTML fine, but I'd expect NSXMLDocumentTidyHTML to be able to just add some quotes around cellpadding values. It's a fairly basic cleanup operation. And I'm not keen to add another dependency into my code base.

有没有什么我错过了Cocoa清理XHTML的方式?

Is there something I'm missing with the way Cocoa cleans up XHTML? Or do I just need to bite the bullet and use HTML Tidy instead in my code?

推荐答案

XHTML文档被视为XML,因此您可以使用 NSXMLDocumentTidyXML 标志更好地运行。

XHTML documents are treated as XML, so you may have better luck with the NSXMLDocumentTidyXML flag.

这篇关于NSXMLDocumentTidyHTML不修复一些XHTML验证错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆