开源html解析类无法正确解析段落之间的空格 [英] Open source html parsing class not properly parsing spaces between paragraphs

查看:104
本文介绍了开源html解析类无法正确解析段落之间的空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一种将HTML文本解析为NSString的开源方法.

I'm using an open source method that parses the html text into an NSString.

生成的字符串在前几对段落之间有大量的空格,但对于随后的段落只有一行空格.这是输出示例.

The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output.

下面是我正在调用的方法.我只更改了两行代码.对于stopCharactersnewLineAndWhitespaceCharacters,我从字符集中删除了/n,因为将其包括在内时,整个文本都是一个较长的段落.

Below is the method I'm calling. I've only changed two lines of the code. For stopCharacters and newLineAndWhitespaceCharacters, I removed /n from the characterset because when it was included, the entire text was one long paragraph.

- (NSString *)stringByConvertingHTMLToPlainText {

    // Pool
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    // Character sets
    NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];

    // Scan and find all tags
    NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
    NSScanner *scanner = [[NSScanner alloc] initWithString:self];
    [scanner setCharactersToBeSkipped:nil];
    [scanner setCaseSensitive:YES];
    NSString *str = nil, *tagName = nil;
    BOOL dontReplaceTagWithSpace = NO;
    do {

        // Scan up to the start of a tag or whitespace
        if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
            [result appendString:str];
            str = nil; // reset
        }

        // Check if we've stopped at a tag/comment or whitespace
        if ([scanner scanString:@"<" intoString:NULL]) {

            // Stopped at a comment or tag
            if ([scanner scanString:@"!--" intoString:NULL]) {

                // Comment
                [scanner scanUpToString:@"-->" intoString:NULL];
                [scanner scanString:@"-->" intoString:NULL];

            } else {

                // Tag - remove and replace with space unless it's
                // a closing inline tag then dont replace with a space
                if ([scanner scanString:@"/" intoString:NULL]) {

                    // Closing tag - replace with space unless it's inline
                    tagName = nil; dontReplaceTagWithSpace = NO;
                    if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
                        tagName = [tagName lowercaseString];
                        dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
                                                   [tagName isEqualToString:@"b"] ||
                                                   [tagName isEqualToString:@"i"] ||
                                                   [tagName isEqualToString:@"q"] ||
                                                   [tagName isEqualToString:@"span"] ||
                                                   [tagName isEqualToString:@"em"] ||
                                                   [tagName isEqualToString:@"strong"] ||
                                                   [tagName isEqualToString:@"cite"] ||
                                                   [tagName isEqualToString:@"abbr"] ||
                                                   [tagName isEqualToString:@"acronym"] ||
                                                   [tagName isEqualToString:@"label"]);
                    }

                    // Replace tag with string unless it was an inline
                    if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];

                }

                // Scan past tag
                [scanner scanUpToString:@">" intoString:NULL];
                [scanner scanString:@">" intoString:NULL];

            }

        } else {

            // Stopped at whitespace - replace all whitespace and newlines with a space
            if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
                if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
            }

        }

    } while (![scanner isAtEnd]);

    // Cleanup
    [scanner release];

    // Decode HTML entities and return
    NSString *retString = [[result stringByDecodingHTMLEntities] retain];
    [result release];

    // Drain
    [pool drain];

    // Return
    return [retString autorelease];

}

这是字符串的NSLog.我只粘贴了前几段

Here is the NSLog of the string. I only pasted the first few paragraphs

Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course.  


 His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way.  






 "I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose."  

 The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again.  

等...

for (int j = 25; j< 50; j++) {
    char test =  [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)];

        NSLog(@"%hhd", test);
    }

012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97

推荐答案

我已经尝试了上面的问题,这就是我的解决方法,

I have tried with the question above and this is how I fixed it,

NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];

retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet whitespaceCharacterSet] withString:@" "];
retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet newlineCharacterSet] withString:@"\n"];

我在NSString上定义了一个类别方法,

I have defined a category method on NSString as,

- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString;

实现如下,

- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString {

    NSMutableString *originalStr = [NSMutableString string];

    if (!self) {
        return nil;
    }

    NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:characterSet];

    int counter = 0;
    for (NSString *stringComponent in componentsArray) {

        counter ++;

        if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && ((![stringComponent isEqualToString:@"\n"]) || (![joiningString isEqualToString:@"\n"]))) {

            if ([componentsArray count] == counter) {
                [originalStr appendFormat:@"%@", stringComponent];                
            } else {
                [originalStr appendFormat:@"%@%@", stringComponent, joiningString];
            }
        }
    }

    return originalStr;
}

NSString+HTML.m文件中将上述方法添加为NSString上的类别.基本上在您给定的html中,空格和换行符混合在一起多次,并且尝试仅剥离换行符是行不通的.因此,通过比较剥离后字符串是否具有换行符或空格,然后将其附加到主字符串上,我通过比较字符串来删除重复的换行符和空格.

Add the above method in NSString+HTML.m file as a category on NSString. Basically in the html given by you, white spaces and newline were getting mixed multiple times, and trying to strip newline alone was not working. So I am removing duplicate newlines and white spaces as shown above by comparing if the string has newline or whitespace after stripping and then appending on to main string.

或者,您也可以尝试

NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];

retString = [retString stripDuplicateNewlineCharacters];

该方法定义为

- (NSString *)stripDuplicateNewlineCharacters {

    NSMutableString *originalStr = [NSMutableString string];

    if (!self) {
        return nil;
    }

    NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];

    int counter = 0;
    for (NSString *stringComponent in componentsArray) {

        counter ++;

        stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@" " withString:@"<#$%$#>"];
        stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#><#$%$#>" withString:@"<#$%$#>"];
        stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#>" withString:@" "];

        if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && (![stringComponent isEqualToString:@"\n"])) {

            if ([componentsArray count] == counter) {
                [originalStr appendFormat:@"%@", stringComponent];
            } else {
                [originalStr appendFormat:@"%@\n", stringComponent];
            }
        }
    }

    return originalStr;
}

在这种情况下,在删除换行符的同时,方法本身也会删除重复的空格.

In this case, the duplicate white spaces are removed in the method itself while removing new line characters.

这篇关于开源html解析类无法正确解析段落之间的空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆