如何在iText中修复孤立的标点符号 [英] How to fix orphaned punctuation in iText

查看:96
本文介绍了如何在iText中修复孤立的标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到了 如何修复iText的汉字换行另一个用户遇到了与我们面临的问题类似的问题. https://stackoverflow.com/users/1622493/bruno-lowagie 的回复表示DefaultSplitCharacter已经使用了中文从iText 5开始就考虑到字符.我们使用的是iText 5.5.6,但仍然看到问题.

I saw in How to fix iText's text wrapping for chinese characters that another user had a similar problem as what we're facing. A response by https://stackoverflow.com/users/1622493/bruno-lowagie indicated the DefaultSplitCharacter has taken Chinese characters into account since iText 5. We're using iText 5.5.6, but still see the problem.

据我所知,DefaultSplitCharacter可以正常工作,但问题似乎在于ColumnText类允许行以这些标点符号开头.

As near as I can tell, DefaultSplitCharacter is working correctly, but the problem appears to be that the ColumnText class allows lines to begin with these punctuation marks.

这是BidiLine类中用于呈现文本的PdfChunks的屏幕截图

但是,结果写在第3行和第5行都以标点符号开头的地方,如此

However, the result is being written where the 3rd and 5th lines both begin with punctuation characters as show in this image of the PDF output

我可以在适当的位置简单地添加一些新行以使其看起来正确,但这意味着如果在内部对文本进行了重新翻译,我的修复程序可能不再起作用.有人知道如何确保iText不会以这些标点符号开头吗?

I can simply add some new lines in the proper places to make it look correct, but this would mean if the text is ever re-translated internally my fix may no longer work. Does anyone know how to ensure that iText won't begin a line with these punctuation characters?

推荐答案

要在亚洲语言中使用分行符,您需要编写自己的SplitCharacter实现.关于换行的很好参考是Unicode®标准附件#14 -Unicode换行算法.另一个是 https://msdn.microsoft.com/en-us/library/cc194864.aspx .

For breaking lines in Asian languages you need to write your own implementation of SplitCharacter. A good reference for line breaking is Unicode® Standard Annex #14 -Unicode Line Breaking Algorithm. Another one is https://msdn.microsoft.com/en-us/library/cc194864.aspx.

在为日语实现此功能后,我不得不为日语文本和英语文本混合编写示例代码.使用上面的参考,可以很容易地为中文修改此代码.

Having suffered through implementing this for Japanese, I'm putting example code I wrote for Japanese text mixed with English text. This code could be modified for Chinese fairly easily using the references above.

以下是显示正在使用的JapaneseSplitCharacter的代码段:

Here is a snippet showing JapaneseSplitCharacter in use:

  Chunk chunk = new Chunk(<asian text>,<asian font>);
  chunk.setSplitCharacter(JapaneseSplitCharacter.SplitCharacter);
  Paragraph paragraph = new Paragraph(chunk);  

这是JapaneseSplitCharacter的代码:

Here is the code for JapaneseSplitCharacter:

import com.itextpdf.text.SplitCharacter;
import com.itextpdf.text.pdf.DefaultSplitCharacter;
import com.itextpdf.text.pdf.PdfChunk;

/**
 * <p/>
 * For basic latin characters spaces, periods, commas, etc. are split characters. For Japanese characters lines can break
 * anywhere, unless prohibited. This class uses logic for Japanese, non-starting and non-ending characters based on the
 * kinsoku rule and uses the DefaultSplitCharacter class for basic latin characters while writing free flowing text to a PDF.
 * <p/>
 */

public class JapaneseSplitCharacter implements SplitCharacter {

  // line of text cannot start or end with this character
  static final char u2060 = '\u2060';   //       - ZERO WIDTH NO BREAK SPACE

  // a line of text cannot start with any following characters in NOT_BEGIN_CHARACTERS[]
  static final char u30fb = '\u30fb';   //  ・   - KATAKANA MIDDLE DOT
  static final char u2022 = '\u2022';   //  •    - BLACK SMALL CIRCLE (BULLET)
  static final char uff65 = '\uff65';   //  ・    - HALFWIDTH KATAKANA MIDDLE DOT
  static final char u300d = '\u300d';   //  」   - RIGHT CORNER BRACKET
  static final char uff09 = '\uff09';   //  )   - FULLWIDTH RIGHT PARENTHESIS
  static final char u0021 = '\u0021';   //  !    - EXCLAMATION MARK
  static final char u0025 = '\u0025';   //  %    - PERCENT SIGN
  static final char u0029 = '\u0029';   //  )    - RIGHT PARENTHESIS
  static final char u002c = '\u002c';   //  ,    - COMMA
  static final char u002e = '\u002e';   //  .    - FULL STOP
  static final char u003f = '\u003f';   //  ?    - QUESTION MARK
  static final char u005d = '\u005d';   //  ]    - RIGHT SQUARE BRACKET
  static final char u007d = '\u007d';   //  }    - RIGHT CURLY BRACKET
  static final char uff61 = '\uff61';   //  。    - HALFWIDTH IDEOGRAPHIC FULL STOP
  static final char uff63 = '\uff63';   //  」    - HALFWIDTH RIGHT CORNER BRACKET
  static final char uff64 = '\uff64';   //  、    - HALFWIDTH IDEOGRAPHIC COMMA
  static final char uff67 = '\uff67';   //  ァ    - HALFWIDTH KATAKANA LETTER SMALL A
  static final char uff68 = '\uff68';   //  ィ    - HALFWIDTH KATAKANA LETTER SMALL I
  static final char uff69 = '\uff69';   //  ゥ    - HALFWIDTH KATAKANA LETTER SMALL U
  static final char uff6a = '\uff6a';   //  ェ    - HALFWIDTH KATAKANA LETTER SMALL E
  static final char uff6b = '\uff6b';   //  ォ    - HALFWIDTH KATAKANA LETTER SMALL O
  static final char uff6c = '\uff6c';   //  ャ    - HALFWIDTH KATAKANA LETTER SMALL YA
  static final char uff6d = '\uff6d';   //  ュ    - HALFWIDTH KATAKANA LETTER SMALL YU
  static final char uff6e = '\uff6e';   //  ョ    - HALFWIDTH KATAKANA LETTER SMALL YO
  static final char uff6f = '\uff6f';   //  ッ    - HALFWIDTH KATAKANA LETTER SMALL TU
  static final char uff70 = '\uff70';   //  ー    - HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char uff9e = '\uff9e';   //  ゙    - HALFWIDTH KATAKANA VOICED SOUND MARK
  static final char uff9f = '\uff9f';   //  ゚    - HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
  static final char u3001 = '\u3001';   //  、    - IDEOGRAPHIC COMMA
  static final char u3002 = '\u3002';   //  。    - IDEOGRAPHIC FULL STOP
  static final char uff0c = '\uff0c';   //  ,    - FULLWIDTH COMMA
  static final char uff0e = '\uff0e';   //  .    - FULLWIDTH FULL STOP
  static final char uff1a = '\uff1a';   //  :    - FULLWIDTH COLON
  static final char uff1b = '\uff1b';   //  ;    - FULLWIDTH SEMICOLON
  static final char uff1f = '\uff1f';   //  ?    - FULLWIDTH QUESTION MARK
  static final char uff01 = '\uff01';   //  !    - FULLWIDTH EXCLAMATION MARK
  static final char u309b = '\u309b';   //  ゛    - KATAKANA-HIRAGANA VOICED SOUND MARK
  static final char u309c = '\u309c';   //  ゜    - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
  static final char u30fd = '\u30fd';   //  ヽ    - KATAKANA ITERATION MARK
  static final char u30fe = '\u30fe';   //  ヾ    - KATAKANA VOICED ITERATION MARK
  static final char u309d = '\u309d';   //  ゝ    - HIRAGANA ITERATION MARK
  static final char u309e = '\u309e';   //  ゞ    - HIRAGANA VOICED ITERATION MARK
  static final char u3005 = '\u3005';   //  々    - IDEOGRAPHIC ITERATION MARK
  static final char u30fc = '\u30fc';   //  ー    - KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char u2019 = '\u2019';   //  ’    - RIGHT SINGLE QUOTATION MARK
  static final char u201d = '\u201d';   //  "    - RIGHT DOUBLE QUOTATION MARK
  static final char u3015 = '\u3015';   //  〕    - RIGHT TORTOISE SHELL BRACKET
  static final char uff3d = '\uff3d';   //  ]    - FULLWIDTH RIGHT SQUARE BRACKET
  static final char uff5d = '\uff5d';   //  }    - FULLWIDTH RIGHT CURLY BRACKET
  static final char u3009 = '\u3009';   //  〉    - RIGHT ANGLE BRACKET
  static final char u300b = '\u300b';   //  》    - RIGHT DOUBLE ANGLE BRACKET
  static final char u300f = '\u300f';   //  』    - RIGHT WHITE CORNER BRACKET
  static final char u3011 = '\u3011';   //  】    - RIGHT BLACK LENTICULAR BRACKET
  static final char u00b0 = '\u00b0';   //  °    - DEGREE SIGN
  static final char u2032 = '\u2032';   //  ′    - PRIME
  static final char u2033 = '\u2033';   //  ″    - DOUBLE PRIME
  static final char u2103 = '\u2103';   //  ℃    - DEGREE CELSIUS
  static final char u00a2 = '\u00a2';   //  ¢    - CENT SIGN
  static final char uff05 = '\uff05';   //  %    - FULLWIDTH PERCENT SIGN
  static final char u2030 = '\u2030';   //  ‰    - PER MILLE SIGN
  static final char u3041 = '\u3041';   //  ぁ    - HIRAGANA LETTER SMALL A
  static final char u3043 = '\u3043';   //  ぃ    - HIRAGANA LETTER SMALL I
  static final char u3045 = '\u3045';   //  ぅ    - HIRAGANA LETTER SMALL U
  static final char u3047 = '\u3047';   //  ぇ    - HIRAGANA LETTER SMALL E
  static final char u3049 = '\u3049';   //  ぉ    - HIRAGANA LETTER SMALL O
  static final char u3063 = '\u3063';   //  っ    - HIRAGANA LETTER SMALL TU
  static final char u3083 = '\u3083';   //  ゃ    - HIRAGANA LETTER SMALL YA
  static final char u3085 = '\u3085';   //  ゅ    - HIRAGANA LETTER SMALL YU
  static final char u3087 = '\u3087';   //  ょ    - HIRAGANA LETTER SMALL YO
  static final char u308e = '\u308e';   //  ゎ    - HIRAGANA LETTER SMALL WA
  static final char u30a1 = '\u30a1';   //  ァ    - KATAKANA LETTER SMALL A
  static final char u30a3 = '\u30a3';   //  ィ    - KATAKANA LETTER SMALL I
  static final char u30a5 = '\u30a5';   //  ゥ    - KATAKANA LETTER SMALL U
  static final char u30a7 = '\u30a7';   //  ェ    - KATAKANA LETTER SMALL E
  static final char u30a9 = '\u30a9';   //  ォ    - KATAKANA LETTER SMALL O
  static final char u30c3 = '\u30c3';   //  ッ    - KATAKANA LETTER SMALL TU
  static final char u30e3 = '\u30e3';   //  ャ    - KATAKANA LETTER SMALL YA
  static final char u30e5 = '\u30e5';   //  ュ    - KATAKANA LETTER SMALL YU
  static final char u30e7 = '\u30e7';   //  ョ    - KATAKANA LETTER SMALL YO
  static final char u30ee = '\u30ee';   //  ヮ    - KATAKANA LETTER SMALL WA
  static final char u30f5 = '\u30f5';   //  ヵ    - KATAKANA LETTER SMALL KA
  static final char u30f6 = '\u30f6';   //  ヶ    - KATAKANA LETTER SMALL KE

  static final char[] NOT_BEGIN_CHARACTERS = new char[]{u30fb, u2022, uff65, u300d, uff09, u0021, u0025, u0029, u002c,
          u002e, u003f, u005d, u007d, uff61, uff63, uff64, uff67, uff68, uff69, uff6a, uff6b, uff6c, uff6d, uff6e,
          uff6f, uff70, uff9e, uff9f, u3001, u3002, uff0c, uff0e, uff1a, uff1b, uff1f, uff01, u309b, u309c, u30fd,
          u30fe, u309d, u309e, u3005, u30fc, u2019, u201d, u3015, uff3d, uff5d, u3009, u300b, u300f, u3011, u00b0,
          u2032, u2033, u2103, u00a2, uff05, u2030, u3041, u3043, u3045, u3047, u3049, u3063, u3083, u3085, u3087,
          u308e, u30a1, u30a3, u30a5, u30a7, u30a9, u30c3, u30e3, u30e5, u30e7, u30ee, u30f5, u30f6, u2060};

  // a line of text cannot end with any following characters in NOT_ENDING_CHARACTERS[]
  static final char u0024 = '\u0024';   //  $   - DOLLAR SIGN
  static final char u0028 = '\u0028';   //  (   - LEFT PARENTHESIS
  static final char u005b = '\u005b';   //  [   - LEFT SQUARE BRACKET
  static final char u007b = '\u007b';   //  {   - LEFT CURLY BRACKET
  static final char u00a3 = '\u00a3';   //  £   - POUND SIGN
  static final char u00a5 = '\u00a5';   //  ¥   - YEN SIGN
  static final char u201c = '\u201c';   //  "   - LEFT DOUBLE QUOTATION MARK
  static final char u2018 = '\u2018';   //   ‘  - LEFT SINGLE QUOTATION MARK
  static final char u300a = '\u300a';   //  《  - LEFT DOUBLE ANGLE BRACKET
  static final char u3008 = '\u3008';   //  〈  - LEFT ANGLE BRACKET
  static final char u300c = '\u300c';   //  「  - LEFT CORNER BRACKET
  static final char u300e = '\u300e';   //  『  - LEFT WHITE CORNER BRACKET
  static final char u3010 = '\u3010';   //  【  - LEFT BLACK LENTICULAR BRACKET
  static final char u3014 = '\u3014';   //  〔  - LEFT TORTOISE SHELL BRACKET
  static final char uff62 = '\uff62';   //  「   - HALFWIDTH LEFT CORNER BRACKET
  static final char uff08 = '\uff08';   //  (  - FULLWIDTH LEFT PARENTHESIS
  static final char uff3b = '\uff3b';   //  [  - FULLWIDTH LEFT SQUARE BRACKET
  static final char uff5b = '\uff5b';   //  {  - FULLWIDTH LEFT CURLY BRACKET
  static final char uffe5 = '\uffe5';   //  ¥  - FULLWIDTH YEN SIGN
  static final char uff04 = '\uff04';   //  $  - FULLWIDTH DOLLAR SIGN

  static final char[] NOT_ENDING_CHARACTERS = new char[]{u0024, u0028, u005b, u007b, u00a3, u00a5, u201c, u2018, u3008,
          u300a, u300c, u300e, u3010, u3014, uff62, uff08, uff3b, uff5b, uffe5, uff04, u2060};

  /**
   * An instance of the jpSplitCharacter.
   */
  public static final JapaneseSplitCharacter SplitCharacter = new JapaneseSplitCharacter();

  /**
   * An instance DefaultSplitCharacter used for BasicLatin characters.
   */
  private static final SplitCharacter defaultSplitCharacter = new DefaultSplitCharacter();

  public JapaneseSplitCharacter() { }

  /**
   * Custom method to for SplitCharacter to handle Japanese characters.
   * Returns <CODE>true</CODE> if the character can split a line. The splitting implementation
   * is free to look ahead or look behind characters to make a decision.
   *
   * @param start   the lower limit of <CODE>cc</CODE> inclusive
   * @param current the pointer to the character in <CODE>cc</CODE>
   * @param end     the upper limit of <CODE>cc</CODE> exclusive
   * @param cc      an array of characters at least <CODE>end</CODE> sized
   * @param ck      an array of <CODE>PdfChunk</CODE>. The main use is to be able to call
   *                {@link PdfChunk#getUnicodeEquivalent(int)}. It may be <CODE>null</CODE>
   *                or shorter than <CODE>end</CODE>. If <CODE>null</CODE> no conversion takes place.
   *                If shorter than <CODE>end</CODE> the last element is used
   * @return <CODE>true</CODE> if the character(s) can split a line
   */
  public boolean isSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck) {

    // Note: If you don't add an try/catch iText and there is an issue with isSplitCharacter() silently fails and
    // you have no idea there was a problem.
    try {

      char charCurrent = getCharacter(current, cc, ck);

      int next = current + 1;
      if (next < cc.length) {
        char charNext = getCharacter(next, cc, ck);
        for (char not_begin_character : NOT_BEGIN_CHARACTERS) {
          if (charNext == not_begin_character) {
            return false;
          }
        }
      }

      for (char not_ending_character : NOT_ENDING_CHARACTERS) {
        if (charCurrent == not_ending_character) {
          return false;
        }
      }

      boolean isBasicLatin = Character.UnicodeBlock.of(charCurrent) == Character.UnicodeBlock.BASIC_LATIN;
      if (isBasicLatin)
        return  defaultSplitCharacter.isSplitCharacter(start, current, end, cc, ck);

      return true;

    } catch (Exception ex) {
      ex.printStackTrace();
    }

    return true;
  }

  /**
   * Returns a character int the array (Note: modified from the iText default version with the addition null
   * check of '|| ck[Math.min(position, ck.length - 1)] == null'.
   *
   * @param position position in the array
   * @param ck       chunk array
   * @param cc       the character array that has to be checked
   * @return the character
   */
  protected char getCharacter(int position, char[] cc, PdfChunk[] ck) {
    if (ck == null || ck[Math.min(position, ck.length - 1)] == null) {
      return cc[position];
    }
    return (char) ck[Math.min(position, ck.length - 1)].getUnicodeEquivalent(cc[position]);
  }

}

希望这会有所帮助.

这篇关于如何在iText中修复孤立的标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆