我怎样才能文字iTextSharp的格式 [英] how can i get text formatting with iTextSharp

查看:246
本文介绍了我怎样才能文字iTextSharp的格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用iTextSharp的从PDF阅读的文本内容。我能读也。但我失去的文本,如字体格式,颜色等有没有什么办法来获取格式为好。

下面是我使用到确切文本的code段 -

  PdfReader读卡器=新PdfReader(F:\\ \\电子书AspectsOfAjax.pdf);
textBox1.Text = ExtractTextFromPDFBytes(reader.GetPageContent(1));

私人字符串ExtractTextFromPDFBytes(byte []的输入)
{
    如果(输入== NULL || input.Length == 0)返回;
    尝试
    {
        字符串resultString =;
        //下垂的表现,如果我们是我们目前的文本对象中
        布尔inTextObject = FALSE;
        //信号表示,如果下一个字符是文字​​如'\\'获得'\'字或'\('得到'('
        布尔nextLiteral = FALSE;
        //()括号嵌套层次。文本内出现()
        INT bracketDepth = 0;
        //保存previous字符得到提取号码等:
        的char [] previousCharacters =新的char [_numberOfCharsToKeep]
        对于(INT J = 0; J< _numberOfCharsToKeep; J ++)previousCharacters [J] ='';
        的for(int i = 0; I< input.Length;我++)
        {
            炭C =(char)的输入[I]
            如果(inTextObject)
            {
                //位置的文字
                如果(bracketDepth == 0)
                {
                    如果(CheckToken(新的String [] {TD,Td的},previousCharacters))
                    {
                        resultString + =\ñ\ r;
                    }
                    其他
                    {
                        如果(CheckToken(新的String [] {',T *,\},previousCharacters))
                        {
                            resultString + =\ N的;
                        }
                        其他
                        {
                            如果(CheckToken(新的String [] {TJ},previousCharacters))
                            {
                                resultString + =;
                            }
                        }
                    }
                }
                //文本对象的结束,也进入到一个新行。
                如果(bracketDepth == 0安培;&安培; CheckToken(新的String [] {ET},previousCharacters))
                {
                    inTextObject = FALSE;
                    resultString + =;
                }
                其他
                {
                    //开始输出文本
                    如果((三=='(')及及(bracketDepth == 0)&安培;&安培;!(nextLiteral))
                    {
                        bracketDepth = 1;
                    }
                    其他
                    {
                        //停止输出文本
                        如果((三==')')及&安培; (bracketDepth == 1)及;&安培; (!nextLiteral))
                        {
                            bracketDepth = 0;
                        }
                        其他
                        {
                            //只是一个普通的文本字符:
                            如果(bracketDepth == 1)
                            {
                                //只有无论什么打印出一个字符。
                                //不跨preT。
                                如果(C =='\\'和;&安培;!nextLiteral)
                                {
                                    nextLiteral = TRUE;
                                }
                                其他
                                {
                                    如果(((c取代; ='')及及(℃下='〜'))||((c取代; = 128)及及(℃下255)))
                                    {
                                        resultString + = c.ToString();
                                    }
                                    nextLiteral = FALSE;
                                }
                            }
                        }
                    }
                }
            }
            //存储最近字符,当我们要回去了检查
            对于(INT J = 0; J< _numberOfCharsToKeep  -  1; J ++)
            {
                previousCharacters [J] = previousCharacters [J + 1];
            }
            previousCharacters [_numberOfCharsToKeep  -  1] = C;

            //启动文本对象
            如果(inTextObject&安培;!&安培; CheckToken(新的String [] {BT},previousCharacters))
            {
                inTextObject = TRUE;
            }
        }
        返回resultString;
    }
    抓住
    {
        返回 ;
    }
}

私人布尔CheckToken(字符串[]令牌的char []近期)
{
    的foreach(在令牌字符串标记)
    {
        如果((近期[_numberOfCharsToKeep  -  3] ==令牌[0])及和放大器;
            (近期[_numberOfCharsToKeep  -  2] ==令牌[1])及和放大器;
            ((近期[_numberOfCharsToKeep  -  1] =='')||
            (近期[_numberOfCharsToKeep  -  1] == 0X0D)||
            (近期[_numberOfCharsToKeep  -  1] ==的0x0A))及和放大器;
            ((近期[_numberOfCharsToKeep  -  4] =='')||
            (近期[_numberOfCharsToKeep  -  4] == 0X0D)||
            (近期[_numberOfCharsToKeep  -  4] ==的0x0A))
            )
        {
            返回true;
        }
    }
    返回false;
}
 

解决方案

让我来试试指着你一个不同的方向。 iTextSharp的有一个非常美丽而简单的文本提取系统可以处理一些基本的标记。遗憾的是它不处理的颜色信息,但<一个href="http://stackoverflow.com/questions/5872051/how-to-get-text-with-a-certain-color-from-a-pdf-c/5873831#5873831">according到@马克斯托勒它可能不会太难实现自己。

BEGIN修改

我一开始就执行颜色信息的工作。请参阅<一href="http://chrishaas.word$p$pss.com/2011/07/31/getting-color-information-from-itextsharps-textrenderinfo-and-itextextractionstrategy/">my博客文章这里了解更多详情。 (对不起,坏的格式化,然后前往吃饭了。)

END修改

在code以下几种相结合的问题和答案在这里,包括<一href="http://stackoverflow.com/questions/2375674/itextsharp-how-to-get-the-position-of-word-on-a-page/4866110#4866110">this一个得到字体高度(虽然它的不准确),以及另外一个(这对我的生活中,我似乎无法找到了),显示了如何检测的仿粗体。

PostscriptFontName 返回一些额外的字符的字体名称的前面,我认为它做的时候你嵌入字体子集与

下面是一个完整的WinForms应用程序,针对iTextSharp的5.1.1.0和提取文本为HTML。

样本PDF的屏幕截图

提取为HTML

示例文本

 &LT;跨度风格=字体家庭:NJNSWD +纸莎草纸,定期,字体大小:11.61407&GT;你好&LT; / SPAN&GT;
&LT;跨度风格=字体家庭:NJNSWD +纸莎草纸正规粗体;字体大小:11.61407&GT; W&LT; / SPAN&GT;
&LT;跨度风格=字体家庭:NJNSWD +纸莎草纸正规粗体;字体大小:37.87201&GT; O&LT; / SPAN&GT;
&LT;跨度风格=字体家庭:NJNSWD +纸莎草纸正规粗体;字体大小:11.61407&GT; RL&LT; / SPAN&GT;
&LT;跨度风格=字体家庭:NJNSWD +纸莎草纸,定期,字体大小:11.61407&GT; D&LT; / SPAN&GT;
&LT; BR /&GT;
&LT;跨度风格=字体家庭:NJNSWD +纸莎草纸,定期,字体大小:11.61407&GT;测试&LT; / SPAN&GT;
 

code

 使用系统;
使用System.Collections.Generic;
使用System.Text;
使用System.Windows.Forms的;
使用iTextSharp.text.pdf.parser;
使用iTextSharp.text.pdf;

命名空间WindowsFormsApplication2
{
    公共部分类Form1中:形态
    {
        公共Form1中()
        {
            的InitializeComponent();
        }

        私人无效Form1_Load的(对象发件人,EventArgs的)
        {
            PdfReader读卡器=新PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop),Document.pdf));
            TextWithFontExtractionStategy S =新TextWithFontExtractionStategy();
            串F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(阅读器,1,S);
            Console.WriteLine(F);

            this.Close();
        }

        公共类TextWithFontExtractionStategy:iTextSharp.text.pdf.parser.ITextExtractionStrategy
        {
            // HTML缓冲
            私人StringBuilder的结果=新的StringBuilder();

            //储存最后使用的属性
            私人矢量lastBaseLine;
            私人字符串lastFont;
            私人浮动lastFontSize;

            //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
            私人枚举TextRenderMode
            {
                fillText方法= 0,
                StrokeText = 1,
                FillThenStrokeText = 2,
                隐形= 3,
                FillTextAndAddToPathForClipping = 4,
                StrokeTextAndAddToPathForClipping = 5,
                FillThenStrokeTextAndAddToPathForClipping = 6,
                AddTextToPaddForClipping = 7
            }



            公共无效RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
                字符串curFont = renderInfo.GetFont()PostscriptFontName。
                //检查仿粗体使用
                如果((renderInfo.GetTextRenderMode()==(int)的TextRenderMode.FillThenStrokeText))
                {
                    curFont + =-Bold;
                }

                //这个code假定如果基线的变化,然后我们是在一个换行符
                矢量curBaseline = renderInfo.GetBaseline()GetStartPoint()。
                矢量topRight = renderInfo.GetAscentLine()GetEndPoint()。
                iTextSharp.text.Rectangle RECT =新iTextSharp.text.Rectangle(curBaseline [Vector.I1],curBaseline [Vector.I2],topRight [Vector.I1],topRight [Vector.I2]);
                单curFontSize = rect.Height;

                //看看事情已经改变,无论是底线,字体或者字体大小
                如果((this.lastBaseLine == NULL)||(curBaseline [Vector.I2]!= lastBaseLine [Vector.I2])||(curFontSize!= lastFontSize)||(curFont!= lastFont))
                {
                    //如果我们放下至少一个span标记接近它
                    如果((this.lastBaseLine!= NULL))
                    {
                        this.result.AppendLine(&所述; /跨度&gt;中);
                    }
                    //如果基线已经改变,然后插入一个换行符
                    如果((this.lastBaseLine = NULL)及!&安培;!curBaseline [Vector.I2] = lastBaseLine [Vector.I2])
                    {
                        this.result.AppendLine(&LT; BR /&gt;中);
                    }
                    //创建适当的样式HTML标签
                    this.result.AppendFormat(&LT;跨度风格= \字体家庭:{0};字体大小:{1} \&gt;中,curFont,curFontSize);
                }

                //将当前文本
                this.result.Append(renderInfo.GetText());

                //设置当前使用的特性
                this.lastBaseLine = curBaseline;
                this.lastFontSize = curFontSize;
                this.lastFont = curFont;
            }

            公共字符串GetResultantText()
            {
                //如果我们写的任何东西,然后我们将永远有一个缺少的结束标记,以便在这里关闭
                如果(result.Length大于0)
                {
                    result.Append(&所述; /跨度&gt;中);
                }
                返回result.ToString();
            }

            //不需要
            公共无效BeginTextBlock(){}
            公共无效EndTextBlock(){}
            公共无效RenderImage(ImageRenderInfo renderInfo){}
        }
    }
}
 

I am using iTextSharp to read text contents from PDF. I am able to read that also. But I am loosing text formatting like the font, color etc. Is there any way to get that formatting as well.

Below is the code segment i am using to exact text -

PdfReader reader = new PdfReader("F:\\EBooks\\AspectsOfAjax.pdf");
textBox1.Text = ExtractTextFromPDFBytes(reader.GetPageContent(1));

private string ExtractTextFromPDFBytes(byte[] input)
{
    if (input == null || input.Length == 0) return "";
    try
    {
        string resultString = "";
        // Flag showing if we are we currently inside a text object
        bool inTextObject = false;
        // Flag showing if the next character is literal  e.g. '\\' to get a '\' character or '\(' to get '('
        bool nextLiteral = false;
        // () Bracket nesting level. Text appears inside ()
        int bracketDepth = 0;
        // Keep previous chars to get extract numbers etc.:
        char[] previousCharacters = new char[_numberOfCharsToKeep];
        for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
        for (int i = 0; i < input.Length; i++)
        {
            char c = (char)input[i];
            if (inTextObject)
            {
                // Position the text
                if (bracketDepth == 0)
                {
                    if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                    {
                        resultString += "\n\r";
                    }
                    else
                    {
                        if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))
                        {
                            resultString += "\n";
                        }
                        else
                        {
                            if (CheckToken(new string[] { "Tj" }, previousCharacters))
                            {
                                resultString += " ";
                            }
                        }
                    }
                }
                // End of a text object, also go to a new line.
                if (bracketDepth == 0 && CheckToken( new string[]{"ET"}, previousCharacters))
                {
                    inTextObject = false;
                    resultString += " ";
                }
                else
                {
                    // Start outputting text
                    if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                    {
                        bracketDepth = 1;
                    }
                    else
                    {
                        // Stop outputting text
                        if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                        {
                            bracketDepth = 0;
                        }
                        else
                        {
                            // Just a normal text character:
                            if (bracketDepth == 1)
                            {
                                // Only print out next character no matter what. 
                                // Do not interpret.
                                if (c == '\\' && !nextLiteral)
                                {
                                    nextLiteral = true;
                                }
                                else
                                {
                                    if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255)))
                                    {
                                        resultString += c.ToString();
                                    }
                                    nextLiteral = false;
                                }
                            }
                        }
                    }
                }
            }
            // Store the recent characters for when we have to go back for a checking
            for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
            {
                previousCharacters[j] = previousCharacters[j + 1];
            }
            previousCharacters[_numberOfCharsToKeep - 1] = c;

            // Start of a text object
            if (!inTextObject && CheckToken(new string[]{"BT"}, previousCharacters))
            {
                inTextObject = true;
            }
        }
        return resultString;
    }
    catch
    {
        return "";
    }
}

private bool CheckToken(string[] tokens, char[] recent)
{
    foreach(string token in tokens)
    {
        if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
            (recent[_numberOfCharsToKeep - 2] == token[1]) &&
            ((recent[_numberOfCharsToKeep - 1] == ' ') ||
            (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
            (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
            ((recent[_numberOfCharsToKeep - 4] == ' ') ||
            (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
            (recent[_numberOfCharsToKeep - 4] == 0x0a))
            )
        {
            return true;
        }
    }
    return false;
}

解决方案

Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn't handle color information but according to @Mark Storer it might not be too hard to implement yourself.

BEGIN EDIT

I started work on implementing color information. See my blog post here for more details. (Sorry for the bad formatting, heading off to dinner now.)

END EDIT

The code below combines several questions and answers here including this one to get the font height (although its not exact) as well as another one (that for the life of me I can't seem to find anymore) that shows how to detect for faux bold.

The PostscriptFontName returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.

Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.

Screenshot of sample PDF

Sample text extracted as HTML

<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Hello </span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">w</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201">o</span>
<span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">rl</span>
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">d </span>
<br />
<span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Test </span>

Code

using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication2
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
            TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
            string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
            Console.WriteLine(F);

            this.Close();
        }

        public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
        {
            //HTML buffer
            private StringBuilder result = new StringBuilder();

            //Store last used properties
            private Vector lastBaseLine;
            private string lastFont;
            private float lastFontSize;

            //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
            private enum TextRenderMode
            {
                FillText = 0,
                StrokeText = 1,
                FillThenStrokeText = 2,
                Invisible = 3,
                FillTextAndAddToPathForClipping = 4,
                StrokeTextAndAddToPathForClipping = 5,
                FillThenStrokeTextAndAddToPathForClipping = 6,
                AddTextToPaddForClipping = 7
            }



            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
                string curFont = renderInfo.GetFont().PostscriptFontName;
                //Check if faux bold is used
                if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
                {
                    curFont += "-Bold";
                }

                //This code assumes that if the baseline changes then we're on a newline
                Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
                Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                Single curFontSize = rect.Height;

                //See if something has changed, either the baseline, the font or the font size
                if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
                {
                    //if we've put down at least one span tag close it
                    if ((this.lastBaseLine != null))
                    {
                        this.result.AppendLine("</span>");
                    }
                    //If the baseline has changed then insert a line break
                    if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
                    {
                        this.result.AppendLine("<br />");
                    }
                    //Create an HTML tag with appropriate styles
                    this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
                }

                //Append the current text
                this.result.Append(renderInfo.GetText());

                //Set currently used properties
                this.lastBaseLine = curBaseline;
                this.lastFontSize = curFontSize;
                this.lastFont = curFont;
            }

            public string GetResultantText()
            {
                //If we wrote anything then we'll always have a missing closing tag so close it here
                if (result.Length > 0)
                {
                    result.Append("</span>");
                }
                return result.ToString();
            }

            //Not needed
            public void BeginTextBlock() { }
            public void EndTextBlock() { }
            public void RenderImage(ImageRenderInfo renderInfo) { }
        }
    }
}

这篇关于我怎样才能文字iTextSharp的格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆