使用iText提取文本不工作:编码或加密文本? [英] Extract text with iText not works: encoding or crypted text?

查看:1141
本文介绍了使用iText提取文本不工作:编码或加密文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pdf文件,作为以下安全属性:printing:allowed;文件装配:不允许;内容副本:允许;可访问性的内容副本:允许;页面提取:不允许;



我尝试使用示例代码作为文档示例,如下所示:

  pdftext.Text = null; 
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for(int page = 1; page< = pdfReader.NumberOfPages; page ++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);
text.Append(System.Environment.NewLine);
text.Append(\\\
Page Number:+ page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value ++;

}

pdftext.Text + = text.ToString();
pdfReader.Close();

但输出文本是带有的行? ?? ??? ? values;



似乎文件被加密或者我们有编码问题...



  var f = pdfReader.IsOpenedWithFullPermissions;  - > FALSE 
var f1 = pdfReader.IsEncrypted ; - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights );

f,f1,f3,f4 return FALSE ...看来文档不是crypt,
...所以我不知道是否是一个编码问题或与encrypet字符串相关的问题...



有人可以帮助我吗? b $ b提前感谢。
GG

解决方案

每当您使用标准代码从文档中提取文本时,第一件事是使用Adobe Acrobat Reader来尝试复制并粘贴文本,Adobe Reader copy& paste根据PDF规范的建议实现文本提取,如果这样失败,这通常意味着需要的必要信息文档中的文本提取丢失或损坏(意外或通过设计)。要提取文本,需要自定义具体针对特定PDF的代码或使用OCR。



如果手边有文档,Adobe Reader复制和粘贴也会导致垃圾,就像使用iText解压缩。



检查文档,发现字体包含 ToUnicode 映射,如下所示:

  / CIDInit / ProcSet 
findresource begin 12 dict begin begincmap / CIDSystemInfo< / Registry(Adobe)
/ (身份)
/补充0
>>
def
/ CMapName / F18 def
1 begincodespacerange< 0000> < FFFF> endcodespacerange
44 beginbfrange
< 20> < 20> < 0020>
< 21> < 21> < E0F9>
< 22> < 22> < E0F1>
< 23> < 23> < EFA>
< 24> < 24> < E0F7>
< 25> < 25> < E0A3>
< 26> < 26> < E084>
< 27> < 27> < E097>
< 28> < 28> < E098>
< 29> < 29> < E09A>
< 2A> < 2A> < E08A>
< 2B> < 2B> < E099>
< 2C> < 2C> < E0A5>
< 2D> < 2D> < E086>
< 2E> < 2E> < E094>
< 2F> < 2F> < E0DE>
< 30> < 30> < E0A6>
< 31> < 31> < E096>
< 32> < 32> < E088>
< 33> < 33> < E082>
< 34> < 34> < E04C>
< 35> < 35> < E0A4>
< 36> < 36> < E0F6>
< 37> < 37> < E0F2>
< 38> < 38> < E0D8>
< 39> < 39> < E0AA>
< 3A> < 3A> < E06C>
< 3B> < 3B> < E087>
< 3C> < 3C> < E095>
< 3D> < 3D> < E0C4>
< 3E> < 3E> < E07E>
< 3F> < 3F> < E055>
< 40> < 40> < E089>
< 41> < 41> < E085>
< 42> < 42> < E083>
< 43> < 43> < E070>
< 44> < 44> < E0E6>
< 45> < 45> < E080>
< 46> < 46> < E0C8>
< 47> < 47> < E0F4>
< 48> < 48> < E062>
< 49> < 49> < E0F3>
< 4A> < 4A> < E04E>
< 4B> < 4B> < E05E>
endbfrange
endcmap CMapName currentdict / CMap的defineresource流行月底结束



也就是说,如果你不是这样,字体声称所有的字形(除了空格字符0x20)代表来自 Unicode私人使用区域的字符U + E0xx。由于该区域的名称表示,这些值没有常见的字符含义。



因此,根据PDF规范的文本提取将返回字符串未定义






有时在这种情况下,仍然可以强制执行通过忽略 ToUnicode 地图,它可以使用该字体的编码或嵌入字体程序中的信息正确文本提取。



很遗憾,结果是编码有效地包含与 ToUnicode 地图相同的信息,例如相同的字体如上

  /差异[32 /空间/ uniE0F9 / uniE0F1 / uniE0FA / uniE0F7 / uniE0A3 / uniE084 / uniE097 / uniE098 
/ uniE09A / uniE08A / uniE099 / uniE0A5 / uniE086 / uniE094 / uniE0DE / uniE0A6 / uniE096
/ uniE088 / uniE082 / uniE04C / uniE0A4 / uniE0F6 / uniE0F2 / uniE0D8 / uniE0AA / uniE06C
/ uniE087 / uniE095 / uniE0C4 / uniE07E / uniE055 / uniE089 / uniE085 / uniE083 / uniE070
/ uniE0E6 / uniE080 / uniE0C8 / uniE0F4 / uniE062 / uniE0F3 / uniE04E / uniE05E]

,并且字体是Type3字体,即没有嵌入字体程序,但是每个字形被定义为一个单独的PDF画布,没有更多的字符信息。



因此,没有在这里任何好处。



其实这些小PDF画布包含内联位图图形这也是文档的图形质量差的原因(如果你没有立即看到,只需放大一点,你会看到字形的粗糙轮廓)。 / p>

顺便说一下,这样的构造通常意味着PDF的生成器明确地希望阻止文本提取。






如果你碰巧要从许多这样的文档中提取文本,你可以尝试并确定从他们的U + E0xx字符到实际上敏感的Unicode字符的映射,并应用该映射到您提取

如果所有这些文档中的所有字体都对同一个实际字符使用相同的U + E0xx代码点,



否则请尝试使用OCR。






以下代码会将网页添加到将 ToUnicode 值映射到所显示字符的文档中:

 无效AddFontsTo(PdfReader读卡器,PdfStamper模子)
{
INT documentPages = reader.NumberOfPages;
为(INT页= 1;页< = documentPages;网页++)
{
//忽略继承的资源,现在
PdfDictionary pageResources = reader.GetPageResources(页);
if(pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if(pageFonts == null || pageFonts.Size == 0)
continue;

List< BaseFont> fonts = new List< BaseFont>();
List< string> fontNames = new List< string>();
HashSet< char> chars = new HashSet< char>();
foreach(pageFonts.Keys中的PdfName键)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if(fontReference == null)
continue;
DocumentFont font =(FileFont)BaseFont.CreateFont((PRIndirectReference)fontReference);
if(font == null)
continue;

PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if(toUni is PRStream)
{
try
{
byte [] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid(,toUnicodeCmap,lb);
}
catch
{
toUnicodeCmap = null;
}
}
if(toUnicodeCmap == null)
continue;
ICollection< int> mapValues = toUnicodeCmap.CreateDirectMapping()。
if(mapValues.Count == 0)
continue;

fonts.Add(font);
fontNames.Add(key.ToString());

foreach(int value in mapValues)
chars.Add((char)value);
}
if(fonts.Count == 0 || chars.Count == 0)
continue;

Rectangle size =(fonts.Count> 10)? PageSize.A4.Rotate():PageSize.A4;

PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell(Page+ page);
foreach(fontNames中的字符串名称)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float [] width = new float [fonts.Count + 1];
width [0] = 2;
for(int i = 1; i <= fonts.Count; i ++)
width [i] = 1;
table.SetWidths(width);
table.WidthPercentage = 100;

List< char> charList = new List< char>(chars);
charList.Sort();
foreach(charList中的char字符)
{
table.AddCell((int)character).ToString(X4));
foreach(BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(),new Font(font))));
}
}

stamper.InsertPage(reader.NumberOfPages + 1,size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while((ColumnText.NO_MORE_TEXT& columnText.Go(false))== 0)
{
stamper.InsertPage(reader.NumberOfPages + 1,size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}

像这样:

  string input = @4700198773.pdf; 
string output = @4700198773-fonts.pdf;

使用(PdfReader reader = new PdfReader(input))
使用(FileStream stream = new FileStream(output,FileMode.Create,FileAccess.Write))
使用= new PdfStamper(reader,stream))
{
AddFontsTo(reader,stamper);
}

其他页面如下所示:





现在您必须比较不同字体和页面的输出相互之间以及与文件的代表性选择的那些。如果你觉得足够好的模式,你可以尝试这种替换的方式。


I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;

I try to get text with sample code as documentation sample as follow:

pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    text.Append(System.Environment.NewLine);
    text.Append("\n Page Number:" + page);
    text.Append(System.Environment.NewLine);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
    progressBar1.Value++;

    }

pdftext.Text += text.ToString();
pdfReader.Close();

but the output text is lines with ""??? ? ???????\n?? ??? ? " values;

seems that file is crypted or we have a encoding problem...

note that in the follow lines

var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();

f, f1, f3, f4 return FALSE ...than seems that the document is not crypted, ...so I don't know if is a Encoding problem or question related to encrypet strings...

Someone can help me? thanks in advance. G.G.

解决方案

Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.

In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.

Inspecting the document one finds that the fonts contain ToUnicode mappings like this:

/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end 

I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.

Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.


Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.

Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above

/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098 
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096 
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C 
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070 
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ] 

and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.

Thus, nothing to gain here either.

Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).

By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.


If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.

If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.

Otherwise do try OCR.


The following code adds pages to a document which map the ToUnicode values to the characters shown:

void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
    int documentPages = reader.NumberOfPages;
    for (int page = 1; page <= documentPages; page++)
    {
        // ignore inherited resources for now
        PdfDictionary pageResources = reader.GetPageResources(page);
        if (pageResources == null)
            continue;
        PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
        if (pageFonts == null || pageFonts.Size == 0)
            continue;

        List<BaseFont> fonts = new List<BaseFont>();
        List<string> fontNames = new List<string>();
        HashSet<char> chars = new HashSet<char>();
        foreach (PdfName key in pageFonts.Keys)
        {
            PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
            if (fontReference == null)
                continue;
            DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
            if (font == null)
                continue;

            PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
            CMapToUnicode toUnicodeCmap = null; 
            if (toUni is PRStream)
            {
                try
                {
                    byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
                    CidLocationFromByte lb = new CidLocationFromByte(touni);
                    toUnicodeCmap = new CMapToUnicode();
                    CMapParserEx.ParseCid("", toUnicodeCmap, lb);
                }
                catch
                {
                    toUnicodeCmap = null;
                }
            }
            if (toUnicodeCmap == null)
                continue;
            ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
            if (mapValues.Count == 0)
                continue;

            fonts.Add(font);
            fontNames.Add(key.ToString());

            foreach (int value in mapValues)
                chars.Add((char)value);
        }
        if (fonts.Count == 0 || chars.Count == 0)
            continue;

        Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;

        PdfPTable table = new PdfPTable(fonts.Count + 1);
        table.AddCell("Page " + page);
        foreach (String name in fontNames)
        {
            table.AddCell(name);
        }
        table.HeaderRows = 1;
        float[] widths = new float[fonts.Count + 1];
        widths[0] = 2;
        for (int i = 1; i <= fonts.Count; i++)
            widths[i] = 1;
        table.SetWidths(widths);
        table.WidthPercentage = 100;

        List<char> charList = new List<char>(chars);
        charList.Sort();
        foreach (char character in charList)
        {
            table.AddCell(((int)character).ToString("X4"));
            foreach (BaseFont font in fonts)
            {
                table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
            }
        }

        stamper.InsertPage(reader.NumberOfPages + 1, size);
        ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
        columnText.AddElement(table);
        columnText.SetSimpleColumn(size);
        while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
        {
            stamper.InsertPage(reader.NumberOfPages + 1, size);
            columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
            columnText.SetSimpleColumn(size);
        }
    }
}

I applied it to your document like this:

string input = @"4700198773.pdf";
string output = @"4700198773-fonts.pdf";

using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
    AddFontsTo(reader, stamper);
}

The additional pages look like this:

Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.

这篇关于使用iText提取文本不工作:编码或加密文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆