如何将PDF文件中的行提取到csv文件中? [英] How do I extract rows from a PDF file into a csv file?

查看:99
本文介绍了如何将PDF文件中的行提取到csv文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此 PDF文件,然后将其放入CSV文件.然后,我将CSV文件导入SQL Server(以便可以轻松运行查询).

我尝试了几种在线pdf到csv转换器以及基于Java的pdf到CSV教程.没事.我今天已经花了6到8个小时来做​​这件事,但失败了.我的csv文件被弄乱了,导入csv时,数据库中有很多空值. 我什至尝试搜索可以向我提供此信息但没有找到DHS的api.

有人可以帮助我提取pdf文件中显示的大学吗?

PS::您可以使用解决方案

在对该问题的评论中已经声明

考虑到相当直接的页面内容流样式,应该可以使用不太复杂的自定义文本提取器来提取数据.

详细信息:

页面内容流样式

常规表条目的内容是按条目逐个绘制的,每个条目按阅读顺序逐字段绘制.因此,在浏览内容流时,我们不必尝试重新安排内容即可建立该顺序.这使这项任务相当容易.

因此,主要工作将是忽略非输入,即第一页的标题,指示新的第一个字母的起始位置的条以及页码.

我们这样做

  • 忽略用于处理标题和首个字母栏的图形和非黑色文本;
  • 不接受不以学校名称"列中的数据开头的条目,该条目只处理校园名称"列中的页码.

(也可以采用其他方法,例如忽略底部区域中的所有内容以处理页码.)

现在我们只需要将条目拆分为它们的字段即可.

同样,由于文档结构非常统一,因此文档结构也将有所帮助,因此表格列在每一页上的位置和尺寸均相同.因此,我们只需要剖析固定的x值即可.

只有一个绊脚石:在某些条目中,原子文本块包含不同列的内容.例如.有时 F M 列的内容被绘制为单个字符串,例如"YN",并且通过字符间距引入了光学距离.

因此,我们必须逐个字符而不是整体地处理文本块.

示例实现

我在这里使用Java和PDF库iText(当前版本5.5.7开发快照).但这并不意味着完全不能使用其他设置来完成,这只是我最习惯的设置.

作为分隔符,我使用制表符,因为其他可能的候选词也出现在文本中,并且我不想对它们进行转义.

这是为应对上述内容而引入的自定义RenderListener类:

public class CertifiedSchoolListExtractionStrategy implements RenderListener
{
    public CertifiedSchoolListExtractionStrategy(Appendable data, Appendable nonData)
    {
        this.data = data;
        this.nonData = nonData;
    }

    //
    // RenderListener implementation
    //
    @Override
    public void beginTextBlock() { }

    @Override
    public void endTextBlock() { }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        try
        {
            Vector startPoint = renderInfo.getBaseline().getStartPoint();
            BaseColor fillColor = renderInfo.getFillColor();
            if (fillColor instanceof GrayColor && ((GrayColor)fillColor).getGray() == 0)
            {
                if (debug)
                    data.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
                {
                    renderCharacter(info);
                }
            }
            else
            {
                if (debug)
                    nonData.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                if (currentField > -1)
                    finishEntry();
                entryBuilder.append(renderInfo.getText());
            }
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        finally
        {
            chunk++;
        }
    }

    public void renderCharacter(TextRenderInfo renderInfo) throws IOException
    {
        Vector startPoint = renderInfo.getBaseline().getStartPoint();

        float x = startPoint.get(I1);

        if (currentField > -1)
        {
            if (isInCurrentField(x))
            {
                entryBuilder.append(renderInfo.getText());
                return;
            }
            if (isInNextField(x))
            {
                currentField++;
                entryBuilder.append('\t').append(renderInfo.getText());
                return;
            }
            finishEntry();
        }
        if (isInNextField(x))
        {
            finishEntry();
            currentField = 0;
        }
        entryBuilder.append(renderInfo.getText());
    }

    public void close() throws IOException
    {
        finishEntry();
    }

    boolean isInCurrentField(float x)
    {
        if (currentField == -1)
            return false;

        if (x < fieldstarts[currentField])
            return false;

        if (currentField == fieldstarts.length - 1)
            return true;

        return x <= fieldstarts[currentField + 1];
    }

    boolean isInNextField(float x)
    {
        if (currentField == fieldstarts.length - 1)
            return false;

        if (x < fieldstarts[currentField + 1])
            return false;

        if (currentField == fieldstarts.length - 2)
            return true;

        return x <= fieldstarts[currentField + 2];
    }

    void finishEntry() throws IOException
    {
        if (entryBuilder.length() > 0)
        {
            if (currentField == fieldstarts.length - 1)
            {
                data.append(entryBuilder).append('\n');
            }
            else
            {
                nonData.append(entryBuilder).append('\n');
            }

            entryBuilder.setLength(0);
        }
        currentField = -1;
    }

    //
    // hidden members
    //
    final Appendable data, nonData;
    boolean debug = false;

    int chunk = 0;
    int currentField = -1;
    StringBuilder entryBuilder = new StringBuilder();

    final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534};
}

( CertifiedSchoolListExtractionStrategy.java )

我们可以这样使用它:

@Test
public void testCertifiedSchoolList_9_16_2015() throws IOException
{
    try (   Writer data = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "data.txt")), "UTF-8");
            Writer nonData = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "non-data.txt")), "UTF-8")    )
    {
        CertifiedSchoolListExtractionStrategy strategy = new CertifiedSchoolListExtractionStrategy(data, nonData);
        PdfReader reader = new PdfReader("certified-school-list-9-16-2015.pdf");

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int page = 1; page <= reader.getNumberOfPages(); page++)
            parser.processContent(page, strategy);
        strategy.close();
    }
}

( ExtractCertifiedSchoolList.java )

现在data.txt包含所有以制表符分隔的行的条目,而non-data.txt则忽略所有内容.

幕后

要了解这里发生的情况,首先必须知道PDF的页面内容是如何组织的,以及(对于给定的示例代码)iText是如何对其进行操作的.

在PDF里面

PDF文档是由许多基本对象类型,一些原始类型(数字,字符串等)和一些更复杂的类型(其他对象或流的数组或字典)构建的结构.

PDF文档中的页面由这样的字典对象表示,该字典对象包含定义某些页面属性(如页面尺寸)的条目以及引用定义了在页面上绘制内容的对象的其他条目:内容流.

内容流本质上包含一系列操作,

  • 选择一种颜色(用于抚摸或填充),
  • 定义一条路径(移动到某个点,直线到另一个点,弯曲到另一个点,...),
  • 抚摸或填补这样的道路,
  • 在某处绘制一些位图图像,
  • 在某处画一些文字,或者
  • 做很多其他事情.

对于眼前的问题,我们最感兴趣的是绘制文本所涉及的操作.与文字处理程序相反,该操作不是取较长的文本字符串并将其安排为一个段落,而是更原始地在此处移动文本位置,在此处输入字符串,再次移动文本位置在此处绘制另一个字符串.

例如在样本PDF中,绘制表标题和第一行的操作如下:

/TT2 1 Tf

选择大小为1的字体 TT2 .

9.72 0 0 9.72 20.16 687.36 Tm

设置文本矩阵,以将文本插入坐标移动到20.16、687.36并按9.72的比例缩放所有内容.

0 g

选择黑色的灰度填充颜色

0 Tc
0 Tw

选择其他字符和单词间距为0.

(SCHOOL)Tj

在此处绘制学校".

/TT1 1 Tf

选择字体 TT1 .

3.4082 0 TD

在x方向上将文本插入点移动3.4082.

<0003>Tj

绘制一个空格字符(当前字体使用不同的编码,每个字符使用16位,而不是8位,这里以十六进制表示).

/TT2 1 Tf
.2261 0 TD
[(NAME)-17887.4(CAMPUS)]TJ

选择字体,移动文本插入点,并绘制字符串"NAME",然后间隔17887.4文本单位,然后绘制"CAMPUS".

/TT1 1 Tf
24.1809 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
[(NAME)-8986.6(F)-923.7(M)-459.3(CITY)-6349.9(ST)-1390.2(CAMPUS)]TJ
/TT1 1 Tf
28.5147 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
(ID)Tj

绘制标题行的其余部分.

/TT4 1 Tf
-56.782 -1.3086 TD

向左移动56.782,向下移动1.3086,即第一条输入行的开头.

("I)Tj
/TT3 1 Tf
.6528 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-16894.2("I)]TJ
/TT3 1 Tf
18.9997 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-8239.9(Y)-1018.9(N)-576.7(Mount)]TJ
/TT3 1 Tf
15.189 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Shasta)-2423.3(CA)-2443.7(41789)]TJ

并绘制第一个输入行.

如您所见,正如我上面提到的,表内容是按阅读顺序绘制的.甚至多行列条目也按所需顺序排列,例如校园名称"Westlake Village的A F International":

[(Inc.)-7228.7(A)]TJ
/TT3 1 Tf
9.26 0 TD 
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(F)Tj
/TT3 1 Tf
.4595 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(International)Tj
/TT3 1 Tf
5.2886 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(of)Tj
/TT3 1 Tf
.8325 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Westlake)Tj
/TT3 1 Tf
3.7739 0 TD
<0003>Tj
/TT4 1 Tf
-11.8374 -1.3086 TD

向下移动到该列的第二行.

(Village)Tj
15.4938 1.3086 TD

再次上移到条目的主行.

[(Y)-1018.9(N)-576.7(Westlake)]TJ 

因此,我们可以随时提取文本,而无需进行排序(可以以完全不同的方式对内容进行排序).

但是我们也看到没有明显的列起点和终点.因此,要将文本与列相关联,我们必须计算每个字符的位置,并将它们与外部给定的列起始位置进行比较.

库支持的解析

PDF库通常提供一些机制来帮助解析此类内容流.

有两种基本架构,一个库可以解析内容流

  • 作为一个整体,并以定位的文本块或
  • 的大数组形式提供
  • 或使用侦听器模式分段并转发各个定位的文本块.

前一种变体乍看起来似乎更易于处理,但可能需要大量资源(我遇到过多个MB内容流),而第二种变体似乎更难处理,但内存需求较小.

我使用的库(iText)遵循了后一种方法,但是您的问题也可以使用前一种方法解决.

RenderListener是在此处实现的侦听器接口,renderText方法检索具有位置的锡制文本块.

在以上(CertifiedSchoolListExtractionStrategy)的实现中,renderText方法首先检查与块关联的填充颜色,然后仅转发黑色文本以供renderCharacter进一步处理.该方法(和一些帮助程序)进而(通过硬编码位置边界)检查文本所在的字段,并相应地导出制表符分隔的值.同样,也可以使用其他库来实现此逻辑.

I want to get a list of all the colleges in USA from this PDF file and put it into a CSV file. I will then import the CSV file into SQL server (so that I can run queries easily).

I tried several online pdf to csv converters and Java based pdf to CSV tutorials. Nothing worked. I have spent 6-8 hours today for this and failed. My csv files were messed up and I had lot of nulls in my DB when i imported the csv. I even tried searching for a DHS api which could give me this info but found none.

Can someone please help me to extract the colleges exactly like they are shown in the pdf file ?

PS: You can see all the colleges using this url also. BUT, you have to scroll manually to extract all the results. It will take too long and data will not be in format given in pdf file.

解决方案

As already claimed in a comment to the question,

Considering the fairly straight forward page content stream style, the data should be extractable using a not too complicated custom text extractor.

In detail:

The page content stream style

Regular table entry content is drawn entry by entry, each entry field by field in reading order. Thus, while going through the content stream we do not have to try and re-arrange the content to establish that order. This makes this task fairly easy.

So the main work will be to ignore non-entries, i.e. the header on the first page, the bars indicating where a new first letter starts, and the page numbers.

We do so by

  • ignoring graphics and non-black text which takes care of the header and the first letter bars;
  • not accepting entries not starting with data in the SCHOOL NAME column which takes care of the page numbers which only live in the CAMPUS NAME column.

(Other approaches also would have done, e.g. ignoring everything in a bottom page area to take care of the page numbers.)

Now we merely have to split the entries into their fields.

Again the document structure helps, as it is a very uniform document, the table columns have the identical position and dimensions on each page. So we merely have to dissect at fixed x values.

There is just one stumbling block: in some entries atomic text chunks contain content of different columns. E.g. sometimes the contents of the F and M columns are drawn as a single string like "YN" and the optical distance is introduced via character spacing.

So we have to process the text chunks character by character, not as a whole.

A sample implementation

I use Java and the PDF library iText (current version 5.5.7 development snapshot) here. This does not mean at all that it cannot be done a using different setup, this merely is the setup I'm most accustomed to.

As separator I use the tab character because other likely candidates also occur as part of the text and I did not want to have to cope with escaping them.

This is the custom RenderListener class introduced to cope with the content as explained above:

public class CertifiedSchoolListExtractionStrategy implements RenderListener
{
    public CertifiedSchoolListExtractionStrategy(Appendable data, Appendable nonData)
    {
        this.data = data;
        this.nonData = nonData;
    }

    //
    // RenderListener implementation
    //
    @Override
    public void beginTextBlock() { }

    @Override
    public void endTextBlock() { }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        try
        {
            Vector startPoint = renderInfo.getBaseline().getStartPoint();
            BaseColor fillColor = renderInfo.getFillColor();
            if (fillColor instanceof GrayColor && ((GrayColor)fillColor).getGray() == 0)
            {
                if (debug)
                    data.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
                {
                    renderCharacter(info);
                }
            }
            else
            {
                if (debug)
                    nonData.append(String.format("%4d\t%3.3f %3.3f\t%s\n", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                if (currentField > -1)
                    finishEntry();
                entryBuilder.append(renderInfo.getText());
            }
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        finally
        {
            chunk++;
        }
    }

    public void renderCharacter(TextRenderInfo renderInfo) throws IOException
    {
        Vector startPoint = renderInfo.getBaseline().getStartPoint();

        float x = startPoint.get(I1);

        if (currentField > -1)
        {
            if (isInCurrentField(x))
            {
                entryBuilder.append(renderInfo.getText());
                return;
            }
            if (isInNextField(x))
            {
                currentField++;
                entryBuilder.append('\t').append(renderInfo.getText());
                return;
            }
            finishEntry();
        }
        if (isInNextField(x))
        {
            finishEntry();
            currentField = 0;
        }
        entryBuilder.append(renderInfo.getText());
    }

    public void close() throws IOException
    {
        finishEntry();
    }

    boolean isInCurrentField(float x)
    {
        if (currentField == -1)
            return false;

        if (x < fieldstarts[currentField])
            return false;

        if (currentField == fieldstarts.length - 1)
            return true;

        return x <= fieldstarts[currentField + 1];
    }

    boolean isInNextField(float x)
    {
        if (currentField == fieldstarts.length - 1)
            return false;

        if (x < fieldstarts[currentField + 1])
            return false;

        if (currentField == fieldstarts.length - 2)
            return true;

        return x <= fieldstarts[currentField + 2];
    }

    void finishEntry() throws IOException
    {
        if (entryBuilder.length() > 0)
        {
            if (currentField == fieldstarts.length - 1)
            {
                data.append(entryBuilder).append('\n');
            }
            else
            {
                nonData.append(entryBuilder).append('\n');
            }

            entryBuilder.setLength(0);
        }
        currentField = -1;
    }

    //
    // hidden members
    //
    final Appendable data, nonData;
    boolean debug = false;

    int chunk = 0;
    int currentField = -1;
    StringBuilder entryBuilder = new StringBuilder();

    final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534};
}

(CertifiedSchoolListExtractionStrategy.java)

We can use it like this:

@Test
public void testCertifiedSchoolList_9_16_2015() throws IOException
{
    try (   Writer data = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "data.txt")), "UTF-8");
            Writer nonData = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "non-data.txt")), "UTF-8")    )
    {
        CertifiedSchoolListExtractionStrategy strategy = new CertifiedSchoolListExtractionStrategy(data, nonData);
        PdfReader reader = new PdfReader("certified-school-list-9-16-2015.pdf");

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int page = 1; page <= reader.getNumberOfPages(); page++)
            parser.processContent(page, strategy);
        strategy.close();
    }
}

(ExtractCertifiedSchoolList.java)

Now data.txt contains all the entries as tab-separated lines and non-data.txt everything ignored.

Behind the scenes

To understand what is happening here, one first has to know how page content in PDFs is organized and how (for the sample code given) iText operates on it.

Inside the PDF

PDF documents are structures built from a number of base object types, some primitive types (numbers, strings, ...) and some more complex ones (arrays or dictionaries of other objects or streams).

A page in a PDF document is represented by such a dictionary object containing entries defining some page properties (like page dimensions) and other entries referencing objects that define what is drawn on the page: the content streams.

Content streams essentially contain a sequence of operations, which may

  • select a color (for stroking or filling),
  • define a path (move to some point, line to some other point, curve to yet another one, ...),
  • stroke or fill such a path,
  • draw some bitmap image somewhere,
  • draw some text somewhere, or
  • do numerous other things.

For the question at hand we mostly are interested in the operations involved in drawing text. In contrast to word processors the operations are not take this long string of text and arrange it as a paragraph but instead more primitively move text position here, draw this short string here, move text position again, and draw another string there.

E.g. in the sample PDF the oeprations for drawing the table header and the first entry line are these:

/TT2 1 Tf

Select font TT2 at size 1.

9.72 0 0 9.72 20.16 687.36 Tm

Set the text matrix to move the text insertion coordinates to 20.16, 687.36 and scale everything following by a factor of 9.72.

0 g

Select the grayscale fill color black

0 Tc
0 Tw

Select additional character and word spacing to 0.

(SCHOOL)Tj

Draw "SCHOOL" here.

/TT1 1 Tf

Select font TT1.

3.4082 0 TD

Move text insertion point by 3.4082 in x direction.

<0003>Tj

Draw a space character (the current font uses a different encoding which uses 16 bit per character, not 8, and here is represented hexadecimally).

/TT2 1 Tf
.2261 0 TD
[(NAME)-17887.4(CAMPUS)]TJ

Select font, move text insertion point, and draw the string "NAME", then a gap of 17887.4 text units, then draw "CAMPUS".

/TT1 1 Tf
24.1809 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
[(NAME)-8986.6(F)-923.7(M)-459.3(CITY)-6349.9(ST)-1390.2(CAMPUS)]TJ
/TT1 1 Tf
28.5147 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
(ID)Tj

Draw the rest of the header line.

/TT4 1 Tf
-56.782 -1.3086 TD

Move left by 56.782 and down by 1.3086, i.e. to the start of the first entry line.

("I)Tj
/TT3 1 Tf
.6528 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-16894.2("I)]TJ
/TT3 1 Tf
18.9997 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-8239.9(Y)-1018.9(N)-576.7(Mount)]TJ
/TT3 1 Tf
15.189 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Shasta)-2423.3(CA)-2443.7(41789)]TJ

And draw the first entry line.

As you see and as I had mentioned above, the table content is drawn in reading order. Even multi line column entries come in the needed order, e.g. the campus name "A F International of Westlake Village":

[(Inc.)-7228.7(A)]TJ
/TT3 1 Tf
9.26 0 TD 
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(F)Tj
/TT3 1 Tf
.4595 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(International)Tj
/TT3 1 Tf
5.2886 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(of)Tj
/TT3 1 Tf
.8325 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Westlake)Tj
/TT3 1 Tf
3.7739 0 TD
<0003>Tj
/TT4 1 Tf
-11.8374 -1.3086 TD

Move down to the second line of the column.

(Village)Tj
15.4938 1.3086 TD

Move up again to the main line of the entry.

[(Y)-1018.9(N)-576.7(Westlake)]TJ 

So we can digest the text as it comes, no need for sorting (the content could be ordered in a completely different way).

But we also see that there are no obvious column start and end points. To associate the text with a column, therefore, we have to calculate the positions of each character and compare them with externally given column start positions.

Parsing supported by libraries

PDF libraries usually provide some mechanism to help parsing such content streams.

There are two basic architectures for this, a library may parse the content stream

  • as a whole and provide it as a big array of positioned text chunks or
  • or piecewise and forward individual positioned text chunks using a listener pattern.

The former variant at first seems easier to handle but may have big resource requirements (I have come across multi-MB content streams), while the second one seems a bit more difficult to handle but has smaller memory requirements.

The library I used (iText) follows the latter approach but your problem could also have been solved using a library following the former one.

RenderListener is the listener interface to implement here, the renderText methods retrieves the tindividual text chunks with positions.

In the implementation above (CertifiedSchoolListExtractionStrategy) the renderText method first checks the fill color associated with the chunk and only forwards black text for further processing in renderCharacter. That method (and some helpers) in turn checks the field the text is in (by hard coded position boundaries) and accordingly exports tab separated values. This logic would similarly have been implemented using other libraries, too.

这篇关于如何将PDF文件中的行提取到csv文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆