我想正确阅读文件(pdf)的内容 [英] i want to read the content of file(pdf) correctly

查看:75
本文介绍了我想正确阅读文件(pdf)的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实际上我开发了一个winform应用程序,应用程序读取内容很好但是使用相同的代码读取pdf files.it的工作但是内容



如횶땐择몎态㺛갿籕뚜뚜靐塥塥塥ࠧ뫳뫳뫳俫俫뫜ڤ଻혫᭍떼떼떼 ꇨ㯽☐녴샯﹯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀馅ꊾᴦ뗖诒Ꝅ퍃怮镫좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬谭칺馵ว퀑뒷ꞹ䰛涉죢㐆莲捥قح泺跛ᬹ䲷妞ఞ。

本内容不理解。这个结果将使用断点追踪



代码如



Actually i have develop one winform application that application reads the content

file(.txt) very well but using same code read the pdf files.it's working but content

like as "횶땐擇몎态㺛갿籕因뚜靐⨎ᴪ䣌塥並ࠧ町뫳俫黶뫜ﭪ଻혫᭍떼㌵ꇨ㯽☐녴샯﹯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀餡ꊾᴦ뗖詒Ꝅ퍃怮鐙좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬譚칺馵ว퀑뒷ꞹ䰛涉죢㐆蓮捥ﳂ濼跛ᬹ䲷妞ఞ".
this content is not understanding.this result will be trace out using breaking points

that code like as

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections;
using System.Windows.Forms;

namespace test
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
      public static string StringFromBytes(byte[] arr)
        {
            char[] ch = new char[arr.Length / 2];
            for (int i = 0; i < ch.Length; ++i)
            {
                ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
            }
            return new String(ch);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            ArrayList fileStatistics = new ArrayList();
            String datasetPath = @"D:\Data Sets\Enron";
            DirectoryInfo d = new DirectoryInfo(datasetPath);
            FileInfo[] files = d.GetFiles("*.pdf");
            MessageBox.Show(files.Length.ToString());

            foreach (FileInfo file in files)
            {                
                    //create instance of data class
                    fileAtt f = new fileAtt();

                    f.fFullName = file.FullName;
                    f.fName = file.Name;
                    f.FileSize = file.Length;
                    f.fExtension = file.Extension;
                    byte[] bytes = File.ReadAllBytes(file.FullName);
                    f.content    =Form1.StringFromBytes(bytes);
                   //f.content = Encoding.ASCII.GetString(bytes);
                   f.lastaccesstime = file.LastAccessTime;                
                    fileStatistics.Add(f);
                 //   StreamReader r = new StreamReader(datasetPath);
                 //foreach
                    
                
            }
            gvStatistics.DataSource = fileStatistics;

        }
        }
    }





fileatt属性类:





fileatt is property class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace test
{
    class fileAtt
    {
        public long FileSize { get; set; }
        public string fName { get; set; }
        public string fFullName { get; set; }
        public string fExtension { get; set; }

        public string content { get; set; }

        public DateTime lastaccesstime { get; set; }
    }
}







i想要正确阅读pdf的内容即内容由用户理解。这是



我的要求。我想根据上面的代码解决方案。



请帮助我。



谢谢你




i want to read the content of pdf's correctly i.e content understand by user.this is

my requirements.i want solution according to the above code.

pls help me.

thank u

推荐答案

PDF文件不是纯文本,而是他们是包含非常复杂结构的二进制文件。因此,您不能只阅读内容并期望在PDF文档中看到文本。



我认为最简单的方法是使用现成的库,例如 iTextSharp [ ^ ]探索PDF的内容并从中提取文本。
PDF files are not pure text, instead they are binary files which contain a quite complex structure. So you cannot just read the content and expect to see the text inside a PDF document.

I think an easiest approach is to use a ready made library such as iTextSharp[^] to explore the content of the PDF and extract text from it.


这篇关于我想正确阅读文件(pdf)的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆