我想正确阅读文件(pdf)的内容 [英] i want to read the content of file(pdf) correctly
问题描述
实际上我开发了一个winform应用程序,应用程序读取内容很好但是使用相同的代码读取pdf files.it的工作但是内容
如횶땐择몎态㺛갿籕뚜뚜靐塥塥塥ࠧ뫳뫳뫳俫俫뫜ڤ혫떼떼떼 ꇨ㯽☐녴샯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀馅ꊾᴦ뗖诒Ꝅ퍃怮镫좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬谭칺馵ว퀑뒷ꞹ䰛涉죢㐆莲捥قح泺跛ᬹ䲷妞ఞ。
本内容不理解。这个结果将使用断点追踪
代码如
Actually i have develop one winform application that application reads the content
file(.txt) very well but using same code read the pdf files.it's working but content
like as "횶땐擇몎态㺛갿籕因뚜靐⨎ᴪ䣌塥並ࠧ町뫳俫黶뫜ﭪ혫떼㌵ꇨ㯽☐녴샯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀餡ꊾᴦ뗖詒Ꝅ퍃怮鐙좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬譚칺馵ว퀑뒷ꞹ䰛涉죢㐆蓮捥ﳂ濼跛ᬹ䲷妞ఞ".
this content is not understanding.this result will be trace out using breaking points
that code like as
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections;
using System.Windows.Forms;
namespace test
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
public static string StringFromBytes(byte[] arr)
{
char[] ch = new char[arr.Length / 2];
for (int i = 0; i < ch.Length; ++i)
{
ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
}
return new String(ch);
}
private void button1_Click(object sender, EventArgs e)
{
ArrayList fileStatistics = new ArrayList();
String datasetPath = @"D:\Data Sets\Enron";
DirectoryInfo d = new DirectoryInfo(datasetPath);
FileInfo[] files = d.GetFiles("*.pdf");
MessageBox.Show(files.Length.ToString());
foreach (FileInfo file in files)
{
//create instance of data class
fileAtt f = new fileAtt();
f.fFullName = file.FullName;
f.fName = file.Name;
f.FileSize = file.Length;
f.fExtension = file.Extension;
byte[] bytes = File.ReadAllBytes(file.FullName);
f.content =Form1.StringFromBytes(bytes);
//f.content = Encoding.ASCII.GetString(bytes);
f.lastaccesstime = file.LastAccessTime;
fileStatistics.Add(f);
// StreamReader r = new StreamReader(datasetPath);
//foreach
}
gvStatistics.DataSource = fileStatistics;
}
}
}
fileatt属性类:
fileatt is property class:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace test
{
class fileAtt
{
public long FileSize { get; set; }
public string fName { get; set; }
public string fFullName { get; set; }
public string fExtension { get; set; }
public string content { get; set; }
public DateTime lastaccesstime { get; set; }
}
}
i想要正确阅读pdf的内容即内容由用户理解。这是
我的要求。我想根据上面的代码解决方案。
请帮助我。
谢谢你
i want to read the content of pdf's correctly i.e content understand by user.this is
my requirements.i want solution according to the above code.
pls help me.
thank u
推荐答案
PDF文件不是纯文本,而是他们是包含非常复杂结构的二进制文件。因此,您不能只阅读内容并期望在PDF文档中看到文本。
我认为最简单的方法是使用现成的库,例如 iTextSharp [ ^ ]探索PDF的内容并从中提取文本。
PDF files are not pure text, instead they are binary files which contain a quite complex structure. So you cannot just read the content and expect to see the text inside a PDF document.
I think an easiest approach is to use a ready made library such as iTextSharp[^] to explore the content of the PDF and extract text from it.
这篇关于我想正确阅读文件(pdf)的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!