在c#中比较两个接近重复的文档(pdf文件) [英] compare two near duplicate documents(pdf files) in c#

查看:360
本文介绍了在c#中比较两个接近重复的文档(pdf文件)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开发了用于比较c#中两个接近重复的文档(pdf文件)的应用程序。



实际比较两个文件的内容。在这里有多少内容与



file1到file2匹配,这意味着最终有多少百分比与file1匹配到file2。



每当比较两个pdf文件你得到的百分比是多少。



我的代码是这样的



  private   bool  FileCompare( string  file1, string  file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;

// 确定同一文件是否被引用了两次。
if (file1 == file2)
{
// 返回true表示文件相同。
return ;
}

// 打开这两个文件。
fs1 = new FileStream(file1,FileMode.Open);
fs2 = new FileStream(file2,FileMode.Open);

// 检查文件大小。如果它们不相同,则文件
// 不一样。
if (fs1.Length!= fs2.Length)
{
// 关闭文件
fs1.Close();
fs2.Close();

// 返回false表示文件不同
< span class =code-keyword> return
false ;
}

// 读取并比较每个文件中的一个字节,直到
// 找到不匹配的字节集或直到
// 已达到file1。
执行
{
// 从每个文件中读取一个字节。
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte)&&(file1byte!= -1));

// 关闭文件。
fs1.Close( );
fs2.Close();

// 返回比较成功。 file1byte
// 此时仅等于file2byte文件是
// 相同。
return ((file1byte - file2byte)== 0 );
}

private void PdfCompare_Load( object sender,EventArgs e)
{

}

private void button1_Click( object sender,EventArgs e)
{
if (FileCompare( this .textBox1.Text, this .textBox2.Text))
{
MessageBox.Show( 文件相同。 );
}
else
{
MessageBox.Show( 文件不相等。);
}
}

}





无论何时比较,请帮助我进行百分比匹配两个重复的pdf文件

解决方案

Quote:

我已开发应用程序

如果你在这里说你写了那种方法:我不相信你。代码注释的英文根本不符合你问题的英文。



但是,该方法逐字节比较两个文件。我不认为你真的想要那个。这没有多大意义。获取文本文档,复制文档并仅删除副本中的第一个字符。内容将是相同的,除了副本中缺少一个字符并且所有其他字符移位一个位置。但是如果你逐字节地比较文件,你的百分比匹配将是完全随机的。你必须采用一种完全不同的方法,这里的解释太复杂了。但我会给你这些指示:



从PDF文档中提取文本: Toxy [ ^ ]



Diff - 算法/库:

http://stackoverflow.com/questions/138331/any-decent-text-diff-merge-engine-for-net [ ^ ]

https://github.com/mmanela/diffplex [ ^ ]


I have develop application for compare two near duplicate documents(pdf files) in c#.

Actually compare the content of two files. In this how much content is matching with

file1 to file2 that means finally howmuch percentage is matching with file1 to file2.

whenever compare the two pdf files howmuch percentage u got.

my code like this

private bool FileCompare(string file1, string file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;
 
// Determine if the same file was referenced two times.
if (file1 == file2)
{
// Return true to indicate that the files are the same.
return true;
}
 
// Open the two files.
fs1 = new FileStream(file1, FileMode.Open);
fs2 = new FileStream(file2, FileMode.Open);
 
// Check the file sizes. If they are not the same, the files 
// are not the same.
if (fs1.Length != fs2.Length)
{
// Close the file
fs1.Close();
fs2.Close();
 
// Return false to indicate files are different
return false;
}
 
// Read and compare a byte from each file until either a
// non-matching set of bytes is found or until the end of
// file1 is reached.
do
{
// Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1));
 
// Close the files.
fs1.Close();
fs2.Close();
 
// Return the success of the comparison. "file1byte" is 
// equal to "file2byte" at this point only if the files are 
// the same.
return ((file1byte - file2byte) == 0);
}
 
private void PdfCompare_Load(object sender, EventArgs e)
{
 
}
 
private void button1_Click(object sender, EventArgs e)
{
if (FileCompare(this.textBox1.Text, this.textBox2.Text))
{
MessageBox.Show("Files are equal.");
}
else
{
MessageBox.Show("Files are not equal.");
} 
}
 
}



please help me for percentage matching whenever compare two nearduplicate pdf files

解决方案

Quote:

I have develop application

If you're saying here that you wrote that method: I don't believe you. The English of the code comments doesn't match the English of your question at all.

However, that method compares two files byte-by-byte. I don't think you actually want that. It wouldn't make a lot of sense. Take a text document, make a copy of it and delete only the very first character in the copy. The contents would be the same except that one character is missing from the copy and all other characters are shifted one position. But if you compare the files byte-by-byte your percentual match would be totally random. You have to go for a completely different approach and it is too complex to explain it here. But I will give you these pointers:

Extracting text from PDF-Documents: Toxy[^]

"Diff"-Algorithms/Libraries:
http://stackoverflow.com/questions/138331/any-decent-text-diff-merge-engine-for-net[^]
https://github.com/mmanela/diffplex[^]


这篇关于在c#中比较两个接近重复的文档(pdf文件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆