如何通过C#读取访问数据库中"OLE对象"字段中存储的Word文档时删除垃圾字符? [英] How to remove junk characters while reading a word document stored in 'OLE Object' field in an access database through C#?
问题描述
我正在通过C#
访问Ms Access
数据库.我能够阅读所有字段.我遇到的问题是,在读取表的OLE Object
字段中存储的.txt
和.doc
文件时,在诸如-
I am accessing an Ms Access
database through C#
. I am able to read all the fields. The problem that I am getting is, while reading .txt
and .doc
files that are stored in OLE Object
field of the table, a lot of extra junk characters are also getting read before and after the actual text like- ÿÿÿÿ‡€ ÿÿÿÿÿÿÿÿˆ ÿÿÿÿÿÿÿÿ€ ˆˆˆˆˆˆˆˆ€ ÿÿÿÿÿÿÿÿþ
.
i 8 @ñÿ 8 N o r m a l CJ _H aJ mH sH tH < A@òÿ¡ <
D e f a u l t P a r a g r a p h F o n t … ÿÿÿÿ ( f p ³ ú ÿ A Ä M • À ' n î 0 q Œ Ï
我的C#代码类似于- `
My C# code is like- `
/*Read from the query and write in a temporary file*/
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, 0, oleBytes.Length - 0);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
var buffer = ms.GetBuffer();
fileStream.Write(buffer, 0, (int)ms.Length);
}
`
然后像单词文档一样读取此临时文件- `
Then read this temporary file like a word document- `
Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
object fpath = file; //this is the path
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
(ref fpath, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);
docs.ActiveWindow.Selection.WholeStory();
docs.ActiveWindow.Selection.Copy();
IDataObject iData = Clipboard.GetDataObject();
if (iData != null)
data = iData.GetData(DataFormats.Text).ToString();
`
不知道出了什么问题?我是否也在从表中读取字段元数据?如果是这样,如何避免呢?读取存储除图像以外的文件的OLE Object
字段的有效方法是什么?
Don't know what is going wrong? Am I reading the fields metadata also from the table? If so how to avoid it? What would be the efficient way to read OLE Object
field that stores files other than images?
推荐答案
我找到了Word文档(.doc
文件)的解决方案. Ms Access中的OLE对象存储在实际数据之前包含一些标头信息,因此仅将字段内容提取为字节数组并将其保存到磁盘是行不通的.任何OLE对象文件都有一些标准签名.对于Word文档,OLEheaderLength is 85 bytes
.因此,我从字节数组的两端剥离了85个字节,例如-
I found the solution for word documents (.doc
files). OLE object storage in Ms Access contains some header information before actual data, so simply extracting the field contents as a byte array and saving it to disk does not work. Any OLE Object file has some standard signature. For word documents, OLEheaderLength is 85 bytes
. So I strip 85 bytes from both ends of the byte array like-
Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word and text douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);
const int offset =85;
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, offset, oleBytes.Length - offset);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
var buffer = ms.GetBuffer();
fileStream.Write(buffer, 0, (int)ms.Length);
}
变量file
将包含.tmp
文件的路径,该文件包含从存储为OLE object in Ms Access
的word文档中读取的数据.该文件可以直接作为word document
打开,或者其扩展名可以更改为.doc
.
The variable file
will contain the path of the .tmp
file, which contains the data read from from the word document stored as an OLE object in Ms Access
. This file can be directly opened in as a word document
or it's extension can be changed .doc
.
其他格式的OLEheaderLength
如下:
1] JPEG/JPG=224
2] BMP=78
3] PDF=85
4] SNP=74
5] DOC=85/90
6] DOCX=87
我不知道.txt(Simple Text) files
的OLEheaderLength
.不幸的是,上述解决方案仅适用于.doc
文件.但是,当涉及到.docx
文件和任何其他文件格式时,它将失败.
I don't know the OLEheaderLength
of .txt(Simple Text) files
. Unfortunately the above solution works only for .doc
files. But when it comes to .docx
files and any other file formats, it fails.
为了找出ole标头的长度,您可以简单地使用已说明并可以从此处下载的库- 查看全文