从Ruby或VBS通过OLE调用时,Word Document.SaveAs忽略编码 [英] Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

查看:136
本文介绍了从Ruby或VBS通过OLE调用时,Word Document.SaveAs忽略编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本(VBS或Ruby),可将Word文档另存为过滤的HTML",但编码参数将被忽略. HTML文件始终在Windows-1252中进行编码.我正在Windows 7 SP1上使用Word 2007 SP3.

I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.

require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit

VBS示例:

Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing

文档:

Document.SaveAs: http://msdn.microsoft.com/zh-我们/library/bb221597.aspx

msoEncoding值: http://msdn.microsoft.com/zh-CN/library/office/aa432511(v=office.12).aspx

msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx

任何建议,如何使Word将HTML文件保存为UTF-8?

Any suggestions, how to make Word save the HTML file in UTF-8?

推荐答案

我的解决方案是使用与Word用于保存文件相同的字符集打开HTML文件. 我还添加了一个白名单过滤器(Sanitize),以清理HTML.使用Nokogiri进行进一步的清洁,Sanitize也依赖于Nokogiri.

My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

require 'sanitize'

# ... add some code converting a Word file to HTML.

# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()

# ... add more cleaning up of Words HTML noise.

sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
    f.write sanitized_html
end

HTML Sanitizer: https://github.com/rgrove/sanitize/

HTML Sanitizer: https://github.com/rgrove/sanitize/

HTML解析器和修饰符: http://nokogiri.org/

HTML parser and modifier: http://nokogiri.org/

在Word 2010中,有一种新方法SaveAs2:

In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

由于我没有Word 2010,因此我尚未测试SaveAs2.

I haven't tested SaveAs2, since I don't have Word 2010.

这篇关于从Ruby或VBS通过OLE调用时,Word Document.SaveAs忽略编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆