从Ruby或VBS通过OLE调用时,Word Document.SaveAs忽略编码 [英] Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS
问题描述
我有一个脚本(VBS或Ruby),可将Word文档另存为过滤的HTML",但编码参数将被忽略. HTML文件始终在Windows-1252中进行编码.我正在Windows 7 SP1上使用Word 2007 SP3.
I have a script, VBS or Ruby, that saves a Word document as 'Filtered HTML', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I'm using Word 2007 SP3 on Windows 7 SP1.
require 'win32ole'
word = WIN32OLE.new('Word.Application')
word.visible = false
word_document = word.documents.open('C:\whatever.doc')
word_document.saveas({'FileName' => 'C:\whatever.html', 'FileFormat' => 10, 'Encoding' => 65001})
word_document.close()
word.quit
VBS示例:
Option Explicit
Dim MyWord
Dim MyDoc
Set MyWord = CreateObject("Word.Application")
MyWord.Visible = False
Set MyDoc = MyWord.Documents.Open("C:\whatever.doc")
MyDoc.SaveAs "C:\whatever2.html", 10, , , , , , , , , , 65001
MyDoc.Close
MyWord.Quit
Set MyDoc = Nothing
Set MyWord = Nothing
文档:
Document.SaveAs: http://msdn.microsoft.com/zh-我们/library/bb221597.aspx
msoEncoding值: http://msdn.microsoft.com/zh-CN/library/office/aa432511(v=office.12).aspx
msoEncoding values: http://msdn.microsoft.com/en-us/library/office/aa432511(v=office.12).aspx
任何建议,如何使Word将HTML文件保存为UTF-8?
Any suggestions, how to make Word save the HTML file in UTF-8?
推荐答案
我的解决方案是使用与Word用于保存文件相同的字符集打开HTML文件. 我还添加了一个白名单过滤器(Sanitize),以清理HTML.使用Nokogiri进行进一步的清洁,Sanitize也依赖于Nokogiri.
My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.
require 'sanitize'
# ... add some code converting a Word file to HTML.
# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()
# ... add more cleaning up of Words HTML noise.
sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
f.write sanitized_html
end
HTML Sanitizer: https://github.com/rgrove/sanitize/
HTML Sanitizer: https://github.com/rgrove/sanitize/
HTML解析器和修饰符: http://nokogiri.org/
HTML parser and modifier: http://nokogiri.org/
In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx
由于我没有Word 2010,因此我尚未测试SaveAs2.
I haven't tested SaveAs2, since I don't have Word 2010.
这篇关于从Ruby或VBS通过OLE调用时,Word Document.SaveAs忽略编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!