如何使用.bat格式将unicode文件批量格式化为ANSI文件? [英] How to use .bat formatting to batch-format unicode files to ANSI files?

查看:221
本文介绍了如何使用.bat格式将unicode文件批量格式化为ANSI文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.bat编程的初学者总数,所以请多多包涵:我一直在尝试将从科学仪器收集到的庞大的Unicode文件数据库转换为ANSI格式.此外,我需要将所有这些文件转换为.txt文件.

Total beginner to .bat programming, so please bear with me: I've been trying to convert a massive database of Unicode files collected from scientific instruments to ANSI format. Furthermore, I need to convert all these files to .txt files.

现在,第二部分非常简单-我以前使用批量重命名实用程序"来完成它,到目前为止,我已经能够使其正常工作.

Now, the second part is pretty trivial -- I used to do it with the "Bulk Rename Utility", and I've been able to make it work so far, I think.

第一部分应该很简单,我发现了多个不同的类似问题,但它们似乎都是针对,或者结束有关所使用的特定编码的漫长讨论.一个问题似乎与我的完全匹配,但具有尝试了他们建议的代码,似乎只有一半文件可以正常传输,另一半作为废话通过.我一直在使用代码:

The first part should be pretty straight forward, and I've found multiple different similar questions, but they all seem to be for powershell, a single file, or end in long discussions about the specific encoding being used. One question seems to match mine exactly, but having tried their suggested code, only half the file seems to transfer fine, the other half comes through as nonsense code. I've been using the code:

for %%F in (*.001) do ren "*SS.001" "*SS1.001"

for %%F in (*.001) do type "%%F" >"%%~nF.txt"

,然后删除/移动多余的文件.

and then deleting/moving the extra files.

过去,我已经成功地手动转换了文件(左),但是当前的编码似乎失败了(右):通过手与程序编码的文件并排比较

I've converted the files by hand successfully in the past (left), but the current encoding seems to be failing (right): Side by side comparison of files encoded by hand vs by program

我的问题是:

  1. 是否有可能我从乐器中获得一个文件多种编码(部分UTF-8,部分UTF-16),并且这是弄乱我的程序(或更可能是,我使用的是太小)?如果是这样,我会明白为什么特别像平方和度数符号这样的字符破了,但是不是数据,只是数字.
  2. 我的代码中是否存在一些明显的错别字,导致出现这种奇怪现象错误?
  3. 如果错误可能嵌入在什么unicode(8 vs 16 vs 32)中,或者我正在使用ANSI(1252与???),如何检查?
  4. 我将如何修复此代码以使其正常工作?

如果还有其他疑问,或者需要补充的其他信息,请告诉我.谢谢!!

If there's any better questions I should be asking or additional information I need to add, please let me know. Thank you!!

推荐答案

我从乐器中获得的单个文件是否可能采用多种编码(部分UTF-8,部分UTF-16),并且这弄乱了我的程序(或更可能是,我正在使用一种编码太小了吗?

Is it possible that a single file I get from my instrument is in multiple encodings (part UTF-8, part UTF-16), and that this is messing up my program (or more likely, i'm using an encoding that is too small)?

我不相信单个文件可以包含多种编码.

I don't believe a single file can contain multiple encodings.

我的代码中是否有明显的错字导致这种奇怪的错误?

Is there some obvious typo in my code that is causing this bizarre error?

cmd环境可以足够轻松地处理不同的代码页,但是它在多字节编码和字节顺序标记方面苦苦挣扎.确实,当尝试读取UCS-2 LE中返回的WMI结果时,这是一个常见问题.尽管存在用于消毒WMI结果的纯批量解决方案不能与所有其他编码通用.

The cmd environment can handle different code pages easily enough, but it struggles with multi-byte encodings and byte order marks. Indeed, this is a common problem when trying to read WMI results returned in UCS-2 LE. Although there exists a pure batch workaround for sanitizing WMI results, it unfortunately doesn't work universally with every other encoding.

如果错误可能被嵌入到我正在使用的Unicode(8 vs 16 vs 32)或ANSI(1252 vs ???)中,我将如何检查?我将如何修复此代码以使其正常工作?

If the error might be embedded in what unicode (8 vs 16 vs 32) or ANSI (1252 vs ???) I'm using, how would I check? How would I fix this code to work?

.NET在理智处理未知编码的文件方面要好得多. StreamReader类,当它读取第一个字符时,将读取BOM表并自动检测文件编码.我知道您希望避免使用PowerShell解决方案,但是PowerShell确实是访问IO方法以透明方式处理这些文件的最简单方法.

.NET is much better at sanely dealing with files of unknown encodings. The StreamReader class, when it reads its first character, will read the BOM and detect the file encoding automatically. I know you were hoping to avoid a PowerShell solution, but PowerShell really is the easiest way to access IO methods to handle these files transparently.

不过,有一种简单的方法可以将PowerShell混合代码合并到批处理脚本中.使用 .bat 扩展名保存此文件,然后查看其是否满足您的要求.

There is a simple way to incorporate PowerShell hybrid code into a batch script though. Save this with a .bat extension and see whether it does what you want.

<# : batch portion
@echo off & setlocal

powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>

function file2ascii ($infile, $outfile) {

    # construct IO streams for reading and writing
    $reader = new-object IO.StreamReader($infile)
    $writer = new-object IO.StreamWriter($outfile, [Text.Encoding]::ASCII)

    # copy infile to ASCII encoded outfile
    while (!$reader.EndOfStream) { $writer.WriteLine($reader.ReadLine()) }

    # output summary
    $encoding = $reader.CurrentEncoding.WebName
    "{0} ({1}) -> {2} (ascii)" -f (gi $infile).Name, $encoding, (gi $outfile).Name

    # Garbage collection
    foreach ($stream in ($reader, $writer)) { $stream.Dispose() }
}

# loop through all .001 files and apply file2ascii()
gci *.001 | %{
    $outfile = "{0}\{1}.txt" -f $_.Directory, $_.BaseName
    file2ascii $_.FullName $outfile
}

确实可以使用 get-content out-file cmdlet简化此过程,但是上面演示的IO流方法将避免您必须加载将整个数据文件存储到内存中-如果您的任何一个数据文件很大,都会有好处.

While it's true that this could process could be simplified using the get-content and out-file cmdlets, the IO stream methods demonstrated above will avoid your having to load the entire data file into memory -- a benefit if any of your data files is large.

这篇关于如何使用.bat格式将unicode文件批量格式化为ANSI文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆