如何删除存在-within-一些文本,而不是在一些文本的开始任何UTF-8 BOM [英] How can I remove any UTF-8 BOM that exists -within- some text, not at the start of some text

查看:119
本文介绍了如何删除存在-within-一些文本,而不是在一些文本的开始任何UTF-8 BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们收到了一些文件,这些文件已被并置另一方。在这些文件中的中间是一些 BOM 字符。

We receive some files, which have been concatenated by another party. In the middle of these files are some BOM characters.

有没有一种方法能检测出这些3字符并删除它们?我看过很多有关如何删除 BOM 从文件的-start- ......但没有中间的例子。

Is there a way we can detect these 3 chars and remove them? I've seen plenty of examples about how to remove the BOM from the -start- of a file ... but not the middle.

推荐答案

假设你的文件足够小,以保留在内存中,并且你有一个 Enumerable.Replace 更换子序列扩展方法,那么你可以使用:

Assuming that your file is small enough to hold in memory, and that you have an Enumerable.Replace extension method for replacing subsequences, then you could use:

var bytes = File.ReadAllBytes(filePath);
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
var empty = Enumerable.Empty<byte>();
bytes = bytes.Replace(bom, empty).ToArray();
File.WriteAllBytes(filePath, bytes);

下面是一个简单的(低效率)实施替换扩展方法:

Here is a simple (inefficient) implementation of the Replace extension method:

public static IEnumerable<TSource> Replace<TSource>(
    this IEnumerable<TSource> source,
    IEnumerable<TSource> match,
    IEnumerable<TSource> replacement)
{
    return Replace(source, match, replacement, EqualityComparer<TSource>.Default);
}

public static IEnumerable<TSource> Replace<TSource>(
    this IEnumerable<TSource> source,
    IEnumerable<TSource> match,
    IEnumerable<TSource> replacement,
    IEqualityComparer<TSource> comparer)
{
    int sLength = source.Count();
    int mLength = match.Count();

    if (sLength < mLength || mLength == 0)
        return source;

    int[] matchIndexes = (
        from sIndex in Enumerable.Range(0, sLength - mLength + 1)
        where source.Skip(sIndex).Take(mLength).SequenceEqual(match, comparer)
        select sIndex
    ).ToArray();

    var result = new List<TSource>();
    int sPosition = 0;
    foreach (int mPosition in matchIndexes)
    {
        var sPart = source.Skip(sPosition).Take(mPosition - sPosition);
        result.AddRange(sPart);
        result.AddRange(replacement);
        sPosition = mPosition + mLength;
    }

    var sLastPart = source.Skip(sPosition).Take(sLength - sPosition);
    result.AddRange(sLastPart);
    return result;
}

这篇关于如何删除存在-within-一些文本,而不是在一些文本的开始任何UTF-8 BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆