NodeJS RTF ANSI查找并用特殊字符替换单词 [英] NodeJS RTF ANSI Find and Replace Words With Special Chars

查看:254
本文介绍了NodeJS RTF ANSI查找并用特殊字符替换单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查找和替换脚本,当单词没有任何特殊字符时,该脚本没有问题。但是,很多时候 都是特殊字符,因为它会找到名字。截至目前,这正在破坏脚本。

I have a find and replace script that works no problem when the words don't have any special characters. However, there will be a lot of times where there will be special characters since it's finding names. As of now this is breaking the script.

脚本将查找 {< some-text>} 并尝试替换内容(以及删除括号)。

The script looks for {<some-text>} and attempts to replace the contents (as well as remove the braces).

示例:

文本.rtf

Here's a name with special char {Kotouč}

script.ts

import * as fs from "fs";

// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {

    // It correctly identifies the targeted text.
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    // Here I need a way to escape `plainText` string so that it matches the source.
    console.log("currMatch::", currMatch);
    console.log("currMatch === plainText::", currMatch === plainText);
    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("newContent:", newContent);
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}

currMatch:: {Kotou\uc0\u269 \}

currMatch === plainText:: false

它看起来像是ANSI的转义字符,我尝试使用 jsesc ,但是会生成不同的字符串 {Kotou\u010D} ,而不是文档生成的 {Kotou\uc0\u269 \}

It looks like ANSI escaping, and I've tried using jsesc but that produces a different string, {Kotou\u010D} instead of what the document produces {Kotou\uc0\u269 \}.

如何动态转义 plainText 字符串变量,使其与文档中的内容匹配?

How can I dynamically escape the plainText string variable so that it matches what is found in the document?

推荐答案

我需要的是加深我对rtf格式以及常规文本编码的了解。

What I needed was to deepen my knowledge on rtf formatting as well as general text encoding.

从文件中读取的原始RTF文本为我们提供了一些提示:

The raw RTF text read from the file gives us a few hints:

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...

这部分rtf文件元数据告诉我们一些事情。

This part of the rtf file metadata tells us a few things.

它使用的是RTF文件格式版本1。编码为ANSI,特别是 cpg1252 ,也称为 Windows-1252 CP-1252 ,即:

It is using RTF file formatting version 1. The encoding is ANSI, and specifically cpg1252, also known as Windows-1252 or CP-1252 which is:


...拉丁字母的单字节字符编码

...a single-byte character encoding of the Latin alphabet

有价值的信息是,我们知道它使用的是拉丁字母,稍后将使用。

The valuable piece of information from that is that we know it is using the Latin alphabet, this will be used later.

了解特定的RTF使用的版本我偶然发现了 RTF 1.5 Spec

Knowing the specific RTF version used I stumbled upon the RTF 1.5 Spec

在该规范中快速搜索了我正在寻找的一个转义序列,发现它是RTF特定的转义控制序列,即 \uc0 。因此,知道我能够解析出我真正的追求, \u269 。现在我知道它是unicode,并且很直觉地认为 \u269 代表 unicode字符代码269 。所以我查找...

A quick search on that spec for one of the escape sequences that I was looking into revealed that it was an RTF specific escape control sequence, that is \uc0. So knowing that I was able to then parse what I was really after, \u269. Now I knew it was unicode and had a good hunch that the \u269 stood for unicode character code 269. So I look that up...

\u269 (字符代码 269 出现在此页面上以确认。现在我知道了字符集以及获得等效的纯文本(未转义)需要做些什么,并且有一个基本的我在这里使用过的SO帖子来启动该功能。

The \u269 (char code 269) shows up on this page to confirm. Now I know the character set and what needs done to get the equivalent plain text (unescaped), and there's a basic SO post I used here to get the function started.

利用所有这些知识,我就可以从那里将其拼凑起来。这是经过更正的完整脚本及其输出:

Using all this knowledge I was able to piece it together from there. Here's the full corrected script and it's output:

script.ts

import * as fs from "fs";


// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;

// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;

/**
 * Util function to strip junk characters from string for comparison.
 * @param {string} str
 * @returns {string}
 */
const cleanupRtfStr = (str: string): string => {
    return str
        .replace(/\s/g, "")
        .replace(/\\/g, "");
};

/**
 * Detects escaped unicode and looks up the character by that code.
 * @param {string} str
 * @returns {string}
 */
const unescapeString = (str: string): string => {
    const unescaped = str.replace(matchEscapedChars, (cc: string) => {
        const stripped: string = cc.replace(unicodeControlReg, "");
        const charCode: number = Number(stripped);

        // See unicode character codes here:
        //  https://unicodelookup.com/#latin/11
        return String.fromCharCode(charCode);
    });

    // Remove all whitespace.
    return unescaped;
};

// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("\n\nnewContent:", newContent);
        break;
    }

    const unescapedMatch: string = unescapeString(currMatch);
    const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
    if (cleanedMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
        console.log("\n\nnewContent:", newContent);
        break;
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}


newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}

希望能帮助其他不熟悉字符编码/转义的人,并且可以在rtf格式的文档中使用它!

Hopefully that helps others that aren't familiar with character encoding/escaping and it's uses in rtf formatted documents!

这篇关于NodeJS RTF ANSI查找并用特殊字符替换单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆