当通过目录列表以获取带有 TODO 关键字(例如//TODO)但不是变量/字符串的文件列表时,TODO 关键字的正则表达式 [英] Regex for TODO keyword when passing through a list of directories to get a list of files with TODO keyword (eg. //TODO) but not as variable / string
问题描述
我正在尝试编写一个应用程序,它查看目录并标记出所有具有 TODO 关键字(每当我们在代码编辑器中编码时闪烁/突出显示颜色的文件(无论是在目录还是子目录中))[我正在使用 Visual Studio 代码]
I'm trying to write an application that looks through a directory and flag out all files (be it in directory or subdirectories) that has the TODO keyword (the one that flashes/highlights in color whenever we code in our code editor [i am using visual studio code]
我已经运行了大部分代码,这只是让我感到困惑的最后一点:因为我的 RegEx 接受TODO"作为单词块,它甚至会选择将 TODO 作为变量名称/字符串内容的文件,例如.
I have gotten most of the code running, its just the last bit that is puzzling me : because my RegEx accepts 'TODO' as a word block, it picks up even files that has TODO as variable name / string content eg.
var todo = 'TODO'
或者var TODO = 'abcdefg'
所以它搞乱了我的测试用例.我们如何编写一个健壮的 TODO 正则表达式/表达式,它能够只选择 TODO 关键字(例如 //TODO
或 //TODO
)并忽略其他用途案例(在变量/字符串等中)我不想硬编码//或正则表达式中的任何内容,因为我希望它尽可能地跨语言(例如.//
(单行)或 /*
(多行)用于 javascript,#
用于 python 等)
so it is messing up with my test cases. How do we write a robust TODO regex / expression that is able to pick up just the TODO keyword (eg. //TODO
or // TODO
) and ignore the other use cases (in variables/strings etc) I dont want to hardcode // or anything in the regex as well, as i would prefer it to be cross-language as much as possible (eg. //
(single-line) or /*
(multi-line) for javascript, #
for python etc)
这是我的代码:
import * as fs from 'fs';
import * as path from 'path';
const args = process.argv.slice(2);
const directory = args[0];
// Using recursion, we find every file with the desired extention, even if its deeply nested in subfolders.
// Returns a list of file paths
const getFilesInDirectory = (dir, ext) => {
if (!fs.existsSync(dir)) {
console.log(`Specified directory: ${dir} does not exist`);
return;
}
let files = [];
fs.readdirSync(dir).forEach(file => {
const filePath = path.join(dir, file);
const stat = fs.lstatSync(filePath); // Getting details of a symbolic link of file
// If we hit a directory, recurse our fx to subdir. If we hit a file (basecase), add it to the array of files
if (stat.isDirectory()) {
const nestedFiles = getFilesInDirectory(filePath, ext);
files = files.concat(nestedFiles);
} else {
if (path.extname(file) === ext) {
files.push(filePath);
}
}
});
return files;
};
const checkFilesWithKeyword = (dir, keyword, ext) => {
if (!fs.existsSync(dir)) {
console.log(`Specified directory: ${dir} does not exist`);
return;
}
const allFiles = getFilesInDirectory(dir, ext);
const checkedFiles = [];
allFiles.forEach(file => {
const fileContent = fs.readFileSync(file);
// We want full words, so we use full word boundary in regex.
const regex = new RegExp('\\b' + keyword + '\\b');
if (regex.test(fileContent)) {
// console.log(`Your word was found in file: ${file}`);
checkedFiles.push(file);
}
});
console.log(checkedFiles);
return checkedFiles;
};
checkFilesWithKeyword(directory, 'TODO', '.js');
非常感谢帮助!!
推荐答案
我认为没有一种可靠的方法可以排除变量名称或字符串值中的 TODO跨语言.您需要正确解析每种语言,并在评论中扫描 TODO.
I don't think there is a reliable way to exclude TODO in variable names or string values across languages. You'd need to parse each language properly, and scan for TODO in comments.
你可以做一个可以随着时间调整的近似值:
You can do an approximation that you can tweak over time:
- 对于变量名称,您需要排除
TODO =
赋值和任何类型的使用,例如TODO.length
- 对于字符串值,您可以在查找时排除
'TODO'
和"TODO"
,甚至"Something TODO today"
匹配引号.带反引号的多行字符串怎么样?
- for variable names you'd need to exclude
TODO =
assignments, and any type of use, such asTODO.length
- for string value you could exclude
'TODO'
and"TODO"
, and even"Something TODO today"
while looking for matching quotes. What about a multi-line string with backticks?
这是使用大量负面预测的开始:
This is a start using a bunch of negative lookaheads:
const input = `Test Case:
// TODO blah
// TODO do "stuff"
/* stuff
* TODO
*/
let a = 'TODO';
let b = 'Something TODO today';
let c = "TODO";
let d = "More stuff TODO today";
let TODO = 'stuff';
let l = TODO.length;
let e = "Even more " + TODO + " to do today";
let f = 'Nothing to do';
`;
let keyword = 'TODO';
const regex = new RegExp(
// exclude TODO in string value with matching quotes:
'^(?!.*([\'"]).*\\b' + keyword + '\\b.*\\1)' +
// exclude TODO.property access:
'(?!.*\\b' + keyword + '\\.\\w)' +
// exclude TODO = assignment
'(?!.*\\b' + keyword + '\\s*=)' +
// final TODO match
'.*\\b' + keyword + '\\b'
);
input.split('\n').forEach((line) => {
let m = regex.test(line);
console.log(m + ': ' + line);
});
输出:
false: Test Case:
true: // TODO blah
true: // TODO do "stuff"
false: /* stuff
true: * TODO
false: */
false: let a = 'TODO';
false: let b = 'Something TODO today';
false: let c = "TODO";
false: let d = "More stuff TODO today";
false: let TODO = 'stuff';
false: let l = TODO.length;
false: let e = "Even more " + TODO + " to do today";
false: let f = 'Nothing to do';
false:
正则表达式组成说明:
^
- 字符串的开始(在我们的例子中,由于分割而开始行)- 用匹配的引号排除字符串值中的 TODO:
(?!
- 负前瞻开始.*
- 贪婪扫描(扫描所有字符,但仍然匹配后面的内容)(['"])
- 单引号或双引号的捕获组.*
- 贪婪扫描\b
- 关键字前的单词woundary(期望关键字包含在非单词字符中)- 在此处添加关键字
\b
- 关键字后的词伤害.*
- 贪婪扫描\1
- 对捕获组的反向引用(单引号或双引号,但上面捕获的那个))
- 负前瞻结束
^
- start of string (in our case start of line due to split)- exclude TODO in string value with matching quotes:
(?!
- negative lookahead start.*
- greedy scan (scan over all chars, but still match what follows)(['"])
- capture group for either a single quote or a double quote.*
- greedy scan\b
- word woundary before keyword (expect keyword enclosed in non-word chars)- add keyword here
\b
- word woundary after keyword.*
- greedy scan\1
- back reference to capture group (either a single quote or a double quote, but the one captured above))
- negative lookahead end
(?!
- 负前瞻开始.*
- 贪婪扫描\b
- 关键字前的词伤害- 在此处添加关键字
\.\w
- 一个点后跟一个字符字符,例如.x
)
- 负前瞻结束
(?!
- negative lookahead start.*
- greedy scan\b
- word woundary before keyword- add keyword here
\.\w
- a dot followed by a word char, such as.x
)
- negative lookahead end
(?!
- 负前瞻开始.*
- 贪婪扫描\b
- 关键字前的词伤害- 在此处添加关键字
\s*=
- 可选空格后跟=
)
- 负前瞻结束
(?!
- negative lookahead start.*
- greedy scan\b
- word woundary before keyword- add keyword here
\s*=
- optional spaces followed by=
)
- negative lookahead end
.*
- 贪婪扫描\b
- 字伤(期望关键字包含在非字字符中)- 在此处添加关键字
\b
- 字伤人
.*
- greedy scan\b
- word woundary (expect keyword enclosed in non-word chars)- add keyword here
\b
- word woundary
了解有关正则表达式的更多信息:https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
Learn more about regular expressions: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
这篇关于当通过目录列表以获取带有 TODO 关键字(例如//TODO)但不是变量/字符串的文件列表时,TODO 关键字的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!