当通过目录列表以获取带有 TODO 关键字(例如//TODO)但不是变量/字符串的文件列表时,TODO 关键字的正则表达式 [英] Regex for TODO keyword when passing through a list of directories to get a list of files with TODO keyword (eg. //TODO) but not as variable / string

查看:29
本文介绍了当通过目录列表以获取带有 TODO 关键字(例如//TODO)但不是变量/字符串的文件列表时,TODO 关键字的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个应用程序,它查看目录并标记出所有具有 TODO 关键字(每当我们在代码编辑器中编码时闪烁/突出显示颜色的文件(无论是在目录还是子目录中))[我正在使用 Visual Studio 代码]

I'm trying to write an application that looks through a directory and flag out all files (be it in directory or subdirectories) that has the TODO keyword (the one that flashes/highlights in color whenever we code in our code editor [i am using visual studio code]

我已经运行了大部分代码,这只是让我感到困惑的最后一点:因为我的 RegEx 接受TODO"作为单词块,它甚至会选择将 TODO 作为变量名称/字符串内容的文件,例如.

I have gotten most of the code running, its just the last bit that is puzzling me : because my RegEx accepts 'TODO' as a word block, it picks up even files that has TODO as variable name / string content eg.

var todo = 'TODO'或者var TODO = 'abcdefg'

所以它搞乱了我的测试用例.我们如何编写一个健壮的 TODO 正则表达式/表达式,它能够只选择 TODO 关键字(例如 //TODO//TODO)并忽略其他用途案例(在变量/字符串等中)我不想硬编码//或正则表达式中的任何内容,因为我希望它尽可能地跨语言(例如.//(单行)或 /*(多行)用于 javascript,# 用于 python 等)

so it is messing up with my test cases. How do we write a robust TODO regex / expression that is able to pick up just the TODO keyword (eg. //TODO or // TODO) and ignore the other use cases (in variables/strings etc) I dont want to hardcode // or anything in the regex as well, as i would prefer it to be cross-language as much as possible (eg. // (single-line) or /* (multi-line) for javascript, # for python etc)

这是我的代码:

import * as fs from 'fs'; 
import * as path from 'path';

const args = process.argv.slice(2);
const directory = args[0];

// Using recursion, we find every file with the desired extention, even if its deeply nested in subfolders.
// Returns a list of file paths
const getFilesInDirectory = (dir, ext) => {
  if (!fs.existsSync(dir)) {
    console.log(`Specified directory: ${dir} does not exist`);
    return;
  }

  let files = [];
  fs.readdirSync(dir).forEach(file => {
    const filePath = path.join(dir, file);
    const stat = fs.lstatSync(filePath); // Getting details of a symbolic link of file

    // If we hit a directory, recurse our fx to subdir. If we hit a file (basecase), add it to the array of files
    if (stat.isDirectory()) {
      const nestedFiles = getFilesInDirectory(filePath, ext);
      files = files.concat(nestedFiles);
    } else {
      if (path.extname(file) === ext) {
        files.push(filePath);
      }
    }
  });

  return files;
};



const checkFilesWithKeyword = (dir, keyword, ext) => {
  if (!fs.existsSync(dir)) {
    console.log(`Specified directory: ${dir} does not exist`);
    return;
  }

  const allFiles = getFilesInDirectory(dir, ext);
  const checkedFiles = [];

  allFiles.forEach(file => {
    const fileContent = fs.readFileSync(file);

    // We want full words, so we use full word boundary in regex.
    const regex = new RegExp('\\b' + keyword + '\\b');
    if (regex.test(fileContent)) {
      // console.log(`Your word was found in file: ${file}`);
      checkedFiles.push(file);
    }
  });

  console.log(checkedFiles);
  return checkedFiles;
};

checkFilesWithKeyword(directory, 'TODO', '.js');



非常感谢帮助!!

推荐答案

我认为没有一种可靠的方法可以排除变量名称或字符串值中的 TODO跨语言.您需要正确解析每种语言,并在评论中扫描 TODO.

I don't think there is a reliable way to exclude TODO in variable names or string values across languages. You'd need to parse each language properly, and scan for TODO in comments.

你可以做一个可以随着时间调整的近似值:

You can do an approximation that you can tweak over time:

  • 对于变量名称,您需要排除 TODO = 赋值和任何类型的使用,例如 TODO.length
  • 对于字符串值,您可以在查找时排除 'TODO'"TODO",甚至 "Something TODO today"匹配引号.带反引号的多行字符串怎么样?
  • for variable names you'd need to exclude TODO = assignments, and any type of use, such as TODO.length
  • for string value you could exclude 'TODO' and "TODO", and even "Something TODO today" while looking for matching quotes. What about a multi-line string with backticks?

这是使用大量负面预测的开始:

This is a start using a bunch of negative lookaheads:

const input = `Test Case:
// TODO blah
// TODO do "stuff"
/* stuff
 * TODO
 */
let a = 'TODO';
let b = 'Something TODO today';
let c = "TODO";
let d = "More stuff TODO today";
let TODO = 'stuff';
let l = TODO.length;
let e = "Even more " + TODO + " to do today";
let f = 'Nothing to do';
`;
let keyword = 'TODO';
const regex = new RegExp(
  // exclude TODO in string value with matching quotes:
  '^(?!.*([\'"]).*\\b' + keyword + '\\b.*\\1)' +
  // exclude TODO.property access:
  '(?!.*\\b' + keyword + '\\.\\w)' +
  // exclude TODO = assignment
  '(?!.*\\b' + keyword + '\\s*=)' +
  // final TODO match
  '.*\\b' + keyword + '\\b'
);
input.split('\n').forEach((line) => {
  let m = regex.test(line);
  console.log(m + ': ' + line);
});

输出:

false: Test Case:
true: // TODO blah
true: // TODO do "stuff"
false: /* stuff
true:  * TODO
false:  */
false: let a = 'TODO';
false: let b = 'Something TODO today';
false: let c = "TODO";
false: let d = "More stuff TODO today";
false: let TODO = 'stuff';
false: let l = TODO.length;
false: let e = "Even more " + TODO + " to do today";
false: let f = 'Nothing to do';
false: 

正则表达式组成说明:

  • ^ - 字符串的开始(在我们的例子中,由于分割而开始行)
  • 用匹配的引号排除字符串值中的 TODO:
    • (?! - 负前瞻开始
    • .* - 贪婪扫描(扫描所有字符,但仍然匹配后面的内容)
    • (['"]) - 单引号或双引号的捕获组
    • .* - 贪婪扫描
    • \b - 关键字前的单词woundary(期望关键字包含在非单词字符中)
    • 在此处添加关键字
    • \b - 关键字后的词伤害
    • .* - 贪婪扫描
    • \1 - 对捕获组的反向引用(单引号或双引号,但上面捕获的那个)
    • ) - 负前瞻结束
    • ^ - start of string (in our case start of line due to split)
    • exclude TODO in string value with matching quotes:
      • (?! - negative lookahead start
      • .* - greedy scan (scan over all chars, but still match what follows)
      • (['"]) - capture group for either a single quote or a double quote
      • .* - greedy scan
      • \b - word woundary before keyword (expect keyword enclosed in non-word chars)
      • add keyword here
      • \b - word woundary after keyword
      • .* - greedy scan
      • \1 - back reference to capture group (either a single quote or a double quote, but the one captured above)
      • ) - negative lookahead end
      • (?! - 负前瞻开始
      • .* - 贪婪扫描
      • \b - 关键字前的词伤害
      • 在此处添加关键字
      • \.\w - 一个点后跟一个字符字符,例如 .x
      • ) - 负前瞻结束
      • (?! - negative lookahead start
      • .* - greedy scan
      • \b - word woundary before keyword
      • add keyword here
      • \.\w - a dot followed by a word char, such as .x
      • ) - negative lookahead end
      • (?! - 负前瞻开始
      • .* - 贪婪扫描
      • \b - 关键字前的词伤害
      • 在此处添加关键字
      • \s*= - 可选空格后跟 =
      • ) - 负前瞻结束
      • (?! - negative lookahead start
      • .* - greedy scan
      • \b - word woundary before keyword
      • add keyword here
      • \s*= - optional spaces followed by =
      • ) - negative lookahead end
      • .* - 贪婪扫描
      • \b - 字伤(期望关键字包含在非字字符中)
      • 在此处添加关键字
      • \b - 字伤人
      • .* - greedy scan
      • \b - word woundary (expect keyword enclosed in non-word chars)
      • add keyword here
      • \b - word woundary

      了解有关正则表达式的更多信息:https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

      Learn more about regular expressions: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

      这篇关于当通过目录列表以获取带有 TODO 关键字(例如//TODO)但不是变量/字符串的文件列表时,TODO 关键字的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆