多字符串的正则表达式是什么? [英] What is regular expression for multi string?

查看:58
本文介绍了多字符串的正则表达式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习制作一个编译器,它有一些规则,例如单个字符串:

I am learning to make a compiler and it's got some rules like single string:

char ch[] ="abcd";

和多字符串:

printf("This is\
a multi\
string");

我写了正则表达式

STRING \"([^\"\n]|\\{NEWLINE})*\"

它对于单行字符串可以正常工作,但不适用于多行字符串,在多行字符串中,一行以'\'字符结尾. 我应该改变什么?

It works fine with single line string but it doesn't work with multi line string where one line ends with a '\' character. What should I change?

推荐答案

常见的字符串模式是

\"([^"\\\n]|\\(.|\n))*\"

这将匹配包含转义双引号(\")和反斜杠(\\)的字符串.它使用\\(.|\n)允许反斜杠后的任何字符.尽管某些反斜杠序列的长度超过一个字符(\x40),但在第一个字符之后都没有包含非字母数字.

This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.

您的输入可能包含Windows行尾(CR-LF),在这种情况下,反斜杠将不会直接跟在换行符之后;后面将有一个回车符.如果您想接受该输入而不是抛出错误(可能更合适),则需要明确地这样做:

It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:

\"([^"\\\n]|\\(.|\r?\n))*\"

但是识别字符串和理解字符串代表什么是两件事.通常,编译器需要将字符串的表示形式转换为字节序列,例如,需要将\n转换为字节10并完全删除反斜杠换行符.

But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.

可以使用启动条件在(f)lex扫描仪中轻松完成该任务. (或者,当然,您可以使用其他词法扫描器重新扫描字符串.)

That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)

此外,您需要考虑错误处理.一旦禁止使用未转义的换行符的字符串(如C一样),就打开了出现未终止字符串的可能性的大门,其中在结束引号之前会遇到换行符.如果未正确关闭字符串,文件末尾可能会发生同样的情况.

Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly​ closed.

如果您有一个单字符后备规则,它将识别未终止字符串的开头引号.这是不理想的,因为它将随后将字符串的内容扫描为程序文本,从而导致一系列错误.如果您不尝试进行错误恢复,那没关系,但是如果这样,通常最好使用另一种模式至少识别未终止的字符串,直到换行为止.

If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.

这篇关于多字符串的正则表达式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆