你可以提出一个更优雅的方式来“标记”HTML格式的C#代码? [英] Can you propose a more elegant way to 'tokenize' c# code for html formatting?
问题描述
(这个问题关于重构F#代码给我一个投票,但也有一些有趣的和有用的答案。在32,000以上SO中的62个F#问题似乎可怜,所以我将冒更多的不赞同的风险!)
我昨天想在博客博客上发布一些代码,然后转向这个网站,我发现过去很有用。然而,博客编辑吃了所有的风格声明,结果竟然是死胡同。所以(和其他黑客一样),我想它有多难是?并在F#的100行中滚动。
以下是代码的肉,将输入字符串转换为记号列表。请注意,这些标记不应与lexing / parsing-style标记混淆。我简单地看了一下,尽管我几乎没有任何理解,但是我明白,他们只会给我标记,而我想保留原始字符串。
现在的问题是:是否有一个更优雅的方式来做到这一点?我不喜欢从输入字符串中删除每个标记字符串所需的n个重新定义,但由于诸如注释,字符串和#region指令(这是包含一个非单词字符)。
pre $ code $ //我们要检测的令牌类型
类型令牌=
|字符串
|的空格字符串
|的评论字符串
|的字符串字符串
|的关键字字符串
|的文本EOF
//把一个字符串转换成一个被识别的令牌列表
让tokenize(s:String)=
//这是解析器 - 我们应该看看编译正则表达式提前?
让nexttoken(st:String)=
与
|匹配st当Regex.IsMatch(st,^ \s +) - > Whitespace(Regex.Match(st,^ \s +)。Value)
| st当Regex.IsMatch(st,^ //。*?\r?\\\
) - >评论(Regex.Match(st,^ //。*?\r?\\\
)。Value)//这是双斜线样式注释
| st当Regex.IsMatch(st,^ / \ *(。| [\r?\\\
])*?\ * /) - > Comment(Regex.Match(st,^ / \ *(。| [\r?\\\
])*?\ * /)。Value)// / * * / style comments http:// ostermiller.org/findcomment.html
| st当Regex.IsMatch(st,@^([^\\ | | \\。|)*) - > Strng(Regex.Match(st,@^([^\\] | \\。|)*)。Value)// unescaped =([ ^\\] | \\。|)*http://wordaligned.org/articles/string-literals-and-regular-expressions
| st当Regex.IsMatch(st ,^#(end)?region) - > Keyword(Regex.Match(st,^#(end)?region)。 >
匹配Regex.Match(st,@^ [^\s] *)。 x当iskeyword x - >关键字(x)// iskeyword使用Microsoft.CSharp.CSharpCodeProvider.IsValidIdentifier - 有点脆弱...
| x - >文本(x)
| _ - > ; EOF
//使用下一个标记将字符串转换为标记列表
让tokeneater s =
让rec循环s acc =
let t = nexttoken s
与
匹配| EOF - > List.rev acc //返回累加器(有将其倒转,因为使用尾递归向后建立)
| Whitespace(x)|评论(x)
|关键字(x)|文字(x)| Strng(x) - >
循环(s.Remove(0,x.Length))(t :: acc)//尾递归
循环s []
tokeneater s
(如果有人真的感兴趣,我很乐意发布其余的代码)
编辑
使用出色的建议
$ / $ / $ / $ /
让nexttoken(st:String)=
匹配st与
|匹配^ \s +s - >空格($)
|匹配^ //。*?\r?(\\\
| $)s - >评论(s)//这是双斜线式的评论
|匹配^ / \ *(。| [\r?\\\
])*?\ * /s - >评论(s)// / * * /样式评论http://ostermiller.org/findcomment.html
|匹配@^ @?([^\\] | \\。|)*s - > Strng(s)// unescaped regexp = ^ @?([^\\] | \\。|)*http://wordaligned.org/articles/string-literals-and-正则表达式
|匹配^#(end)?regions - >关键字
|匹配@^ [^\s] +s - > / /所有文本,直到下一个空白或报价(这可能是错误的)
匹配s与
| IsKeyword x - >关键字
| _ - >文本
| _ - > EOF
使用活动模式来封装Regex.IsMatch和Regex.Match对,如下所示:
let(| Matches | _ | )re s =
let m = Regex(re).Match(s)
if m.Success then
Some(Matches(m.Value))
else
None
然后你的nexttoken函数可以是这样的:
let nexttoken(st:String)=
匹配st与
|匹配^ s +s - >空格(s)
|匹配^ //。*?\r?\\\
s - >评论
...
(This question about refactoring F# code got me one down vote, but also some interesting and useful answers. And 62 F# questions out of the 32,000+ on SO seems pitiful, so I'm going to take the risk of more disapproval!)
I was trying to post a bit of code on a blogger blog yesterday, and turned to this site, which I had found useful in the past. However, the blogger editor ate all the style declarations, so that turned out to be a dead end.
So (like any hacker), I thought "how hard can it be?" and rolled my own in <100 lines of F#.
Here is the 'meat' of the code, which turns an input string into a list of 'tokens'. Note that these tokens aren't to be confused with the lexing/parsing-style tokens. I did look at those briefly, and though I hardly understood anything, I did understand that they would give me only tokens, whereas I want to keep my original string.
The question is: is there a more elegant way of doing this? I don't like the n re-definitions of s required to remove each token string from the input string, but it's difficult to split the string into potential tokens in advance, because of things like comments, strings and the #region directive (which contains a non-word character).
//Types of tokens we are going to detect
type Token =
| Whitespace of string
| Comment of string
| Strng of string
| Keyword of string
| Text of string
| EOF
//turn a string into a list of recognised tokens
let tokenize (s:String) =
//this is the 'parser' - should we look at compiling the regexs in advance?
let nexttoken (st:String) =
match st with
| st when Regex.IsMatch(st, "^\s+") -> Whitespace(Regex.Match(st, "^\s+").Value)
| st when Regex.IsMatch(st, "^//.*?\r?\n") -> Comment(Regex.Match(st, "^//.*?\r?\n").Value) //this is double slash-style comments
| st when Regex.IsMatch(st, "^/\*(.|[\r?\n])*?\*/") -> Comment(Regex.Match(st, "^/\*(.|[\r?\n])*?\*/").Value) // /* */ style comments http://ostermiller.org/findcomment.html
| st when Regex.IsMatch(st, @"^""([^""\\]|\\.|"""")*""") -> Strng(Regex.Match(st, @"^""([^""\\]|\\.|"""")*""").Value) // unescaped = "([^"\\]|\\.|"")*" http://wordaligned.org/articles/string-literals-and-regular-expressions
| st when Regex.IsMatch(st, "^#(end)?region") -> Keyword(Regex.Match(st, "^#(end)?region").Value)
| st when st <> "" ->
match Regex.Match(st, @"^[^""\s]*").Value with //all text until next whitespace or quote (this may be wrong)
| x when iskeyword x -> Keyword(x) //iskeyword uses Microsoft.CSharp.CSharpCodeProvider.IsValidIdentifier - a bit fragile...
| x -> Text(x)
| _ -> EOF
//tail-recursive use of next token to transform string into token list
let tokeneater s =
let rec loop s acc =
let t = nexttoken s
match t with
| EOF -> List.rev acc //return accumulator (have to reverse it because built backwards with tail recursion)
| Whitespace(x) | Comment(x)
| Keyword(x) | Text(x) | Strng(x) ->
loop (s.Remove(0, x.Length)) (t::acc) //tail recursive
loop s []
tokeneater s
(If anyone is really interested, I am happy to post the rest of the code)
EDIT Using the excellent suggestion of active patterns by kvb, the central bit looks like this, much better!
let nexttoken (st:String) =
match st with
| Matches "^\s+" s -> Whitespace(s)
| Matches "^//.*?\r?(\n|$)" s -> Comment(s) //this is double slash-style comments
| Matches "^/\*(.|[\r?\n])*?\*/" s -> Comment(s) // /* */ style comments http://ostermiller.org/findcomment.html
| Matches @"^@?""([^""\\]|\\.|"""")*""" s -> Strng(s) // unescaped regexp = ^@?"([^"\\]|\\.|"")*" http://wordaligned.org/articles/string-literals-and-regular-expressions
| Matches "^#(end)?region" s -> Keyword(s)
| Matches @"^[^""\s]+" s -> //all text until next whitespace or quote (this may be wrong)
match s with
| IsKeyword x -> Keyword(s)
| _ -> Text(s)
| _ -> EOF
I'd use an active pattern to encapsulate the Regex.IsMatch and Regex.Match pairs, like so:
let (|Matches|_|) re s =
let m = Regex(re).Match(s)
if m.Success then
Some(Matches (m.Value))
else
None
Then your nexttoken function can look like:
let nexttoken (st:String) =
match st with
| Matches "^s+" s -> Whitespace(s)
| Matches "^//.*?\r?\n" s -> Comment(s)
...
这篇关于你可以提出一个更优雅的方式来“标记”HTML格式的C#代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!