用regexp识别并分离希伯来词 [英] Identify and isolate hebrew word with regexp
问题描述
我需要解析一个希伯来语句子来识别和隔离(所以我可以用'span'标签包装它们)中的每个单词。
我首先尝试通过从空格字符中识别非空格字符而没有标点符号但仍然不起作用:
I need to parse a hebrew sentence to identify and isolate (so I can wrap them up with 'span' tags) every word in it. I'm first trying without punctuation by identifying the non spaces chars from the spaces chars but still doesn't work:
var regex = /(\s)*(\S)+(\s)*/g;
任何想法?
谢谢
any idea? thanks
编辑:我已经有了一个用英语完成工作的正则表达式,我会把它放在它可以帮助理解我想要的东西的情况下实现:
edit: I have already a regular expression that does the job in english, I'm putting it in case it can help to understand what I want to achieve:
var regExp = /\b([^\s']+)\b/g,
edit2:添加代码示例
edit2: Adding code example
var regex = /(\s)*(\S)+(\s)*/g;
var sentence = "שלום מה קורה מהיום";
sentence.replace(regex, function(match, p1, p2, p3){console.log('"' + match + '"', '"' + p1 + '"', '"' + p2 + '"', '"' + p3 + '"');});
// result
"שלום " "undefined" "ם" " " VM1494:2
"מה " "undefined" "ה" " " VM1494:2
"קורה " "undefined" "ה" " " VM1494:2
"היום" "undefined" "ם" "undefined" VM1494:2
"undefinedundefinedundefinedundefined"
edit3:我需要能够在最后使用相同的标点符号重新组合句子。
edit3: I need to be able to reassemble the sentence with the same punctuation at the end.
推荐答案
好吧,你可能知道网上的希伯来语是个婊子。
尝试使用此正则表达式:
Well, as you might know Hebrew in the web is a bitch. Try using this regex:
[\s]*(\S)+[\s]*
例如:
var words = sentence.match(/[\s]*(\S)+[\s]*/g);
它确实留在尾随空格中以清除它们你可以做这样的事情:
It does leave in the trailing spaces to clear them up you could do something of this sort:
words = words.join().split(" ")
我正在尝试其他一些正则表达式的变种来试图绕过连接分裂黑客,如果我找到任何东西我会更新。
I'm trying out some other regex variations to try and circumvent the join-split hack, I'll update if I find anything.
另外,你可以采用替换的方式做:
Also, you could go the "replace" way and do:
var words = sentence.replace(/[#`~?!#\$%\.;:,]*/g, "").split(" ")
只需确保添加可能使用的任何标点符号。
Just make sure to add any punctuation that might be used.
然后获取包含单词的新HTML字符串使用span标记你可以这样做:
Then to get a new HTML string with the words wrapped with a span tag you can do this:
让我们说:
var sentence = "?שלום, מה קורה מהיום"
var words = sentence.replace(/[#`~?!#\$%\.;:,]*/g, "").split(" ")
然后:
var newSentence = encodeURI(sentence)
words.forEach(function(word){
word = encodeURI(word)
newSentence = newSentence.replace(word, "<span>" + word + "</span>")
})
newSentence = decodeURI(newSentence);
newSentence
将您的文字包裹起来在标点符号中留下标记:
newSentence
will have your words wrapped with a span tag while leaving in the punctuations:
这篇关于用regexp识别并分离希伯来词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!