OCaml:如何从字符串中删除所有非字母字符? [英] OCaml: How to remove all non-alphabetic characters from a string?

查看:125
本文介绍了OCaml:如何从字符串中删除所有非字母字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从字符串中删除所有非字母字符?

How do I remove all the non-alphabetic characters from a string?

例如

"Wë_1ird?!"  ->  "Wëird"

在Perl中,我将使用=~ s/[\W\d_]+//g进行此操作.在Python中,我会使用

In Perl, I'd do this with =~ s/[\W\d_]+//g. In Python, I'd use

re.sub(ur'[\W\d_]+', u'', u"Wë_1ird?!", flags=re.UNICODE)

等等.

AFAICT,Str.regex不支持\W\d等(我不能 告诉它是否支持Unicode,但我对此表示怀疑.

AFAICT, Str.regex does not support \W, \d, etc. (I can't tell whether it supports Unicode, but somehow I doubt it).

推荐答案

我不是regexes和utf方面的专家,但是如果我不熟悉,那么我会使用re2库,这是我的第一个库近似值:

I'm not an expert in regexes and utf, but if I were in your shoes, then I would use re2 library, and this is my first approximation:

open Core.Std
open Re2.Std
open Re2.Infix

let drop _match = ""

let keep_alpha s = Re2.replace ~/"\\PL" ~f:drop s

前三行打开库,并将它们的定义引入范围.您无需打开库即可使用它,否则需要为每个定义加上前缀. OCaml核心库是以这种方式专门设计的,用户应打开Std子模块以将所有必要的定义带入范围. Re2库来自同一个人,并具有一致的约定. open Re2.Infix将带前缀(和前缀运算符)的作用域,即~/,它将根据字符串创建一个正则表达式. drop函数只是忽略其参数,并返回一个空字符串.我给参数加上了下划线前缀,因为这是未使用的参数的约定(受编译器的尊重).您也可以只使用普通uderscore作为通配符,例如let drop _ = "".接下来是keep_alpha函数,它将用空字符串替换与utf字母类不匹配的任何utf符号,即从输出中将其删除.

The first three lines open libraries and bring their definitions into scope. You do not need to open library to use it, but otherwise you need to prefix each defintion. OCaml core library is specially designed in a such way, that a user should open Std submodule to bring all necessary defintions to scope. Re2 library is from the same guys and have a consisten conventions. open Re2.Infix will bring infix (and prefix operators) to scope, namely ~/ that will create a regex from a string. The drop function just ignores its argument and returns an empty string. I've prefixed parameter with an underscore, since it is a convention for unused parameteers (respected by a compiler). You can also use just a plain uderscore, as a wild card instead, like let drop _ = "". Next is keep_alpha function that will substitute any utf symbol that doesn't match a utf letter class with an empty string, i.e., remove it from the output.

我已经检查了我的代码,并修复了一些错误.另外,我想展示如何在顶层使用此代码.您有几种选择,但是最简单的方法是使用core库附带的coretop脚本.它使用顶级utop,因此请确保已安装它:

I've checked my code, and fixed errors. Also, I would like to show, how to play with this code in toplevel. You've several options, but the easiest is to use coretop script that ships with core library. It uses utop toplevel, so make sure that you have installed it:

 $ opam install -y utop

完成后,您可以启动顶层:

Once, it is done, you can start toplevel:

 $ coretop -require re2

-require re2标志将自动查找re2库并将其加载到顶层.您可以使用以下命令在不重新启动utop的情况下加载其他库:

this -require re2 flag will automatically find and load re2 library to your toplevel. You can load additional libraries without restarting utop with the following command:

 # #require "libname";;

第一个#是顶层的提示符,您不应键入它,但是第二个是指令的开始,因此请确保实际输入它.任何指令都应从#符号开始. utop中还有其他有用的指令,即:

The first # is a toplevel's prompt, you shouldn't type it, but the second is a start of directive, so make sure that actually type it. Any directive should be started from # symbol. There're other useful directives in utop, namely:

 # #use "filename.ml";;   (* will load and evaluate filename.ml      *)
 # #list;;                (* will list all available packages        *)
 # #typeof "keep_alpha";; (* will infer and print type of expression *)

直到您按;;序列终止代码,Toplevel才会评估您的代码.有时您可能会在真实的代码中看到这个丑陋的;;,但这不是必需的,只是说顶层,您希望它在此位置评估您的代码并向您显示结果.

Toplevel will not evaluate your code until you terminate it with ;; sequence. You may sometimes see this ugly ;; in a real code, but it is not needed, it is just to say the toplevel, that you want it to evaluate your code right at this place, and show you the result.

这篇关于OCaml:如何从字符串中删除所有非字母字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆