Module ReSource

Module Re: code for creating and using regular expressions, independently of regular expression syntax.

Sourcetype t

Regular expression

Sourcetype re

Compiled regular expression

Sourcemodule Group : sig ... end

Manipulate matching groups.

Sourcetype groups = Group.t
  • deprecated Use Group.t

Compilation and execution of a regular expression

Sourceval compile : t -> re

Compile a regular expression into an executable version that can be used to match strings, e.g. with exec.

Sourceval group_count : re -> int

Return the number of capture groups (including the one corresponding to the entire regexp).

Sourceval group_names : re -> (string * int) list

Return named capture groups with their index.

Sourceval exec : ?pos:int -> ?len:int -> re -> string -> Group.t

exec re str searches str for a match of the compiled expression re, and returns the matched groups if any.

More specifically, when a match exists, exec returns a match that starts at the earliest position possible. If multiple such matches are possible, the one specified by the match semantics described below is returned.

Examples:

# let regex = Re.compile Re.(seq [str "//"; rep print ]);;
val regex : re = <abstr>

# Re.exec regex "// a C comment";;
- : Re.Group.t = <abstr>

# Re.exec regex "# a C comment?";;
Exception: Not_found

# Re.exec ~pos:1 regex "// a C comment";;
Exception: Not_found
  • parameter pos

    optional beginning of the string (default 0)

  • parameter len

    length of the substring of str that can be matched (default -1, meaning to the end of the string)

  • raises Not_found

    if the regular expression can't be found in str

Sourceval exec_opt : ?pos:int -> ?len:int -> re -> string -> Group.t option

Similar to exec, but returns an option instead of using an exception.

Examples:

# let regex = Re.compile Re.(seq [str "//"; rep print ]);;
val regex : re = <abstr>

# Re.exec_opt regex "// a C comment";;
- : Re.Group.t option = Some <abstr>

# Re.exec_opt regex "# a C comment?";;
- : Re.Group.t option = None

# Re.exec_opt ~pos:1 regex "// a C comment";;
- : Re.Group.t option = None
Sourceval execp : ?pos:int -> ?len:int -> re -> string -> bool

Similar to exec, but returns true if the expression matches, and false if it doesn't. This function is more efficient than calling exec or exec_opt and ignoring the returned group.

Examples:

# let regex = Re.compile Re.(seq [str "//"; rep print ]);;
val regex : re = <abstr>

# Re.execp regex "// a C comment";;
- : bool = true

# Re.execp ~pos:1 regex "// a C comment";;
- : bool = false
Sourceval exec_partial : ?pos:int -> ?len:int -> re -> string -> [ `Full | `Partial | `Mismatch ]

More detailed version of execp. `Full is equivalent to true, while `Mismatch and `Partial are equivalent to false, but `Partial indicates the input string could be extended to create a match.

Examples:

# let regex = Re.compile Re.(seq [bos; str "// a C comment"]);;
val regex : re = <abstr>

# Re.exec_partial regex "// a C comment here.";;
- : [ `Full | `Mismatch | `Partial ] = `Full

# Re.exec_partial regex "// a C comment";;
- : [ `Full | `Mismatch | `Partial ] = `Partial

# Re.exec_partial regex "//";;
- : [ `Full | `Mismatch | `Partial ] = `Partial

# Re.exec_partial regex "# a C comment?";;
- : [ `Full | `Mismatch | `Partial ] = `Mismatch
Sourceval exec_partial_detailed : ?pos:int -> ?len:int -> re -> string -> [ `Full of Group.t | `Partial of int | `Mismatch ]

More detailed version of exec_opt. `Full group is equivalent to Some group, while `Mismatch and `Partial _ are equivalent to None, but `Partial position indicates that the input string could be extended to create a match, and no match could start in the input string before the given position. This could be used to not have to search the entirety of the input if more becomes available, and use the given position as the ?pos argument.

Sourcemodule Mark : sig ... end

Marks

High Level Operations

Sourcetype split_token = [
  1. | `Text of string
    (*

    Text between delimiters

    *)
  2. | `Delim of Group.t
    (*

    Delimiter

    *)
]
Sourceval all : ?pos:int -> ?len:int -> re -> string -> Group.t list

Repeatedly calls exec on the given string, starting at given position and length.

Examples:

# let regex = Re.compile Re.(seq [str "my"; blank; word(rep alpha)]);;
val regex : re = <abstr>

# Re.all regex "my head, my shoulders, my knees, my toes ...";;
- : Re.Group.t list = [<abstr>; <abstr>; <abstr>; <abstr>]

# Re.all regex "My head, My shoulders, My knees, My toes ...";;
- : Re.Group.t list = []
Sourcetype 'a gen = unit -> 'a option
Sourceval all_gen : ?pos:int -> ?len:int -> re -> string -> Group.t gen
Sourceval all_seq : ?pos:int -> ?len:int -> re -> string -> Group.t Seq.t
Sourceval matches : ?pos:int -> ?len:int -> re -> string -> string list

Same as all, but extracts the matched substring rather than returning the whole group. This basically iterates over matched strings.

Examples:

# let regex = Re.compile Re.(seq [str "my"; blank; word(rep alpha)]);;
val regex : re = <abstr>

# Re.matches regex "my head, my shoulders, my knees, my toes ...";;
- : string list = ["my head"; "my shoulders"; "my knees"; "my toes"]

# Re.matches regex "My head, My shoulders, My knees, My toes ...";;
- : string list = []

# Re.matches regex "my my my my head my 1 toe my ...";;
- : string list = ["my my"; "my my"]

# Re.matches ~pos:2 regex "my my my my head my +1 toe my ...";;
- : string list = ["my my"; "my head"]
Sourceval matches_gen : ?pos:int -> ?len:int -> re -> string -> string gen
Sourceval matches_seq : ?pos:int -> ?len:int -> re -> string -> string Seq.t
Sourceval split : ?pos:int -> ?len:int -> re -> string -> string list

split re s splits s into chunks separated by re. It yields the chunks themselves, not the separator. An occurence of the separator at the beginning or the end of the string is ignoring.

Examples:

# let regex = Re.compile (Re.char ',');;
val regex : re = <abstr>

# Re.split regex "Re,Ocaml,Jerome Vouillon";;
- : string list = ["Re"; "Ocaml"; "Jerome Vouillon"]

# Re.split regex "No commas in this sentence.";;
- : string list = ["No commas in this sentence."]

# Re.split regex ",1,2,";;
- : string list = ["1"; "2"]

# Re.split ~pos:3 regex "1,2,3,4. Commas go brrr.";;
- : string list = ["3"; "4. Commas go brrr."]

Zero-length patterns:

Be careful when using split with zero-length patterns like eol, bow, and eow. Because they don't have any width, they will still be present in the result. (Note the position of the \n and space characters in the output.)

# Re.split (Re.compile Re.eol) "a\nb";;
- : string list = ["a"; "\nb"]

# Re.split (Re.compile Re.bow) "a b";;
- : string list = ["a "; "b"]

# Re.split (Re.compile Re.eow) "a b";;
- : string list = ["a"; " b"]

Compare this to the behavior of splitting on the char itself. (Note that the delimiters are not present in the output.)

# Re.split (Re.compile (Re.char '\n')) "a\nb";;
- : string list = ["a"; "b"]

# Re.split (Re.compile (Re.char ' ')) "a b";;
- : string list = ["a"; "b"]
Sourceval split_delim : ?pos:int -> ?len:int -> re -> string -> string list

split_delim re s splits s into chunks separated by re. It yields the chunks themselves, not the separator. Occurences of the separator at the beginning or the end of the string will produce empty chunks.

Examples:

# let regex = Re.compile (Re.char ',');;
val regex : re = <abstr>

# Re.split regex "Re,Ocaml,Jerome Vouillon";;
- : string list = ["Re"; "Ocaml"; "Jerome Vouillon"]

# Re.split regex "No commas in this sentence.";;
- : string list = ["No commas in this sentence."]

# Re.split regex ",1,2,";;
- : string list = [""; "1"; "2"; ""]

# Re.split ~pos:3 regex "1,2,3,4. Commas go brrr.";;
- : string list = ["3"; "4. Commas go brrr."]

Zero-length patterns:

Be careful when using split_delim with zero-length patterns like eol, bow, and eow. Because they don't have any width, they will still be present in the result. (Note the position of the \n and space characters in the output.)

# Re.split_delim (Re.compile Re.eol) "a\nb";;
- : string list = ["a"; "\nb"; ""]

# Re.split_delim (Re.compile Re.bow) "a b";;
- : string list = [""; "a "; "b"]

# Re.split_delim (Re.compile Re.eow) "a b";;
- : string list = ["a"; " b"; ""]

Compare this to the behavior of splitting on the char itself. (Note that the delimiters are not present in the output.)

# Re.split_delim (Re.compile (Re.char '\n')) "a\nb";;
- : string list = ["a"; "b"]

# Re.split_delim (Re.compile (Re.char ' ')) "a b";;
- : string list = ["a"; "b"]
Sourceval split_gen : ?pos:int -> ?len:int -> re -> string -> string gen
Sourceval split_seq : ?pos:int -> ?len:int -> re -> string -> string Seq.t
Sourceval split_full : ?pos:int -> ?len:int -> re -> string -> split_token list

split re s splits s into chunks separated by re. It yields the chunks along with the separators. For instance this can be used with a whitespace-matching re such as "[\t ]+".

Examples:

# let regex = Re.compile (Re.char ',');;
val regex : re = <abstr>

# Re.split_full regex "Re,Ocaml,Jerome Vouillon";;
- : Re.split_token list =
  [`Text "Re"; `Delim <abstr>; `Text "Ocaml"; `Delim <abstr>;
  `Text "Jerome Vouillon"]

# Re.split_full regex "No commas in this sentence.";;
- : Re.split_token list = [`Text "No commas in this sentence."]

# Re.split_full ~pos:3 regex "1,2,3,4. Commas go brrr.";;
- : Re.split_token list =
  [`Delim <abstr>; `Text "3"; `Delim <abstr>; `Text "4. Commas go brrr."]
Sourceval split_full_gen : ?pos:int -> ?len:int -> re -> string -> split_token gen
Sourceval split_full_seq : ?pos:int -> ?len:int -> re -> string -> split_token Seq.t
Sourcemodule Seq : sig ... end

String expressions (literal match)

Sourceval str : string -> t
Sourceval char : char -> t

Basic operations on regular expressions

Sourceval alt : t list -> t

Alternative.

alt [] is equivalent to empty.

By default, the leftmost match is preferred (see match semantics below).

Sourceval seq : t list -> t

Sequence

Sourceval empty : t

Match nothing

Sourceval epsilon : t

Empty word

Sourceval rep : t -> t

0 or more matches

Sourceval rep1 : t -> t

1 or more matches

Sourceval repn : t -> int -> int option -> t

repn re i j matches re at least i times and at most j times, bounds included. j = None means no upper bound.

Sourceval opt : t -> t

0 or 1 matches

String, line, word

We define a word as a sequence of latin1 letters, digits and underscore.

Sourceval bol : t

Beginning of line

Sourceval eol : t

End of line

Sourceval bow : t

Beginning of word

Sourceval eow : t

End of word

Sourceval bos : t

Beginning of string. This differs from start because it matches the beginning of the input string even when using ~pos arguments:

let b = execp (compile (seq [ bos; str "a" ])) "aa" ~pos:1 in
assert (not b)
Sourceval eos : t

End of string. This is different from stop in the way described in bos.

Sourceval leol : t

Last end of line or end of string

Sourceval start : t

Initial position. This differs from bos because it takes into account the ~pos arguments:

let b = execp (compile (seq [ start; str "a" ])) "aa" ~pos:1 in
assert b
Sourceval stop : t

Final position. This is different from eos in the way described in start.

Sourceval word : t -> t

Word

Sourceval not_boundary : t

Not at a word boundary

Sourceval whole_string : t -> t

Only matches the whole string, i.e. fun t -> seq [ bos; t; eos ].

Match semantics

A regular expression frequently matches a string in multiple ways. For instance exec (compile (opt (str "a"))) "ab" can match "" or "a". Match semantic can be modified with the functions below, allowing one to choose which of these is preferable.

By default, the leftmost branch of alternations is preferred, and repetitions are greedy.

Note that the existence of matches cannot be changed by specifying match semantics. seq [ bos; str "a"; non_greedy (opt (str "b")); eos ] will match when applied to "ab". However if seq [ bos; str "a"; non_greedy (opt (str "b")) ] is applied to "ab", it will match "a" rather than "ab".

Also note that multiple match semantics can conflict. In this case, the one executed earlier takes precedence. For instance, any match of shortest (seq [ bos; group (rep (str "a")); group (rep (str "a")); eos ]) will always have an empty first group. Conversely, if we use longest instead of shortest, the second group will always be empty.

Sourceval longest : t -> t

Longest match semantics. That is, matches will match as many bytes as possible. If multiple choices match the maximum amount of bytes, the one respecting the inner match semantics is preferred.

Sourceval shortest : t -> t

Same as longest, but matching the least number of bytes.

Sourceval first : t -> t

First match semantics for alternations (not repetitions). That is, matches will prefer the leftmost branch of the alternation that matches the text.

Sourceval greedy : t -> t

Greedy matches for repetitions (opt, rep, rep1, repn): they will match as many times as possible.

Sourceval non_greedy : t -> t

Non-greedy matches for repetitions (opt, rep, rep1, repn): they will match as few times as possible.

Groups (or submatches)

Sourceval group : ?name:string -> t -> t

Delimit a group. The group is considered as matching if it is used at least once (it may be used multiple times if is nested inside rep for instance). If it is used multiple times, the last match is what gets captured.

Sourceval no_group : t -> t

Remove all groups

Sourceval nest : t -> t

When matching against nest e, only the group matching in the last match of e will be considered as matching.

For instance:

let re = compile (rep1 (nest (alt [ group (str "a"); str "b" ]))) in
let group = Re.exec re "ab" in
assert (Group.get_opt group 1 = None);
(* same thing but without [nest] *)
let re = compile (rep1 (alt [ group (str "a"); str "b" ])) in
let group = Re.exec re "ab" in
assert (Group.get_opt group 1 = Some "a")
Sourceval mark : t -> Mark.t * t

Mark a regexp. the markid can then be used to know if this regexp was used.

Character sets

Sourceval set : string -> t

Any character of the string

Sourceval rg : char -> char -> t

Character ranges

Sourceval inter : t list -> t

Intersection of character sets

Sourceval diff : t -> t -> t

Difference of character sets

Sourceval compl : t list -> t

Complement of union

Predefined character sets

Sourceval any : t

Any character

Sourceval notnl : t

Any character but a newline

Sourceval alnum : t
Sourceval wordc : t
Sourceval alpha : t
Sourceval ascii : t
Sourceval blank : t
Sourceval cntrl : t
Sourceval digit : t
Sourceval graph : t
Sourceval lower : t
Sourceval print : t
Sourceval punct : t
Sourceval space : t
Sourceval upper : t
Sourceval xdigit : t

Case modifiers

Sourceval case : t -> t

Case sensitive matching. Note that this works on latin1, not ascii and not utf8.

Sourceval no_case : t -> t

Case insensitive matching. Note that this works on latin1, not ascii and not utf8.

Internal debugging

Sourceval pp : Format.formatter -> t -> unit
Sourceval pp_re : Format.formatter -> re -> unit
Sourceval print_re : Format.formatter -> re -> unit

Alias for pp_re. Deprecated

Experimental functions

Sourceval witness : t -> string

witness r generates a string s such that execp (compile r) s is true.

Be warned that this function is buggy because it ignores zero-width assertions like beginning of words. As a result it can generate incorrect results.

Deprecated functions

Sourcetype substrings = Group.t

Alias for Group.t. Deprecated

  • deprecated Use Group.t
Sourceval get : Group.t -> int -> string

Same as Group.get. Deprecated

  • deprecated Use Group.get
Sourceval get_ofs : Group.t -> int -> int * int

Same as Group.offset. Deprecated

  • deprecated Use Group.offset
Sourceval get_all : Group.t -> string array

Same as Group.all. Deprecated

  • deprecated Use Group.all
Sourceval get_all_ofs : Group.t -> (int * int) array

Same as Group.all_offset. Deprecated

  • deprecated Use Group.all_offset
Sourceval test : Group.t -> int -> bool

Same as Group.test. Deprecated

  • deprecated Use Group.test
Sourcetype markid = Mark.t

Alias for Mark.t. Deprecated

  • deprecated Use Mark.
Sourceval marked : Group.t -> Mark.t -> bool

Same as Mark.test. Deprecated

  • deprecated Use Mark.test
Sourceval mark_set : Group.t -> Mark.Set.t

Same as Mark.all. Deprecated

  • deprecated Use Mark.all
Sourceval replace : ?pos:int -> ?len:int -> ?all:bool -> Re__.Compile.re -> f:(Group.t -> string) -> string -> string

replace ~all re ~f s iterates on s, and replaces every occurrence of re with f substring where substring is the current match. If all = false, then only the first occurrence of re is replaced.

Sourceval replace_string : ?pos:int -> ?len:int -> ?all:bool -> Re__.Compile.re -> by:string -> string -> string

replace_string ~all re ~by s iterates on s, and replaces every occurrence of re with by. If all = false, then only the first occurrence of re is replaced.

Examples:

# let regex = Re.compile (Re.char ',');;
val regex : re = <abstr>

# Re.replace_string regex ~by:";" "[1,2,3,4,5,6,7]";;
- : string = "[1;2;3;4;5;6;7]"

# Re.replace_string regex ~all:false ~by:";" "[1,2,3,4,5,6,7]";;
- : string = "[1;2,3,4,5,6,7]"
Sourcemodule View : sig ... end
Sourcemodule Emacs : sig ... end

Emacs-style regular expressions

Sourcemodule Glob : sig ... end

Shell-style regular expressions

Sourcemodule Perl : sig ... end

Perl-style regular expressions

Sourcemodule Pcre : sig ... end
Sourcemodule Posix : sig ... end

References:

Sourcemodule Str : sig ... end

Module Str: regular expressions and high-level string processing