Taurus Software
   

Regular Expressions in Warehouse and DataBridger

 

Regular expressions are available in Warehouse and DataBridger Studio using the RXFIND and RXMATCH functions. RXFIND is used to search a string for a pattern and RXMATCH is used to test if an entire string matches a pattern. The patterns are specified using regular expression syntax. Regular expressions were invented in the 1950s and are common throughout the computing world. Each regular expression implementation has many things in common, and each has its own enhancements to the common subset. A web search of "regular expressions" yields a cornucopia of information.

 

Contents

 

1

RXFIND

 

1.1

Syntax

 

1.2

Arguments

 

1.3

Return

 

1.4

Examples

 

2

RXMATCH

 

2.1

Syntax

 

2.2

Arguments

 

2.3

Return

 

2.4

Examples

 

3

Regular Expression Reference

 

3.1

Regular Expression Characters

 

3.2

Bracket Expressions

 

3.3

Regular Expression Classes

 

3.4

Regular Expression Escapes

 

3.5

Regular Expression Quantifiers

 

3.6

Regular Expression Groups and OR

 

3.7

Regular Expression Options

 

3.8

Regular Expression Examples

 

3.9

Further Information

 

1 RXFIND

 


  • The syntax of RXFIND is:

      return = RXFIND(source-string, regular-expression)


    RXFIND has the following two arguments:

      source-string

    String to be be search for a pattern matching the regular expression

      regular-expression

    The regular expression used to search the source string



    RXFIND searches the source string for a series of characters that match the regular expression. If the regular expression is not found, zero (0) is returned. If the regular expression is found, the index within the source string of the first character matched is returned, with the first index being 1.

    Note: In a regular expression, a leading circumflex (^) indicates a match with the beginning of the source string and a trailing dollar sign ($) indicates a match with the end of the source string (See 3.1 below). Therefore, if the regular expression has a leading circumflex and a trailing dollar sign, RXFIND returns only 0 for no match or 1 for a match, thus performing like RXMATCH. Many examples from the internet have a leading circumflex and a trailing dollar sign to indicate the RXMATCH functionality. For example, this Microsoft web page shows the regular expression to check a social security number is "^\d{3}-\d{2}-\d{4}$". You can either use this regular expression as is and call RXFIND, or remove the circumflex and dollar sign and call RXMATCH using "\d{3}-\d{2}-\d{4}".


    RXFIND examples within a Warehouse expression:

     

    No.

    Expression

    Result

     

    1

    RXFIND("123abc", "abc")

    4

     

    2

    RXFIND("123abc", "xyz")

    0

     

    3

    RXFIND("123abc", "[a-z]")

    4

     

    4

    RXFIND("123abc", "az")

    0

     

    5

    RXFIND("123abc", "a*")

    1

     

    6

    RXFIND("123abc", "a+")

    4

     

    7

    RXFIND("123abc", "[0-9].a")

    2


    Notes on examples:

     

    1

    The regular expression "abc" is found in position 4 of the source string.

     

    2

    The regular expression "xyz" is not found in the source string.

     

    3

    The first lower case letter (specified with [a-z]) is in position 4.

     

    4

    The regular expression "az" is not found in the source string.

     

    5

    Zero a's (specified with a*) are found in position 1.

     

    6

    One or more a's (specified with a+) are found in position 4.

     

    7

    A digit (specified with [0-9]), followed by any character (specified with .), then a, is found in position 2.


1 RXMATCH

 


  • The syntax of RXMATCH is:

      return = RXMATCH(source-string, regular-expression)


    RXMATCH has the following two arguments:

      source-string

    String to be searched for a pattern matching the regular expression

      regular-expression

    The regular expression used to search the source string



    RXMATCH matches the source string against the regular expression. If the source string exactly matches the regular expression, true ($TRUE) is returned. If the source string is not an exact match, false (FALSE) is returned.


    RXMATCH examples within a Warehouse expression:

     

    No.

    Expression

    Result

     

    1

    RXMATCH("abc", "abc")

    True

     

    2

    RXMATCH("123abc", "abc")

    False

     

    3

    RXMATCH("abc", "[a-z]")

    False

     

    4

    RXMATCH("abc", "[a-z]+")

    True

     

    5

    RXMATCH("123abc", "[0-9]+")

    False

     

    6

    RXMATCH("123abc", "[0-9]+[a-z]+")

    True

     

    7

    RXMATCH("123abc", "[0-9].+c")

    True


    Notes on examples:

     

    1

    The regular expression "abc" matches the source string.

     

    2

    The regular expression "abc" does not matches the source string.

     

    3

    The source string does not match the single character regular expression "[a-z]".

     

    4

    The source string matches one or more lower case letters (specified with [a-z]+).

     

    5

    The source string does not match one or digits (specified with [0-9]+).

     

    6

    The source string matches one or digits ([0-9]+) followed by one or more lower case letters ([a-z]+).

     

    7

    The source string matches one digit ([0-9]), then one or more characters (.+), followed by c.



3 Regular Expression Reference

 

Regular expressions with RXFIND and RXMATCH functions are specified using regular expression sytnax with features common to many computer programming languages. Relatively simple regular expressions are standard and are able to be used without change in many programming environments.




  • In a regular expression all alphanumeric characters are matched as they are in a case-sensitive manner. For example, the regular expression "abc" matches string "abc", but not "Abc". Unless they have a special meaning, special characters also must match exactly. For example, the regular expression "a@b" matches string "a@b".

    The characters with a special meaning are:
         

    Source

     

    Regular

       

    Description

    String

     

    Expression

      .

    A period matches any one character

    A5c

    matches

    A.c
      \

    A backslash is used to escape special characters

    A*B

    matches

    A\*B
      ^

    A circumflex (hat) matches the beginning of the string

    Abc

    matches

    ^Abc
      $

    A dollar sign matches the end of the string

    Abc

    matches

    Abc$
      |

    A vertical bar is used to do a logical OR

    Ab

    matches

    Xy|Ab
      [ ]

    Square brackets are used match a group

    Ab

    matches

    A[abc]
      ( )

    Parentheses are used to specify a group

    AdeF

    matches

    A(bc|de)F
      ?

    A question mark matches zero or one of the previous character or group

    Ac

    matches

    Ab?c
      +

    A plus sign matches one or more of the previous character or group

    Abbbc

    matches

    Ab+c
      *

    An asterisk matches zero or more of the previous character or group

    Ac

    matches

    Ab*c
      {m,n}

    Curly braces are used to match a range of the previous character or group

    Abbc

    matches

    Ab{2,3}c


  • Bracket expressions are used to match or not match a single character. A list of matching characters are placed within square brackets. A range may be specified using a hyphen (-) between the high and low values. If the first character is a circumflex (^) the expression is interpreted as NOT, meaning a match is made if the source character is not in the list. Examples:

      [.?!]

    Matches either a period (.), question mark (?) or exclamation point(!).

      [1-9]

    Matches any digit, except 0.

      [0-9A-Z]

    Matches an upper case hexadecimal digit.

      [^/:]

    Matches any character, except a slash (/) or colon (:).

      [^0-9.]

    Matches any character, except a numeric digit or a period (.).



  • Certain classes of characters (such as numeric digits) have predefined representations within regular expressions. Many classes have more than one specification to provide more compatibility between regular expressions in different computer languages.
         

    Bracket

     

    Class

    Description

    Equivalent

      [:alnum:]

    Matches any alphanumeric character

    [0-9A-Za-z]
      [:alpha:]

    Matches any alphabetic character

    [A-Za-z]
      [:ascii:]

    Matches any ASCII character from 0 -to 127

    [\x00-\x7F]
      [:blank:]

    Matches a tab or space character

    [\t ]
      [:cntrl:]

    Matches any control character

    [\x00-\x1F]
      [:digit:]

    Matches any numeric digit

    [0-9]
      [:graph:]

    Matches any graphical (non-space printing) character

    [!-~]
      [:lower:]

    Matches any lower case alphabetic character

    [a-z]
      [:print:]

    Matches any printing character

    [ -~]
      [:punct:]

    Matches any punctuation character

    [!-/:-@\[-`{-~]
      [:space:]

    Matches any whitespace character

    [\t\n\v\f\r ]
      [:upper:]

    Matches any upper case alphabetic character

    [A-Z]
      [:word:]

    Matches any word character

    [0-9A-Z_a-z]
      [:xdigit:]

    Matches a hexadecimal digit

    [0-9A-Fa-f]
      \p{Alnum}

    Matches any alphanumeric character

    [0-9A-Za-z]
      \p{Alpha}

    Matches any alphabetic character

    [A-Za-z]
      \p{ASCII}

    Matches any ASCII character from 0 -to 127

    [\x00-\x7F]
      \p{Blank}

    Matches a tab or space character

    [\t ]
      \p{Cntrl}

    Matches any control character

    [\x00-\x1F]
      \p{Digit}

    Matches any numeric digit

    [0-9]
      \p{Graph}

    Matches any graphical (non-space printing) character

    [!-~]
      \p{Lower}

    Matches any lower case alphabetic character

    [a-z]
      \p{Print}

    Matches any printing character

    [ -~]
      \p{Punct}

    Matches any punctuation character

    [!-/:-@\[-`{-~]
      \p{Space}

    Matches any whitespace character

    [\t\n\v\f\r ]
      \p{Upper}

    Matches any upper case alphabetic character

    [A-Z]
      \p{XDigit}

    Matches a hexadecimal digit

    [0-9A-Fa-f]
      \d

    Matches any numeric digit

    [0-9]
      \D

    Matches any character, except a numeric digit

    [^0-9]
      \s

    Matches any whitespace character

    [\t\n\v\f\r ]
      \S

    Matches any character, except a whitespace character

    [^\t\n\v\f\r ]
      \w

    Matches any word character

    [0-9A-Z_a-z]
      \W

    Matches any character, except a word character

    [^0-9A-Z_a-z]


  • Escapes are elements of a regular expression that begin with a backslash (\). Escapes are used to match special characters or to treat characters with a special meaning as regular characters. For example, if you wished to match an asterisk, you would use \* in the regular expression. Here is a list of supported escapes:

      \0nnn

    Species a character value in octal using nnn.

      \a

    Matches an alert (bell) or ASCII 7

      \b

    Matches a backspace or ASCII 8

      \cX

    Matches the control character corresponding to X

      \d

    Matches any numeric digit

      \D

    Matches any character, except a numeric digit (See above)

      \e

    Matches an escape or ASCII 27

      \f

    Matches a form feed or ASCII 12

      \n

    Matches a line feed or ASCII 10

      \p{group}

    Matches the specified group (See above)

      \r

    Matches a carriage return or ASCII 13

      \s

    Matches any whitespace character (See above)

      \S

    Matches any character, except a whitespace character (See above)

      \t

    Matches a horizontal tab or ASCII 9

      \uhhhh

    Species a character value in with exactly 4 hexadecimal digits using hhhh.

      \v

    Matches a vertical tab or ASCII 11

      \w

    Matches any word character (See above)

      \W

    Matches any character, except a word character (See above)

      \xhhhh

    Species a character value in with 1 to 4 hexadecimal digits using hhhh.

      \\

    Matches a backslash (\)

      \special

    Matches any special character. e.g. \( matches (


    Note: Using backslashes in Warehouse scripts can be confusing because the Warehouse script processor converts backslashes and then the regular expression processes them again. That means that backslashes in a script must be doubled. For example, if you wish to check if a field called src_field matches "ab\cd(ef)", you would use this RXMATCH expression:

    RXMATCH(src_field, "ab\\\\cd\\(ef\\)")


  • Regular Expression quantifiers are used to specify how many occurrences of a character or group are needed to make a match. The quantifiers are:

      *

    Zero or more occurrences

      +

    One or more occurrences

      ?

    Zero or one occurrences

      {m}

    Exactly m occurrences

      {m,}

    At least m occurrences, i.e. m occurrences or more

      {m,n}

    Between m and n occurrences



  • Parentheses ( ) are used to create groups within a regular expression. The purpose of a group is either to use a quantifier or to create more than one possible match using a vertical bar (|) as an OR operator.

    Examples:

     

    Source

    Regular

       
     

    String

    Expression

    Matches

    Comments

      AbAbAb (Ab)+

    Yes

    Match Ab one or more times

      Adef A(bc|de)f

    Yes

    Match start with A, then either bc or de, followed by f

      Acdf A(bc|de)f

    No

    The cd does not match

      Abcdef A(bc|de)f

    No

    Only one bc or de permitted between the A and f

      Abcdef A(bc|de)*f

    Yes

    Two occurrences of bc or de between the A and f

      Af A(bc|de)*f

    Yes

    Zero occurrences of bc or de between the A and f

      abe (ab|cd*)e

    Yes

    ab matches first part of OR, followed by e

      cde (ab|cd*)e

    Yes

    cd matches second part of OR, followed by e

      ce (ab|cd*)e

    Yes

    c matches and d occurs zero or more times.

      cabe (ab|cd*)+e

    Yes

    c matches, then ab, then e.

      abEcdEf ((ab|cd)E)+f

    Yes

    Groups may be nested to create complex expressions.



  • Regular expressions may contain options specified with a leading (? followed by the option, then ). Only two options are supported: i to do case insensitive matching, and s to allow a dot (.) to match a newline ab character. Normally a dot (.) does not match a newline.


    Examples:

     

    Source

    Regular

       
     

    String

    Expression

    Matches

    Comments

      Abc (?i)ABC

    Yes

    Case insensitive match

      Abc (?i)[A-Z]+

    Yes

    Case insensitive match with 1 or more A to Z



  • Here are some things that can be matched using regular expressions:

     

    Item

    Regular Expression

     

    Email address

    [a-zA-Z0-9_.+-]+@([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,5}
     

    Roman numeral

    M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})
     

    Web address

    https?://((.)+\.)+[A-Za-z]{2,5}(/.*)?
     

    Phone number

    [01]?[- .]?(\([2-9]\d{2}\)|[2-9]\d{2})[- .]?\d{3}[- .]?\d{4}
     

    Real number

    [+-]?\d+(\.\d*)?([eE][+-]?\d+)?
     

    SSN

    \d{3}-\d{2}-\d{4}
     

    US Dollars

    \$(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?


  • The following websites contain more information about regular expressions. Keep in mind that each regular expression implementation is different and the information in these websites is not necessarily applicable to the Warehouse implementation.

     

    JavaScript Regular Expressions

     

    Much information about regular expressions

     

    Microsoft .NET Regular Expression Reference

     

    Open Group Regular Expression Standard

     

    Oracle Java Regular Expression Reference

     

    w3schools.com JavaScript Regular Expression Reference

     

    Wikipedia information on regular expressions


Last updated 07/31/2017