Taurus Software - Regular Expressions


		Home \| Contact Us

Regular Expressions in Warehouse and DataBridger

Regular expressions are available in Warehouse and DataBridger Studio using the RXFIND and RXMATCH functions. RXFIND is used to search a string for a pattern and RXMATCH is used to test if an entire string matches a pattern. The patterns are specified using regular expression syntax. Regular expressions were invented in the 1950s and are common throughout the computing world. Each regular expression implementation has many things in common, and each has its own enhancements to the common subset. A web search of "regular expressions" yields a cornucopia of information.

Contents


1	RXFIND

1.1

1.2

1.3

1.4


2	RXMATCH

	2.1	Syntax
	2.2	Arguments

2.3

Return

2.4

Examples


3	Regular Expression Reference

3.1

Regular Expression Characters

3.2

Bracket Expressions

3.3

Regular Expression Classes

	3.4	Regular Expression Escapes
	3.5	Regular Expression Quantifiers
	3.6	Regular Expression Groups and OR
	3.7	Regular Expression Options
	3.8	Regular Expression Examples
	3.9	Further Information

1 RXFIND

1.1 Syntax

The syntax of RXFIND is:

return = RXFIND(source-string, regular-expression)

1.2 Arguments

RXFIND has the following two arguments:

	source-string	String to be be search for a pattern matching the regular expression
	regular-expression	The regular expression used to search the source string

1.3 Return

RXFIND searches the source string for a series of characters that match the regular expression. If the regular expression is not found, zero (0) is returned. If the regular expression is found, the index within the source string of the first character matched is returned, with the first index being 1.

Note: In a regular expression, a leading circumflex (^) indicates a match with the beginning of the source string and a trailing dollar sign ($) indicates a match with the end of the source string (See 3.1 below). Therefore, if the regular expression has a leading circumflex and a trailing dollar sign, RXFIND returns only 0 for no match or 1 for a match, thus performing like RXMATCH. Many examples from the internet have a leading circumflex and a trailing dollar sign to indicate the RXMATCH functionality. For example, this Microsoft web page shows the regular expression to check a social security number is "^\d{3}-\d{2}-\d{4}$". You can either use this regular expression as is and call RXFIND, or remove the circumflex and dollar sign and call RXMATCH using "\d{3}-\d{2}-\d{4}".

1.4 Examples

RXFIND examples within a Warehouse expression:

No.	Expression	Result
1	RXFIND("123abc", "abc")	4
2	RXFIND("123abc", "xyz")	0
3	RXFIND("123abc", "[a-z]")	4
4	RXFIND("123abc", "az")	0
5	RXFIND("123abc", "a*")	1
6	RXFIND("123abc", "a+")	4
7	RXFIND("123abc", "[0-9].a")	2

Notes on examples:

	1	The regular expression "abc" is found in position 4 of the source string.
	2	The regular expression "xyz" is not found in the source string.
	3	The first lower case letter (specified with [a-z]) is in position 4.
	4	The regular expression "az" is not found in the source string.
	5	Zero a's (specified with a*) are found in position 1.
	6	One or more a's (specified with a+) are found in position 4.
	7	A digit (specified with [0-9]), followed by any character (specified with .), then a, is found in position 2.

1 RXMATCH

2.1 Syntax

The syntax of RXMATCH is:

return = RXMATCH(source-string, regular-expression)

2.2 Arguments

RXMATCH has the following two arguments:

	source-string	String to be searched for a pattern matching the regular expression
	regular-expression	The regular expression used to search the source string

2.3 Return

RXMATCH matches the source string against the regular expression. If the source string exactly matches the regular expression, true ($TRUE) is returned. If the source string is not an exact match, false (FALSE) is returned.

2.4 Examples

RXMATCH examples within a Warehouse expression:

No.	Expression	Result
1	RXMATCH("abc", "abc")	True
2	RXMATCH("123abc", "abc")	False
3	RXMATCH("abc", "[a-z]")	False
4	RXMATCH("abc", "[a-z]+")	True
5	RXMATCH("123abc", "[0-9]+")	False
6	RXMATCH("123abc", "[0-9]+[a-z]+")	True
7	RXMATCH("123abc", "[0-9].+c")	True

Notes on examples:

	1	The regular expression "abc" matches the source string.
	2	The regular expression "abc" does not matches the source string.
	3	The source string does not match the single character regular expression "[a-z]".
	4	The source string matches one or more lower case letters (specified with [a-z]+).
	5	The source string does not match one or digits (specified with [0-9]+).
	6	The source string matches one or digits ([0-9]+) followed by one or more lower case letters ([a-z]+).
	7	The source string matches one digit ([0-9]), then one or more characters (.+), followed by c.

3 Regular Expression Reference

Regular expressions with RXFIND and RXMATCH functions are specified using regular expression sytnax with features common to many computer programming languages. Relatively simple regular expressions are standard and are able to be used without change in many programming environments.

3.1 Regular Expression Characters

In a regular expression all alphanumeric characters are matched as they are in a case-sensitive manner. For example, the regular expression "abc" matches string "abc", but not "Abc". Unless they have a special meaning, special characters also must match exactly. For example, the regular expression "a@b" matches string "a@b".

The characters with a special meaning are:

		Source		Regular
	Description	String		Expression
.	A period matches any one character	A5c	matches	A.c
\	A backslash is used to escape special characters	A*B	matches	A\*B
^	A circumflex (hat) matches the beginning of the string	Abc	matches	^Abc
$	A dollar sign matches the end of the string	Abc	matches	Abc$
\|	A vertical bar is used to do a logical OR	Ab	matches	Xy\|Ab
[ ]	Square brackets are used match a group	Ab	matches	A[abc]
( )	Parentheses are used to specify a group	AdeF	matches	A(bc\|de)F
?	A question mark matches zero or one of the previous character or group	Ac	matches	Ab?c
+	A plus sign matches one or more of the previous character or group	Abbbc	matches	Ab+c
*	An asterisk matches zero or more of the previous character or group	Ac	matches	Ab*c
{m,n}	Curly braces are used to match a range of the previous character or group	Abbc	matches	Ab{2,3}c

3.2 Bracket Expressions

Bracket expressions are used to match or not match a single character. A list of matching characters are placed within square brackets. A range may be specified using a hyphen (-) between the high and low values. If the first character is a circumflex (^) the expression is interpreted as NOT, meaning a match is made if the source character is not in the list. Examples:

	[.?!]	Matches either a period (.), question mark (?) or exclamation point(!).
	[1-9]	Matches any digit, except 0.
	[0-9A-Z]	Matches an upper case hexadecimal digit.
	[^/:]	Matches any character, except a slash (/) or colon (:).
	[^0-9.]	Matches any character, except a numeric digit or a period (.).

3.3 Regular Expression Classes

Certain classes of characters (such as numeric digits) have predefined representations within regular expressions. Many classes have more than one specification to provide more compatibility between regular expressions in different computer languages.

		Bracket
Class	Description	Equivalent
[:alnum:]	Matches any alphanumeric character	[0-9A-Za-z]
[:alpha:]	Matches any alphabetic character	[A-Za-z]
[:ascii:]	Matches any ASCII character from 0 -to 127	[\x00-\x7F]
[:blank:]	Matches a tab or space character	[\t ]
[:cntrl:]	Matches any control character	[\x00-\x1F]
[:digit:]	Matches any numeric digit	[0-9]
[:graph:]	Matches any graphical (non-space printing) character	[!-~]
[:lower:]	Matches any lower case alphabetic character	[a-z]
[:print:]	Matches any printing character	[ -~]
[:punct:]	Matches any punctuation character	[!-/:-@\[-`{-~]
[:space:]	Matches any whitespace character	[\t\n\v\f\r ]
[:upper:]	Matches any upper case alphabetic character	[A-Z]
[:word:]	Matches any word character	[0-9A-Z_a-z]
[:xdigit:]	Matches a hexadecimal digit	[0-9A-Fa-f]
\p{Alnum}	Matches any alphanumeric character	[0-9A-Za-z]
\p{Alpha}	Matches any alphabetic character	[A-Za-z]
\p{ASCII}	Matches any ASCII character from 0 -to 127	[\x00-\x7F]
\p{Blank}	Matches a tab or space character	[\t ]
\p{Cntrl}	Matches any control character	[\x00-\x1F]
\p{Digit}	Matches any numeric digit	[0-9]
\p{Graph}	Matches any graphical (non-space printing) character	[!-~]
\p{Lower}	Matches any lower case alphabetic character	[a-z]
\p{Print}	Matches any printing character	[ -~]
\p{Punct}	Matches any punctuation character	[!-/:-@\[-`{-~]
\p{Space}	Matches any whitespace character	[\t\n\v\f\r ]
\p{Upper}	Matches any upper case alphabetic character	[A-Z]
\p{XDigit}	Matches a hexadecimal digit	[0-9A-Fa-f]
\d	Matches any numeric digit	[0-9]
\D	Matches any character, except a numeric digit	[^0-9]
\s	Matches any whitespace character	[\t\n\v\f\r ]
\S	Matches any character, except a whitespace character	[^\t\n\v\f\r ]
\w	Matches any word character	[0-9A-Z_a-z]
\W	Matches any character, except a word character	[^0-9A-Z_a-z]

3.4 Regular Expression Escapes

Escapes are elements of a regular expression that begin with a backslash (\). Escapes are used to match special characters or to treat characters with a special meaning as regular characters. For example, if you wished to match an asterisk, you would use \* in the regular expression. Here is a list of supported escapes:

	\0nnn	Species a character value in octal using nnn.
	\a	Matches an alert (bell) or ASCII 7
	\b	Matches a backspace or ASCII 8
	\cX	Matches the control character corresponding to X
	\d	Matches any numeric digit
	\D	Matches any character, except a numeric digit (See above)
	\e	Matches an escape or ASCII 27
	\f	Matches a form feed or ASCII 12
	\n	Matches a line feed or ASCII 10
	\p{group}	Matches the specified group (See above)
	\r	Matches a carriage return or ASCII 13
	\s	Matches any whitespace character (See above)
	\S	Matches any character, except a whitespace character (See above)
	\t	Matches a horizontal tab or ASCII 9
	\uhhhh	Species a character value in with exactly 4 hexadecimal digits using hhhh.
	\v	Matches a vertical tab or ASCII 11
	\w	Matches any word character (See above)
	\W	Matches any character, except a word character (See above)
	\xhhhh	Species a character value in with 1 to 4 hexadecimal digits using hhhh.
	\\	Matches a backslash (\)
	\special	Matches any special character. e.g. \( matches (

Note: Using backslashes in Warehouse scripts can be confusing because the Warehouse script processor converts backslashes and then the regular expression processes them again. That means that backslashes in a script must be doubled. For example, if you wish to check if a field called src_field matches "ab\cd(ef)", you would use this RXMATCH expression:

RXMATCH(src_field, "ab\\\\cd\$ef\$")

3.5 Regular Expression Quantifiers

Regular Expression quantifiers are used to specify how many occurrences of a character or group are needed to make a match. The quantifiers are:

	*	Zero or more occurrences
	+	One or more occurrences
	?	Zero or one occurrences
	{m}	Exactly m occurrences
	{m,}	At least m occurrences, i.e. m occurrences or more
	{m,n}	Between m and n occurrences

3.6 Regular Expression Groups and OR

Parentheses ( ) are used to create groups within a regular expression. The purpose of a group is either to use a quantifier or to create more than one possible match using a vertical bar (|) as an OR operator.

Examples:

Source	Regular
String	Expression	Matches	Comments
AbAbAb	(Ab)+	Yes	Match Ab one or more times
Adef	A(bc\|de)f	Yes	Match start with A, then either bc or de, followed by f
Acdf	A(bc\|de)f	No	The cd does not match
Abcdef	A(bc\|de)f	No	Only one bc or de permitted between the A and f
Abcdef	A(bc\|de)*f	Yes	Two occurrences of bc or de between the A and f
Af	A(bc\|de)*f	Yes	Zero occurrences of bc or de between the A and f
abe	(ab\|cd*)e	Yes	ab matches first part of OR, followed by e
cde	(ab\|cd*)e	Yes	cd matches second part of OR, followed by e
ce	(ab\|cd*)e	Yes	c matches and d occurs zero or more times.
cabe	(ab\|cd*)+e	Yes	c matches, then ab, then e.
abEcdEf	((ab\|cd)E)+f	Yes	Groups may be nested to create complex expressions.

3.7 Regular Expression Options

Regular expressions may contain options specified with a leading (? followed by the option, then ). Only two options are supported: i to do case insensitive matching, and s to allow a dot (.) to match a newline ab character. Normally a dot (.) does not match a newline.

Examples:

Source	Regular
String	Expression	Matches	Comments
Abc	(?i)ABC	Yes	Case insensitive match
Abc	(?i)[A-Z]+	Yes	Case insensitive match with 1 or more A to Z

3.8 Regular Expression Examples

Here are some things that can be matched using regular expressions:

	Item	Regular Expression
	Email address	[a-zA-Z0-9_.+-]+@([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,5}
	Roman numeral	M{0,4}(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})
	Web address	https?://((.)+\.)+[A-Za-z]{2,5}(/.*)?
	Phone number	[01]?[- .]?($[2-9]\d{2}$\|[2-9]\d{2})[- .]?\d{3}[- .]?\d{4}
	Real number	[+-]?\d+(\.\d*)?([eE][+-]?\d+)?
	SSN	\d{3}-\d{2}-\d{4}
	US Dollars	\$(\d{1,3}(\,\d{3})*\|(\d+))(\.\d{2})?

3.9 Further Information

The following websites contain more information about regular expressions. Keep in mind that each regular expression implementation is different and the information in these websites is not necessarily applicable to the Warehouse implementation.

	JavaScript Regular Expressions
	Much information about regular expressions
	Microsoft .NET Regular Expression Reference
	Open Group Regular Expression Standard
	Oracle Java Regular Expression Reference
	w3schools.com JavaScript Regular Expression Reference
	Wikipedia information on regular expressions

Last updated 07/31/2017