• Get application security done the right way! Detect, Protect, Monitor, Accelerate, and more…
  • Regex expression can be a pain. Well, sometimes!

    Let’s learn about Regular Expressions and their patterns. We are going to look into such patterns that seem like a convoluted soup of characters. We will see what every character in a regular expression means.

    After reading this article, you will be able to create your regular expressions and use them for as you like. In the end, we will also list down some of the online RegEx testing tools so that based on requirement you can create your RegEx and test it using these tools.

    Introduction

    Regular Expressions or as it’s commonly known – RegEx is any sequence of characters that can be used as a pattern to search for characters or strings.

    For example – to determine if a string or phrase contains the word “apple” we can use the regex “/apple” to search within the string. As another example, we can use “/[0-9]” to check if a given string contains a number between 0 and 9.

    Regular Expressions and their use

    Regular expressions are widely used for a variety of purposes in modern-day web-related operations. Validation of web forms, Web search engines, lexical analyzers in IDE’s, text editors, and document editors are among a few examples where regular expressions are frequently used.

    We have all used “CTRL + F” many times to search within a document or a piece of code to find a particular word or a phrase or an expression. This operation can be pointed out as a very common example of the use of regular expressions.

    Before going on any further, let’s have a look at a very commonly used regular expression.

    Can you guess 🤔 the below RegEX what is it used for?

    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

    Don’t worry if you can’t guess it. I am dam sure you would be able to guess by the end of this article.

    First let’s get started with A, B, C of RegEx.

    Tokens

    To start with, let’s look at the various symbols in the Regex shown above.

    ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

    If we look at the regex given above, we can see that is composed of many symbols or characters or tokens. Let’s find out what they mean:

    Token

    Meaning

    ^

    This token denotes the start of a string.

    (…)

    This denotes a group where everything that is given within (…) is captured.

    […]

    The [] encloses characters any of which can be matched. For example – [abc] will match either a or b or c.

    a-z

    The set of lowercase alphabets from a to z. We must keep in mind that Regex is case sensitive.

    A-Z

    The set of uppercase characters from A to Z.

    0-9

    The digits from 0 to 9.

    _

    This will match the character _.

    \

    This is the escape character.

    \.

    This matches the character “.” literally. This is used because the symbol “.” in regex is a token in itself which matches any character

    +

    This is a quantifier. This matches one or more characters it is used with. For example, a+ means one or more occurrences of the character a.

    \-

    This will match the “-” character.

    @

    This will match the “@” character.

    {}

    This is another quantifier. It is used to denote the number of occurrences of a character. For example, a{3} means exactly 3 a’s.

    $

    This denotes the end of a string.

    Break down of the given Regex pattern

    Now, armed with this preliminary knowledge of tokens, let’s try to decode the above regular expression:

    • ^([a-zA-Z0-9_\-\.]+) means we are looking for a string that starts with at least one or more uppercase or lowercase alphanumeric characters, underscores, hyphens, or dots. For instance, anything that looks similar to user_name.01 will match the pattern. We must remember that here don’t need to include all the symbols just any one character in [a-zA-Z0-9_\-\.] will do.
    • The @ character matches for a single occurrence of @. Adding to the previous example, something like [email protected] will fit.
    • ([a-zA-Z0-9_\-\.]+)  is similar to the first point. It too means that we are looking for a string that contains at least one or more alphanumeric characters, underscores, hyphens, or dots. Adding to the example, [email protected] will fit here.
    • As you might have already guessed, we are hinting at an email pattern. Moving on, \. matches the single “.” character. If we continue with the ongoing example, something like [email protected]
    • ([a-zA-Z]{2,5})$ this means that the string should end with 2 to 5 alphabet characters either uppercase or lowercase. If we add .com to the previous example, we can get [email protected], which is the common pattern of an email string.

    Combining all of the above, we can see that we are searching for an email id string. Now we can use this expression to validate any email id. If our test email id matches this pattern we can say it is a valid email id.

    P.S. – This a pattern for most common email ids on the web.

    Types of Tokens

    Many tokens can be used in various combinations within a Regex to describe a wide variety of expressions. Below we are going to take a look at the various types of tokens that are used in regular expressions. Furthermore, we are also going to look at the most commonly used tokens in each category.

    Basic Tokens

    Let’s start with the basic tokens. These tokens are used with almost every regular expression. Hence, we must learn about them first.

    Token

    Meaning

    \r

    This matches the carriage return character.

    \0

    It matches the null character.

    \n

    This looks for a new line.

    \t

    This matches for a tab.

    Character classes

    Moving on, let’s look at the character tokens. They are used to match alphabets, numbers and other special characters.

    Token

    Meaning

    a

    This matches literally for the character a. Similarly, all alphabets and numbers when used in isolation look for the specific character itself.

    abc

    It matches the string abc.

    [abc]

    This looks for a single character among a, b or c.

    [^abc]

    This matches any character except a or b or c.

    [a-z]

    A lowercase character in the range from a to z

    [^a-z]

    Any character not in the range from a to z. This includes uppercase characters as well.

    [A-Z]

    An uppercase character between A and Z.

    [^A-Z]

    A character not between A and Z.

    [0-9]

    Any number in the range 0 to 9

    [^0-9]

    A character not in the range 0 to 9

    [a-zA-Z0-9]

    This matches for a character which may be a lower case character from between a and z or any character between A and Z or any number between 0 and 9

    [^a-zA-Z0-9]

    Any character that doesn’t fall in the previous category.

    .

    Any single character

    \s

    This is used to look for any whitespace character.

    \S

    This is used to look for any non-whitespace character.

    \d

    This matches for any digit

    \D

    This matches for any non-digit

    \w

    It matches any word character

    \W

    It matches any non-word character

    $

    This denotes the end of a string

    \b

    This matches a word boundary

    \B

    This is used to match a non-word boundary

    Quantifiers

    This special class of tokens is used to match the number of consecutive occurrences of a character or a string or a number. They are used in conjunction with the other tokens.

    Let’s look at a few common quantifiers.

    Tokens

    Meaning

    a?

    This matches for zero or one occurrence of a.

    a*

    This matches for zero or more occurrences(consecutive) of a.

    a+

    This is for at least one or more consecutive occurrences of a.

    a{5}

    This looks for exactly five consecutive occurrences of the letter a.

    a{5, }

    This is for at least five or more consecutive occurrences of a.

    a{5, 7}

    This looks for any number of consecutive a’s between 5 and 7.

    Groups

    These tokens will match in groups as the name suggests.

    Tokens

    Meaning

    (…)

    This captures everything enclosed within the parenthesis.

    (a|b)

    This matches either a or b.

    (?:…)

    Matches everything enclosed within the brackets

    (?(1)yes|no)

    This matches a conditional statement.

    Flags

    These are special instructions given to the pattern matcher engine while searching for a match.

    Tokens Meaning

    g

    Global match. This will search until the matching engine finds no more match, i.e., until the end of the given string or group of strings.

    m

    Multiline match i.e., line by line.

    x

    Tells the engine to ignore whitespaces while matching.

    X

    This is used for extended matching.

     s

    This matches a single line.

     i

    This is used for case insensitive matching.

     u

    For Unicode characters.

    Anchors

    Additional instructions for the engine regarding positions.

    Tokens

    Meaning

    ^

    This denotes the start of a string

    \A

    This too denotes the start of a string as well

    \Z

    The token for the end of a string.

    \z

    The token for absolute end of a string.

     \G

    This is for the start of a match.

    Commonly used regular expressions

    Regular expressions are widely used over the Internet. From form validations to looking up data containing a particular keyword or keywords, regular expressions are almost inseparable from modern-day computing applications.

    Let’s look at some familiar examples of the use of regular expressions.

    Matching a phone number

    Let’s see what is the pattern of a phone number used in India. The Country Code comes first. It usually contains a “+” character followed by the number 91, which is the country code for India. Also, Indian phone numbers generally start with 6, 7, 8, or 9. This is then followed by 9 other digits.

    So a valid regex for an Indian cell phone number would be as given.

    ^(\+91[\-\s]?)?[0]?(91)?[6-9]\d{9}$

    Testing the strength of passwords

    Most websites recommend us to provide a strong password which contains a combination of numbers, uppercase and lowercase characters, and symbols. Also, there has to be a minimum number of characters – 6 or 8. This is done so that the password becomes very hard to crack.

    Any password following this rule can be generated or validated for password strength using a regular expression.

    ^(((?=.*[a-z])(?=.*[A-Z]))|((?=.*[a-z])(?=.*[0-9]))|((?=.*[A-Z])(?=.*[0-9])))(?=.{6,})

    URL Matching

    URLs are the most common way to use the internet and quickly visit the webpage we want. Almost every website has an URL. Hence, every URL is standardized and follows a definite pattern. Every URL either follows the HTTP or the HTTPs protocol followed by “://” and the “www” often. Then the name of the website followed by a .com or .net or .org etc.

    To test the validity of an URL we can use a regex like the one given below.

    https?:\/\/(www\.)?[[email protected]:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

    Date and Time formats

    Date and time formats are also very commonly used across the web. There are many formats of dates used by a variety of applications or software or systems. Dates should always be used in a format that makes it usable for the user or the application that is trying to read it.

    A date in the format dd-MM-yyyy can be validated by using a regular expression which can be as given below.

    ^(1[0-2]|0[1-9])/(3[01]|[12][0-9]|0[1-9])/[0-9]{4}$

    Now, let’s explore some of the online RegEx tools which can be handy to build and troubleshoot.

    If you want to learn more about regular expressions, their examples, and advanced usages, here is a list of  websites that you can always refer to:

    Regex101

    RegEx101

    Regex101 is an excellent reference guide and an interactive tool for creating your regular expressions, it can help you get started with regex very quickly.

    Using this we can test RegEx for the below languages.

    • PCRE (PHP)
    • ECMAScript (JavaScript)
    • Python
    • GoLang

    It provides supports for RegEx functionalities like a match, substitution, and unit tests. Apart from this one can save the old tested RegEx.

    FreeFormatter

    FreeFormater RegEx tester

    FreeFormatter is JavaScript-based and uses the XRegExp library for enhanced features. It facilitates testing a RegEx against a match as well as replacing a match. It supports below flags, which can be used depending upon the requirement while testing a RegEx

    • i – Case-insensitive
    • m – Multiline
    • g – Global (don’t stop at the first match)
    • s – Dot matches all INCLUDING line breaks (XRegExp only).

    Regex Crossword

    RegExCrosswood

    If Regex and puzzles interest you, this is the site to go to. It has a series of fun and interactive puzzles. They will definitely help you learn more about regular expressions.

    • Optimized for phones and solving RegEx puzzles on the go.
    • A step by step tutorial, teaching you the different symbols and RegEx patterns.
    • Bend your mind around cubistic 2D palindrome RegEx puzzles.
    • Wide range of RegEx puzzles with difficulties from beginner to expert.

    RegExr

    RegExr

    RegExr is a website for getting your hands dirty with Regex. You can write regex, match patterns, and have all the fun with this Codepen equivalent for Regular Expressions.

    Features

    • Supports JavaScript & PHP/PCRE RegEx.
    • Results update in real-time as you type.
    • Roll over a match or expression for details.
    • Validate patterns with suites of Tests.
    • Save & share expressions with others.
    • Full RegEx Reference with help & examples.

    Pythex

    It is a Python-based regular expression tester. Pythex is a quick way to test your Python regular expressions. It comes with four flags namely

    • Ignore Case
    • Multiline
    • DotAll
    • Verbose

    Rubular

    Rubular

    Rubular is a Ruby-based regular expression editor. It supports and uses the Ruby 2.5.7 version onwards.

    Debuggex

    Debuggex

    It is JavaScript-based and supports RegEx for Python and Perl Compatible Regular Expressions(PCRE). Using this online tool we can embed our RegEx to StackOverflow. It provides a facility to share the RegEx result by creating a unique link against each RegEx test.

    ExtendsClass

    Extendsclass

    ExtendsClass is a toolbox for developers. It provides RegEx testing support for the below languages.

    • JavaScript
    • Python (3.4)
    • Ruby (2.1)
    • Java (JDK 14)
    • PHP (7)

    RegEx Tester

    RegExTester

    This free regular expression tester lets you test your regular expressions against any entry of your choice and clearly highlights all matches. Using this, we can save the old tested RegEx for future reference. Moreover, it supports JavaScript and PCRE RegEx.

    Web ToolKit

    WebToolKitOnline RegExTester

    Web Toolkit contains a set of utility tools, RegEx tester is one of them. We can input our RegEx here and can test it against a value. It also provides a facility for replacing, matching, and copying the expressions. Apart from this, it provides a toggle to perform a case-sensitive and global match.

    Conclusion

    We learned the regular expressions, a few common examples, and some of the online testing tools. With this knowledge, we can create our regular expressions and use them in our applications.