Lab – Regular Expressions

Lab – Regular Expressions

Objectives

After completing this lab, you should be able to:

  • Use an interactive tool to develop regular expression patterns for validation
  • Use the .NET Framework’s Regex class to validate user input strings
  • Discover guidance at the Microsoft Patterns & Practices website

Overview

In this lab, you’ll use an interactive regular expression parser to experiment with regular expression patterns and develop a pattern for use in your code. You’ll then implement that pattern by using the System.Text.RegularExpressions.Regex class in the .NET Framework. Along the way you’ll be introduced to a HOWTO at the Microsoft Patterns & Practices website that not only shows how to use the Regex class, but also includes several patterns you can use in your own code.

Scenario

This scenario is a library of regular expression validation logic that will be reused in many applications.

Setup

Open the before\RegexLab.sln solution. This includes a skeletal application where you will do your work. There is also a fully completed lab in the after\ directory that you can use to compare with your work if you get stuck.

Part 1 - Interactively Developing Regular Expressions

1.The lab solution contains two projects.

  1. RegexBenchis a Windows Forms application that’s been provided for you to help you interactively develop regular expressions for validating input.
  2. RegexPractice is where you’ll be working.

2.The first step to building regex validation is to determine the regular expression patterns that you’ll be using. One easy way to do this is with a program like RegexBench. Build and run RegexBench, which consists of a single form:

  1. The top input area is where you type in a regular expression (a pattern) that you’d like to test.
  2. In the upper right are all the options you can specify when you create a Regex object in .NET. This controls whether the match will be case sensitive, and so on. You won’t use many of these during the lab, but the entire set of flags is there for you to experiment with. Hover your mouse over any of these checkboxes for the flag’s documentation.
  3. The input area in the middle of the form is where you type test input data to see if there’s a match or not.
  4. Below the test input area is where the result is shown. The result will either be a match that shows all captured groups, “No match”, or the details of any exception that was thrown (typically because there’s a syntax error in your pattern). The surrounding background of the result area will change color to help cue you to when you have a match. Green is a match, yellow is no match, and red indicates an exception.
  5. At the bottom of the form is a cheat sheet that shows things that the author of this lab gets tired of looking up all the time  Hover your mouse over anything on the cheat sheet for a quick description of the item.
  6. To the right of the cheat sheet is a checkbox called Automatch. When that box is checked, the pattern is matched against the input text at each keystroke in either input field, or whenever the RegexOptions are changed. If you turn off this feature, you’ll have to press the Match! Button to try the match.

3.Start by building a regular expression that validates a simple U.S. zip code. There are a few different ways that zip codes can be represented, but start with the simplest, a five digit number.

  1. Type the following pattern into RegexBench (I’ve highlighted each pattern in this lab manual with marching ants to make it clear where it begins and ends: \d{5}
  2. In the input data area, enter a valid zip code such as 12345. It should match.
  3. Now type in an invalid zip code like 123456. It should also match.

Note: This is a very common mistake – you need to include anchors that show the start and end of the string, otherwise any string containing five consecutive digits will be acceptable.

  1. Here’s the corrected version: ^\d{5}$

4.Now write a completely new pattern that matches a 9-digit zip code like this one: 12345-1234.

  1. Don’t forget the anchors.
  2. Test your result. I’m going to include a few endnotes[1] to hidethe answers.

Note:If you get stuck, just hover your mouse over the endnote reference to see the answer.Be sure to experiment a bit before giving up – this is the best way to learn!

5.Now put the two patterns together with alternation to get a pattern that supports both formats[2].

  1. Test your pattern with various inputs.
  2. Simplify your pattern by using an optional group, factoring out the \d{5}, which is always present in both formats.[3]
  3. Test your new pattern.
  4. As a final step, tweak your pattern so that it supports a nine-digit zip code that runs together: 123456789. Hint: you only need to make the hypen optional![4]

6.Document your pattern.

  1. At this point your pattern might be getting a bit hard to read. You’re going to fix that by commenting it.
  2. Check the IgnorePatternWhitespace option.
  3. Add some whitespace and comments to make your pattern easier to understand.
  4. This will help you and your colleagues when you revisit your code a year from now!
  5. Here’s what the result might look like – note that it still matches even as you’re adding whitespace and comments!
  1. Leave RegExBench running – you’re going to need to copy your pattern over into your code in the next part of the lab.

Part 2 - Writing Code to Use the .NET Regex Class

1.Examine the RegexPractice project.

  1. From Solution Explorer, open the RegexPractice.cs file in the RegexPractice project.
  2. The Validate class at the top of the source file is where you’ll be working. This class can be used anywhere you need to validate data, for example, in an ASP.NET web application.
  3. Below the Validate class is a Main function (RegexPractice is a console application). You’ll use Main to run a suite of tests against your Validate class to ensure it works the way you expect it to.

Note: In a production environment, this code would go into whatever unit testing framework you are using, such as that built into Visual Studio Team System.

2.Copy the pattern you just built into your code.

  1. Switch to RegexBench for a second and copy the regular expression to the clipboard, comments and all.

  1. Now switch back to RegexPractice.cs and paste the pattern into the body of the USZipCode method, surrounding it with an “@”-style literal string. You should end up with something like this:

publicstaticvoid USZipCode(string userInput) {

string pattern = @"

^ # nothing before this!

\d{5} # main zip code body

( # optional zip+4 extension

-? # may or may not have a hyphen

\d{4} # here's the +4 part

)?$ # nothing after this!";

}

Note: It may seem strange to have a single string literal span multiple lines in your code, but that’s one really useful reason for having the “@” literal string syntax in C#. The code above is perfectly good C#, and it’s wise to use this format for documenting regular expression patterns in your code.

3.Call the static Regex.IsMatch method to check if the user input matches your pattern.

  1. Add a using statement to the top of the source file for System.Text.RegularExpressions.
  2. Call Regex.IsMatch, passing userInput, pattern, and the correct RegexOptions flags as the third argument (to remind yourself which flags you used, you may need to flip back to RegexBench and see which RegexOptions you chose).
  3. If the match fails, IsMatch returns false. In that case throw a ValidationException (the class is defined right below the Validate class). The exception’s message should be something like, “Malformed zip code”.

4.Note that in Main there are already a couple of tests for your USZipCode method.

  1. Test.Good succeeds if the user input provided results in a match.
  2. Test.Bad succeeds if the user input causes your method to throw a ValidationException. As with any good unit test, you should always test both good and bad cases.

5.Build and run the RegexPractice application.

  1. Running it outside the debugger with Ctrl+F5 will ensure that the console stays open until you press a key so that you can see the output.
  2. If you don’t see any output, you’ve done a good job – the Test methods will only complain if they fail.
  3. Add a few more tests to Main after the ones you’ve already got:

Test.Good("12345-1234", testMethod);

Test.Good("123451234", testMethod);

Test.Bad("ABCDE", testMethod);

Test.Bad("", testMethod);

  1. Build and run to make sure these new tests succeed as well.

6.Add the following code to test the other methods of your Validate class (consider this a bit of test driven development here – you’ve not yet implemented the other methods, but that’s OK):

testMethod = "USPhoneNumber";

Test.Good("4255550123", testMethod);

Test.Good("(425) 555-0123", testMethod);

Test.Good("425-555-0123", testMethod);

Test.Good("425 555 0123", testMethod);

Test.Good("1-425-555-0123", testMethod);

Test.Bad("425555012", testMethod);

Test.Bad("555-0123", testMethod);

Test.Bad("", testMethod);

testMethod = "Currency";

Test.Good("1.00", testMethod);

Test.Good("3343", testMethod);

Test.Bad("1234.567", testMethod);

Test.Bad("", testMethod);

testMethod = "EmailAddress";

Test.Good("", testMethod);

Test.Good("", testMethod);

Test.Bad("alice@bar.c", testMethod);

Test.Bad("", testMethod);

Test.Bad("'drop table customers--", testMethod);

7.Build and run your application. You should have many errors, but you’ll fix them in the next step.

8.Implement the other validation methods in a similar fashion.

  1. Make use of the resources at Microsoft Patterns & Practices to help you look up commonly used patterns.
  2. Surf to and scroll down a bit to find some well-known patterns that you can use.
  3. If you have extra time, you can explore these patterns using RegexBench – use the whitespace feature to break them apart and comment them. But for now, just paste them into your code as single-line strings. Don’t forget to use the ‘@’ sign before each string literal, because regex patterns often have backslashes that would normally be meaningful in a C# string.

9.Build and test. You should now have a fully functional data validation class!

Conclusion

Congratulations, you’ve taken a big step toward writing more secure code. By validating user input with a regular expression, you’ve erected another barrier to an attacker who might try to attack your application by submitting malformed data. Keep in mind that the quality of the defense is in direct proportion to the quality of the patterns you use, so consider how the data will be used and test your patterns accordingly!

RegexBench is yours to use as you like. Feel free to use it to develop your own patterns after you’re done with this lab. You should also take a look at some of the more fully-featured productsout there such as The Regulator (

Resources

  • Patterns & Practices Security Guidance
  • How To: Use Regular Expressions to Constrain Input in ASP.NET
  • Writing Secure Code, 2cnd Edition, Howard & LeBlanc
  • Mastering Regular Exprssions, Friedl

[1]^\d{5}-\d{4}$

[2]^(\d{5}|\d{5}-\d{4})$ (note that the outer parentheses are important, otherwise you’ll end up saying, “at the beginning of the string, find five consecutive digits. Or (second alternative goes here). The first alternative would match 123456789!

[3]^\d{5}(-\d{4})?$

[4]^\d{5}(-?\d{4})?$