Friday, January 27, 2012

Regular Expressions in Dart

‹prev | My Chain | next›

Up tonight, I am going to explore regular expressions in Dart. As an old time Perl hacker, I love regular expressions. Even thought it has been years since I last coded Perl, I still think of its regular expressions as the best implementation that I have ever used. I do not care for the Ruby implementation--I never got the hang of the API and still have to refer to the documentation when using them. Javascript is better, but I forget whether matches() or test() is a string method or a regular expression method.

Unfortunately for me, the Dart implementation looks to be closer to Ruby's than Javascript's. But I will give it the benefit of the doubt.

Last night, I was working with simple greetings (e.g. "Howdy, Bob!"). In the end, I stuck with string splitting in order to identify the two parts:
  set greeting(greeting) {
    var parts = greeting.split(',');

    if (parts.length == 2) {
      _greeting_opening = parts[0];
      name = parts[1].replaceAll(' ', '').replaceAll('!', '');
    }
    else {
      _full_greeting = greeting;
    }
  }
That is a little noisy (because it does not use regular expressions!), but fairly straight forward. I take the greeting (e.g. "Howdy, Bob!") and split on commas. In the default case, I am left with two parts: "Howdy" and " Bob!". This falls into the first conditional in which I set the _greeting_opening and name to the two parts. The second part, in particular is painful since I have to strip spaces and punctuation.

As noisy as this solution is, it does not even cover even simple deviations from the default case. When I attempted to set a greeting of "Yo, Yo, Yo, Alice!", my scheme was thwarted.

The regular expression that ought to cover both cases is: /^(.+)\s*,\s*([\w\s]+)\W?$/. The two parentheses indicate the portions of the string that I want to extract. The greeting opening is all characters starting at the beginning of the line (^) all the way up to the last comma, ignoring any space around the comma (\s*,\s*). The period in (.+) matches any character and the plus indicates one or more of them.

The second part of the string in which I am interested starts after the comma and any spaces (\s*,\s*). Here I try to match one or more characters or spaces: ([\w\s]+). I match any number of those characters until the end of the string ($) with the exception of a possible non-alphanumeric at the end of the line (\W?).

So let's see how this works in Dart. Regular expressions in Dart are built with the RegExp class. I try this against my default case:
main() {
  var matcher = new RegExp('^(.+)\s*,\s*([\w\s]+)\W?$');

  matcher.allMatches("Howdy, Bob!").forEach((m) {
    print(m[0]);
  });
}
I am not quite sure about the allMatches() usage--I am just trying that as my first attempt. It hardly matters because I am greeted by:
$ dart regexp.dart
'/home/cstrom/repos/dart-book/samples/regexp.dart': Error: line 2 pos 54: illegal character after $ in string interpolation
  var matcher = new RegExp('^(.+)\s*,\s*([\w\s]+)\W?$');
                                                     ^
Well that is unfortunate. The dollar sign character in Dart strings is special--it signifies a string interpolation ("$name" might produce "Bob"). But here I am not attempting to interpolate a variable--I just want to match the end of the line. Ugh.

The API reference for RegExp puts an at sign before strings in RegExp's. Perhaps that prevents interpolation? There is one way to find out:
main() {
  var matcher = new RegExp(@'^(.+)\s*,\s*([\w\s]+)\W?$');

  matcher.allMatches("Howdy, Bob!").forEach((m) {
    print(m[0]);
  });
}
That does appear to prevent interpolation because I now get output:
$ dart regexp.dart
Howdy, Bob!
Aha! I get allMatches() now. That is used when there is an expectation that there are multiple matches within a string. Since I do not have to worry about that here, I should use firstMatch. I also see that the zeroth match is the entire string that matched the whole regular expression. I want the first and second sub-expressions. Something like:
main() {
  var matcher = new RegExp(@'^(.+)\s*,\s*([\w\s]+)\W?$');

  matcher.firstMatch("Howdy, Bob!").tap((m) {
    print(m[1]);
    print(m[2]);
  });
}
This results in:
$ dart regexp.dart
Unhandled exception:
NoSuchMethodException - receiver: 'Instance of 'JSRegExpMatch'' function name: 'tap' arguments: [Closure]]
 0. Function: 'Object.noSuchMethod' url: 'bootstrap' line:360 col:3
 1. Function: '::main' url: '/home/cstrom/repos/dart-book/samples/regexp.dart' line:4 col:40
Aw, man! No tap() in Dart? Say it ain't so!

Ah well, I can manage with:
main() {
  final matcher = new RegExp(@'^(.+)\s*,\s*([\w\s]+)\W?$');

  var m = matcher.firstMatch("Howdy, Bob!");
  print(m[1]);
  print(m[2]);
}
This results in:
$ dart regexp.dart
Howdy
Bob
Trying this out against the very challenging "Yo, yo, yo, Alice!":
main() {
  final matcher = new RegExp(@'^(.+)\s*,\s*([\w\s]+)\W?$');

  var m = matcher.firstMatch("Howdy, Bob!");
  print(m[1]);
  print(m[2]);

  m = matcher.firstMatch("Yo, yo, yo, Alice!");
  print(m[1]);
  print(m[2]);
}
I find:
$ dart regexp.dart                                  ~/repos/dart-book/samples
Howdy
Bob
Yo, Yo, Yo
Alice
Hunh. I think I might like that. That makes sense and I can definitely see myself remembering how to use those.

What about the match vs. test from Javascript? It seems that Dart improves on that. Instead of test, Dart uses contains. It still uses the matches() method on regular expressions. I try this out with a generalized version of my regular expression exploration code:
main() {
  final matcher = new RegExp(@'^(.+)\s*,\s*([\w\s]+)\W?$');

  try_regexp(matcher, "Howdy, Bob!");

  try_regexp(matcher, "Yo, yo, yo, Alice!");

  try_regexp(matcher, "Hey. You.");
}

try_regexp(regexp, str) {
  print('----');
  print(str);

  print("contains? ${str.contains(regexp)}");
  print("hasMatch? ${regexp.hasMatch(str)}");

  var m = regexp.firstMatch(str);
  print(m[1]);
  print(m[2]);
  print('');
}
I have also added a new string, "Hey. You.", which should not match. And indeed it does not:
dart regexp.dart                                  ~/repos/dart-book/samples
----
Howdy, Bob!
contains? true
hasMatch? true
Howdy
Bob

----
Yo, yo, yo, Alice!
contains? true
hasMatch? true
Yo, yo, yo
Alice

----
Hey. You.
contains? false
hasMatch? false
Unhandled exception:
NullPointerException
 0. Function: 'Object.noSuchMethod' url: 'bootstrap' line:360 col:3
 1. Function: '::try_regexp' url: '/home/cstrom/repos/dart-book/samples/regexp.dart' line:19 col:10
 2. Function: '::main' url: '/home/cstrom/repos/dart-book/samples/regexp.dart' line:8 col:13
(try.dartlang.org)


In the end, I am forced to admit that I rather like Dart's regular expressions. They will even improve my Javascript coding (I will always remember that Dart has a saner contains() method, which is analogous to the silly test()). It may not be Perl, and it would be nice if there were regular expression literals, but this is pretty darn nice.

Day #278

2 comments:

  1. @ before a string indicates that it's a "raw" string (similar to putting r before a string in Python), and therefore string interpolation etc. is disabled.

    I can't remember now where exactly I read this.

    ReplyDelete
  2. Current versions of dart do not use the @ to make a string "raw". Dart now uses the r prefix too.

    ReplyDelete