after playing with grep
March 27, 2008 – 6:45 amRegular expressions are awesome, but ridiculous. I did not know there were so many implementations and syntax variants. I think it is time to read the book. For the uninitiated (and so that my parents can understand this post), regular expressions are powerful bits of syntax that make search operations (like grep) much cooler and more functional.
For instance, take… uh… google. I know my parents have used google. You can query google with words and phrases, like “cab” if you’re looking for websites about taxis. This will give you websites that use the word “cab” in them. (Similarly, if I’m writing a paper about taxis, I might search for all instances of the word “cab” in there.)
What if you want more complicated searches? You can do things like “cab” OR “taxi” or “cab” AND “taxi” which gives you the intersection and union of results containing those words, respectively - not a big deal.
Okay. But wait, why do our search results suddenly return things on cable television? Oh yeah. Cable television. But we just want the word ‘cab.’ So what if we did something like…
<the word has to start here!>cab<the word has to end here!>
Or in some regular expression syntax variants, ^cab$ (it’s faster to type ^ and $ than the tags above).
What if we were studying chromatography and wanted to find references to color - I mean colour - I mean… actually, we don’t care how it’s spelled, only that it starts with “col” and ends with “r”? (So words like colander and colthisisacompletelymadeupwordr are also things we are, for some reason, looking for.)
We could search for something like this: col<stuff can go in the middle, we don’t care>r - or in shorter, more common syntax, col*r, the asterisk (*) being shorthand for “stuff goes in here, but we don’t care what it is, or how long.” There are special characters and a syntax where you can say things like “A single character goes here, but we don’t care what it is as long as it’s a character,” or “there should be whitespace between these two words - we don’t care how much or what kind, you can use tabs, a space, a bazillion spaces, whatever” and search for those.
The list goes on. You can match…
palindromes: ^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1$
email addresses: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
and so on, and so forth, and hooray for simple ideas with great power. (See, mom and dad? This is why I get all excited about these kinds of things. And that’s just the tip of the iceberg. And there’s far more to it than regular expressions.)
3 Responses to “after playing with grep”
Last time I tried using regexps in Google, they didn’t work. Now it seems they do! Whoo!
There are various progs to translate between regexp syntaxes. I use RegexBuddy (non-free, I’m afraid, but powerful), but there are many others with different kinds of freedom, and some with web interfaces too, created by javascript demons (daemons?) with backslashes for fingers.
By Sam Kuper on Mar 27, 2008
But if you want to conform strictly to RFC 2822 for email addresses you can use
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Although you really shouldn’t. Read more about it at http://www.regular-expressions.info/email.html
By Grant Hutchins on Mar 27, 2008
Regexps are older than previously thought:
http://geekandpoke.typepad.com/geekandpoke/2008/06/the-history-o-1.html
By Sam Kuper on Jun 15, 2008