Regular Expression

Javelin steps forward again with yet another enlightening lecture on an advanced topic. Regular Expressions put power into softcode.

Author: Javelin@M*U*S*H
Category: Softcode
Commands: @emit, @pemit.
Features: regexp commands.

MUSHCode for Regular Expression

Topic: Regular Expressions
Author: Javelin
Summary: Javelin steps forward again with yet another enlightening lecture
on an advanced topic. Regular Expressions put power into softcode.

Trispis stands and walks to the front of the room.
Trispis shuffles some papers around official-like.
Trispis clears his throat and sips from a glass of water, then wipes his brow
with a handkerchief.

Javelin poses invisibly.

Trispis says, "Welcome to yet another of MUSH 101's last minute lectures."

(TM)

Trispis says, "Tonight's topic is Regular Expressions."
Trispis says, "Tonight's lecturer will be our very own Javelin, current
maintainer of PennMUSH, god and chief brewer of M*U*S*H."
Trispis says, "We'll be using this classroom for the lecture (Jav speaks
here). I'll be the moderator and will notice hands being raised on the +101
channel (@chan/on 101, then raise your hand or something if you have a
question). Unless otherwise stipulated by Javelin, this will be the procedure
for tonight's lecture."
Trispis says, "Without further ado, please give a round of applause for
Javelin."
Trispis takes his seat once again in the front center.

mith phews and applauds

Trispis claps.

Javelin makes sure he's on +101. :)
Javelin says, "Good evening, folks. Before we begin, the obligatory
disclaimer. If you've never written a $command on an object before, this
lecture's probably too advanced for you. If you're a perl guru, this lecture's
probably too basic for you. Otherwise, you may be in the right place."

China smiles.

Javelin says, "I seem to be lagging a bit tonight, so I hope you'll bear with
me. As Tris said, tonight's topic is regular expressions. Let's see a fast
show of hands: how many people here know what a regular expression is? Raise
both hands if you've successfully written a regular expression. :)"

mith raises both hands

Rhysem raises both hands.
Raevnos raises both hands and feet.
Trispis raises one hand.
China doesn't know if she knows.
Xyrxwyrth raises an eyebrow..
Krevinek raises his hands, feet, hair, etc.

Javelin takes in the diversity of experience, and chuckles. "Ok. For those who
aren't sure if they know, a regular expression is basically a pattern of
characters -- a description, if you will, of a set of strings."
Javelin says, "For example: doggie is a regular expression. It's a pattern of
characters that describes the string d-o-g-g-i-e"

China listens.

Javelin says, "What makes regular expressions (or regexps, for short) more
exciting, of course, is that you can describe more than one string with a
regexp."
Javelin says, "For example, g+ is a regular expression that describes a string
with 1 or more g's."
Javelin says, "The great use for regexps is when you want to see if a string
matches a pattern, and you need that pattern to be flexible. For example, you
want to match the word dog or doggie or doggy, all with one pattern."
Javelin says, "Most of us are familiar with PennMUSH's simple wildcards (used
in functions like match()). In that system, dog* would match dog, doggy,
doggies, but also doggone, doge, dogcatcher, and lots of other things we might
not want it to."
(Jargon aside: that kind of matching is sometimes called 'globbing', btw)
Javelin says, "So far, so good?"

Trispis nods.
Mr.Ghost nods
China nods understanding so far.
Landus nods

Javelin says, "Ok, so how do we build up a regular expression? Well, a regexp
consists of a set of characters that have to match, and a set of modifiers
describing how those characters should match."
Javelin says, "All of the modifiers look like punctuation, so that means that
you can pretty much assume that anything that looks like a letter or number
represents a character that you're trying to match."

China says, "ah."

Javelin says, "What I'm going to go through here are the basic modifiers, and
how to use 'em. There are about 6 different common implementations of regular
expressions, most of which add additional modifiers and tricks, but what we'll
talk about tonight will work with any of them."
Javelin says, "So, let's go back to the dogs. We know that dog would match the
characters d-o-g, so that's a start."
Javelin says, "Since all of our paterns are going to have a dog in them."
Javelin says, "But dog will match the characters d-o-g anywhere. That is,
they'll match dog, doggie, hot dog, my dog has fleas."
Javelin says, "Try it yourself: think regmatch(hot dog,dog)"

China blinks.
China says, "we suppost to get a 1?"

Javelin says, "So the first thing we need to do is *anchor* our match --
indicate that we only want to match d-o-g when it appears at the beginning of
the string. The anchor character is '^' (a caret), and the anchored pattern
would be: ^dog"

<101> Trispis says, "Yes, China."

Javelin says, "Now if you try think regmatch(hot dog,^dog), you'll see that it
doesn't match. But regmatch(dog day afternoon,^dog) should."

Mr.Ghost says, "Cool"

Javelin says, "By the way, the other anchor (anchor at the *end* of the
string) is '$' (a dollar sign. As in dog$, which matches hot dog, but not dog
park"

Trispis reminds everyone to use 101 for questions and comments (it helps keep
the log clean).

<101> Mr.Ghost has joined this channel.

Javelin says, "(And, of course ^dog$ matches exactly one string: dog. Not hot
dog, not doge, not dog biscuit, just dog."
Javelin says, "Any questions so far?"

<101> China has joined this channel.

Trispis has a question.

Javelin says, "Sure."

Trispis says, "Can you show me the syntax for a user command of 'dog' in both
normal globbing and regexp? I'm curious about how to build a regexp $command."

Javelin says, "Sure. Let me put down an object."
Javelin drops Demo.
Javelin says, "On that demo object, you'll see a typical glob-matched $command
to match 'dog'"

Demo(#6948V)
Type: Thing Flags: VISUAL
Owner: Javelin Zone: *NOTHING* Ducats: 10
Parent: *NOTHING*
Basic Lock: =Javelin(#7POWweACM)
Powers:
Warnings checked: none
Created: Mon Mar 27 19:59:09 2000
DOG_GLOB [#7]: $dog: @emit It's a globby dog
Home: Code Classroom(#1061RnJ)
Location: Code Lab(#1058RnJ)

Trispis nods.

Javelin says, "And now also a regexp version"

Demo(#6948V)
Type: Thing Flags: VISUAL
Owner: Javelin Zone: *NOTHING* Ducats: 10
Parent: *NOTHING*
Basic Lock: =Javelin(#7POWweACM)
Powers:
Warnings checked: none
Created: Mon Mar 27 19:59:09 2000
DOG_GLOB [#7]: $dog: @emit It's a globby dog
DOG_REGEXP [#7R]: $^dog$: @emit It's a regular dog.
Home: Code Classroom(#1061RnJ)
Location: Code Lab(#1058RnJ)

Trispis says, "Got it. Thanks."

Javelin says, "Beware! The initial $ in DOG_REGEXP isn't part of the regular
expression - it's the usual $ that means 'user-defined command coming'"

<101> China says, "how'd you see it Trispis?"
<101> Javelin says, "ex demo"
<101> Carbon-Based Landus says, "Point of interest..maybe want to point out
the difference between the $ at the beginning of the command, and at the end -
i.e., their different significance..."
<101> Mr.Ghost says, "He Did"
<101> Trispis says, "He just did."

Javelin says, "Whereas the $ right before the : (which ends the user-defined
command) is the 'end-of-string' anchor and part of the regexp."

<101> Carbon-Based Landus says, "oh..sorry. :p"

Javelin says, "Notice also that there's a big R attribute flag on the
dog_regexp attribute. If you want to use regexps for command-matching, you
must: @set demo/dog_regexp=regexp"

<101> Krevinek has joined this channel.
<101> Carbon-Based Landus says, "Oh, he said it while I was typing..ack."
<101> China ponders this.

Javelin says, "That tags that attribute as using regexp matching instead of
the usual globbing. This applies to ^listens, too."

Jinelle is so lost.

<101> Mr.Ghost says, "&listen_attr ^^dog$:bleh ?"
<101> Javelin grins, ex demo

Javelin says, "If you examine demo again, you'll see an example with
^listens."

Demo(#6948V)
Type: Thing Flags: VISUAL
Owner: Javelin Zone: *NOTHING* Ducats: 10
Parent: *NOTHING*
Basic Lock: =Javelin(#7POWweACM)
Powers:
Warnings checked: none
Created: Mon Mar 27 19:59:09 2000
DOG_GLOB [#7]: $dog: @emit It's a globby dog
DOG_REGEXP [#7R]: $^dog$: @emit It's a regular dog.
SOUND_REGEXP [#7R]: ^^Javelin says,: @emit Javelin said something.
Home: Code Classroom(#1061RnJ)
Location: Code Lab(#1058RnJ)

<101> Jinelle has joined this channel.
<101> China says, "aaah, the #7R thingy."

Javelin says, "Again, the initial ^ means 'this is a listen attribute', but
the second '^' is the beginning-of-string anchor in the regular expression."

<101> Trispis says, "Ah! I notice no * for the 'wildcard meaning anything
thereafter'."
<101> Krevinek says, "Question here, so you made the regexp flag so you could
be backwards compatible with current DBs/MUSHcode, and allow for this more
flexible matching?"
<101> Javelin nods to Tris and Krev.

Javelin says, "We know how to anchor a match. Let's take a look at a few other
things we can do."

<101> Mr.Ghost has a Q

Javelin says, "The | (pipe sign) in a regular expression means, basically,
'or'. It's used when you want a pattern to match this or that. Like:
this|that"

<101> Mr.Ghost says, "In DEMO\SOUND_REGEXP would %0 be everything after
Javelin or Javelin Says, ?"
<101> Trispis says, "Go ahead and ask here, Mr.G. Jav will reply when he has a
chance."
<101> Mr.Ghost says, "Okay"
<101> Javelin says, "No, Ghost, %0 wouldn't be set. We haven't gotten there
yet. :)"
<101> Mr.Ghost listens then

Javelin says, "So, now you should all be able to write a regular expression
that matches dog, doggy, or doggie, but nothing else. Right?"

<101> Trispis tries.

Javelin says, "Say 'em out loud. :)"

Trispis says, "$^dog$|^doggie$|^doggy$:@emit this is the dog or doggie or
doggy command"
Trispis sees his inefficency.
Trispis rewrites (or tries)

Javelin says, "Close, but the good news is that the | binds tighter than the
anchors, so you don't repeat the anchors."

<101> Mr.Ghost says, "&reg.test1 me=^^Can|Canner|cannon:@emit Boom! ?"

Javelin suggests just writing the regexp, btw, not the command.

<101> Javelin says, "That also matches Candle, Ghost. Do you see why?"

Trispis says, "$^dog|doggie|doggy$:"

<101> Mr.Ghost does
<101> Yusif has joined this channel.

Trispis says, "There's gotta be a way to do it without repeating the dog part,
though."

<101> Javelin says, "And Cannery. and cannonade. And Can-Can"
<101> China raises an eyebrow at Javelin's remark.

Javelin nods. "What Tris said will work. Note that it's anchored at both front
and back, so we won't match hot doggie or doggy treat. And then it offers 3
options.

Raevnos says, "Actually, the anchors bind tighter than |."

Trispis says, "$^dog&''|gie|gy$ (or something liek that)"

Javelin says, "Another way to do it might be: ^do(g|gy|gie)$"
Javelin says, "They do, Raev?"

Trispis says, "Yeah. Like that."

Raevnos says, "Yup."

Javelin dohs.

Raevnos says, "think regmatch(doggiesfoo, ^dog|doggie|doggy$)"
Raevnos says, "That matches."

Javelin amends Tris's version then: ^(dog|doggy|doggie)$

Raevnos says, "think regmatch(doggiesfoo, ^(dog|doggie|doggy)$)"

<101> Chili says, "&reg.test1 me=^Can\b|Canner\b|cannon\b:@emit boot"

Raevnos says, "Easy fix. :)"

Trispis says, "cuz of the unanchored doggie in the middle?"

Javelin chuckles. So you see how confusing it can be. :)

<101> Chili says, "err"
<101> Chili says, "&reg.test1 me=^^Can\b|Canner\b|cannon\b:@emit boot"

China says, "YES!"

<101> Chili says, "that way it'll make sure they're actual words"

Javelin says, "But the really good news, then, is that parentheses do what
you'd expect (and more, as we'll see later) - group things together."

<101> Javelin doesn't think \b works in MUSH regexps.
<101> Professor Raevnos says, "Except the regexp engine used right now doesn't
support \b."
<101> Chili says, "well.. actually i'm not sure how the MUSH regexp matching
is working righ tnow"
<101> Professor Raevnos says, "Wait for 1.7.3. =)"
<101> Trispis says, "Question."
<101> Chili ohs
<101> Kyieren says, "Is someone logging this class?"
<101> Trispis says, "Since Raev pointed out the binding thing... this brings
in a question about the closing anchor."
<101> Trispis says, "Can I do this? (coming)"

Javelin says, "So, ^(dog|doggy|doggie)$ works. So does ^do(g|gy|gie)$. So does
^dog(|y|ie)$, btw (using a null pattern as an option)"

<101> Javelin says, "Yes, Ky."
<101> Trispis says, "$^dog($|gies$|gy$):"
<101> Javelin says, "Yep, that should work too."

Javelin says, "If you wanted to do this match w/o regexp, you'd either have to
write 3 different match patterns, or you'd have to match dog* and then check
to see if the * part matched nothing, gy, or gie, etc."

Trispis says, "Wait."

Javelin waits.

Trispis says, "You said ^(dog|doggy|doggies)$, but... as Raev demonstrated,
that would match doggypoo, wouldn't it?"

Javelin says, "No, ^dog|doggy|doggies$ would, but note the parens."
Javelin says, "That pattern is read as 'start-of-string, dog or doggy or
doggies, end-of-string'"

Trispis says, "Oh, so the parens are explicit?"

Javelin raises an eyebrow and smirks.

Trispis says, "er... I mean... the contents are explicit"

China sees that now.

Trispis says, "They don't need anchored inside?"

Javelin still isn't sure what you mean by explicit, actually.

<101> Carbon-Based Landus rotfl

Javelin says, "The parens group everything in them into a single unit."

Trispis says, "I don't know how to phrase my query, but I think I understand."

Javelin says, "So just as ^dog$ matches anchor-dog-anchor, ^(dog|doggy)$
matches anchor-(either dog or doggy)-anchor"

Trispis nods.

Javelin grins. "At this point you should all be saying either 'oooh...aaahh'
or 'this is so easy, tell me something confusing!"

China nodnodnods finally.

Trispis says, "oooh ahhh"

Javelin says, "So here it comes. The Most Confusing Modifier. * (asterisk)"

Kyieren says, "I'm saying WTF myself"
Kyieren :)

Javelin says, "The * means 'zero or more of whatever unit came before me'.
That is, the pattern a* matches zero or more letter a's."
Javelin says, "Try a few regmatches against that pattern."
Javelin says, "Like: think regmatch(testing,a*) or think regmatch(aaah!,a*)"

<101> mith says, "has a question"
<101> Kyieren claps with glee and gets a 1 for apple :)

Javelin says, "You may be surprised to find that it matches _anything_. Dose
anyone see why?"

Kyieren says, "* is a wildcard?"

Raevnos does. :)

Trispis says, "Because you haven't anchored it anywhere."

Raevnos says, "But since I have to head home, I'll let someone else answer.
Night!"

Javelin says, "Not in the glob sense, Ky. After all, the regexp a* will match
'e'"
Javelin says, "Close, Tris. Remember, in regexps, * means zero or more. Every
string has either zero a's or more a's in it."

Trispis says, "But, by anchoring with ^a*$, it wouldn't match 'testing' would
it?"
Trispis tries it.

Javelin says, "No. That pattern would only match a blank string or a string of
a's."

Trispis says, "So, I was partially right, then?"

Javelin nods.
Javelin says, "The * is confusing because it doesn't work like we're used to
in globbing. It doesn't mean 'anything here'. It means, 'zero or more of the
unit before' (it rhymes. :)"

China thinks on this.

Kyieren hehs

Javelin says, "The unit before is usually the character before. So aa* matches
an 'a' followed by zero or more 'a's. Different than (aa)*, which matches zero
or more 'aa's"

<101> Trispis says, "so a+ == aa* ?"

Javelin likes bananas, but never knows when to stop. banana(na)*

<101> Mr.Ghost says, "So ^a*dog$ only matches naaaaaaadog, dog and adog ?"
<101> Javelin says, "Tris, correct. The + is 'one or more', so aa* is the same
as a+"
<101> Mr.Ghost says, "n being some number of as"
<101> Javelin says, "Right, Ghost!"

Javelin says, "That pattern matches banana, bananana, banananana,
banananananana, etc."

<101> Trispis says, "bana(na)+ ?"

Javelin says, "As you just heard on +101, the + (plus sign) is like *, but
means 'one or more'. So bana(na)+ is the same."

<101> Javelin says, "GMTA"
<101> China sorta understand it for the most part but will ask the stupid
question. What zero? Zero has an 'e' in it.
<101> Trispis thinks he's catching on.
<101> Trispis says, "0"
<101> Javelin means none
<101> Chili has connected.
<101> China says, "OH! ya didn't say 'that'."

Javelin says, "There's also a modifer than means 0 or 1. The question mark
(?). So dog? matches a string containing d-o-(optional)g -- that is d-o or
d-o-g"

Kyieren puts a 'CONFUSED' sticker on his forehead and watches the rest of the
show.

Trispis says, "oooooh."

<101> Trispis says, "Tell me now, tell me now!"
<101> China cause it kicks off whatever letter is jsut before it?
<101> Javelin nosd.
<101> China says, "that was a say. sorry."

Javelin says, "The ? is very handy, actually. What does this match? ^beers?$"
China beer
Mr.Ghost says, "beer and beers"

<101> Trispis says, "How do you build +activity to match +act[i[v[i[t[y]]]]]
?"
<101> Octavian has joined this channel.
<101> Kyieren says, "scream"
<101> Javelin says, "Frankly, I'd be included to just match the act part."
<101> China says, "'why' would you want to Tris?"
<101> Trispis wants to make smart matching.
<101> Kyieren says, "isn't that overkill tho?"
<101> Trispis says, "Of course. (:"

Javelin says, "Oh, that reminds me. If you need to match a literal *, ?, or +,
you've got to put a backslash in front of it: ^\+act$"

<101> Mr.Ghost says, "Huh?"
<101> Mr.Ghost says, "Okau"

Javelin says, "The same applies if you need to match a literal paren, or a
literal backslash."

Mr.Ghost says, "Define literal in this context please"

<101> Trispis says, "I think I have it. I'd use nested parentheses some how,
right?"

Javelin says, "And actual parenthesis in the string."

<101> Javelin nods.
<101> Javelin thinks it's act(i(v(i(t(y)?)?)?)?)?
<101> China crosses her eyes.
<101> Javelin says, "But that's really wicked. :)"

Kyieren says, "Say if you need to match +bap and not just bap"
Kyieren says, "That's what he means"

Javelin says, "There's 3 more pieces of the standard regular expression
toolkit left."

<101> Trispis was gonna try act(i?(v?(i?(t?(y?)))))

Javelin says, "A period (.) matches any character. So ta.t matches tart, tact,
tait, etc."

<101> Javelin says, "That matches actvy, I think. :)"
<101> Trispis says, "Like the ? in globbing?"
<101> Javelin says, "Yeah, but the ? in globbing can match no character, too."
<101> Trispis says, "No it can't."
<101> Javelin says, "IT can't?"
<101> Trispis says, "Read my Code Basics lecture. (;"
<101> Javelin confuses MUSH and unix globbing. :)
<101> Kyieren pushes Javelin back to the lecture hall.
<101> Trispis says, "It must be one character in length."

Javelin says, "What that means, then, is that .* (dot-star) matches zero or
more of anything."

Trispis says, "Oooh. Okay."

Javelin says, "That's basically the equivalent of the globbing *. ba.*na
matches a string containing bana, batna, banana, barglearglena"

Trispis says, "Does the . match no character?"

Javelin says, "No, a . must match a character."
Javelin says, "But .? matches zero or one character."

Trispis nods.

Javelin says, "And .+ matches one or more characters. Everybody still
following along as I put these pieces together?"

Kyieren says, "I think so"

Trispis is following, but getting lost in his own imagination quickly.

Kyieren makes T watch Pokemon, draining all brain function.

<101> Trispis says, "Time for a practical example now, perhaps, using several
things together?"

Javelin says, "Example. I want to write a command like this:
fire.gun.<person's name - any string>"
Javelin says, "That is, f-i-r-e-period-g-u-n-period-one or more characters"
Javelin says, "What might that look liek?"

Kyieren says, "FiredGunatSara would match that, right?"

Trispis says, "^fire\.gun\..*$"
Trispis guesses.

Javelin says, "Ky's right if that were the regexp, but I meant more like what
Tris is talkign about. Writing a regexp to match that."

Kyieren says, "Oh, silly me"

Javelin says, "Tris has it almost right: That will match 'fire', a literal
period, 'gun', a literal period, and zero or more of any character."
Javelin says, "(almost, because I said one or more)"

mith says, "^fire.gun..+$ ?"

Trispis eeps and fixes.

^fire.gun.*

Javelin says, "Right, mith, if the backslashes are there."

Kyieren says, "That was me"

Trispis says, "^fire\.gun\..+$"

Kyieren says, "Doh"

Javelin says, "You can do it on the channel so it doesn't eval, if it's
easier."

China crosses her eyes again.

Javelin says, "Ok, now really practical! What's a regexp that would match an
IP address?"

mith says, "^fire\.gun\..+$ dang, eval"

Kyieren says, "^....*"

Javelin says, "An ip address, for our purposes, is 1-3 numbers, a dot, 1-3
numbers, a dot, 1-3 numbers, a dot, 1-3 numbers."

Kyieren says, "ack!"
Kyieren says, "^\.\.\..*"
Kyieren says, "?"

Javelin says, "That requires ...(followed by anything)"

Kyieren says, "oh, wrong"

Javelin says, "Nice regexp for matching ellipsis... :)"

Mr.Ghost says, "+++\.+++\.+++"

Kyieren says, "^\.+\.+\.+.*"
Kyieren says, "Man am I stupid"

Javelin says, "Ghost, the + must modify something. What comes before it?"

mith says, "We don't know how to match just numbers yet d;"

Mr.Ghost has a clue

Kyieren replaces his CONFUSED label with a STUPID one :)

Mr.Ghost hasn't rather

Javelin says, "A number matches itself."
Javelin says, "You might start by thinking about this:
(0|1|2|3|4|5|6|7|8|9)+\."

mith says, "(1|2|3|4|5|6|7|8|9|0)\. .. yeah"

Kyieren says, "...\...\...\..."
Kyieren says, "^...\...\...\..."

Trispis was about to do like Ky.

Kyieren says, "?"
Kyieren says, "wait, that's only two numbers"

Yusif hmms. But the + allows more than 3, so we have to limit to 3 and only 3.
Right?

Kyieren says, "please hold"
Kyieren says, "^...\....\....\...."

Javelin says, "Ah, but that'll match letters, too, won't it?"
Javelin says, "a.b.c.d isn't an IP address."

Kyieren says, "ack!"
Kyieren slaps

Javelin grins at Yusif. Yeah, we should, but let's not worry about that just
yet.

Yusif says, "Okay. :)"
Yusif says,
"(0|1|2|3|4|5|6|7|8|9)+.(0|1|2|3|4|5|6|7|8|9)+.(0|1|2|3|4|5|6|7|8|9)+.(0|1|2|3
|4|5|6|7|8|9)+"

Kyieren says, "^.?.?.?\..?.?.?\..?.?.?\..?.?.?"

(Aside: Some of the fancier extended regular expression systems have a way of
saying '3 or more' or 'no more than 5' or '7-9 of those', but Penn's doesn't
in 1.7.2)

Kyieren screams and must have been way off, I'm still matching letters

Javelin says, "Again, Ky, pretty close, but those .'s will match letters, not
just numbers. Yusif's pretty much got it, if those .'s are really \.'s"

Yusif says,
"(0|1|2|3|4|5|6|7|8|9)+\.(0|1|2|3|4|5|6|7|8|9)+\.(0|1|2|3|4|5|6|7|8|9)+\.(0|1|
2|3|4|5|6|7|8|9)+"

Yusif meant that, but got evaled on me.

Javelin says, "Right-o."

Yusif wows, just found a use for one of my univ math courses. <sigh>

Trispis has a problem with that on principle.
Trispis says, "999.999.999.999 isn't a valid IP."

Kyieren says, "Hey wait, ip's only go up to 256, minus two points to Yusif :)"
Kyieren spanks also. :)

Javelin says, "That's a semantic problem, not a syntactic one, but good - you
can work that out for homework. :)"
Javelin says, "Now, that's an awful lot to type, and life's too short, so
there's a faster way!"

Kyieren cheers

Mr.Ghost says,
"$0?(|1|2)+0?(|1|2|3|4|5)+(0|1|2|3|4|5|6|7|8|9)+\.0?(1|2)+0?(1|2|3|4|5)+0?(1|2
|3|4|5|6|7|8|9)+\.0?(1|2)+0?(1|2|3|4|5)+0?(1|2|3|4|5|6|7|8|9)+"

Kyieren says, "I bet it's .# or some such"

Javelin says, "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+"

Trispis says, "And doesn't Yusif's allow for 4 digit numbers?"

Kyieren oooooooo

Yusif nods to trispis and 10 digit ones too

Javelin says, "Scrollback, T, I said not to worry about that yet. :)"

Kyieren ahhhhhh
Kyieren == peanut gallery

Trispis says, "Oh. Okay."

Javelin says, "Using []'s, you can enclose a set of characters that are ok to
match, and that set can include ranges of characters."

Kyieren says, "[A-Z]?"

Trispis says, "oooh."

Mr.Ghost says, "[a-h] is a range?"

Yusif so... "0-21-91-9\.0-21-91-9\.0-21-91-9\.0-21-91-9"

Javelin says, "Some handy things: [A-Z] (match all cap letters), [a-zA-Z]
(match all letters)"

Yusif acks.

Kyieren cheers

Yusif so...
"[0-2][1-9][1-9]\.[0-2][1-9][1-9]\.[0-2][1-9][1-9]\.[0-2][1-9][1-9]"

Kyieren says, "We should have done this on the 101 channel, eval is annoying
:)"

Trispis says, "[a-Z]?"

Kyieren says, "I was about to ask that too, T"

Javelin says, "[a-zA-Z0-9] (match all alphnumeric). [a-zA-Z\.\,\?\;\!\: ]
(match typical characters in a sentence, include space."
Javelin says, "No, [a-Z] isn't really safe."
Javelin says, "On an ascii system, there's lots of characters you don't want
between lowercase a and capital Z"

Kyieren says, "Will the MUSH rise up against you and bring the wrath of God
down upon you (server crash) if you do?"

Javelin says, "No."
Javelin says, "But you might match punctuation."

Yusif blinks. "So you are basically automatically removing pesky things like [
and ( and such... handy.

Javelin says, "Trickier still, you can match all characters *except* a certain
character, but we'll save that one."

<101> Trispis says, "It doesn't use their octal order?"
<101> Javelin says, "It does."
<101> Javelin knows there's chars between them, but forgets what they are.
Aren't there?
<101> Trispis doesn't recall any.
<101> Mr.Ghost wonders if [1-23] is a valid range

mith says, "What about A-z? caps come first?"

<101> Trispis says, "If I remembered my C, I could compile a program to find
out."

Javelin says, "See, that why you should just write [A-Za-z] :)"
Javelin says, "Safe, no matter what."

<101> Javelin says, "It's valid, but it means 1-2, and 3, so it's the same as
1-3"
<101> Mr.Ghost understands
<101> Trispis says, "digits only, then?"
<101> Javelin says, "Those are characters in there, not numbers, so '23' is
the character '2', then '3'"
<101> Trispis says, "er, characters. yeah."

Javelin says, "So, now, if we wanted to write a pattern for an ip address, and
we don't want to match letters, and we don't want to match 4-digit or longer
numbers, how do we do it?"

Kyieren says, "oo"

Mr.Ghost types away

Javelin recommends using the chat channel, or say/noeval

Kyieren says, "^[0-9]\.[0-9]\.[0-9]\.[0-9]?"
Kyieren says, "Hmm I guess that question shouldn't be there"
Kyieren says, "wait"

Javelin says, "That matches 1.1.1.1, but does it match 12.12.12.12?"

Kyieren says, "that's wrong"

mith says, "^[1-2]?[0-9]?[0-9]?\. then repeat?"

<101> Trispis says,
"[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?"

Kyieren dies

Mr.Ghost says, "0-2?0-5?1-9+\.0-2?0-5?1-9+\.0-2?0-5?1-9"

Javelin says, "mith, that matches just a period. :)"

<101> Trispis intentionally left out on ? from each.

Mr.Ghost bah

mith says, "wait, that could match .... also couldn't it?"

<101> Mr.Ghost says,
"[0-2]?[0-5]?[1-9]+\\.[0-2]?[0-5]?[1-9]+\\.[0-2]?[0-5]?[1-9]"

Kyieren rofl at Javelin!!!

<101> Javelin says, "Doesn't match 100.100.100.100, Ghost. :)"
<101> Mr.Ghost says, "Only one \\"
<101> Kyieren says, "259.259.259.259?"
<101> Javelin says, "Don't worry about the semantics. :)"
<101> Kyieren says, "Yeah I know :)"
<101> Trispis says, "Because I want at least ONE occurrence each set."
<101> Mr.Ghost okies
<101> Trispis says, "The second two being optional."

Javelin, not considering semantics, would repeat this: [0-9][0-9]?[0-9]?\.

<101> Trispis says, "I got it right! (:"
<101> Mr.Ghost says,
"([0-2]?[0-5]?[0-9])+\.([0-2]?[0-5]?[1-9])+\.([0-2]?[0-5]?[1-9])+"
<101> Kyieren says, "I can't get a programming language that doesn't exist, my
future career is looking extremely bleak"
<101> Trispis says, "You're thinking right, Mr.G, but ther's some if/then
involved for the leading zero's."

Javelin says, "ex demo again."

<101> Trispis says, "012.012.012.012 isn't a valid IP."
<101> Kyieren says, "stop worrying about semantics"
<101> Kyieren says, "for the 50th time :)"

Javelin fixse some ?'s in it.

Demo(#6948V)
Type: Thing Flags: VISUAL
Owner: Javelin Zone: *NOTHING* Ducats: 10
Parent: *NOTHING*
Basic Lock: =Javelin(#7POWweACM)
Powers:
Warnings checked: none
Created: Mon Mar 27 19:59:09 2000
DOG_GLOB [#7]: $dog: @emit It's a globby dog
DOG_REGEXP [#7R]: $^dog$: @emit It's a regular dog.
IPTEST [#7R]: $^ip
[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?$:
@emit %N entered %0
SOUND_REGEXP [#7R]: ^^Javelin says,: @emit Javelin said something.
Home: Code Classroom(#1061RnJ)
Location: Code Lab(#1058RnJ)

Javelin says, "Ok. Now, let's say we wanted to check semantics, too."

Trispis says, "Then it gets ugly."

Javelin says, "For example, let's say we wanted to assure ourselves that the
first number in the address is between 1 and 256."

<101> Mr.Ghost says, "actually 012.012.012.012 is 10.10.10.10 but that's a
different story"

Yusif thinks each number could be (1-9|1-90-9|10-90-9|20-50-9)
Yusif keeps doing that.

<101> Yusif says, "([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-5][0-9])"

Javelin says, "What would be great would be if, in place of that @emit, we
could write: @switch lt(FIRSTNUMBER,256)=0, @emit That's too big, @emit Ok"

Javelin says, "More generally, it would be nice to capture parts of the string
that matched the pattern."

Javelin says, "(Maybe you want to capture the first 3 octets, which could be a
class-C address, or something - techie aside)"
Javelin says, "This turns out to be really easy!"
Javelin says, "ex demo again. I've made one small change to the regexp
pattern."

Demo(#6948V)
Type: Thing Flags: VISUAL
Owner: Javelin Zone: *NOTHING* Ducats: 10
Parent: *NOTHING*
Basic Lock: =Javelin(#7POWweACM)
Powers:
Warnings checked: none
Created: Mon Mar 27 19:59:09 2000
DOG_GLOB [#7]: $dog: @emit It's a globby dog
DOG_REGEXP [#7R]: $^dog$: @emit It's a regular dog.
IPTEST [#7R]: $^ip
([0-9][0-9]?[0-9]?)\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?$:
@emit %N entered %0, and the first octet was: %1
SOUND_REGEXP [#7R]: ^^Javelin says,: @emit Javelin said something.
Home: Code Classroom(#1061RnJ)
Location: Code Lab(#1058RnJ)

Javelin says, "I've enclosed the pattern for the first octet in parens."
Javelin says, "We saw earlier that parens just group things into a unit."
Javelin says, "So it doesn't change the meaning of anything."
Javelin says, "But parens *also* have a handy side effect - they capture the
part of the string that matches what's in the parens!"
Javelin turns demo on. Try typing: 'ip <valid or invalid address>' and ex demo
to see how it works.

Javelin entered IP 128.248.90.226, and the first octet was: 128

Krevinek entered IP 266.266.266.266, and the first octet was: 266

Trispis entered IP 999.254.254.254, and the first octet was: 999

Javelin says, "Try 1.2.3.a, too. :)"

Mr.Ghost hrms
Trispis hrms.

China entered IP 123.123.12.123, and the first octet was: 123

Javelin says, "As you can see from the code, if the pattern matched at all, %0
contains the entire matched pattern (including the word IP!), and %1 contains
whatever matched what's in the first set of parens."

Trispis says, "%0 is my 'to match' (the overall glob being regmatched), and %1
is the first grouped paren set?"

Javelin says, "If you had more parens, %2 would match the second set, and so
on, up to %9"

Mr.Ghost hrms

Trispis says, "So, we can do errorchecking after the : ?"

Javelin says, "BTW, parens are counted by counting left parens, from left to
right. So (a(bc)), if matched by the string 'abc', would set %1 to 'abc' and
%2 to 'bc'"
Javelin says, "Right."
Javelin says, "If you have a pet shop and you want to implement a 'tickle'
command for all the animals, you might be able to do it globally, with:
^tickle (puppy|kitty|birdie|fishie|snakie)$: and now %1 tells you what they
tickled so you can customize the reaction of the animal."

Trispis says, "Okay... this is where you get to do part of your errorchecking
in the command and the other part in the process. So, instead of *.*.*.*, we'd
use the current code to eliminate ANY processing of letters, and then
errorcheck the numbers afterward?"

Javelin says, "Right. Depending on the pattern, you've done a big part of the
job syntactically."
Javelin says, "Actually, globbing *.*.*.* also matches 1.2.3.3.4.5.6, which
the regexp doesn't. :)"

Mr.Ghost says, "neat"
Trispis nods.

Javelin says, "The regmatch() function, which can do regexp matching for you
too, allows you to assign the parts matched in parens to %q1-%q9 registers,
and to choose which register is used for which part. Read its help for
details."

REGMATCH()
(Help text from TinyMUSH 2.2.4, with permission)
regmatch(<string>,<regexp>[,<register list>])
regmatchi(<string>,<regexp>[,<register list>])

This function matches the regular expression <regexp> against the
entirety of <string>, returning 1 if it matches and 0 if it does not.
regmatchi() does the same thing, but case-insensitively.

If <register list> is specified, there is a side-effect: any
parenthesized substrings within the regular expression will be set
into the specified local registers, in the order they were specified
in the list. <register list> can be a list of one through nine numbers.
If the specified register is -1, the substring is not copied into a
register. Under regmatchi, case of the substring may be modified.

For example, if <string> is 'cookies=30', and <regexp> is '(.+)=([0-9]*)'
(parsed; note that escaping may be necessary), then the 0th substring
matched is 'cookies=30', the 1st substring is 'cookies', and the 2nd
substring is '30'. If <register list> is '0 3 5', then %q0 will become
"cookies=30", %q3 will become "cookies", and %q5 will become "30".
If <register list> was '0 -1 5', then the "cookies" substring would
simply be discarded.

See 'help regexp syntax' for an explanation of regular expressions.

REGEXP SYNTAX
Topic: REGEXP SYNTAX

The following explanation is taken from Henry Spencer's regexp(3)
package, the regular expression library used in PennMUSH:

A regular expression is zero or more branches, separated by
`|'. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a
match for the first, followed by a match for the second,
etc.

A piece is an atom possibly followed by `*', `+', or `?'.
An atom followed by `*' matches a sequence of 0 or more
matches of the atom. An atom followed by `+' matches a
sequence of 1 or more matches of the atom. An atom followed
by `?' matches a match of the atom, or the null string.

Continued in 'help regexp syntax2'.


REGEXP SYNTAX2
An atom is a regular expression in parentheses (matching a
match for the regular expression), a range (see below), `.'
(matching any single character), `^' (matching the null
string at the beginning of the input string), `$' (matching
the null string at the end of the input string), a `\'
followed by a single character (matching that character), or
a single character with no other significance (matching that
character).

A range is a sequence of characters enclosed in `[]'. It
normally matches any single character from the sequence. If
the sequence begins with `^', it matches any single
character not from the rest of the sequence. If two
characters in the sequence are separated by `-', this is
shorthand for the full list of ASCII characters between them
(e.g. `[0-9]' matches any decimal digit). To include a
literal `]' in the sequence, make it the first character
(following a possible `^'). To include a literal `-', make
it the first or last character.

Continued in 'help regexp ambiguity' and 'help regexp examples'.


Javelin says, "And note that by default, regmatch() is case-sensitive
(regmatchi isn't). But matching on $commands and ^patterns is case-insensitive
unless you also set teh CASE flag on that attribtue."

REGEXP AMBIGUITY
Topic: REGEXP AMBIGUITY

If a regular expression could match two different parts of
the input string, it will match the one which begins
earliest. If both begin in the same place but match
different lengths, or match the same length in different
ways, life gets messier, as follows.

In general, the possibilities in a list of branches are
considered in left-to-right order, the possibilities for
`*', `+', and `?' are considered longest-first, nested
constructs are considered from the outermost in, and
concatenated constructs are considered leftmost-first. The
match that will be chosen is the one that uses the earliest
possibility in the first choice that has to be made. If
there is more than one choice, the next will be made in the
same manner (earliest possibility) subject to the decision
on the first choice. And so forth.

Continued in 'help regexp ambiguity2'.


REGEXP AMBIGUITY2
For example, `(ab|a)b*c' could match `abc' in one of two
ways. The first choice is between `ab' and `a'; since `ab'
is earlier, and does lead to a successful overall match, it
is chosen. Since the `b' is already spoken for, the `b*'
must match its last possibility-the empty string-since it
must respect the earlier choice.

In the particular case where no `|'s are present and there
is only one `*', `+', or `?', the net effect is that the
longest possible match will be chosen. So `ab*', presented
with `xabbbby', will match `abbbb'. Note that if `ab*' is
tried against `xabyabbbz', it will match `ab' just after
`x', due to the begins-earliest rule. (In effect, the
decision on where to start the match is the first choice to
be made, hence subsequent choices must respect it even if
this leads them to less-preferred alternatives.)


Javelin can show you a monster function using regexps here. :)

<101> Mr.Ghost says, "How do you use perens and brackets in regmatch(), escape
them with \ ?"
<101> Javelin says, "I think you can use parens as you'd expect, and you must
escape brackets, yeah."
<101> Javelin says, "But that's tricky, so test it."

Javelin says, "Well, let me just describe an intersting application, rather
than kill you with code."

<101> Sylvia says, "Too bad Talek's not around, he'd be able to give exact
rules =)"
<101> Mr.Ghost says, "Yeah, cause if you need to eval in regmatch() etc etc
... you're stuck"
<101> Javelin sticks out his tongue. :)
<101> mith says, "hrm. that's why my think regmatch()'s wern't working (:"

Javelin says, "Lots of people connect to this mush from different sites."
Javelin says, "LASTSITE only stores their last site. For some people, I'd like
to know all of the sites they use."

<101> Sylvia giggles, and responds in kind. "Well the RegEx stuff, was, at one
time his pride and joy you know =)"
<101> Javelin says, "Not the MUSH regexp stuff."
<101> Javelin hacked that in, from TinyMUSH, actually. :)

Javelin says, "But here's the thing. These days, lots of people connect from
dialups which have names like dialup-1234-342-23.twink.net"
Javelin says, "So if I just collected all their sites, I'd have an attribute
full of useless information."
Javelin says, "What I really want is to decide if I should add their site or
not, based on whether it 'looks like' a site I've seen already or not."
Javelin says, "That is, whether it's the same, except for some numbers near
the front being different."
Javelin says, "Regexps are very handy for this kind of problem (which is hard,
even with regexps, but nearly impossible without)"
Javelin says, "Well, this has gone on way longer than I expected. I'll hang
out for questions and to look at code if you're working on something with
regexps, but we'll quit the Javelin talks part here. I hope you've found this
at least a bit enlightening. Read the help on regexp and regmatch and
associated topics for lots more detail!"
Javelin says, "Oh! One last thing, that I can't forget..."

Kyieren says, "Everyone in this room wins a wizbit?"

China blinks.

Trispis says, "On Ky's MUSH?"

Yusif fathoms regexps now. yay.

Kyieren says, "No here :)"

Javelin says, "IF any of you are hardcode hackers, the standard unix utility
'egrep' lets you search files for lines that match a regular expression, so
next time you need to know where to find fun_element or fun_item, you can:
egrep 'fun_(element|item)' src/*.c"
Javelin is now really done. Thanks for coming!

Kyieren says, "You ended the log?"

Xyrxwyrth cheers.

Javelin says, "Tris is in charge of the log."

Kyieren says, "Oh"
Kyieren says, "t< YOU ended the loG?"
Kyieren says, "Whoa"

Trispis says, "Log is still running for followup questions."

Kyieren says, "Ah"

Trispis intends to practice by rewriting all of his toolkit commands in
regexp.

Mr.Ghost has a question but it was from earlier in the day on a buffer gone to
the bit void. :P

Kyieren says, "Uh, so every command will have a tk- prefix? :P"

Javelin grins.

Mr.Ghost says, "What you showed me earlier, Jav, I might understand now, drat
:)"

China sighs.

Kyieren says, "I could make a grand, spammy exit that would leave you all
oooh'ing and ahhhh'ing at my softcode ability, but I think I'll settle for a
subtler exit :)"

Javelin says, "Why the sigh, China? If you continue to sigh, I'll speak in
haiku."

Trispis says, "Don't feel bad, China. I'm still a bit confused, but intend to
work on it."

China says, "I followed it up until it got complicated."

Mr.Ghost woos and found it.

Trispis says, "There seems to be a lot of logic left of the colon with
regexp."

Javelin says, "All things start tricky. By the thousand-and-tenth time,
they're easier"

Trispis says, "Instead of just globbing and narrowing the input after
receiving it, now I have to think about restricting the input to specifics."

Krevinek says, "That is the point, the logic is handled by the rather ugly
stuff to left, and so you do less on the right."

Octavian hmms. l33t h@1k00.

Trispis nods to Krev.

Kyieren laughs at Oct.

Javelin says, "By left, right, or both, when there's tough parsing to do,
something is hellish."

Krevinek nods.

Trispis parsed that, believe it or not.
Trispis says, "I still wanna know about the 'smart matching'. I'm gonna have
to write a test command."

Mr.Ghost says, "If I had a string 234Dogs, how would I get the 234 bit and the
Dogs bit in the registers? I see the regmatch(2Dogs,^([0-9])(.*),0 1) logic,
but what about (# Dynamic)Dogs, 3Dogs, 7578253Dogs?"

Krevinek says, "Hmm... I think I know :)"

Javelin says, "What exactly do you want to match and reject?"

Mr.Ghost says, "I want the numbers to go to %q0 and Chars to go to %1, should
I spereate them with something like "|" and if so, why not use before() and
after() ?"

Javelin says, "But the string is any number of numbers, followed by any number
of chars?"

mith says, "([0-9]+)"

Mr.Ghost says, "Yeah, it spawns from that vsub() thing you expressed earlier
today"

Javelin offers: regmatch(string,^([0-9]+)([^0-9].*),0 1)
Javelin says, "This is tricky, and uses something I didn't teach you. Here's
the problem."

Mr.Ghost smacks head

<101> Trispis says, "Explain why the ?'s come where they do, please.
&reg.smart tk=$^t(e(s(t)?)?)?$:@pemit %#=success"

Javelin says, "The naive solution is ([0-9]+)(.*)"

Mr.Ghost says, "I I see, too easy"

<101> Trispis thought they'd come immediately after each letter.

Javelin says, "But this solution won't work right in many cases, because it's
ambiguous. Should 12dog be parsed as 12 - dog, or 1 - 2dog"
Javelin says, "I think Penn's regexp matcher gives 12 - dog, but I'm not sure
it always will. So I prefer to match "one or more numbers", followed by "a
non-number and then zero or more anything""
Javelin says, "[^0-9] means 'any character but not 0-9'"

<101> Javelin says, "Ok, we read t(e(s(t?)?)?)? as 't', followed by an
optional e(s(t?)?)?"

Mr.Ghost may just end up going with a static line of values :P

<101> Trispis saw it for a brief flash of a second.
<101> Javelin says, "Which means either t, or te, or te followed by an
optional (st?)?"
<101> Javelin is screwing it up, but the logic is right.
<101> Javelin says, "Which means either t, or te, or tes followed by a t?"
<101> Javelin says, "Which means either t, or te, or tes, or test."
<101> Trispis sees it as a forest. I'll see the trees in a minute.... or a
week.

Javelin drats, and lost his haiku roll.

<101> Trispis says, "so, te? is the same as t(e)?, correct?"
<101> Trispis tries to build it.
<101> Javelin says, "Right."
<101> Trispis says, "te? == t(e)?"
<101> Javelin says, "You know there's an optional t and the end, so build it
backward. t? is on the right somewhere."
<101> Trispis says, "t(e)?, t(e(s)?)?"
<101> Trispis says, "t(e)?, t(e(s)?)?, t(e(s(t)?)?)?"
<101> Javelin says, "But if there's a final t, there must have been an s
before it, so (st?)?"
<101> Javelin nods.
<101> Sylvia says, "Start with the innermost one, then work back out -- it's
easier than trying to understand it left to right."
<101> Trispis says, "Ah. Okay. A logical shortcut there. Yeah. For the final
character."

Javelin is heading out for bed. :)
Javelin waves. Have fun.

Trispis waves. Thanks, Jav.

China says, "thank you, sir."