Home > PHP > Intermediate regex use

Intermediate regex use

March 2nd, 2009

Now regexes are a wonderful thing. Personally i can’t get enough of them. they are terse, compact, to-the-point and have a extremely high geek value. Below I will show a example of how you can use simple regular expression rules to accomplish a non-trivial task.

shortening a text

Often we will want to show a short excerpt of a full text, think for instance about articles and what not. A common but ugly way to do this is via substr(). A much more elegant way is via a regular expression.

A easy way to do this is via the following one liner.

preg_replace('/(.{1,250}\.)\s.*/ms','\1', $text );

What this does is select all text, with a sub selection that selects the greatest range of all characters between 1 and 250 ending on a dot. Then it replaces the text with that sub-selection. Resulting in a excerpt that will have the maximum amount of characters up till 250, but will likely end with a full sentence.

There is still the case that no dot is present in the first 250 characters, so you will have to test the output of the preg_replace. But more often then not, the above will work.

Now let’s look at what is happening.

/(.{1,250}

Immediately from the start we open a sub-expression to match the beginning of the text. The dot character gets interpreted as “any possible character except newline”. We then use the brackets to set a quantifier from 1 to 250 characters on the dot character, which means we want to match anything as long as it’s either 1 or 250 characters long.

Since the default behaviour for quantifiers is to be greedy, it will try to match the maximum amount of characters.

\.)\s

Now that we have defined the amount of text we want to match, we are going to define the ending of this match. This, as stated before, is simply a dot. We have to escape that dot as to not match everything. Then we end the sub-expression since this is everything we want.

However it’s not everything we want to test for. So right after the sub-expression I ask to match a whitespace character (\s). This will ensure that it’s actually the ending of a sentence and not for instance the URL of a domain.

.*/ms','\1', $text );

Now that we are done with the desired selection we simply state “.*” Which will match all characters zero or more times.

Then in the replacement section of preg_replace we state that we want to back reference the result of the first sub-expression, and we are done.

As you might have noticed I added a “m” and “s” to the end of my matching expression, these are global modifiers that will change the behaviour of the expression.

The “m” stands for multiline, which will change the default behaviour of starting and stopping at new lines, to match a entire text. The “s” modifies the behaviour of the “.” character to also match newlines. You will often see these two modifiers combined because without the “s” modifier a expression like /.*/m would still only match till the first newline.

Now the above is certainly not watertight, it’s just the illustrate how versatile regular expressions can be and how you can take advantage of the more advanced uses of it.

If you are interested in learning more about regular expressions in PHP I would advise you to go read the PCRE pattern modifiers page and the regexp reference page.

admin PHP , , , ,