Nothing To Lose

If you don’t have it, how can you lose it!
Subscribe

Regular Expression — Use of \b (Word Edges) Part 1

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

The actual post became a bit too long so have chopped in 3 parts.

Here is a small explanation what ‘Backslash Character’ \b and \B does when use in a regular expression. I will try to demonstrate it using the grep command under Linux.

In most of the place you will find the official explanation says:
\b Match the empty string at the edge of a word.
\B Match the empty string provided it’s not at the edge of a word.

Before you can proceed forward you should be clear about how a word is defined with respect to a regex or rather what will be considered as a word when you will be using option/meta characters etc which work on word. Check out the earlier entry Regular Expression “Word Boundary”.

If you are already clear in you understanding with how a word is defined we can proceed ahead.

Take the following sentences/strings:

This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.
Be unique do not be a copycat.

We will use the string ‘cat‘ as reference for understanding how \b and \B works. Open you favorite text editor and copy the above strings in the file and save it as say ‘wordedge.txt‘.

before we proceed using grep, lets make sure grep is set to display the match in different color.
run the following command: $ alias grep=’grep –color=always’

First run the command directly for the string ‘cat’ on the file wordedge.txt your o/p should be something similar below:

$grep ‘cat’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.
Be unique do not be a copycat.

In the above case it is looking for occurrence of characters ‘c’ followed by ‘a’ followed by a ‘t’ and it selects those where ever applicable.

\b is used to match the pattern at the edge(s) of the word.
\B is used to match the pattern which is not at the edge of the word.

Using \b: Finding pattern at the left edge of words.

Enter the following command (note the \b in the beginning of the pattern)

$grep ‘\bcat’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.

Notice that first and second line of output cat is selected it is a completed word. This is because the word cat is stand alone and the pattern ‘cat’ is beginning from left side the word.

The line three of the output make it clear as you can see ‘cat’ is selected from ‘catapult’ because it is at the left edge of the word.

Using \b: Finding pattern at the right edge of words.

Enter the following command (note the \b in the end of the pattern)

$grep ‘cat\b’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
Be unique do not be a copycat.

Notice that first and second line of output cat is selected it is a completed word. This is because the word cat is stand alone and the pattern ‘cat’ is ending from right side the word.

So one point to understand is that if the pattern is available as a standalone word it will match for both right and left edge.

The line three of the output make it clear as you can see ‘cat’ is selected from ‘copycat’ because it is at the right edge of the word.

Using \bpattern\b

So what happens when we put ‘\b’ on both sides of out pattern.
Enter the following command (note the \b at both ends of the pattern)

$grep ‘\bcat\b’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.

So when ‘\b’ is used on both the sides, then the whole word is selected.
It can be interpreted as that cat should be the beginning of the word and as well as the end of the word.. which mean it has to be that word.

Next About:
Regular Expression — Use of \B (NOT Word Edge)

Regular Expression “Word Boundary”

January 28, 2009 By: Dexter Category: Regular Expressions, Tutorial

Word Boundary:

(The explanation below for grep under Linux using BASH)

Before you can proceed using regular expressions effectively one needs to clearly understand what part of a give string will be treated as a word.

Here are a few points:

  • The set [a-zA-Z0-9_] is considered to be a word.
  • Any other character between the combination of the above set will be a word separator.
  • Any combination of [a-zA-Z0-9_] till the end of the string or line.
  • Any combination of [a-zA-Z0-9_] from the beginning of the string till any other character is encountered.

Take the following sentences/strings:


This line contains mice.
mice in the beginning.
is there one in the end! mice
Some mice are free.
Some are trapped like this !mice. !mice!  (mice) .
Some are mixedmice, catmicecat and micecat.


Notice I have put ‘micein many places in the strings. Open you favourite text editor and copy the above strings in the file and save it as say ‘mice.txt‘.

before we proceed using grep, lets make sure grep is set to display the match in different color.
run the following command:

$ alias grep=’grep –color=always’

Running grep command directly on the file mice.txt for ‘mice’ you will notice that all the lines that contains mice are selected. I have highlighted it below:

$ grep ‘mice’ mice.txt
This line contains mice.
mice in the beginning.
is there one in the end! mice
Some mice are free.
Some are trapped like this !mice. !mice!  (mice) .
Some are mixedmice, catmicecat and micecat.

Notice that wherever the literals ‘mice’ are coming in the sequence they are selected.

To understand how words are differentiated run the following command to search the word ‘mice’:

$ grep ‘\<mice\>’ mice.txt
OR
$ grep -w ‘mice’ mice.txt

This line contains mice.
mice in the beginning.

is there one in the end! mice
Some mice are free.
Some are trapped like this !mice. !mice!  (mice) .

Lets understand the output one line at a time:

Output line 0)
mice in the beginning.
The Zeroth line that is in the output contains the ‘mice’ in the beginning and follows a character that is not in the set [a-zA-Z0-9_].
So anything that starts in a string which is a combination of  [a-zA-Z0-9_] and then follows any other character which is not the part of the set is a word.

Output line 1) is there one in the end! mice
The first line that is in the output contains the ‘mice’ in the end and just before ‘m’ there is a space which is not in the set [a-zA-Z0-9_]. This makes it a word.

Output line 2) Some mice are free.
The second line contains the ‘mice’ containing and space before and after it, since space is not a part of the word character set it is considered a word. This is also a standard understanding for separating word generally.

Output line 3) Some are trapped like this !mice. !mice!  (mice) .
The third line is interesting since it displays three entities selected these are  !mice. notice that the character before and after mice is not in the set [a-zA-Z0-9_]. This makes it a word. Similarly for the other two !mice! and (mice) the characters surrounding it are not in the set [a-zA-Z0-9_]. This makes them a word.

So something like the following strings having mice in between will be treated as word.
#@$$$mice##
abcde!mice.ran
can.you.see.the.mice.here

In the following mice will not be treated as a word:
Three blindmice
micemice
redmice
mincemice

Hope this helped you understand what is considered a word from the point of view of a regex.