Nothing To Lose

If you don’t have it, how can you lose it!
Subscribe

Archive for March, 2009

Counting occurences of a words/pattern using grep and wc

March 29, 2009 By: Dexter Category: Regular Expressions, Tutorial

Some times you will come across the requirement of counting how many times a word/pattern has occurred in a file (text).
Here is a simple usage of the command grep and wc to do the same.
lets use the following file for example. (I have named it sample.txt)

Linux is a nice operating system.
Many people thing Linux just has a text based interface.
When people see the GUI on Linux they are really amazed.
Linux was developed by Linus Torvalds.
Linux is to UNIX, so if you have worked on Linux you will be able to work on UNIX also.

Now to count how many time “Linux” occurred in the text you can use:

grep -o ‘Linux’ sample.txt | wc -l

Explaination:
The grep command with -o option searches for the pattern in quotes, in this case ‘Linux’, If you just run the command like that you will get an output like

grep -o ‘Linux’ sample.txt
Linux
Linux
Linux
Linux
Linux
Linux

Now in the actual command

grep -o ‘Linux’ sample.txt | wc -l

When this output is piped to wc -l, which is a word counter tool, with the option -l it counts the number of lines that it receives, and since the output of grep gives each match in a different line, the number of lines are equal to number of occurrences of the text/pattern/word.

Remember grep is case sensitive so use the -i option to ignore case

grep -io ‘Linux’ sample.txt | wc -l

else only ‘Linux’ will be selected all other occurrences will be ignored.

Of course if you are just looking for how many lines have a particular pattern/word occurring and not the count of the word/pattern it self use

grep -c ‘Linux’ sample.txt
OR
grep -ic ‘Linux’ sample.txt // to ignore case

Of course if you are familiar with patterns matching, you can replace ‘Linux’ with your regular expression to look for occurrences of a particular pattern

e.g  ‘(Linux|UNIX|AIX)’  will look for occurrence of Linux or UNIX or AIX.

Note if you are going to use grep do not forge to escape the brackets and the pipe symbol.

Hope that was useful.

NOTE: This explanation is with respect to BASH Shell (GNU bash, version 3.1.17(2)) with grep (GNU grep) 2.5

[end]

Regular Expression — Using of \b and \B together. Part 3

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

Using \B and \b together: Finding pattern at the NOT at RIGHT edge of words.

Continues from Previous Post: Regular Expression — Use of \B (Not Word Edges) Part 2

For ‘\B’ at the beginning of the pattern and ‘\b’ at the end of the pattern


$grep ‘\Bcat\b’ wordedge.txt
Be unique do not be a copycat.


when ‘\B’ is at the beginning of the pattern and ‘\b’ at the end of the pattern, ‘\B’ makes sure that the pattern is not at the beginning of the word and ‘\b’ on the end make sure the word is on the right edge of the word.
So in this case you will find the patterns always at the right edge of the word.
This is different than using just ‘\b’ at the end of the pattern because just using ‘\b’ allows us to select the whole word itself if it is the pattern.

Now vice-versa:

For ‘\b’ at the beginning of the pattern and ‘\B’ at the end of the pattern.


$grep ‘\bcat\B’ wordedge.txt
old weapon used to hit birds is catapult.


when ‘\b’ is at the beginning of the pattern and ‘\B’ at the end of the pattern, ‘\b’ makes sure that the pattern is at the beginning of the word and ‘\B’ on the end make sure the word is NOT on the right edge of the word.
So in this case you will find the patterns always at the left edge of the word.
This is different than using just ‘\b’ at the beginning of the pattern because just using ‘\b’ allows us to select the whole word itself if it is the pattern.

Hope this was useful.

[enough]

Regular Expression — Use of \B (Not Word Edges) Part 2

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

Using \B: Finding pattern at the NOT at LEFT edge of words.

Continues from Previous Post: Regular Expression — Use of \b (Word Edges) Part 1

Enter the following command (note the \B in the start of the pattern)


$grep ‘\Bcat’ wordedge.txt
cat command is used to concatenate two or more files.
Be unique do not be a copycat.


Notice the first line of the output. the cat in ‘concatenate’ is selected as it is not at the (left) edge of the word.

The second line has cat selected in ‘copycat’ as it is not at the left edge of the word.

So if you are using ‘\B’ in the beginning of the pattern then it will look for that pattern after the first character of words.

Using \B: Finding pattern at the NOT at RIGHT edge of words.

Enter the following command (note the \B in the end of the pattern)


$grep ‘cat\B’ wordedge.txt
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.


Notice the first line of the output. the cat in ‘concatenate’ is selected as it is not at the (right) edge of the word.
The second line has cat selected in ‘catapult’ as it is not at the right edge of the word.
So if you are using ‘\B’ in the end of the pattern then it will look for that pattern before the last character of words.

Using \Bpattern\B

So what happens when we put ‘\B’ on both sides of out pattern.
Enter the following command (note the \B at both ends of the pattern)


$grep ‘\Bcat\B’ wordedge.txt
cat command is used to concatenate two or more files.


So when ‘\B’ is used on both the sides, then the whole pattern is searched within a word.

It can be interpreted as that ‘cat’ should NOT be the beginning of the word and as well as NOT at the end of the word.. which mean it has to be inside some other word that word.

NEXT: Using \B and \b together

Regular Expression — Use of \b (Word Edges) Part 1

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

The actual post became a bit too long so have chopped in 3 parts.

Here is a small explanation what ‘Backslash Character’ \b and \B does when use in a regular expression. I will try to demonstrate it using the grep command under Linux.

In most of the place you will find the official explanation says:
\b Match the empty string at the edge of a word.
\B Match the empty string provided it’s not at the edge of a word.

Before you can proceed forward you should be clear about how a word is defined with respect to a regex or rather what will be considered as a word when you will be using option/meta characters etc which work on word. Check out the earlier entry Regular Expression “Word Boundary”.

If you are already clear in you understanding with how a word is defined we can proceed ahead.

Take the following sentences/strings:

This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.
Be unique do not be a copycat.

We will use the string ‘cat‘ as reference for understanding how \b and \B works. Open you favorite text editor and copy the above strings in the file and save it as say ‘wordedge.txt‘.

before we proceed using grep, lets make sure grep is set to display the match in different color.
run the following command: $ alias grep=’grep –color=always’

First run the command directly for the string ‘cat’ on the file wordedge.txt your o/p should be something similar below:

$grep ‘cat’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.
Be unique do not be a copycat.

In the above case it is looking for occurrence of characters ‘c’ followed by ‘a’ followed by a ‘t’ and it selects those where ever applicable.

\b is used to match the pattern at the edge(s) of the word.
\B is used to match the pattern which is not at the edge of the word.

Using \b: Finding pattern at the left edge of words.

Enter the following command (note the \b in the beginning of the pattern)

$grep ‘\bcat’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.

Notice that first and second line of output cat is selected it is a completed word. This is because the word cat is stand alone and the pattern ‘cat’ is beginning from left side the word.

The line three of the output make it clear as you can see ‘cat’ is selected from ‘catapult’ because it is at the left edge of the word.

Using \b: Finding pattern at the right edge of words.

Enter the following command (note the \b in the end of the pattern)

$grep ‘cat\b’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.
Be unique do not be a copycat.

Notice that first and second line of output cat is selected it is a completed word. This is because the word cat is stand alone and the pattern ‘cat’ is ending from right side the word.

So one point to understand is that if the pattern is available as a standalone word it will match for both right and left edge.

The line three of the output make it clear as you can see ‘cat’ is selected from ‘copycat’ because it is at the right edge of the word.

Using \bpattern\b

So what happens when we put ‘\b’ on both sides of out pattern.
Enter the following command (note the \b at both ends of the pattern)

$grep ‘\bcat\b’ wordedge.txt
This is a line that has a cat.
cat command is used to concatenate two or more files.

So when ‘\b’ is used on both the sides, then the whole word is selected.
It can be interpreted as that cat should be the beginning of the word and as well as the end of the word.. which mean it has to be that word.

Next About:
Regular Expression — Use of \B (NOT Word Edge)