Nothing To Lose

If you don’t have it, how can you lose it!
Subscribe

Archive for the ‘Regular Expressions’

Catching email id’s from file(s) using grep and other utils

April 18, 2009 By: Dexter Category: BASH, Linux Commands, Regular Expressions, Shell Scripting, Tutorial

Here is a simple mechanism that you can use to collect all the email id(s) from a file(s) into a single file. To do this we will be using the following command cat , grep, sort and uniq.

This one liner should do the work

cat file | grep -io ‘\<[^-.][0-9A-Za-z\.\-\_]\+@[0-9A-Za-z.]\+\>‘ | sort | uniq

If you want all the id’s in some file then redirect the above command to a file.

cat file | grep -io ‘\<[^-.][0-9A-Za-z\.\-\_]\+@[0-9A-Za-z.]\+\>‘ | sort | uniq > mailid.txt

Now lets convert this into a shell script where we shall accept a directory name from the user. This directory will be the one containing the files having the email ids
You can download the script from here Script to retrieve Email ids form files in a directory
I have noticed the copy paste of the code below is not working because of formatting characters.

#!/bin/bash
clear
echo -n “Enter the name of a DIRECTORY from where you want to pick up email id’s: “;
read dirname;

# check if the entered name is a directory

if [ -d $dirname ];then

cd $dirname; #  if it exists change to the directory

else

echo “+============================+”

echo “| Check your directory name! |”

echo “+============================+”

exit 1;

fi

# Loop through all file  in the given directory

for files in *

do

if [ ! -d $files ];then

# process all files and store them in a temporary file in users home dir

echo “Processing file $files”;

cat $files | egrep -io ‘\<[^-.][0-9A-Za-z\.\-\_]+@[0-9A-Za-z.]+\>‘ >> ~/$$;

echo “Processed”;

fi

done

cd -  # get back to previous working dir, i am assuming it was home

# sort the emails ids in the file, remove duplicates and store in a final file.

sort ~/$$ | uniq >> emailids.$$

# remove the temporary file

rm ~/$$

# tell the user where the mail ids are stored

echo

echo “+=================================================+”

echo ” Your email ids are available in ~/emailids.$$ ”

echo “+=================================================+”

exit 0;

Well I should warn you, the regular expression will catch anything that looks like an email id, so you might end up having lots of things that looks like an email id.
[end]

Counting occurences of a words/pattern using grep and wc

March 29, 2009 By: Dexter Category: Regular Expressions, Tutorial

Some times you will come across the requirement of counting how many times a word/pattern has occurred in a file (text).
Here is a simple usage of the command grep and wc to do the same.
lets use the following file for example. (I have named it sample.txt)

Linux is a nice operating system.
Many people thing Linux just has a text based interface.
When people see the GUI on Linux they are really amazed.
Linux was developed by Linus Torvalds.
Linux is to UNIX, so if you have worked on Linux you will be able to work on UNIX also.

Now to count how many time “Linux” occurred in the text you can use:

grep -o ‘Linux’ sample.txt | wc -l

Explaination:
The grep command with -o option searches for the pattern in quotes, in this case ‘Linux’, If you just run the command like that you will get an output like

grep -o ‘Linux’ sample.txt
Linux
Linux
Linux
Linux
Linux
Linux

Now in the actual command

grep -o ‘Linux’ sample.txt | wc -l

When this output is piped to wc -l, which is a word counter tool, with the option -l it counts the number of lines that it receives, and since the output of grep gives each match in a different line, the number of lines are equal to number of occurrences of the text/pattern/word.

Remember grep is case sensitive so use the -i option to ignore case

grep -io ‘Linux’ sample.txt | wc -l

else only ‘Linux’ will be selected all other occurrences will be ignored.

Of course if you are just looking for how many lines have a particular pattern/word occurring and not the count of the word/pattern it self use

grep -c ‘Linux’ sample.txt
OR
grep -ic ‘Linux’ sample.txt // to ignore case

Of course if you are familiar with patterns matching, you can replace ‘Linux’ with your regular expression to look for occurrences of a particular pattern

e.g  ‘(Linux|UNIX|AIX)’  will look for occurrence of Linux or UNIX or AIX.

Note if you are going to use grep do not forge to escape the brackets and the pipe symbol.

Hope that was useful.

NOTE: This explanation is with respect to BASH Shell (GNU bash, version 3.1.17(2)) with grep (GNU grep) 2.5

[end]

Regular Expression — Using of \b and \B together. Part 3

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

Using \B and \b together: Finding pattern at the NOT at RIGHT edge of words.

Continues from Previous Post: Regular Expression — Use of \B (Not Word Edges) Part 2

For ‘\B’ at the beginning of the pattern and ‘\b’ at the end of the pattern


$grep ‘\Bcat\b’ wordedge.txt
Be unique do not be a copycat.


when ‘\B’ is at the beginning of the pattern and ‘\b’ at the end of the pattern, ‘\B’ makes sure that the pattern is not at the beginning of the word and ‘\b’ on the end make sure the word is on the right edge of the word.
So in this case you will find the patterns always at the right edge of the word.
This is different than using just ‘\b’ at the end of the pattern because just using ‘\b’ allows us to select the whole word itself if it is the pattern.

Now vice-versa:

For ‘\b’ at the beginning of the pattern and ‘\B’ at the end of the pattern.


$grep ‘\bcat\B’ wordedge.txt
old weapon used to hit birds is catapult.


when ‘\b’ is at the beginning of the pattern and ‘\B’ at the end of the pattern, ‘\b’ makes sure that the pattern is at the beginning of the word and ‘\B’ on the end make sure the word is NOT on the right edge of the word.
So in this case you will find the patterns always at the left edge of the word.
This is different than using just ‘\b’ at the beginning of the pattern because just using ‘\b’ allows us to select the whole word itself if it is the pattern.

Hope this was useful.

[enough]

Regular Expression — Use of \B (Not Word Edges) Part 2

March 18, 2009 By: Dexter Category: Regular Expressions, Tutorial

Using \B: Finding pattern at the NOT at LEFT edge of words.

Continues from Previous Post: Regular Expression — Use of \b (Word Edges) Part 1

Enter the following command (note the \B in the start of the pattern)


$grep ‘\Bcat’ wordedge.txt
cat command is used to concatenate two or more files.
Be unique do not be a copycat.


Notice the first line of the output. the cat in ‘concatenate’ is selected as it is not at the (left) edge of the word.

The second line has cat selected in ‘copycat’ as it is not at the left edge of the word.

So if you are using ‘\B’ in the beginning of the pattern then it will look for that pattern after the first character of words.

Using \B: Finding pattern at the NOT at RIGHT edge of words.

Enter the following command (note the \B in the end of the pattern)


$grep ‘cat\B’ wordedge.txt
cat command is used to concatenate two or more files.
old weapon used to hit birds is catapult.


Notice the first line of the output. the cat in ‘concatenate’ is selected as it is not at the (right) edge of the word.
The second line has cat selected in ‘catapult’ as it is not at the right edge of the word.
So if you are using ‘\B’ in the end of the pattern then it will look for that pattern before the last character of words.

Using \Bpattern\B

So what happens when we put ‘\B’ on both sides of out pattern.
Enter the following command (note the \B at both ends of the pattern)


$grep ‘\Bcat\B’ wordedge.txt
cat command is used to concatenate two or more files.


So when ‘\B’ is used on both the sides, then the whole pattern is searched within a word.

It can be interpreted as that ‘cat’ should NOT be the beginning of the word and as well as NOT at the end of the word.. which mean it has to be inside some other word that word.

NEXT: Using \B and \b together