Saturday, December 22, 2007

Extracting email addresses from multiple files

Few days back, we conducted a LTSP (Linux Terminal Server Project) ToT (Training of Trainers). We invited application (a .txt format) for the training. We needed 25 applications but we had around 90.
Now sending notification to everyone about their selection status was a problem. One option was to open the .txt file one by one then copy and paste the email address which was way too manual and time consuming (not less than 2 hours). Hence, we thought of an alternative.

I placed all the .txt files in a folder 'application'. Then I extracted out the lines that contained email addresses. The following simple yet powerful command did this for me.

jitendra@jitendra:~/application$ cat *.* | grep '@' > email_lines


cat *.* concatenated all the files in the directory and grep extracted the lines containing @. The output was directed to a file email_lines. The output contained lines like the one shown below:
(g) Email address: abc@gef.com

Though this command was sufficient to simplify out task and I could have done with it with more 15 minutes processing. But I wanted to go further experimenting and making the task more simple.

Now what I wanted was the section only after : from each line. This was accomplished using cut command. I went about like this.

jitendra@jitendra:~/application$ cut -d: -f2 email_lines>email_list

cut command remove the sections from each line of files. Here -d defined : to be the delimiter. -f was used to select the field. 2 specifies the field to the right of the delimiter : . 1 would have specified field to the left of delimiter. email_lines is the file from which the field section has to be removed. I directed the output to email_list.

Now I had a list of email addresses only. I preferred adding comma manually to them to create a comma separated list of email-addresses to send the notification.

The task which would have otherwise taken more than 2 hours was over in less than 10 minutes.