Text File Parsing

It’s no secret to anyone that I’m a Unix guy. I’ve worked on Unix-based systems for over 20 years now. If I’m forced to use a Windows machine, the first thing I do is install software to provide me a more Unix-like experience. I could give dozens of reasons why I love Unix such as programming environments, robust command line utilities, stability, etc. But one of the most useful to me is the ability to perform complex text manipulations from the command line. Parsing log files, editing text files on the fly, creating reports, all easy from a Unix command line. Of all the commands I use, some of my favorites are sed, tr, grep, and cut. With just those four commands I can make magic happen. For instance, today I had a configuration file for a git-o-lite server that I needed to use to generate a list of repositories from. I could open the file in a text editor and edit it… but that’s work. As a programmer, I program things so I don’t have to do trivial work. Besides, given the large number of repositories in the config file it would take too much time and be prone to error. Knowing the format of the config file, I opened up a command prompt and started stringing together commands to transform the data one step at a time.  At the end, I had a list of every repository nicely sorted alphabetically. This list could then be fed into another process that might perform some maintenance task on each repository or perform some other task. In the end, my command was a long string of small commands strung together.

cat gitolite.conf | grep 'repos/' | cut -d'=' -f2 | tr ' ' '\n' | tr -d ' ' | grep -v '^$' | grep -v '@' | grep 'repos' | sort

While it may look cryptic to the uninitiated, I’ll explain each command and why it was used. First, I used the cat file to display the contents of the config file for git-o-lite. The cat command is useful for displaying text data from a command prompt in Unix, and is almost always the starting point for any Unix text-processing. Next, the grep command is executed to find all lines containing the text ‘repos/’ which I know is the starting point for all the repository names in the configuration file. Grep is another commonly used Unix command that is used to search a file for a text string and display matching rows. Numerous versions of grep exist providing all kinds of advanced functionality, but basic grep is still the most commonly used. Now that I have all lines containing repository names, I can begin to process that list.  I start by using cut to remove the variable names. Since Git-o-lite allows variables to defined  (@var = value), and I only want the value, I will tell cut to split on the equal sign delimiter  (-d’=’) and give me only the second field (-f2). Since multiple repository names may be on a single line, I next need to split the data on the space so that each repository is on it’s own line. The tr command will translate one character to another. So, in this instance, I will change ‘ ‘ to ‘\n’ (\n is a newline character – like hitting the return key on the keyboard). Next, I delete any remaining spaces using using the -d flag for the tr command. At this point, my output contains empty lines which I want to remove. The -v argument for grep will remove lines containing the supplied search string. Here, the cryptic ‘^$’ is used where ^ is the beginning of the line and $ is the end of the line (these are standard grep patterns). Next, I run through a few more grep commands to cleanup the data and then pipe the content to the sort function. Now, I have my list of repositories ready for whatever follow on processing I want. Last step, copy the above commands to a shell script so that I don’t have to type all those commands in again.

Throughout my career, I have used the above process innumerable times. I can extract any manner of text data from a file quickly and easily simply by using a string of commands to incrementally transform the data to my desired output. If you are a programmer, and you’re not familiar with the above commands, you’re probably working too hard.

Leave a Reply