Text File Parsing

It’s no secret to anyone that I’m a Unix guy. I’ve worked on Unix-based systems for over 20 years now. If I’m forced to use a Windows machine, the first thing I do is install software to provide me a more Unix-like experience. I could give dozens of reasons why I love Unix such as programming environments, robust command line utilities, stability, etc. But one of the most useful to me is the ability to perform complex text manipulations from the command line. Parsing log files, editing text files on the fly, creating reports, all easy from a Unix command line. Of all the commands I use, some of my favorites are sed, tr, grep, and cut. With just those four commands I can make magic happen. For instance, today I had a configuration file for a git-o-lite server that I needed to use to generate a list of repositories from. I could open the file in a text editor and edit it… but that’s work. As a programmer, I program things so I don’t have to do trivial work. Besides, given the large number of repositories in the config file it would take too much time and be prone to error. Knowing the format of the config file, I opened up a command prompt and started stringing together commands to transform the data one step at a time.  At the end, I had a list of every repository nicely sorted alphabetically. This list could then be fed into another process that might perform some maintenance task on each repository or perform some other task. In the end, my command was a long string of small commands strung together.

cat gitolite.conf | grep 'repos/' | cut -d'=' -f2 | tr ' ' '\n' | tr -d ' ' | grep -v '^$' | grep -v '@' | grep 'repos' | sort

While it may look cryptic to the uninitiated, I’ll explain each command and why it was used. First, I used the cat file to display the contents of the config file for git-o-lite. The cat command is useful for displaying text data from a command prompt in Unix, and is almost always the starting point for any Unix text-processing. Next, the grep command is executed to find all lines containing the text ‘repos/’ which I know is the starting point for all the repository names in the configuration file. Grep is another commonly used Unix command that is used to search a file for a text string and display matching rows. Numerous versions of grep exist providing all kinds of advanced functionality, but basic grep is still the most commonly used. Now that I have all lines containing repository names, I can begin to process that list.  I start by using cut to remove the variable names. Since Git-o-lite allows variables to defined  (@var = value), and I only want the value, I will tell cut to split on the equal sign delimiter  (-d’=’) and give me only the second field (-f2). Since multiple repository names may be on a single line, I next need to split the data on the space so that each repository is on it’s own line. The tr command will translate one character to another. So, in this instance, I will change ‘ ‘ to ‘\n’ (\n is a newline character – like hitting the return key on the keyboard). Next, I delete any remaining spaces using using the -d flag for the tr command. At this point, my output contains empty lines which I want to remove. The -v argument for grep will remove lines containing the supplied search string. Here, the cryptic ‘^$’ is used where ^ is the beginning of the line and $ is the end of the line (these are standard grep patterns). Next, I run through a few more grep commands to cleanup the data and then pipe the content to the sort function. Now, I have my list of repositories ready for whatever follow on processing I want. Last step, copy the above commands to a shell script so that I don’t have to type all those commands in again.

Throughout my career, I have used the above process innumerable times. I can extract any manner of text data from a file quickly and easily simply by using a string of commands to incrementally transform the data to my desired output. If you are a programmer, and you’re not familiar with the above commands, you’re probably working too hard.

Counting Lines of Code

Often times, while writing software, it’s useful to know how many lines of code exist in a project. While this is a poor metric for gauging a developer’s output, it can be useful in getting a feel for the size of a project. It may also be useful in comparing different languages. For example, how much more concise is Kotlin than Java? Unfortunately, there are no built-in tools for doing this calculation. So, how can you easily count lines of code? I use a simple shell script – which will work on Linux or Mac, to count my code.

#!/bin/bash

extensions=(bas c cc cob cpp cs fth f90 go h html java js jsp m pas php pl py sc sh sql xhtml)

for extension in ${extensions[@]}
do
  lines=`find . -name "*.$extension" -type f -exec cat {} \; | tr -d '[:blank:]' 2> /dev/null | grep -v '^$' | wc -l `
  if [ $lines -ne 0 ]
  then 
    echo $extension: $lines
  fi
done

You can add additional language support by adding the file extension to the extensions list. Then, run this script from the root directory for the project to see the code count.

Python Web Server

As a software developer, I often write HTML files or JavaScript code that I would like to see in my web browser. I could just open the file in my browser, but if I’m testing JavaScript I would rather run from a real server. I could start up an Apache server, or create an instance of NGINX using Docker. But both of those options require too much effort for simply testing an HTML file.  What I really want is just a command that will allow me to start an HTTP server in the folder where the HTML or JavaScript files are saved. Well, thanks to Python, such a option exists. Simply go to the directory where you want the server to run from and execute the command:

python -m SimpleHTTPServer 8080

This will start up a simple HTTP server running on port 8080. Now, point your web browser to localhost:8080 and you can see your web page! Nothing could be simpler for quick testing of web pages or websites!