Practical Stata Programming
Applied Statistics Using Stata^{®}

Tabmiss is a user-written Stata program, written by Marcelo Coca-Perraillon. The program counts number of missing and non-missing observations for the given variables (or all variables by default) and prints a table that includes the missing count and frequencies. In this article I will go through the program and explain how it works and how it creates the output table.

In the early steps of data preparation, we need to know that how many missing observations are existing in the data set. **Tabmiss** is a user-written ado program - authored by Marcelo Coca-Perraillon- that counts the number of missing and non-missing observations as well as their frequencies. I remember that I used to use this package a lot although I have been summarizing missing observations using `misstable summarize`

more often because it returns a smaller and more usefull table. You can find tabmiss package by searching `findit tabmiss`

. Here is an example of what the command returns using **auto.dta** dataset.

Variable | Obs Missings Feq.Missings NonMiss Feq.NonMiss -------------+--------------------------------------------------------------- price | 22 0 0 22 100 mpg | 22 0 0 22 100 rep78 | 22 1 4.545 21 95.45

Variable | Obs Missings Feq.Missings NonMiss Feq.NonMiss -------------+--------------------------------------------------------------- make | 74 0 0 74 100 price | 74 0 0 74 100 mpg | 74 0 0 74 100 rep78 | 74 5 6.757 69 93.24 headroom | 74 0 0 74 100 trunk | 74 0 0 74 100 weight | 74 0 0 74 100 length | 74 0 0 74 100 turn | 74 0 0 74 100 displacement | 74 0 0 74 100 gear_ratio | 74 0 0 74 100 foreign | 74 0 0 74 100 rep79 | 74 5 6.757 69 93.24

**Tabmiss** is an interesting package to look into. First of all, it's a simple program with very basic calculations i.e. counting and calculating frequencies. In addition, the program is only one paragraph long. It also has a simple syntax and a loop which makes it more interesting. And finally, the program returns a table which requires some knowledge of SMCL language. Therefore, understanding the algorithm and the program can be very educative for advanced Stata users who wish to practice Stata programming. Still, I consider the program to be beginner-level which means a Stata user who is not experienced with Stata programming, will be able to make sense of it.

As shown in the example above, **tabmiss** returns a table where the variable list appears on the first column, followed by other columns that include number of observations, number of missing observations, frequency of missing observations, number of non-missing observations, and frequency of non-missing observations. The algorithm of the program can be simplified as described below:

- Define the syntax of the program to take varlist and apply expressions (if in)
- Make the varilist optional so that if no variable is given to the program, tabmiss will return the results for all of the variables in the loaded data set
- Print the first row of the table that includes the title of each column (variable, Obs, Missings, etc.)
- Each variable of the variable list that is given to the program should go though a loop that:
- Counts number of observations
- Counts number of missing observations
- Counts number of non-missing observations
- Calculates frequency of missing observations (frequency of missings = (Missings / Observations)*100)
- Calculates frequency of missing observations (frequency of non-missings = (nonMissings / Observations)*100)
- Prints the values corresponding to each variable in the columns
- End the program

}

end

Next, I begin explaining the program codes step by step to analyze how the program works.

The program begins by defining the name of the program, i.e. **tabmiss**, and the version of the Stata that the program should be run with, i.e. version

The first row of the returned output is just a table that defines the title of each column. Therefore it should be printed before printing the results of the calculations. Here is how the table and the columns' titles are defined. For creating the table,

You might wonder that what does the ^{th} character. Let's explain this in more details to make sure you understand it clearly. The table begins with displaying the

So overall this string is 22 characters long. The

The difference between the two commands in the example above is that the first command adds 7 empty characters between

For completing the table that includes the count and relative frequency of the missing and non-missing observations, **tabmiss** run all the variables though a loop and prints the values in the table. In the loop, the

}

So how does the program count the number of missing observations in each variable? There are many solutions to this problem, but tabmiss creates a temporary binary variable as an indicator of missingness. The variable - which is named `egen`

command. Therefore, the temporary variable will have value of

Once the temporary variable is generated, again, there are different possibilities for counting it. **Tabmiss** prefers to obtain the count by running the `summarize`

command for observations that equal to `summarize`

command (type `return list`

after the summarize command to see all the scalars that the command returns).

Tabmiss saves the value of the **total number of missing observations** in a local macro named

● Note that `quietly sum`

means `quetly summarize`

and it is not summing up the variable!

To calculate the total number of observations (both missing and non-missing), tabmiss uses the same procedure that was used for counting the number of missing observations, i.e. using the `summarize`

command to return the total number of observations in the

The `summarize`

command uses the temporary variable i.e.

To count the number of non-missing observations, the total number of observation which is stored in local

Once you understand that the what

}

end

So far, we are calculate the number of missing and non-missing observations and their frequencies as well. All we have to do to finish the program is to print these values in the columns that we have defined at the outset of the program. Since this part is in the loop, the program will loop over the variables and complete the table.

So what `display`

command can be reformated. The format syntax is different based on the content. For String content, the format is

This example makes it clear that when the string variable is shorted than

So we learned how to keep the variables organized in the table, but what if the string (variable name) is longer than **Variables** was 13 character wide (**tabmiss** consider a limit for the maximum width of a variable as well, name which is 12 characters. The

tabmiss also reformats the counted and calculated numbers. These are all Numeric formats because each local macro includes whether an integer (counted number) or a frequency. The distance between the numbers is also added using empty strings. To make sure that you are creating a reasonable table, you should practice reformating the results very carefully. For detailed explanation in this regard, I refer you to the **u manual**, part **Formats: Controlling how data are displayed**.

Here are a few tasks to begin playing with the program and changing it's algorithm as well as output.

- Add a horizontal line at the end of the table. Also, make the title of the columns (Variables, Obs, etc.) to appear in
**bold**text. - change the syntax of the program so that the variable list
**must be given**to the program. - Suggest alternative solutions for calculating the number and frequency of non-missing observations.
- Change the output of the program so that it returns only Number of observations, number of missings, and frequency of missings in three columns and name that program
**tabmiss2**.

The example ado file below is the commented version of tabmiss.ado that you can download.

tabmiss.ado