Haghish, E. F. (2014). Practical Stata Programming: Analyzing Tabmiss Stata ado program.
Updated on November 23th 2014

Tabmiss

| Quick Tips |     | Introduction |      | Algorithm |     | Tabmiss |     | Analysis |     | Exercise |     | Download ado |    

Quick Tips

Tabmiss is a user-written Stata program, written by Marcelo Coca-Perraillon. The program counts number of missing and non-missing observations for the given variables (or all variables by default) and prints a table that includes the missing count and frequencies. In this article I will go through the program and explain how it works and how it creates the output table.

Introduction

In the early steps of data preparation, we need to know that how many missing observations are existing in the data set. Tabmiss is a user-written ado program - authored by Marcelo Coca-Perraillon- that counts the number of missing and non-missing observations as well as their frequencies. I remember that I used to use this package a lot although I have been summarizing missing observations using misstable summarize more often because it returns a smaller and more usefull table. You can find tabmiss package by searching findit tabmiss. Here is an example of what the command returns using auto.dta dataset.

sysuse auto, clear /* loading the auto data set */
tabmiss price mpg rep78 if foreign == 1

    Variable |     Obs       Missings   Feq.Missings    NonMiss   Feq.NonMiss
-------------+---------------------------------------------------------------
       price |      22           0            0             22          100
         mpg |      22           0            0             22          100
       rep78 |      22           1        4.545             21        95.45

tabmiss /* tabmiss for all variables */
    Variable |     Obs       Missings   Feq.Missings    NonMiss   Feq.NonMiss
-------------+---------------------------------------------------------------
        make |      74           0            0             74          100
       price |      74           0            0             74          100
         mpg |      74           0            0             74          100
       rep78 |      74           5        6.757             69        93.24
    headroom |      74           0            0             74          100
       trunk |      74           0            0             74          100
      weight |      74           0            0             74          100
      length |      74           0            0             74          100
        turn |      74           0            0             74          100
displacement |      74           0            0             74          100
  gear_ratio |      74           0            0             74          100
     foreign |      74           0            0             74          100
       rep79 |      74           5        6.757             69        93.24

Tabmiss is an interesting package to look into. First of all, it's a simple program with very basic calculations i.e. counting and calculating frequencies. In addition, the program is only one paragraph long. It also has a simple syntax and a loop which makes it more interesting. And finally, the program returns a table which requires some knowledge of SMCL language. Therefore, understanding the algorithm and the program can be very educative for advanced Stata users who wish to practice Stata programming. Still, I consider the program to be beginner-level which means a Stata user who is not experienced with Stata programming, will be able to make sense of it.

Algorithm

As shown in the example above, tabmiss returns a table where the variable list appears on the first column, followed by other columns that include number of observations, number of missing observations, frequency of missing observations, number of non-missing observations, and frequency of non-missing observations. The algorithm of the program can be simplified as described below:

  • Define the syntax of the program to take varlist and apply expressions (if in)
  • Make the varilist optional so that if no variable is given to the program, tabmiss will return the results for all of the variables in the loaded data set
  • Print the first row of the table that includes the title of each column (variable, Obs, Missings, etc.)
  • Each variable of the variable list that is given to the program should go though a loop that:
    • Counts number of observations
    • Counts number of missing observations
    • Counts number of non-missing observations
    • Calculates frequency of missing observations (frequency of missings = (Missings / Observations)*100)
    • Calculates frequency of missing observations (frequency of non-missings = (nonMissings / Observations)*100)
    • Prints the values corresponding to each variable in the columns
  • End the program

Tabmiss

program define tabmiss
version 7.0
syntax [varlist] [if] [in]

di in text" Variable {c |} Obs" /*
*/ _col(30) "Missings Feq.Missings NonMiss Feq.NonMiss"
di in text"{hline 13}{c +}{hline 63}"

foreach i of local varlist {
tempvar contar
quietly egen `contar' = rmiss(`i') `if' `in'
quietly sum `contar' if `contar' == 1
local faltan = r(N)
quietly sum `contar' `if' `in'
local obser = r(N)
local nomiss = `obser' - `faltan'
local feqmiss = (`falatan' / `obser')*100
local feqno = (`nomiss' / `obser')*100
display in text %12s abbrev("`i'", 12) " {c |}" /*
*/ as result /*
*/ %8.0g `obser' " " /*
*/ %9.0g `falatan' " " %6.0g `feqmiss' " " /*
*/ %9.0g `nomiss' " " %6.0g `feqno'
}

end

Analysis of tabmiss package

Next, I begin explaining the program codes step by step to analyze how the program works.


Syntax

program define tabmiss
version 7.0
syntax [varlist] [if] [in]

The program begins by defining the name of the program, i.e. tabmiss, and the version of the Stata that the program should be run with, i.e. version 7.0. The program uses Stata syntax and expects a variable list and logical expression (i.e. if and in) for making a selection of observations. However, the varlist is considered optional, which means that if the user does not specify the variables, the program will take all the variables that are loaded into Stata into account. Similarly, using the if and in expressions are optional.


Output table

di in text" Variable {c |} Obs" /*
*/ _col(30) "Missings Feq.Missings NonMiss Feq.NonMiss"
di in text"{hline 13}{c +}{hline 63}"

The first row of the returned output is just a table that defines the title of each column. Therefore it should be printed before printing the results of the calculations. Here is how the table and the columns' titles are defined. For creating the table, {hline #} was used to create horizontal lines (number of # indicates the width of the line in terms of number of characters). In addition, {c |} and {c +} were used to create vertical lines of the table. {c |} is one of the SMCL syntax for creating a vertical tall line and the {c +} which is placed under the {c |}, creates a wide dash and extends the tall vertical line created by {c +} and merges it to a horisontal line. In fact, the {c +} is used to connect tall vertical and horizontal lines together.

You might wonder that what does the _col(30) do? It just prints the second part of the string variable from the column 30, i.e. the 30th character. Let's explain this in more details to make sure you understand it clearly. The table begins with displaying the " Variable {c |} Obs" string which includes 4 empty character + Variable + 1 empty character + {c |} + 5 empty characters + Obs. Note that {c |} is considered to be only one character. In the example below, the white-space characters are shown with hashtags.

"####Variable#{c |}#####Obs"

So overall this string is 22 characters long. The _col(30) SMCL syntax prints the rest on the character number 30 and therefore skips 7 more characters. Therefore the two commands below should yeild identical results:

*** Example ***
display " Variable {c |} Obs X"
display " Variable {c |} Obs" _col(30) "X"

The difference between the two commands in the example above is that the first command adds 7 empty characters between Obs and X and the second command uses _col(30) to add space between two separate strings. Now let's print the actual command that was used for creating the table.


Analysis Loop

foreach i of local varlist {

For completing the table that includes the count and relative frequency of the missing and non-missing observations, tabmiss run all the variables though a loop and prints the values in the table. In the loop, the varlist is the macro that includes the variable list. Note that varlist is a local that includes the variable list. The example below demonstrates how to loop over a local macro.

*** Example ***
local drinks "coffee milk soda tea water"
foreach i of local drinks {
display "`i'"
}

Counting number of missing observations

tempvar contar
quietly egen `contar' = rmiss(`i') `if' `in'
quietly sum `contar' if `contar' == 1
local faltan = r(N)

So how does the program count the number of missing observations in each variable? There are many solutions to this problem, but tabmiss creates a temporary binary variable as an indicator of missingness. The variable - which is named contar is generated scored using rowmiss function which belongs to the egen command. Therefore, the temporary variable will have value of 1 for missing observations (rows) and 0 for non-missing observations.

Once the temporary variable is generated, again, there are different possibilities for counting it. Tabmiss prefers to obtain the count by running the summarize command for observations that equal to 1 and then using the r(N) scalar which is automatically returned by the summarize command (type return list after the summarize command to see all the scalars that the command returns).

Tabmiss saves the value of the total number of missing observations in a local macro named faltan. This name itself has no meaning and you could change it to anything (call it Mr_Carrot if you want!) and it would work the same. faltan will be used later on for calculating the frequencies.

Note that quietly sum means quetly summarize and it is not summing up the variable!


Counting number of observations (missing and non-missing)

quietly sum `contar' `if' `in'
local obser = r(N)

To calculate the total number of observations (both missing and non-missing), tabmiss uses the same procedure that was used for counting the number of missing observations, i.e. using the summarize command to return the total number of observations in the r(N) scalar. The difference is that no logical expression is used to limit the command only to missing observations.

The summarize command uses the temporary variable i.e. contar. alternatively, the actual variable could have been used by using the `i' instead of `contar'. Although logically, summarizing a binary variable is probably faster than the original variable (assuming you have are running Stata on the same computer that you got on your birthday when you turned 14 and the data set includes tens of thousands of observations). The total number of observations is saved in a local macro named obser.


Counting number of non-missing observations

local nomiss = `obser' - `faltan'

To count the number of non-missing observations, the total number of observation which is stored in local `obser' is deducted from the number of missing observation which is stored in local `faltan'. The obtain value that indicates the count of non-missing observations is stored in local nomiss.

Counting the frequency of missing and non-missing observations

local feqmiss = (`falatan' / `obser')*100
local feqno = (`nomiss' / `obser')*100

Once you understand that the what `falatan', `nomiss', and `obser' macros include, understanding this command becomes very simple. These locals include numbers and can be used in arithmatic operations. As explained in the algorithm, the frequencies are calculated by dividing the number of missing observations to the total number of observations and multiplying the results by 100. The same procedure is used for calculating the frequency of non-missing.


Printing the results of each variable

display in text %12s abbrev("`i'", 12) " {c |}" /*
*/ as result /*
*/ %8.0g `obser' " " /*
*/ %9.0g `falatan' " " %6.0g `feqmiss' " " /*
*/ %9.0g `nomiss' " " %6.0g `feqno'

}
end

So far, we are calculate the number of missing and non-missing observations and their frequencies as well. All we have to do to finish the program is to print these values in the columns that we have defined at the outset of the program. Since this part is in the loop, the program will loop over the variables and complete the table.

So what display in text %12s is supposed to mean? The content of the display command can be reformated. The format syntax is different based on the content. For String content, the format is % + number + s. Formating the strings allow to consider a constant width for the column. In this example, the minimum width of each variable is set to 12 characters. Therefore, if a variable name is 5 character long, such as "price", it will begin with 7 empty spaces to align the variables to the right side of the column, where the line is drawn. For example, try the following command in your Stata:

*** Example ***
display in text %20s "coffee"
display in text %20s "coffee milk"
display in text %20s "coffee milk soda tea water"

This example makes it clear that when the string variable is shorted than 20 character long, it will be aligned to the right by adding empty space characters to the left side of the variable. However, this function does not limit or fit the variables which are longer than 20 characters.

So we learned how to keep the variables organized in the table, but what if the string (variable name) is longer than 12s which is specified in the program? If you remember, the column of the Variables was 13 character wide (####Variable#{c |}). What happens if we have a variable which its name is longer than 13 characters? Obviously, it will deform the table. To avoid that problem, tabmiss consider a limit for the maximum width of a variable as well, name which is 12 characters. The abbrev() function which is a string function is used to abbreviate the variables which are longer than 12 characters. abbrev(s,n) is usually used with variable names for abbreviating them, although it can abbreviate any string by the n given number of characters. In tabmiss this number is set to 12.

tabmiss also reformats the counted and calculated numbers. These are all Numeric formats because each local macro includes whether an integer (counted number) or a frequency. The distance between the numbers is also added using empty strings. To make sure that you are creating a reasonable table, you should practice reformating the results very carefully. For detailed explanation in this regard, I refer you to the u manual, part Formats: Controlling how data are displayed.

Exercise

Here are a few tasks to begin playing with the program and changing it's algorithm as well as output.

  • Add a horizontal line at the end of the table. Also, make the title of the columns (Variables, Obs, etc.) to appear in bold text.
  • change the syntax of the program so that the variable list must be given to the program.
  • Suggest alternative solutions for calculating the number and frequency of non-missing observations.
  • Change the output of the program so that it returns only Number of observations, number of missings, and frequency of missings in three columns and name that program tabmiss2.

Download commented ado

The example ado file below is the commented version of tabmiss.ado that you can download.
tabmiss.ado





Aboutفارسی