Your one-stop multiple missing value imputation shop: R 2.15.0 with the rrp package.
Link to the last RSS article here: Introduction to basic Text Mining in R. -- Ed.
By Dr. Jon Starkweather, Research and Statistical Support Consultant
This month we provide a recommendation for dealing with multiple missing value imputation. Of course, every researcher must deal with missing values at some point. The first key issue when dealing with missing values is attempting to determine if the values are missing at random or if there is some discernable pattern to the missing-ness. For a thorough treatment of that issue, please see Little and Rubin (1987). If there is no discernable pattern among the missing values and a decision to impute, or estimate, those missing values has been taken; a choice must be made among the many techniques available for imputation of missing values. Some methods for identifying, displaying, and imputing missing values have been previously discussed in this column (see: Starkweather, 2010). However, the present article will deal exclusively with the use of the rrp.impute function of the rrp package (Iacus, 2012). The rrp stands for Random Recursive Partitioning (Iacus & Porro, 2009 & 2007). However, the rrp package is only available from R-Forge for older versions of R (e.g. R version 2.15.0). Therefore, this article will provide instructions for downloading and installing R version 2.15.0, as well as the installation of the rrp package into R version 2.15.0. We will be using a Windows 7 PC (please note: you must have administrator privileges in order to install software). The article will then proceed to show how to use the rrp.impute function in order to impute multiple missing values with a simulated data set.
Installing R 2.15.0 and rrp
The first thing we need to do is determine where on your machine you want to install the old version of R; from here on we will refer to this version as R 2.15.0. Generally, RSS personnel recommend creating a specific directory (i.e. folder) on your machine’s hard drive for all R installations. The file path location of such a directory should look something like:
However, we recognize some people have R installed in the default location (inside the Program Files directory); in which case, your R directory will be located inside the Program Files directory. Inside the R directory there should be at least one installation of R, typically the most recent version; which as of this writing is R 3.0.2 which can be seen in the image below.
The location shown above will be referred to as the R directory; in which we will install R 2.15.0 (and which should contain the latest version of R [e.g. R 3.0.2]). Next, we need to retrieve R 2.15.0 from the CRAN archives and download it to the R directory on our machine (the location shown in the image above). Old versions of R can be accessed from CRAN (http://cran.us.r-project.org/) by clicking on the R Binaries link on the left side of the main CRAN page (see image below with binaries link marked with the red rectangle).
Once you click on the R Binaries link, you will then select the operating system in which you want to install (“windows” marked below with a red rectangle);
then click “base” distribution from the Subdirectories as show below;
then click “Previous releases” (marked with the red rectangle in the image below).
Then click on R 2.15.0 (marked with the red rectangle in the image below).
Then click on “Download R 2.15.0 for Windows” (marked with the red rectangle in the image below). This will allow you to save the installation, or executable, file to the R directory on your machine as located and discussed above.
Then, you simply navigate to your R directory and double click the installation file to install R 2.15.0. Once you have finished installing R 2.15.0, your R directory should look something like what is below.
At this point, we can open the (R 2.15.0) console in preparation of installing the rrp package (Iacus, 2012).
Next, we need to point our favorite browser to the rrp package page of R-Forge (https://r-forge.r-project.org/R/?group_id=1480). Once on that page (displayed below), you will need to copy the installation script line (marked below with a red rectangle) and paste it into your R 2.15.0 console in order to install the rrp package (as shown further below).
It is very important that you never update this version of R and that you never ‘update packages’ associated with this version of R (if you do, you’ll need to uninstall R 2.15.0 and start over). This way, you can have this old version of R for the dedicated purpose of multiple missing value imputation – and this version consume only a small amount of space on your hard drive because it should only have the rrp package installed. The latest version of R will continue to be the version you should use for all other operations.
Using rrp to impute missing values
First thing we need to do is import our simulated data (rrp.ex.data.txt) from the RSS webserver and get a summary of it. We name the data “data.1” for this example and we notice from the summary the data contains 158 cases (n = 158) and 8 columns: id, sex, age, Q1, Q2, Q3, Q4, Q5. We also notice from the summary there are missing values among the responses to the sex, age, Q2, and Q4 variables.
Next, we remove the (arbitrary) identification column (“id”) because it contains no meaningful information (i.e. it is not related at all to any of the other columns of data).
Next, we need to load the package (rrp) which contains the imputation function (rrp.impute). We also need to set the seed (set.seed) so that we can replicate exactly the resulting imputations we get. Notice below, we simply use the 8-digit date (at the time of writing) for the seed number (2014 Jan. 14th = 20140114). Then, we can submit this data (data.2) to the ‘rrp.impute’ function. Notice below, we have a “$new.data” tacked onto the end of the function – this allows us to return just the imputed data frame (rather than the two object list which the function naturally returns). We assign the imputed data frame to a new object (data.3). You can gain a better understanding of the arguments of the ‘rrp.impute’ function by referring to the help files and / or package documentation (Iacus, 2009). See also Iacus, and Porro (2009); Iacus, & Porro (2007) listed at the bottom of this document.
We can see the missing (NA) have been imputed by comparing the summaries of each data frame.
As you may have noticed above, we did not need to restrict the ‘rrp.impute’ function to only the numeric vectors (i.e. columns) of the data. This is one reason why RSS personnel recommend going through the (minor) trouble of having an old version of R installed on our machines. Having the old version (R 2.15.0) and the rrp package installed allows us to impute missing data quickly because ‘rrp.impute’ is the only function we are aware of which allows us to impute both numeric and categorical variables with one run of a function. The other main reason we recommend using ‘rrp.impute’ is because we have run simulations to compare the performance (in terms of bias & variability of estimated / imputed values) of ‘rrp.impute’ to a number of other highly recommended imputation strategies (e.g. maximum likelihood multiple imputation [package norm, package Amelia], Iterative Robust Model-based Imputation [package VIM], & Sequential k nearest neighbors [package SeqKnn, package rrcovNA]). Our results suggest the random recursive partitioning (rrp) method provides estimates with very low bias and low variability – approximately the same amounts one would get from applying the maximum likelihood method; and of the methods tested, ‘rrp.impute’ is the only one which imputes both numeric and categorical values. Keep in mind; all these methods assume the missing values are missing at random (i.e. no discernable pattern to the missing values).
Until next time; make sure you get a retainer and keep your ear to the grindstone…
References / Resources
Iacus, S. M. (2012). Package rrp. Available at: https://r-forge.r-project.org/R/?group_id=1480
Iacus, S. M. (2009). Package rrp manual. Available at: http://www.unt.edu/rss/class/Jon/R_SC/Module4/rrp.pdf
Iacus, S. M., & Porro, G. (2009). Random Recursive Partitioning: A matching method for the estimation of the average treatment effect. Journal of Applied Econometrics, 24, 163—185.
Iacus, S.M., & Porro, G. (2007). Missing data imputation, matching and other applications of random recursive partitioning. Computational Statistics and Data Analysis, 52(2), 773—789.
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley & Sons.
Starkweather, J. (2010). How to identify and impute multiple missing values using R. Benchmarks Online, November 2010. Available at: http://web3.unt.edu/benchmarks/issues/2010/11/rss-matters
Originally published February 2014 -- Please note that information published in Benchmarks Online is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - http://www.unt.edu . You can also consult the UNT Helpdesk - http://www.unt.edu/helpdesk/. Questions and comments should be directed to firstname.lastname@example.org.