RSS Matters content
http://it.unt.edu/benchmarks/rss-matters
enRSS Matters
http://it.unt.edu/benchmarks/issues/2015/02/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-02</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Data Reduction for making Comparisons: Principle Component Scores</strong><strong>.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2015/01/rss-matters">Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.</a> -- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>wo years ago this column addressed one way of creating composite or indicator scores using Factor Analysis (Starkweather, <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/CompositeScores_JDS_Feb2012.pdf">2012</a>). That article approached composite score creation from a measurement modeling perspective in which each composite score represented a latent variable. The current article approaches composite score creation from a non-measurement modeling perspective.</p>
<p>This month’s article discusses how to create composite scores from many variables using the ultimate data reduction technique: Principle Component Analysis (PCA). PCA is not measurement model based; it is a linear model based data reduction technique used to reduce the number of observed variables down to their <em>principle components</em> while maximizing the <em>total </em>amount of variance explained in the observed variables. PCA assumes linear relationships among the observed variables (i.e. it is not appropriate if curvilinear relationship are discovered among the observed variables). For a more detailed explanation of the differences between PCA and FA, please consult Starkweather (<a href="http://www.unt.edu/rss/class/Jon/Benchmarks/PCAvsFAvsAA_JDS_July2010.pdf">2010</a>).</p>
<p>Occasionally, a data analyst is called upon to take many observed variables and combine them or reduce them to one variable or a few variables. The observed variables may, or may not, be directly related to one another and they may or may not be of the same scale. The one or a few resulting variables are weighted linear composite scores which can then be used to compare organizational units (e.g. departments within a larger organizational structure). In this situation, it is critically important to realize we are not interested in creating, assessing, or confirming a measurement model with latent variables and error. We are not assuming classical test theory model of measurement. We are solely interested in reducing many variables to one variable (or a few variables) so we can compare units. Those units may be individuals or organizations.</p>
<h4><strong>The Situation: <em>General Hospital</em></strong></h4>
<p>Our example this month concerns a (<em>fictional</em>) General Hospital. The hospital board requested the director, Annabelle Lecter, M.D., to compare each Service Department. Each service department (Informational Services [IS], Therapeutic Services [TS], Diagnostic Services [DS], and Support Services [SS]) contains various disparate organizational structures (see pages 1 – 3 <a href="http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln">here</a>). The service departments do not initially seem comparable because each has specific tasks, budgets, number and status of personnel, degree of patient interaction, physical supply needs (weekly, monthly, yearly), and so forth. The director has access to a variety of these types of variables for each department and wants to reduce all of this information down to a single variable on which to compare the departments. Some departments have very small values on some variables by design or purpose (of the specific department) and some departments have very large values on some variables by design or purpose (of the specific department).</p>
<p>At first, the director thinks it might be best to transform all these variables to <em>Z</em>-scores (i.e. standardize them) so they are all on the same scale and then simply add or average all the <em>Z</em>-scores to get one number for each department. The directory quickly realizes this is not tenable because <em>Z</em>-scores, although used to compare individuals across two (or more) variables, are not meant to be combined. If <em>Z</em>-scores are averaged, the mean should be at or very near zero. Furthermore, creating a composite score using either of these two techniques (sum or mean) explicitly assume each variable is equally important and essentially interchangeable (with respect to the resulting composite score).</p>
<p>What the director really needs is a technique which creates a composite score (for each department) in such a way that each observed variable is weighted by its ability to account for variance in all the observed variables (combined). The <em>variance in all observed variables</em> is represented by the variance-covariance matrix or correlation matrix of observed variables. By submitting the observed variables’ data (i.e. variance-covariance matrix or correlation matrix) to PCA and specifying the computation of Principle Component Scores (PCS) and then saving the scores of the <em>first</em> component, the director will have achieved her goal. Keep in mind, with PCA the first component is the one which accounts for the most variance and any subsequent components are accounting for variance <em>left over</em> after the variance which was accounted for by previous component has been removed. So, just to be clear; if the first component accounts for 48% of the variance of the observed variables, then that is 48% of 100% of the variance of the observed variables. If the second component accounts for 25% of the variance then that is 25% of the remaining 52% total variance of the observed variables (i.e. whatever is left after the first component has been extracted). So each subsequent component (i.e. component 3 through component <em>J</em> – 1, where <em>J</em> is the number of observed variables) is accounting for less and less of the total observed variables’ variance.</p>
<p>Now you may be asking the question; “but what does the component score <em>mean</em>?” In order to determine that, one would evaluate the direction and magnitude of loadings of each observed variable to the first component. The variables which have the largest absolute value loadings are those most contributing to the component (i.e. accounting for the most variance in all the observed variables). Loadings are interpreted just like correlation coefficients – positive vs. negative and between -1 and +1. If more than one component is evaluated, it is very likely the observed variables will coalesce on one or the other component decisively with each component’s definition (or name) becoming apparent based on which observed variables load most on a particular component. For example, say that the observed variables 1, 3, 5, 7, and 9 load most on the first component; while the observed variables 2, 4, 6, 8, and 10 load most on the second component. Then we would name the first component based on the content or meaning of observed variables 1, 3, 5, 7, and 9. Likewise, we would name the second component based on the content of the observed variables 2, 4, 6, 8, and 10.</p>
<p>Tutorials using PCA (with and without saving component scores) are available for each of the three most popular statistical software packages through the Research and Statistical Support instructional / tutorial websites (links provided directly below).</p>
<ul>
<li>For users of the statistical programming language environment R, please see: <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module7/M7_PCAandFA.R">http://www.unt.edu/rss/class/Jon/R_SC/Module7/M7_PCAandFA.R</a></li>
<li>For users of the SAS programming suite, please see: <a href="http://www.unt.edu/rss/class/Jon/SAS_SC/SAS_Module7.htm">http://www.unt.edu/rss/class/Jon/SAS_SC/SAS_Module7.htm</a></li>
<li>For users of the SPSS program, please see: <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/Module9/M9_PCA/SPSS_M9_PCA1.htm">http://www.unt.edu/rss/class/Jon/SPSS_SC/Module9/M9_PCA/SPSS_M9_PCA1.htm</a></li>
</ul>
<h4 style="text-align: left;" align="center"><strong>References</strong></h4>
<p>Hospital Organizational Structure: <a href="http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln">http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln</a></p>
<p>Starkweather, J. (2010). Principal Components Analysis vs. Factor Analysis…and Appropriate Alternatives. Available in original form at Benchmarks: <a href="http://it.unt.edu/benchmarks/issues/2010/07/rss-matters">http://it.unt.edu/benchmarks/issues/2010/07/rss-matters</a> and available as an Adobe.pdf <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/PCAvsFAvsAA_JDS_July2010.pdf">here</a>.</p>
<p>Starkweather, J. (2012). How to Calculate Empirically Derived Composite or Indicator Scores. Available in original form at Benchmarks: <a href="http://web3.unt.edu/benchmarks/issues/2012/02/rss-matters">http://web3.unt.edu/benchmarks/issues/2012/02/rss-matters</a> and available as an Adobe.pdf <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/CompositeScores_JDS_Feb2012.pdf">here</a>.</p>
<p> </p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published February 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsThu, 19 Feb 2015 21:51:10 +0000cpl0001996 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2015/01/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-01</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2014/12/rss-matters">Identifying or Verifying the Number of Factors to Extract using Very Simple Structure.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>W</strong></span>e use the term <em>explicit</em> because we are going to calculate these examples <em>by hand</em> with programing rather than simply loading a package and using functions to estimate parameters. The purpose of using these explicit methods is to hopefully convey a better understanding of what it means to <em>do</em> Bayesian statistics. </p>
<p>First, we must present a little bit about Bayesian statistics. Very, very briefly, Bayesian statistics requires three elements: a prior, likelihood, and a posterior. The prior is a distribution specified by the researcher which represents all <em>prior</em> information regarding the parameter the researcher is attempting to estimate. The prior represents an educated, best guess at the parameter (e.g. the mean of the prior) and the degree of certainty or confidence in that educated, best guess (e.g., the variance and shape of the prior distribution). The prior is specified before (i.e. <em>prior</em>) to data collection. The prior is then combined with the likelihood (a representation of the data at hand) to create a more informed, empirical distribution of the parameter being estimated. We call this last distribution the <em>posterior</em> distribution. The mean of the posterior is our estimate of the parameter. Interval estimates can then be calculated from the posterior which truly will represent the interval which contains the actual population parameter; we call those intervals, <em>credible intervals</em> (rather than confidence intervals – which <em>do not</em> tell you the probability of the population parameter being contained in this interval).</p>
<p>Let's say we want to estimate <strong>the mean</strong> IQ scores on the Weschler Adult Intelligence Scale (WAIS) of a small town, X.Town, which has a population of 10000 individuals. Let's start by importing the X.Town data.</p>
<p><span style="color: #ff0000;">x.town.df <- read.table("http://www.unt.edu/rss/class/Jon/ExampleData/X.Town.sample.txt",</span></p>
<p><span style="color: #ff0000;"> header = TRUE, sep = ",", dec = ".", na.strings = "NA")</span></p>
<p><span style="color: #ff0000;">nrow(x.town.df)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p>We know from a mountain of normative data and prior research that the U.S. population distribution of WAIS scores has a mean (µ) of 100 and a standard deviation (σ) of 15. This information represents a best case scenario; where we <em>know</em> the population distribution and that distribution is normally distributed with an identified mean and standard deviation. Generally, we would not have such great prior information; so consider an alternative where we have virtually no prior information accept to know the WAIS questions / procedures which allow a possible score to range from 1 to 200. In such a case, our specification of a prior distribution would mean each score in that range is equally likely -- which prompts us to specify a <em>uniform</em> distribution (i.e. a distribution in which each value has an equal probability of being represented). A uniform prior is also known as an un-informative or un-informed prior. In both examples below we are using a population of 10000 individuals. </p>
<p><span style="color: #ff0000;">uninformed.prior <- rep(seq(1:200), 50)</span></p>
<p><span style="color: #ff0000;">length(uninformed.prior)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p><span style="color: #ff0000;">summary(uninformed.prior)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;"> 1.00 50.75 100.50 100.50 150.20 200.00</span></p>
<p><span style="color: #ff0000;">hist(uninformed.prior)</span></p>
<p><img src="/benchmarks/sites/default/files/EB_002.png" alt="Histogram of Uninformed.prior" width="451" height="480" /> </p>
<p>However, with the WAIS and the knowledge of the U.S. population, we can specify a Gaussian (i.e. normal) distribution as our prior.</p>
<p><span style="color: #ff0000;">informed.prior <- rnorm(10000, mean = 100, sd = 15)</span></p>
<p><span style="color: #ff0000;">length(informed.prior)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p><span style="color: #ff0000;">summary(informed.prior)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;"> 37.51 89.93 100.10 100.10 110.40 157.30</span></p>
<p><span style="color: #ff0000;">hist(informed.prior)</span></p>
<p><span style="color: #ff0000;"><img src="/benchmarks/sites/default/files/EB_001.png" alt="Histogram of uninformed.prior" width="454" height="480" /></span></p>
<p>Clearly; the two example priors above are extremes (i.e. worst case and best case); there are a variety of other distributions which can be specified as priors (e.g. Cauchy, Poisson, beta, etc.) and the prior <strong>is not</strong> required to be symmetrical. For more information on the variety of distributions, see: <a href="http://en.wikipedia.org/wiki/List_of_probability_distributions">http://en.wikipedia.org/wiki/List_of_probability_distributions</a></p>
<p>Our research questions are as follows: What is the mean WAIS score of the population (<em>n</em> = 10000) of X.Town; and, does that mean differ from the larger (U.S.) population? In more precise terms, what is the population mean of X.Town WAIS scores and is that mean <em>larger</em> than the known U.S. population mean. To be clear, there are two populations we are referring to here; the population of X.Town (<em>N</em> = 10000) and the larger population of the U.S.</p>
<p>It is unrealistic to think we would have all 10000 adult citizens' data from X.Town; we would generally have a sample of that town's data. Note; the 7th column of our X Town data file contains the WAIS scores. Here we randomly sample (<em>n</em> = 1000) cases from the entire X.Town data (<em>N</em> = 10000):</p>
<p><span style="color: #ff0000;">wais.sample <- sample(x.town.df[,7], 1000, replace = FALSE)</span></p>
<p><span style="color: #ff0000;">length(wais.sample)</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<h4><strong>Traditional Frequentist Perspective: Null Hypothesis Significance Testing (NHST).</strong></h4>
<p>In a traditional <em>frequentist</em> setting, we would begin by simply calculating the sample mean as our best estimate of the entire X.Town population mean WAIS score:</p>
<p><span style="color: #ff0000;">M <- mean(wais.sample)</span></p>
<p><span style="color: #ff0000;">M</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p>and the standard error of that mean if we wanted confidence intervals for that estimate (of the entire X.Town's mean):</p>
<p><span style="color: #ff0000;">std.err <- sqrt(15^2 / length(wais.sample))</span></p>
<p><span style="color: #ff0000;">std.err</span></p>
<p><span style="color: #0000ff;">[1] 0.4743416</span></p>
<p>Then using an alpha value (e.g. 0.05) look up the associated critical value (i.e. +/-1.96) in a table; then calculate the lower and upper bounds of the confidence interval for our estimate (i.e. the confidence interval for the estimated mean of X.Town).</p>
<p><span style="color: #ff0000;">lower.bound <- (-1.96*std.err) + M</span></p>
<p><span style="color: #ff0000;">lower.bound</span></p>
<p><span style="color: #0000ff;">[1] 106.7008</span></p>
<p><span style="color: #ff0000;">upper.bound <- (+1.96*std.err) + M</span></p>
<p><span style="color: #ff0000;">upper.bound</span></p>
<p><span style="color: #0000ff;">[1] 108.5602</span></p>
<p>Then, we would run a one sample t-test using our random sample of X.Town adults' WAIS scores, comparing <strong>the mean</strong> of the sample scores (<em>M</em>; as our best estimate of the entire X.Town's mean) to the mean of the U.S. population (mu<em>:</em> µ); using the standard error of the mean (<em>std.err</em>) and some pre-designated probability cutoff (e.g. 0.05) to determine statistical significance.</p>
<p><span style="color: #ff0000;">t.test(wais.sample, alternative = 'greater', mu = 100, conf.level = .95)</span></p>
<p> <span style="color: #0000ff;">One Sample t-test</span></p>
<p><span style="color: #0000ff;">data: wais.sample</span></p>
<p><span style="color: #0000ff;">t = 17.0653, df = 999, p-value < 2.2e-16</span></p>
<p><span style="color: #0000ff;">alternative hypothesis: true mean is greater than 100</span></p>
<p><span style="color: #0000ff;">95 percent confidence interval:</span></p>
<p><span style="color: #0000ff;"> 106.8944 Inf</span></p>
<p><span style="color: #0000ff;">sample estimates:</span></p>
<p><span style="color: #0000ff;">mean of x</span></p>
<p><span style="color: #0000ff;"> 107.6305</span></p>
<p>It is important to recall (or review) what the above test is doing. We have drawn a random sample of data from X.Town and we are testing <strong>the mean</strong> of that sample against a known (U.S.) population mean to determine if the sample indeed comes from that population (i.e. the null hypothesis). Notice we are using the sample mean (<em>n</em> = 1000) as a representation of the entire X.Town's WAIS scores (<em>N</em> = 10000).</p>
<h4><strong>Bayesian Perspective: Bayesian Statistics; Bayesian Inference; Bayesian Parameter Estimation.</strong></h4>
<p>All three of the above terms are often used to refer to Bayesian data analysis. The examples below were all adapted from Kaplan (2014). Our example explores the normal prior for the normal sampling model in which the variance σ² (sigma squared) is assumed to be known. Thus, the problem is one of estimating <strong>the mean</strong> µ (mu). Let <em>y</em> denote a data vector of size <em>n </em>(<em>y</em> = the sample of 1000 WAIS scores). We assume that <em>y</em> follows a normal distribution shown with the equation below:</p>
<p style="text-align: center;"> <em>p</em>(<em>y</em>|µ, σ²) = (1/sqrt(2*p*σ)) * exp(-((<em>y</em> - µ)²) / (2*σ²))</p>
<p>To clarify and show an example in R, we use the following:</p>
<p><span style="color: #ff0000;">mu <- 100</span></p>
<p><span style="color: #ff0000;">o <- 15</span></p>
<p><span style="color: #ff0000;">y <- wais.sample</span></p>
<p>We use the word ‘output’ to refer to <em>p</em>(<em>y</em>|µ, σ²) from above; which is read as the probability of <em>y</em>, given a mean of mu (µ), and variance of sigma squared (σ²).</p>
<p><span style="color: #ff0000;">output <- (1/sqrt(2*pi*o)) * exp(-((y - mu)^2) / (2*o^2))</span></p>
<p><span style="color: #ff0000;">summary(output)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;">0.000289 0.047630 0.078600 0.069690 0.096360 0.103000</span></p>
<p>Next, we specify the prior. We have plenty of confidence that our prior distribution of the mean is normal with its own mean and variance hyper-parameters, <em>k</em> and <em>t</em>² (using <em>t</em> in R code to refer to tau: τ), respectively, which for this example are known. The prior distribution can be written as: </p>
<p align="center"><em>p</em>(µ|<em>k</em>,<em>t</em>²) = (1/sqrt(2* p*<em>t</em>²)) * exp(-((µ - <em>k</em>)²) / (2*<em>t</em>²))</p>
<p>The term, <em>p</em>(µ|<em>k</em>,<em>t</em>²), can be read as the probability of µ given <em>k</em> and <em>t</em>².</p>
<p><span style="color: #ff0000;">k <- mean(y); k</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p><span style="color: #ff0000;">t <- sd(y); t</span></p>
<p><span style="color: #0000ff;">[1] 14.13976</span></p>
<p><span style="color: #ff0000;">n <- length(y); n</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<p><span style="color: #ff0000;">prior.mean <- (1/sqrt(2*pi*t^2)) * exp(-((mu - k)^2) / (2*t^2))</span></p>
<p><span style="color: #ff0000;">prior.mean</span></p>
<p><span style="color: #0000ff;">[1] 0.02439102</span></p>
<p>Combine the prior information with the likelihood of the data (given the population variance; sigma squared [σ²] and the sample size [<em>n</em>]) to create the posterior distribution. Using some algebra, the posterior distribution can be obtained as: </p>
<p align="center"><em>p</em>(µ|<em>y</em>)~<em>N</em>[ ((<em>k</em>/<em>t</em>²)+(<em>n</em>*mean(<em>y</em>)/σ²)) / ((1/<em>t</em>²)+(<em>n</em>/σ²)), (<em>t</em>²*σ²)/(σ²+(<em>n</em>*<em>t</em>²)) ]</p>
<p>Thus, the posterior distribution of mu (µ) is normal with a mean:</p>
<p><span style="color: #ff0000;">posterior.mu <- ((k/t^2)+(n*mean(y)/o^2)) / ((1/t^2)+(n/o^2))</span></p>
<p><span style="color: #ff0000;">posterior.mu</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p>and variance:</p>
<p><span style="color: #ff0000;">posterior.o2 = (t^2*o^2)/(o^2+(n*t^2))</span></p>
<p><span style="color: #ff0000;">posterior.o2</span></p>
<p><span style="color: #0000ff;">[1] 0.2247471</span></p>
<p>So, the posterior distribution can be simulated using these two parameters (and <em>n </em>= 1000); which in R, should be:</p>
<p><span style="color: #ff0000;">posterior <- rnorm(n = length(y), mean = posterior.mu,</span></p>
<p><span style="color: #ff0000;"> sd = sqrt(posterior.o2))</span></p>
<p><span style="color: #ff0000;">hist(posterior)</span></p>
<p> <img src="/benchmarks/sites/default/files/EB_004.png" alt="Histogram of Posterior" width="454" height="480" /></p>
<p>In a traditional frequentist analysis, one would be required to report both the estimated mean (i.e. mean of the sample) and a confidence interval with lower and upper bounds of that mean. However, a frequentist confidence interval only tells us; if this same study was repeated 100 times, we would expect the sample mean to be between the upper and lower bounds 95 times (if using a 95% confidence interval). It <strong>does not</strong> tell us the probability of the population parameter being included in the interval. Here in the Bayesian setting, we use the posterior distribution and simply take the quantiles (i.e. probabilities) to compute the lower and upper bounds of a <em>credible interval</em> – which does give us the probability that the actual population parameter is included in this interval.</p>
<p><span style="color: #ff0000;">quantile(posterior, c(.05,.95))</span></p>
<p> <span style="color: #0000ff;">5% 95%</span></p>
<p><span style="color: #0000ff;">106.8662 108.4625</span></p>
<p>It is critically important to recognize, the above example is <strong>only</strong> interested in estimating the mean of X.Town's WAIS scores. The example is NOT attempting to estimate the entire X.Town's distribution of WAIS scores. So let's compare the actual mean of X.Town's WAIS scores to the sample mean, and the mean of the posterior distribution (of course, in a real research situation you would not have the 'actual' parameter -- i.e. mean of the entire population of X.Town).</p>
<p><span style="color: #ff0000;">mean(x.town.df$wais)</span></p>
<p><span style="color: #0000ff;">[1] 107.8662</span></p>
<p><span style="color: #ff0000;">mean(wais.sample)</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p><span style="color: #ff0000;">mean(posterior)</span></p>
<p><span style="color: #0000ff;">[1] 107.6389</span> </p>
<p>Undoubtable readers will notice the virtually identical estimates provided by the mean of the posterior (i.e. Bayesian estimate) and simply the mean of the sample (i.e. frequentist estimate); and both of those are very, very close to the X.Town population mean. There are two very important reasons for this. First, the Bayesian and Frequentist methods will result in virtually the same parameter estimate(s) with large samples. The prior is weighted very lightly and the likelihood (a representation of the data at hand) contributes the bulk of the weight to the estimation when large samples are used in a Bayesian analysis. Second, the data used in the examples above is simulated data and a truly random sample (<em>n </em>= 1000) was taken from the entire population (<em>N = </em>10000). Therefore, our results here have very low bias as a result of the truly random sample and the fact that 10% of the population was contained in the sample. Most research is not conducted on a truly random sample and very few research endeavors include 10% of the population as the sample.</p>
<p>Lastly, hypothesis testing and statistical significance are not foreign to the Bayesian perspective. For example, if one were interested in conducting a Bayesian <em>t</em>-test, you would use something called Bayes Factors which has been covered on the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> web site and specifically <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module10/BayesFactor.R">here</a> in Module 11. Bayes Factors were also discussed in a previous <a href="http://web3.unt.edu/benchmarks/issues/2011/03/rss-matters">RSS Matters</a> article (<a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BayesFactors_JDS_Mar2011.pdf">Adobe.pdf version</a>). </p>
<p>Until next time, “<em>knowledge is freedom and ignorance is slavery.”</em></p>
<p>-- The above quote is attributed to Miles Dewey Davis III (1926 – 1991): <a href="http://www.goodreads.com/author/quotes/54761.Miles_Davis">http://www.goodreads.com/author/quotes/54761.Miles_Davis</a></p>
<p> </p>
<h4 style="text-align: left;" align="center">Highly Recommended Reference</h4>
<p>Kaplan, D. (2014). <em>Bayesian Statistics for the Social Sciences</em>. New York: The Guilford Press. </p>
<h4 style="text-align: left;" align="center">Other Important Resources</h4>
<p>Albert, J. (2007). <em>Bayesian Computation with R.</em> New York: Springer Science + Business Media, LLC.</p>
<p>Berry, D. A. (1996). <em>Statistics: A Bayesian Perspective.</em> Belmont, CA: Wadsworth Publishing Company.</p>
<p>Berry, S. M., Carlin, B. P., Lee, J. J., & Muller, P. (2011). <em>Bayesian Adaptive Methods for Clinical Trials. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Bolker, B. M. (2008). <em>Ecological Models and Data in R.</em> Princeton, NJ: Princeton University Press.</p>
<p>Bolstad, W. M. (2004). <em>Introduction to Bayesian Statistics.</em> Hoboken NJ: John Wiley & Sons, Inc.</p>
<p>Broemeling, L. D. (2007). <em>Bayesian Biostatistics and Diagnostic Medicine. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Congdon, P. (2005). <em>Bayesian Models for Categorical Data. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Congdon, P. (2006). <em>Bayesian Statistical Modeling. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Dey, D. K., Ghosh, S., & Mallick, B. K. (2011). <em>Bayesian Modeling in Bioinformatics. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Efron, B. (1986). Why isn’t everyone a Bayesian? <em>The American Statistician, 40</em>, 1 – 5.</p>
<p>Gelman, A., & Hall, J. (2007). <em>Data Analysis Using Regression and Multilevel/Hierarchical Models. </em>New York: Cambridge University Press.</p>
<p>Gelman, A., & Meng, X. (2004). <em>Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. <em>British Journal of Mathematical and Statistical Psychology, 66</em>, 8 – 38.</p>
<p>Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). <em>Bayesian Data Analysis </em>(2<sup>nd</sup> ed.).Boca Raton, FL: Chapman & Hall/CRC.</p>
<p>Geweke, J. (2005). <em>Contemporary Bayesian Econometrics and Statistics.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p>Ghosh, J. K., Delampady, M., & Samanta, T. (2006). <em>An Introduction to Bayesian Analysis: Theory and Methods. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Hoff, P. D. (2009). <em>A First Course in Bayesian Statistical Methods.</em> New York: Springer Science + Media, LLC.</p>
<p>Jackman, S. (2009). <em>Bayesian Analysis for the Social Sciences.</em> West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Jeffreys, H. (1939). <em>Theory of Probability </em>(1<sup>st</sup> ed.). London: Oxford University Press.</p>
<p>Jeffreys, H. (1948). <em>Theory of Probability </em>(2<sup>nd</sup> ed.). London: Oxford University Press.</p>
<p>Koop, G. (2003). <em>Bayesian Econometrics.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p>Koop, G., Poirier, D., & Tobias, J. (2007). <em>Bayesian Econometric Methods. </em>New York: Cambridge University Press.</p>
<p>Kruschke, J. K. (2011). <em>Doing Bayesian Data Analysis. </em>Burlington, MA: Academic Press.</p>
<p>Lancaster, T. (2004). <em>An Introduction to Modern Bayesian Econometrics. </em>Malden, MA: Blackwell Publishing.</p>
<p>Lee, P. M. (2004). <em>Bayesian Statistics: An Introduction </em>(3<sup>rd</sup> ed.). New York: Oxford University Press Inc.</p>
<p>Link, W. A., & Barker, R. J. (2010). <em>Bayesian Inference with Ecological Applications. </em>London: Academic Press (Elsevier Ltd.).</p>
<p>Lynch, S. M. (2007). <em>Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Mallick, B., Gold, D. L., & Baladandayuthapani, V. (2009). <em>Bayesian Analysis of Gene Expression. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Marin, J., & Robert, C. P. (2007). <em>Bayesian Core: A Practical Approach to Computational Bayesian Statistics. </em>New York: Springer Science + Business Media, LLC.</p>
<p>McGrayne, S. B. (2011). <em>The Theory that Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy.</em> New Haven, CT: Yale University Press.</p>
<p>Rossi, P. E., Allenby, G. M., & McCulloch, R. (2005). <em>Bayesian Statistics and Marketing. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Rouder, J. N., Speckman, P. L., Sun, D., & Morey, R. (2009). Bayesian <em>t</em> tests for accepting and rejecting the null hypothesis. <em>Psychonomic Bulletin & Review, 16</em>(2), 225 – 237.</p>
<p>Sorensen, D., & Gianola, D. (2002). <em>Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Stone, L., D. (1975). <em>Theory of Optimal Search. </em>Mathematics in Science and Engineering, Vol. 118. New York: Academic Press, Inc.</p>
<p>Tanner, M. A. (1996). <em>Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. </em>New York: Springer-Verlag, Inc.</p>
<p>Taroni, F., Bozza, S., Biedermann, A., Garbolino, P., & Aitken, C. (2010). <em>Data Analysis in Forensic Science: A Bayesian Decision Perspective. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Williamson, J. (2010). <em>In Defense of Objective Bayesianism. </em>Oxford, UK: Oxford University Press.</p>
<p>Woodworth, G. G. (2004). <em>Biostatistics: A Bayesian Introduction.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published January 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsTue, 20 Jan 2015 23:47:03 +0000cpl0001981 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/12/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-12</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Identifying or Verifying the Number of Factors to Extract using Very Simple Structure.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2014/11/rss-matters">Statistical Resources (update; version 3).</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>F</strong></span>actor analysis is perhaps one of the most frequently used analyses. It is versatile and flexible; meaning, it can be applied to a variety of data situations and types, and it can be applied in a variety of ways. However, conducting factor analysis generally requires the data analyst to make several decisions. Analysts often run several factor analyses, even when attempting to <em>confirm</em> an established factor structure; in order to assess the fit of the data to several factor models (e.g. one factor model, two factor model, three factor model, etc.). Over the 100 years since Spearman (1904) developed factor analysis there have been many, many criteria proposed for determining the number of factors to extract (e.g. eigenvalues greater than one, Horn’s [1965] parallel analysis, Cattell’s [1966] scree plot or test, Velicer’s [1976] Minimum Average Partial [MAP] criterion, etc.). Each of these proposed criteria have strengths and weaknesses; and they occasionally conflict with one another, which makes using one criterion over another a risky proposition. This month’s article demonstrates a very handy method for comparing multiple criteria in the pursuit of choosing to extract the appropriate number of factors during factor analysis. </p>
<p>In popular culture it is not uncommon to hear someone say, “There’s an <em>app</em> for that.” The phrase generally refers to the idea that an <em>application</em> exists (for a smart phone) which does the task being discussed. Likewise, here at RSS we very frequently find “There’s a <em>pack</em> for that.” This phrase refers to the virtual certainty of finding an R <em>package</em> which has a function devoted to some analysis or technique we are discussing. The primary package we will be using here is one package which contains a great many useful functions and as a result is very often <em>the </em>package we end up using for a variety of analyses. The primary package we will be using here is the ‘psych’ package (Revelle, 2014). The ‘psych’ package has grown substantially over the last few years and includes many very useful functions – if you have not taken a look at it recently, you might want to check it out. </p>
<p>Our examples below will actually require two packages, the ‘psych’ package and the ‘GPArotation’ package (Bernaards & Jennrich, 2014). The ‘GPArotation’ package should be familiar to anyone with experience doing factor analysis – it provides functions for several rotation strategies. The primary function we demonstrate below is the ‘vss’ function from the ‘psych’ package. The <em>Very Simple Structure </em>(VSS; Revell & Rocklin, 1979) function provides a nice output of criteria for varying levels of factor model complexity (i.e. number of factors to extract). The Very Simple Structure (VSS) terminology is used to refer to the idea that all loadings which are less than the maximum loading (of an item to a factor) are suppressed to zero – thus forcing a particular factor model to be much more interpretable or more clearly distinguished. Then, fit of several models of increasing rank complexity (i.e. more and more factors specified) can be assessed using the residual matrix of each model (i.e. original matrix minus the reproduced matrix of the models). We will also be using both the ‘fa’ function (from the ‘psych’ package) and the ‘factanal’ function (from the ‘stats’ package – included with all installations of R) to fit factor analysis models to the data structures. </p>
<h4><strong>Examples</strong></h4>
<p>The first two examples used here can easily be duplicated using the scripts provided below (i.e. the data file is available at the URL in the script / screen capture image). The third example is the example contained in the help file of the ‘vss’ function and can be accessed using the script below. First, load the two packages we will be using.</p>
<p> <img src="/benchmarks/sites/default/files/VSS_001.png" alt="Load packages" width="640" height="287" /></p>
<p>Next, we will import the comma delimited text (.txt) file from the RSS server using the URL and file name (vss_df.txt) contained in the script / image below. We also run a simple ‘summary’ on the data frame to make sure it was imported correctly.</p>
<p><img src="/benchmarks/sites/default/files/VSS_002.png" alt="Import the comma delimited text (.txt) file from the RSS server " width="475" height="480" /> </p>
<p>The simulated data includes a sample identification number for each participant (s.id), a grouping variable (group 1 or group 2), age of each participant (age in years), sex of each participant (female or male), class standing of each participant (freshman, sophomore, junior, or senior), and 30 item scores. Next, we will identify which participants belong to group 1 and which belong to group 2; as well as the number of participants in each group.</p>
<p><img src="/benchmarks/sites/default/files/VSS_003.png" alt="Identify group participants " width="640" height="98" /> </p>
<p>So, we have 418 participants in group 1 and 982 participants in group 2. Generally when analysts intend to do factor analysis they have an idea of how many factors they believe the appropriate factor model contains; and often they have an idea of whether an orthogonal or oblique rotation strategy is warranted. For this first example (i.e. group 1) looking at the 30 item scores (i.e. columns 6 through 35), we believe there are two factors and therefore; we specify 3 factors (<em>n</em> = 3) in the ‘vss’ function. We also believe the factors are likely to be meaningfully related and consequently, we specify an oblimin rotation strategy. Next, we apply the ‘vss’ function to group 1. Also note, we specified Maximum Likelihood Estimation as the Factor Method (fm = “mle”) because this is the method used by default with the ‘factanal’ (i.e. factor analysis) function of the ‘stats’ package. We specified the number of observations (i.e. number of rows, cases, or participants) using the length of the group 1 vector (g1). Recall from above, the group 1 vector contains the row numbers of all the participants from group 1.</p>
<p> <img src="/benchmarks/sites/default/files/VSS_004.png" alt="“Very Simple Structure” table" width="640" height="300" /></p>
<p>The first few rows of output (i.e. “Very Simple Structure” table) show the function called and the <em>maximum</em> complexity values. This is a good example because the VSS complexity rows are conflicting; VSS complexity 1 shows a 2-factor model is best while VSS complexity 2 indicates a 3-factor model is best. The VSS complexity 2 is a bit misleading because both the 2-factor model and 3-factor model display a VSS complexity 2 of 0.80; as can be seen in the first column of output under the “Statistics by number of factors” table. So, in fact both complexity 1 and complexity 2 are in agreement. Furthermore, the Velicer MAP <em>minimum</em> is reached with the 2-factor model; which can also be seen in the third column of the “Statistics by number of factors” table. The Bayesian Information Criterion (BIC) <em>minimum</em> is reached with the 2-factor model; as well as the Sample Size adjusted BIC (SABIC) – shown in columns 10 and 11 respectively of the “Statistics by number of factors” table. The ‘vss’ function also produces a plot (by default) which shows the number of factors on the x-axis and the VSS (complexity) Fit along the y-axis with lines and numbers in the Cartesian plane representing the (3) different factor models (see below).</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSSChart.png" alt="“Very Simple Structure” chart" width="453" height="480" /> </p>
<p>To interpret the graph, focus on the model (1, 2, or 3 factor models) which has the highest line (and numerals) in relation to the y-axis; but also note any transitions of the model lines. In this example, the transitions are all very nearly flat but a later example will better demonstrate the utility of this type of plot.</p>
<p>Next, we can verify the fit of our 2-factor model using either the ‘fa’ function (from the ‘psych’ package) and / or the ‘factanal’ function (of the ‘stats’ package).</p>
<p><img src="/benchmarks/sites/default/files/VSS_005_1.png" alt="Verify the fit of our 2-factor model" width="475" height="480" /> </p>
<p>*Note: the last few lines of output from the ‘fa’ function are cut off (i.e. not shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_006_0.png" alt="Verify the fit of our 2-factor model " width="476" height="480" /></p>
<p>*Note: last few lines of output from the ‘factanal’ function are cut off (i.e. not shown).</p>
<p>We will now assess the group 2 (g2) data. This group is believed to be best served with a 3-factor model; so we specify 4 factors (<em>n </em>= 4) in the ‘vss’ function call; again with the factor method set to Maximum Likelihood Estimation (fm = “mle”) and an oblique rotation strategy (rotate = “oblimin”).</p>
<p><img src="/benchmarks/sites/default/files/VSS_007.png" alt="VSS 3-factor model supported" width="640" height="267" /> </p>
<p>In this example all of the indices in the top table (“Very Simple Structure”) are in agreement; although both VSS complexity metrics display the same <em>maximum</em> for a 3-factor model and a 4-factor model. Looking at the first two columns of the “Statistics by number of factors” table shows the identical complexity <em>maximums</em> (0.84) for both the 3-factor model (row 3) and the 4-factor model (row 4) with both complexities 1 and 2 (columns 1 and 2). But, given the other indices agreement in support of the 3-factor model, that would be the model most appropriate. The plot (below) reinforces the interpretation of the tabular output above.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSSPlot.png" alt="VSS Plot" width="453" height="480" /></p>
<p>The plot (above) shows that the 3-factor model is meaningfully better than the 1-factor or 2-factor models and the 4-factor model does not show any improvement over the 3-factor model – which is evident because the number 4 in the plot is not [further] above the line associated with the 3-factor model (i.e. no gain or transition upward; as is the case from 1-factor to 2-factors and to 3-factors). Therefore, we fit the 3-factor model to our data using the ‘fa’ function (of the ‘psych’ package) and / or the ‘factanal’ function of the ‘stats’ package.</p>
<p><img src="/benchmarks/sites/default/files/VSS_008_0.png" alt="Fit the 3-factor model to our data using the ‘fa’ function " width="474" height="480" /> </p>
<p>*Note: the last few lines of output from the ‘fa’ function are cut off (i.e. not shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_009.png" alt="Fit the 3-factor model to our data using the ‘fa’ function " width="475" height="480" /> </p>
<p>*Note: last few lines of output from the ‘factanal’ function are cut off (i.e. not shown).</p>
<p>The next example is straight from the help file of the ‘vss’ function and is discussed here because it demonstrates a situation when the tables of output from the ‘vss’ function are not in agreement. When this situation occurs, one must rely upon the plot produced by the ‘vss’ function rather than the textual output. First, open the help file (here the plain text version is shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_012.png" alt="Open the help file " width="619" height="480" /> </p>
<p>Next, scroll to the bottom of the help file and copy / paste the relevant lines of script into the R console.</p>
<p><img src="/benchmarks/sites/default/files/VSS_013.png" alt="Scroll to the bottom of the help file and copy / paste the relevant lines of script into the R console. " width="640" height="330" /> </p>
<p>As mentioned previously, the tables of statistics do not provide a clear answer to the question of which factor model is best (i.e. how many factors should be extracted). However, if we review the associated plot, we can clearly see the 4-factor model is the best (i.e. highest; even when embedded within models with more than 4 factors, with good separation from previous models).</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSS_014.png" alt="Review the associated plot" width="452" height="480" /> </p>
<h4> <strong>Conclusions</strong></h4>
<p>The intent of this article was to raise awareness of the dangers of using only one criteria or method for deciding upon the number of factors to extract when conducting factor analysis. This article also demonstrated the ease with which an analyst can compute and evaluate several such criteria to reach a more informed decision. More extensive examples of the data analysis solutions are available at the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page. Lastly, a copy of the script file used for the above examples is available <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/VerySimpleStructure.R">here</a>.</p>
<p>Until next time; remember what George Carlin said: <em>“just ‘cause you got the monkey off your back doesn’t mean the circus left town</em>.”</p>
<h4 style="text-align: left;" align="center"><strong>References / Resources</strong></h4>
<p>Bernaards, C., & Jennrich, R. (2014). The ‘GPArotation’ package. Documentation available at <a href="http://cran.r-project.org/web/packages/GPArotation/index.html">CRAN</a>; the package <a href="http://cran.r-project.org/web/packages/GPArotation/GPArotation.pdf">manual</a> and the package <a href="http://cran.r-project.org/web/packages/GPArotation/vignettes/Guide.pdf">vignette</a>.</p>
<p>Carlin, G. (1937 – 2008). <em>Just One-Liners</em>. <a href="http://www.just-one-liners.com/ppl/george-carlin">http://www.just-one-liners.com/ppl/george-carlin</a></p>
<p>Cattell, R. B. (1966). The scree test for the number of factors. <em>Multivariate Behavioral </em><em>Research, 1</em>(2), 245 – 276.</p>
<p>Horn, J. (1965). A rationale and test for the number of factors in factor analysis. <em>Psychometrika, 30</em>(2), 179 – 185.</p>
<p>Horn, J. L., & Engstrom, R. (1979). Cattell's scree test in relation to bartlett's chi-square test and other observations on the number of factors problem. <em>Multivariate Behavioral </em><em>Research, 14</em>(3), 283 – 300.</p>
<p>McDonald, R. P. (1999). <em>Test Theory: A Unified Treatment.</em> Mahwah, NJ: Erlbaum. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. <em>Philosophical Magazine, 2</em>, 559 – 572.</p>
<p>Revelle, W. (2014). The ‘psych’ package. Documentation available at <a href="http://cran.r-project.org/web/packages/psych/index.html">CRAN</a>; the package <a href="http://cran.r-project.org/web/packages/psych/psych.pdf">manual</a> and the package <a href="http://cran.r-project.org/web/packages/psych/vignettes/overview.pdf">vignette</a>.</p>
<p>Revelle, W., & Rocklin, T. (1979). Very simple structure: An alternative procedure for estimating the optimal number of interpretable factors. <em>Multivariate Behavioral Research, 14</em>, 403 – 414. Available at: <a href="http://personality-project.org/revelle/publications/vss.pdf">http://personality-project.org/revelle/publications/vss.pdf</a></p>
<p>Spearman, C. (1904). General Intelligence: Objectively Determined and Measured. <em>American Journal of Psychology, 15</em>, 201 – 292.</p>
<p>Statistics Canada. (2010). <em>Survey Methods and Practices</em>. Ottawa, Canada: Minister of Industry. <a href="http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X">http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X</a></p>
<p>Thompson, B. (2004). <em>Exploratory and confirmatory factor analysis: Understanding concepts and applications</em>. Washington, DC: American Psychological Association.</p>
<p>Velicer, W. (1976). Determining the number of components from the matrix of partial correlations. <em>Psychometrika, 41</em>(3), 321 – 327.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published December 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsWed, 17 Dec 2014 17:26:22 +0000cpl0001965 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/11/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-11</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Statistical Resources (update; version 3).</strong></h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/10/rss-matters"> BOOtstrapping the Generalized Linear Model.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>his month’s article originally appeared first in November of 2011, but periodically it is necessary to update it with more current resources. The original article was motivated by a Research and Statistical Support (RSS) workshop given for graduate students and contains much the same content as was presented in the workshop: Statistical Resources. The following materials are, for the most part, freely available through the World Wide Web. The resources mentioned below fall, generally, into three categories; the resources we at RSS maintain, the resources available to UNT community members, and resources available to the general public with access to the web.</p>
<h4><strong>RSS Resources</strong></h4>
<p>The main <a href="http://www.unt.edu/rss/">RSS website</a> offers several resources, both specific resources aimed at particular software and more general resources (e.g., <a href="http://www.unt.edu/ACS/datamanage.htm">Data Management Services</a>). One of the key resources available to members of the UNT community is the opportunity to set up a <a href="http://www.unt.edu/rss/Consulting.htm">consulting</a> appointment with RSS staff. The <a href="https://untsitsm.saasit.com/Login.aspx?ProviderName=UNT&Scope=SelfService&role=SelfService&CommandId=SearchOffering&SearchString=rss_apt">link</a> to contact RSS staff for consultation is prominently displayed on each of the pages associated with RSS. The link guides clients to a web interface, known as the Front Range system, which forwards the service request to RSS staff, who then contact the requestor directly (generally through email). Please, read the frequently asked questions (<a href="http://www.unt.edu/rss/FAQ.htm">FAQ</a>) prior to submitting a Front Range request. It is also important to note that RSS staff maintains a rather extensive collection of digital and paper copies of articles, book chapters and whole books. RSS staff members often lend copies of these (in whole or part) to clients so clients can research various analytic or methodological concepts to their own satisfaction (and often the satisfaction of their colleagues, advisors, or committees, etc.).</p>
<p>A second frequently used resource RSS offers consists of the <a href="http://www.unt.edu/rss/Instructional.htm">instructional</a> services for RSS supported software. These were initially short courses offered in a classroom twice per semester; however, they have been migrated to the online format so that they may reach a wider audience and allow self-paced learning. These pages were designed to show how a particular software package can be used (e.g., <a href="http://www.unt.edu/rss/class/Jon/R_SC/">R</a>, <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/">SPSS</a>, <a href="http://www.unt.edu/rss/class/Jon/SAS_SC/">SAS</a>), they are not designed to teach statistics or how to interpret statistics (although some interpretation is offered among the many pages). In fact, some of the software supported by RSS is not directly related to statistics (e.g., <a href="http://www.unt.edu/rss/SURVEYclasslinks.html">survey technology</a> such as <a href="http://www.unt.edu/rss/class/survey/QSurvey.html">Zope and QSurvey</a>). On each of the R, SPSS, SAS short course pages you will also find links to resources specific to those software packages; from user manuals provided by the software producer (e.g., <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm">SPSS Manuals</a>, <a href="http://cran.r-project.org/web/views/">CRAN Task Views</a>) to other users’ user guides or websites (e.g. <a href="http://www.statmethods.net/">Quick-R</a>, <a href="http://lists.mcgill.ca/archives/stat-l.html">STAT-L</a>). There is even an R specific search engine available called, <a href="http://www.rseek.org/">RSeek</a>.</p>
<p>Another resource RSS offers is displayed right here; the contributions by RSS staff to the <a href="http://web3.unt.edu/benchmarks/"><em>Benchmarks</em></a> online publication in the <em>RSS Matters</em> column. Each article in the <em>RSS Matters</em> column is linked to the previous article and an <a href="http://www.unt.edu/rss/rssmattersindex.htm">index of <em>RSS Matters</em></a> articles is maintained on the RSS website. The index is quite handy for finding particular topics (e.g., canonical correlation), rather than clicking back through the years of articles available through the column links.</p>
<p>RSS has recently introduced a new service for instructors at UNT in which we can provide a randomly sampled data set from a fictional population named <a href="http://www.unt.edu/rss/class/Jon/Example_Sim/">Examplonia</a>. Examplonia is a fictional country which provides a meaningful context for statistical analysis examples. The population data for Examplonia was generated to provide a statistical population from which random samples could be drawn for the completion of example statistical analysis problems. The current version of the Examplonia population contains a variety of univariate, bivariate, and multivariate effects; including random effects based on hierarchical structure. If you are an instructor for a statistics course, you may be interested in obtaining some simulated data for your class (i.e. data for in-class demonstrations, homework assignments, etc.). Learn more about the population by visiting the <a href="http://www.unt.edu/rss/class/Jon/Example_Sim/">Examplonia</a> webpage.</p>
<p>RSS has also implemented some new services this year; all of which are focused on making software available to researchers through a web browser and relieving them of need to download and install software. Meaning, <a href="http://www.sagemath.org/">Sage Mathematics</a> and <a href="http://www.rstudio.org/">RStudio</a> along with the other services, can be accessed through a web browser. Sage Mathematics is mathematical computing software which can integrate the use of <strong>R</strong>. A brief introduction can be found at the Sage link above. RStudio is an integrated development environment for running the <strong>R</strong> statistical package. A brief introduction can be found <a href="http://web3.unt.edu/benchmarks/issues/2012/05/rss-matters">here</a>. Another new service is called <a href="http://rss.unt.edu:8083/tiki-index.php">Tiki Wiki</a>; an open source, freely available, content management system (CMS). More information can be found <a href="https://info.tiki.org/">here</a>. The final new service introduced this year is called <a href="http://rss.unt.edu:8082/">Galaxy Server</a> , also open source and freely available. Galaxy is “a web-based platform for data intensive biomedical research” (for more information, see: <a href="https://usegalaxy.org/">here</a>). These servers/services are available to faculty and advanced graduate students; however those interested need to submit a request for an access account for each service. Once a user has setup an account, they can simply visit the servers using their preferred web browser and conduct analyses using the software without having to install the software on their local machines. RSS is also working on implementing a Concerto Server; however, as of this writing we are still learning about it and are not ready to fulfill requests for access yet (more information can be found <a href="http://www.concerto-signage.org/deploy">here</a>).</p>
<h4><strong>Online Statistical Textbooks</strong></h4>
<p>The <a href="http://onlinestatbook.com/rvls/">Rice Virtual Lab in Statistics</a> is a valuable site for anyone interested in learning or teaching some of the basics of traditional (i.e. frequentist) statistics. The site offers several <a href="http://onlinestatbook.com/stat_sim/index.html">animations</a> for understanding concepts which are often difficult for newcomers to statistics (e.g., <a href="http://onlinestatbook.com/stat_sim/sampling_dist/index.html">sampling distribution characteristics</a> & the <a href="http://onlinestatbook.com/stat_sim/normal_approx/index.html">Central Limit Theorem</a>). The Rice Virtual Lab in Statistics also offers an online (free; no registration required) introductory statistics textbook. The textbook is called <a href="http://davidmlane.com/hyperstat/index.html">HyperStat</a> and contains chapters which cover the usual contents such as describing univariate and bivariate data, elementary probability, the normal distribution, point estimation, interval estimation, Null Hypothesis testing, statistical power, t-tests, Analysis of Variance (ANOVA), prediction, chi-square, non-parametric tests, and effect size estimates.</p>
<p>Another online repository of statistical resources is the site maintained by Michael Friendly at York University. The <a href="http://www.math.yorku.ca/SCS/StatResource.html">site</a> offers a variety of links to resources for a variety of software, tutorials for specific analyses, and sections of links for statistical societies, associations, and academic departments; as well as links to support more general computing resources (e.g., using Unix). A similar <a href="http://www.claviusweb.net/statistics.shtml">site</a> listing various statistical resources on the web is maintained by Clay Helberg.</p>
<p><a href="http://www.statsoft.com/textbook/">Statsoft</a>, the company behind the statistical software <a href="http://www.statsoft.com/">Statistica</a>, also offers web surfers a textbook covering a variety of statistical topics. The Statsoft site covers topics ranging from <a href="http://www.statsoft.com/textbook/elementary-statistics-concepts/button/1/">elementary concepts</a>, <a href="http://www.statsoft.com/textbook/basic-statistics/?button=1">basic statistics</a>, <a href="http://www.statsoft.com/textbook/anova-manova/?button=1">ANOVA/MANOVA</a> to multivariate topics such as <a href="http://www.statsoft.com/textbook/principal-components-factor-analysis/?button=1">principle components and factor analysis</a>, <a href="http://www.statsoft.com/textbook/multidimensional-scaling/?button=2">multidimensional scaling</a>, and <a href="http://www.statsoft.com/textbook/structural-equation-modeling/?button=2">structural equation modeling</a>. Unlike Statnotes, mentioned above, the Statsoft site does not offer software output or interpretation (although graphs and tables are often used). However, one handy feature of the Statsoft site is the interactive glossary; each hyperlinked word sends the users to the definition/entry for that word in the glossary. The Statsoft textbook is also <a href="https://www.statsoft.com/products/statistics-methods-and-applications-book/order/">available</a> in printed form for $80.00 plus shipping.</p>
<h4><strong>Miscellaneous Other Resources</strong></h4>
<p>Another resource option for members of the UNT community, which is often overlooked, is the <a href="http://www.library.unt.edu/">UNT library system</a>. The library’s <a href="http://iii.library.unt.edu/">general catalog</a> contains a monumental collection of resources, from textbooks being used in current courses to books which focus on the statistical analyses used in particular fields and authoritative books devoted to specific types of analysis (e.g., searching “logistic regression” yielded 66 returns). Furthermore, the electronic resources offer access to thousands of periodicals (i.e. journals) from a variety of databases (e.g. EBSCOHost, Medline, ERIC, LexisNexis, & JSTOR). One of the most frequently used databases by RSS staff is the JSTOR database, which contains many of the most prominent methodological and statistical journals – with almost all articles available (through the UNT portal) in full text (i.e. Adobe.pdf format). Another commonly used resource is the <a href="http://www.jstatsoft.org/">Journal of Statistical Software</a>, which contains articles on a variety of statistical computing applications/software, as well as articles covering statistical methods. One more often consulted resource is the <em>little green books</em> which are actually a series published by <a href="http://www.sagepub.com/home.nav">Sage</a>. The <a href="http://www.sagepub.com/productSearch.nav?seriesId=Series486">Quantitative Applications in the Social Sciences</a> series are a collection of thin, soft covered, books; each dealing with a specific research or statistical topic. The UNT library carries approximately 145 of the series’ editions and the RSS staff has collected most of the series as well. There are approximately 170 books in the series and a typical researcher would be hard pressed not to find something of value among them. Of course, there are more general resources, such as <a href="http://www.google.com/">Google</a>, <a href="http://www.scholarpedia.org/">Scholarpedia</a>, <a href="http://www.wikipedia.org/">Wikipedia</a>, and even <a href="http://www.youtube.com/watch?v=mL27TAJGlWc">Youtube</a>; all of which can be useful.</p>
<p>Until next time, remember; GIYF – Google is your friend.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published November 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsFri, 21 Nov 2014 17:40:59 +0000cpl0001957 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/10/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-10</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><em>BOO</em>tstrapping the Generalized Linear Model</h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/09/rss-matters">Factor Analysis with Binary items: A quick review with examples.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:richard.herrington@unt.edu">Dr. Richard Herrington</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>R</strong></span>esearchers do not need to be afraid - the availability of fast computers and public domain software libraries such as R and the R package <em>boot</em>, make forays into <em>bootstrap confidence interval estimation</em> reasonably straight forward. R package <em>boot</em> was designed to be general enough to allow the data analyst to simulate the empirical sampling distribution of most estimators (and then some), and to calculate corresponding confidence intervals for that estimator. There are a few tricks to learn when using package boot, but once those small hurdles have been navigated, the lessons learned can be applied more generally to other estimation settings.</p>
<p>R package boot is comprised of a set of functions that are well documented both with theory and examples in the book: <em>Bootstrap Methods and Their Application</em>, by A.C. Davison and D.V. Hinkley (1997). The purpose of this short note is to demonstrated how to approximate nonparametric confidence intervals, using resampling methods, for the <em>generalized linear model</em> (glm) using the R package <em>boot</em>.</p>
<p>We’ll start off by simulating a data set from the following probability regression model: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="411" height="356">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>samp.size<-5000</p>
<p> </p>
<p>x1 <- rnorm(samp.size) </p>
<p>x2 <- rnorm(samp.size)</p>
<p>x3 <- rnorm(samp.size)</p>
<p>x4 <- rnorm(samp.size)</p>
<p> </p>
<p># True Model</p>
<p># x0 x1 x2 x3 x4 x1*x2</p>
<p>z <- 1 + 2*x1 + 3*x2 + 4*x3 + 5*x4 + 10*x1*x2 </p>
<p>pr <- 1/(1+exp(-z)) </p>
<p>y <- rbinom(samp.size,1,pr) </p>
<p> </p>
<p>> sim.data.df <- data.frame(y=y,x1=x1,x2=x2,x3=x3,x4=x4, <br /> ,x5=x1*x2)</p>
<pre>> head(sim.data.df)</pre>
<pre> y x1 x2 x3 x4 x5</pre>
<pre>1 0 0.9632201 -1.0871521 -2.0283342 0.5727080 -1.0471668</pre>
<pre>2 0 2.8738768 -1.4818353 0.1265646 1.9195807 -4.2586121</pre>
<pre>3 1 -0.5552309 0.8576629 1.1878977 -0.7940654 -0.4762010</pre>
<pre>4 0 -0.7519217 0.7630796 -0.7534080 -0.6768429 -0.5737761</pre>
<pre>5 0 0.6789053 -1.6454898 0.5337027 -0.9163869 -1.1171318</pre>
<pre>6 0 1.4138792 -0.3052833 1.0388294 -0.9189572 -0.4316337</pre>
<p>.</p>
<p>.</p>
<p>.</p>
</div>
</td>
</tr>
</tbody>
</table>
<br /><br /></td>
</tr>
</tbody>
</table>
<p>Using the R function <em>glm </em>we can estimate the model coefficients using a binomial probability model for the <em>y</em> outcome variable:</p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="531" height="266">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>glm.fit<-glm(y~x1+x2+x3+x4+x1*x2,</p>
<p> data=sim.data.df,</p>
<p> family="binomial")</p>
<p>glm.fit</p>
<p> </p>
<pre>> glm.fit</pre>
<pre> </pre>
<pre>Call: glm(formula = y ~ x1 + x2 + x3 + x4 + x1 * x2, family = "binomial", </pre>
<pre> data = sim.data.df)</pre>
<pre> </pre>
<pre>Coefficients:</pre>
<pre>(Intercept) x1 x2 x3 x4 x1:x2 </pre>
<pre> 1.009 1.973 3.101 4.081 5.113 10.144 </pre>
<pre> </pre>
<pre>Degrees of Freedom: 4999 Total (i.e. Null); 4994 Residual</pre>
<pre> Null Deviance: 6910 </pre>
<pre> Residual Deviance: 1265 AIC: 1277</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p><br />R function <em>glm </em>does a reasonably good job of recovering the population regression coefficients – although we did use a very large sample size in comparison to the number of variables in the model.</p>
<p>R package <em>caret</em> provides a useful helper function for displaying kernel density estimated histograms for the predictors as a function of the two level outcome variable <em>y</em>:</p>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="431" height="149">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>library(caret)</p>
<p>featurePlot(x = sim.data.df[,c(2:6)],</p>
<p> y = as.factor(sim.data.df$y),</p>
<p> plot = "density",</p>
<p> scales = list(x = list(relation="free"),</p>
<p> y = list(relation="free")),</p>
<p> adjust = 1.5,</p>
<p> pch = "|",</p>
<p> layout = c(3, 3),</p>
<p> auto.key = list(columns = 2))</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The resulting plot is returned: </p>
<p> <img src="/benchmarks/sites/default/files/graphs.png" alt="Graphic output" width="609" height="423" /></p>
<p>The chosen population coefficients separate the groups with a large difference between the groups (1/0) on the predictor variables. We can calculate the marginal probabilities of the estimated predictors to see how large the average probability change is, in moving from a 50% probability of being in group 1, to the estimated probability of being in group 1, given a unit change in the predictors: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="504" height="106">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>library(arm)</p>
<p>glm.coefs<-coef(glm.fit)</p>
<p>invlogit(glm.coefs) - .50</p>
<pre>(Intercept) x1 x2 x3 x4 x1:x2 </pre>
<pre> 0.2327767 0.3779851 0.4569387 0.4833883 0.4940197 0.4999607 </pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>We have chosen very large predictor <em>effect sizes</em> for the simulation. Essentially, predictors <em>x4</em> and <em>x5</em> maximally predict the probability of <em>y=1</em> membership: knowledge of predictors x4 and x5 move our predicted marginal probability of y=1 from .50 (absent the information from <em>x4</em> and <em>x5</em>) to .99 given the information provided by <em>x4</em> and <em>x5</em>.</p>
<p>Now on to the bootstrap confidence intervals: first we need to create a wrapper function that will pass the resampled data, and their corresponding indices, to the <em>glm</em> function:</p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="504" height="178">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>glm.coefs<-function (dataset, index)</p>
<p>{</p>
<p> sim.data.df<-dataset[index,]</p>
<p> </p>
<p> glm.fit <-try(glm(y~x1+x2+x3+x4, #+x1*x2,</p>
<p> data=sim.data.df,</p>
<p> family="binomial"), silent = TRUE)</p>
<p> </p>
<p> coefs<-try(coef(glm.fit), silent=TRUE)</p>
<p> print(coefs)</p>
<p> </p>
<p> return(coefs)</p>
<p>}</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The vector that contains the indices of the resampled data (<em>index</em>) will be passed to the <em>glm</em> function. Lastly, our wrapper function for glm - <em>glm.coefs</em> – will return the estimated coefficients back to the <em>boot</em> function for tabulation and post-processing. Additionally, we have used the <em>try</em> function so that if a resampled data set fails <em>glm</em> estimation, the <em>glm.coefs</em> and <em>boot</em> will not break out with error, but will instead continue with missing values for the coefficients. Lastly, we have put a print statement within the body of glm.coefs, so that we can monitor the estimated coefficients values as they are being estimated.</p>
<p>Our last bit of R script sends the data and glm.coefs function to boot for processing: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="146">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>boot.fit<-boot(sim.data.df, glm.coefs, R=1000)</p>
<p>boot.fit</p>
<p> </p>
<p>for(ii in 1:length(boot.fit$t0))</p>
<p> {</p>
<p> cat(rep("\n",5))</p>
<p> print(names(boot.fit$t0[ii]))</p>
<p> cat(rep("\n",2))</p>
<p> print(boot.ci(boot.fit, conf = 0.95, type = c("norm","perc","basic"),index = ii))</p>
<p> }</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The for loop in this script isn’t necessary, but is merely a short-cut for printing out the results of three different types of confidence intervals (CI) for for the six estimated parameters (intercept and x1-x6). Notice that we capture the true population parameter for each of the three CI types. This a simply a consequence of having used few predictors, an initial large sample size, and 1000 bootstrap samples in the bootstrap CI estimation. </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="416">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<pre>> boot.fit</pre>
<pre> </pre>
<pre>ORDINARY NONPARAMETRIC BOOTSTRAP</pre>
<pre> </pre>
<pre> </pre>
<pre>Call:</pre>
<pre>boot(data = sim.data.df, statistic = glm.coefs, R = 1000)</pre>
<pre> </pre>
<pre> </pre>
<pre>Bootstrap Statistics :</pre>
<pre> original bias std. error</pre>
<pre>t1* 1.008756 0.007386088 0.08582566</pre>
<pre>t2* 1.973487 0.011373649 0.12787464</pre>
<pre>t3* 3.101113 0.027926437 0.15442723</pre>
<pre>t4* 4.080900 0.027597606 0.17447659</pre>
<pre>t5* 5.113291 0.036752067 0.21991954</pre>
<pre>t6* 10.144203 0.074247504 0.42935352</pre>
<pre>> for(ii in 1:length(boot.fit$t0))</pre>
<pre>+ {</pre>
<pre>+ cat(rep("\n",5))</pre>
<pre>+ print(names(boot.fit$t0[ii]))</pre>
<pre>+ cat(rep("\n",2))</pre>
<pre>+ print(boot.ci(boot.fit, conf = 0.95, type = c("norm","perc","basic"), index = ii))</pre>
<pre>+ }</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="810">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<table style="width: 889px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top">
<p> [1] "(Intercept)"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 0.833, 1.170 ) ( 0.824, 1.164 ) ( 0.854, 1.194 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p>[1] "x1"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 1.711, 2.213 ) ( 1.704, 2.191 ) ( 1.756, 2.243 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> [1] "x2"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 2.771, 3.376 ) ( 2.731, 3.369 ) ( 2.833, 3.471 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> [1] "x3"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 3.711, 4.395 ) ( 3.704, 4.369 ) ( 3.793, 4.457 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> </p>
</td>
</tr>
</tbody>
</table>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="618" height="459">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<pre>[1] "x4"</pre>
<pre> </pre>
<pre> </pre>
<pre>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</pre>
<pre>Based on 1000 bootstrap replicates</pre>
<pre> </pre>
<pre>CALL : </pre>
<pre>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc", </pre>
<pre> "basic"), index = ii)</pre>
<pre> </pre>
<pre>Intervals : </pre>
<pre>Level Normal Basic Percentile </pre>
<pre>95% ( 4.646, 5.508 ) ( 4.621, 5.498 ) ( 4.728, 5.606 ) </pre>
<pre>Calculations and Intervals on Original Scale</pre>
<pre> </pre>
<pre>[1] "x1:x2"</pre>
<pre> </pre>
<pre>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</pre>
<pre>Based on 1000 bootstrap replicates</pre>
<pre> </pre>
<pre>CALL : </pre>
<pre>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc", </pre>
<pre> "basic"), index = ii)</pre>
<pre> </pre>
<pre>Intervals : </pre>
<pre>Level Normal Basic Percentile </pre>
<pre>95% ( 9.23, 10.91 ) ( 9.15, 10.84 ) ( 9.45, 11.13 ) </pre>
<pre>Calculations and Intervals on Original Scale</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published October 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsTue, 21 Oct 2014 23:07:53 +0000cpl0001947 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/09/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-09</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Factor Analysis with Binary items: A quick review with examples.</strong><strong><strong><br /></strong></strong></h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/08/rss-matters"> Call to Create a UNT R Users Group</a>.</em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>here have been several clients in recent weeks that have come to us with binary survey data which they would like to factor analyze. The current article was written in order to provide a simple resource for others who may find themselves in a similar situation.</p>
<p>Of course, our professional conscience requires that we mention at the outset; if you are creating a survey (online, paper & pencil, or any other format) you should create the items and response choices in such a way that the responses may be considered interval or ratio; or at the very least, ordinal – not nominal categories and particularly not binary categories. We also feel compelled to advise you against the use of two other types of items. Please do not use any type of contingency or dependent items (e.g. if you answered ‘yes’ to item 6, go to item 6a; if you answered ‘no’ to item 6, please move forward to item 7). Also, please do not use any type of multiple response items (e.g. ‘choose all those which apply’). If you would like more information on why we make the recommendations above, please consult the substantial literature on survey development (e.g. McDonald, 1999; OECD, 2008, Statistics Canada, 2010).</p>
<h4><strong>Examples</strong></h4>
<p>First, import some (simulated) example data. The data used here is available at the URL given in the ‘read.table’ function below. The data contains eight binary items (x1, x2, x3, x4, x5, x6, x7, & x8) with 1000 cases (i.e. rows) which support two orthogonal factors.</p>
<p><span style="color: #ff0000;">df.1 <- read.table(</span></p>
<p><span style="color: #ff0000;"> "http://www.unt.edu/rss/class/Jon/Benchmarks/BinaryDataFA.txt",</span></p>
<p><span style="color: #ff0000;"> header = TRUE, sep = ",", na.strings = "NA", dec = ".",</span></p>
<p><span style="color: #ff0000;"> strip.white = TRUE)</span></p>
<p><span style="color: #ff0000;">head(df.1)</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">1 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">2 0 1 1 1 0 0 0 0</span></p>
<p><span style="color: #0000ff;">3 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">4 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">5 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">6 0 1 0 0 0 0 0 0</span></p>
<p><span style="color: #ff0000;">nrow(df.1)</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<p>Notice above, the data is numeric; this is important because if you simply supply this data to a factor analysis function, that function will (by default) calculate the matrix of association assuming those numbers are interval or ratio – which would be incorrect or potentially very biased. Therefore, what is really needed is a way to calculate the correct matrix of association (for the factor analysis) using the appropriate correlation statistic for each pair of variables in our data. Fortunately, the ‘polycor’ package (Fox, 2014) contains a function called ‘hetcor’ for doing just that. The ‘hetcor’ function basically looks at each pair of variables in a data frame and computes the appropriate <em>heterogeneous correlation</em> for each pair based on the type of variables which make up each pair. Recall that with categorical variables, the polychoric correlation is appropriate, and the tetrachoric correlation is a special case of the polychoric correlation (for when both variables being correlated are binary). The ‘hetcor’ function is capable of calculating Pearson correlations (for numeric data), polyserial correlations (for numeric and ordinal data), and polychoric correlations (for ordered or non-ordered factors) – from a single data frame with all of the above mentioned types of variables.</p>
<p>So, because the data is imported as numeric, we must first recode it as factor (i.e. categorical); which can be done very easily using the ‘sapply’ function. There are other packages and functions which allow more precise control over recoding variables; such as the ‘recode’ function in the ‘car’ package (Fox, et al., 2014).</p>
<p><span style="color: #ff0000;">df.2 <- sapply(df.1, as.factor)</span></p>
<p><span style="color: #ff0000;">head(df.2)</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">[1,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[2,] "0" "1" "1" "1" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[3,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[4,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[5,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[6,] "0" "1" "0" "0" "0" "0" "0" "0"</span></p>
<p>Once the <em>numeric</em> data have been recoded <em>as factor</em>, we can proceed by loading the ‘polycor’ package which contains the ‘hetcor’ function.</p>
<p><span style="color: #ff0000;">library(polycor)</span></p>
<p><span style="color: #0000ff;">Loading required package: mvtnorm</span></p>
<p><span style="color: #0000ff;">Loading required package: sfsmisc</span></p>
<p>Now we can compute the appropriate correlation matrix and assign that matrix to a new object (het.mat). Notice below, we are extracting only the correlation matrix ($cor) from the output of the ‘hetcor’ function.</p>
<p><span style="color: #ff0000;">het.mat <- hetcor(df.2)$cor</span></p>
<p><span style="color: #0000ff;">Warning messages:</span></p>
<p><span style="color: #0000ff;">1: In polychor(x, y, ML = ML, std.err = std.err) :</span></p>
<p><span style="color: #0000ff;"> inadmissible correlation set to 1</span></p>
<p><span style="color: #0000ff;">2: In hetcor.data.frame(dframe, ML = ML, std.err = std.err, bins = bins, :</span></p>
<p><span style="color: #0000ff;"> the correlation matrix has been adjusted to make it positive-definite</span></p>
<p><span style="color: #ff0000;">het.mat</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5</span></p>
<p><span style="color: #0000ff;">x1 1.000000000 0.910975550 0.844483311 0.691731074 -0.002245134</span></p>
<p><span style="color: #0000ff;">x2 0.910975550 1.000000000 0.859541108 0.808750265 0.037625262</span></p>
<p><span style="color: #0000ff;">x3 0.844483311 0.859541108 1.000000000 0.723304581 -0.026716610</span></p>
<p><span style="color: #0000ff;">x4 0.691731074 0.808750265 0.723304581 1.000000000 -0.001185206</span></p>
<p><span style="color: #0000ff;">x5 -0.002245134 0.037625262 -0.026716610 -0.001185206 1.000000000</span></p>
<p><span style="color: #0000ff;">x6 -0.039424602 -0.004851113 -0.046661991 -0.001214029 0.993573475</span></p>
<p><span style="color: #0000ff;">x7 0.002335945 0.005438252 -0.014930707 -0.009831874 0.879110898</span></p>
<p><span style="color: #0000ff;">x8 -0.036916591 -0.054512229 0.006043798 0.031313650 0.794959194</span></p>
<p> <span style="color: #0000ff;">x6 x7 x8</span></p>
<p><span style="color: #0000ff;">x1 -0.039424602 0.002335945 -0.036916591</span></p>
<p><span style="color: #0000ff;">x2 -0.004851113 0.005438252 -0.054512229</span></p>
<p><span style="color: #0000ff;">x3 -0.046661991 -0.014930707 0.006043798</span></p>
<p><span style="color: #0000ff;">x4 -0.001214029 -0.009831874 0.031313650</span></p>
<p><span style="color: #0000ff;">x5 0.993573475 0.879110898 0.794959194</span></p>
<p><span style="color: #0000ff;">x6 1.000000000 0.849171046 0.781588616</span></p>
<p><span style="color: #0000ff;">x7 0.849171046 1.000000000 0.703973732</span></p>
<p><span style="color: #0000ff;">x8 0.781588616 0.703973732 1.000000000</span></p>
<p>Although there are two warnings listed above, the function does in fact return the appropriate correlation matrix. Now we can proceed with the factor analysis using this ‘het.mat’ correlation matrix as the matrix of association for the factor analysis.</p>
<p><span style="color: #ff0000;">fa.1 <- factanal(covmat = het.mat, factors = 2, rotation = "varimax")</span></p>
<p><span style="color: #ff0000;">fa.1</span></p>
<p><span style="color: #0000ff;">Call:</span></p>
<p><span style="color: #0000ff;">factanal(factors = 2, covmat = het.mat, rotation = "varimax")</span></p>
<p><span style="color: #0000ff;">Uniquenesses:</span></p>
<p><span style="color: #0000ff;"> x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">0.164 0.005 0.252 0.345 0.005 0.008 0.243 0.368</span></p>
<p><span style="color: #0000ff;">Loadings:</span></p>
<p><span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">x1 0.913</span></p>
<p><span style="color: #0000ff;">x2 0.997</span></p>
<p><span style="color: #0000ff;">x3 0.863</span></p>
<p><span style="color: #0000ff;">x4 0.809</span></p>
<p><span style="color: #0000ff;">x5 0.997 </span></p>
<p><span style="color: #0000ff;">x6 0.996 </span></p>
<p><span style="color: #0000ff;">x7 0.870 </span></p>
<p><span style="color: #0000ff;">x8 0.794 </span> </p>
<p> <span style="color: #ff0000;"> Factor1 Factor2</span></p>
<p><span style="color: #ff0000;">SS loadings 3.378 3.232</span></p>
<p><span style="color: #ff0000;">Proportion Var 0.422 0.404</span></p>
<p><span style="color: #ff0000;">Cumulative Var 0.422 0.826</span></p>
<p><span style="color: #ff0000;">The degrees of freedom for the model is 13 and the fit was 12.2084</span></p>
<p>Another equally effective way to factor analyze binary data (or any other type of data), using a correlation matrix, is with the ‘fa’ function from the ‘psych’ package (Revelle, 2014). Again, we use the correlation matrix we generated with the ‘hetcor’ function. Please note, the default method of extraction for the ‘fa’ function is minimum residuals (method = minres) and not maximum likelihood (method = ml).</p>
<p><span style="color: #ff0000;">library(psych)</span></p>
<p><span style="color: #ff0000;">fa.2 <- fa(r = het.mat, nfactors = 2, n.obs = nrow(df.2), rotate = "varimax")</span></p>
<p><span style="color: #0000ff;">Loading required package: MASS</span></p>
<p><span style="color: #0000ff;">Loading required package: GPArotation</span></p>
<p><span style="color: #0000ff;">Loading required package: parallel</span></p>
<p><span style="color: #ff0000;">fa.2</span></p>
<p><span style="color: #0000ff;">Factor Analysis using method = minres</span></p>
<p><span style="color: #0000ff;">Call: fa(r = het.mat, nfactors = 2, n.obs = nrow(df.2), rotate = "varimax")</span></p>
<p><span style="color: #0000ff;">Standardized loadings (pattern matrix) based upon correlation matrix</span></p>
<p> <span style="color: #0000ff;"> MR1 MR2 h2 u2 com</span></p>
<p><span style="color: #0000ff;">x1 -0.11 0.93 0.87 0.128 1</span></p>
<p><span style="color: #0000ff;">x2 -0.09 0.96 0.94 0.062 1</span></p>
<p><span style="color: #0000ff;">x3 -0.11 0.92 0.86 0.141 1</span></p>
<p><span style="color: #0000ff;">x4 -0.07 0.86 0.75 0.250 1</span></p>
<p><span style="color: #0000ff;">x5 0.98 0.10 0.96 0.036 1</span></p>
<p><span style="color: #0000ff;">x6 0.97 0.08 0.94 0.058 1</span></p>
<p><span style="color: #0000ff;">x7 0.91 0.09 0.84 0.160 1</span></p>
<p><span style="color: #0000ff;">x8 0.87 0.07 0.76 0.242 1</span></p>
<p> <span style="color: #0000ff;">MR1 MR2</span></p>
<p><span style="color: #0000ff;">SS loadings 3.51 3.41</span></p>
<p><span style="color: #0000ff;">Proportion Var 0.44 0.43</span></p>
<p><span style="color: #0000ff;">Cumulative Var 0.44 0.87</span></p>
<p><span style="color: #0000ff;">Proportion Explained 0.51 0.49</span></p>
<p><span style="color: #0000ff;">Cumulative Proportion 0.51 1.00</span></p>
<p><span style="color: #0000ff;">Mean item complexity = 1</span></p>
<p><span style="color: #0000ff;">Test of the hypothesis that 2 factors are sufficient.</span></p>
<p><span style="color: #0000ff;">The degrees of freedom for the null model are 28 and the objective function was 23.3 with Chi Square of 23199.31</span></p>
<p><span style="color: #0000ff;">The degrees of freedom for the model are 13 and the objective function was 13.77</span></p>
<p><span style="color: #0000ff;">The root mean square of the residuals (RMSR) is 0.04</span></p>
<p><span style="color: #0000ff;">The df corrected root mean square of the residuals is 0.06</span></p>
<p><span style="color: #0000ff;">The harmonic number of observations is 1000 with the empirical chi square 99.24 with prob < 2.3e-15</span></p>
<p><span style="color: #0000ff;">The total number of observations was 1000 with MLE Chi Square = 13694.45 with prob < 0</span></p>
<p><span style="color: #0000ff;">Tucker Lewis Index of factoring reliability = -0.273</span></p>
<p><span style="color: #0000ff;">RMSEA index = 1.029 and the 90 % confidence intervals are 1.011 1.04</span></p>
<p><span style="color: #0000ff;">BIC = 13604.65</span></p>
<p><span style="color: #0000ff;">Fit based upon off diagonal values = 0.99</span></p>
<p><span style="color: #0000ff;">Measures of factor score adequacy </span></p>
<p><span style="color: #0000ff;"> MR1 MR2</span></p>
<p><span style="color: #0000ff;">Correlation of scores with factors 1 1</span></p>
<p><span style="color: #0000ff;">Multiple R square of scores with factors 1 1</span></p>
<p><span style="color: #0000ff;">Minimum correlation of possible factor scores 1 1</span></p>
<h4><strong>Conclusions</strong></h4>
<p>As demonstrated above, using binary data for factor analysis in R is no more difficult than using continuous data for factor analysis in R. Although not demonstrated here, if one has polytomous and other types of mixed variables one wants to factor analyze, one can also use the ‘hetcor’ function (i.e. heterogeneous correlations) located in the ‘polycor’ package (Fox, 2014). More extensive examples of the use of the ‘hetcor’ function are available at the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page where many other examples (not just factor analysis) are provided. Lastly, a copy of the script file used for the above examples is available <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BinaryFA.R">here</a>.</p>
<p>Until next time; remember what George Carlin said:<em>“inside every cynical person, there is a disappointed idealist</em>.”</p>
<h4><span style="font-size: 1em;">References / Resources</span></h4>
<p>Carlin, G. (1937 – 2008). <a href="http://www.just-one-liners.com/ppl/george-carlin">http://www.just-one-liners.com/ppl/george-carlin</a></p>
<p>Fox, J. (2014). The ‘polycor’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/polycor/index.html">http://cran.r-project.org/web/packages/polycor/index.html</a></p>
<p>Fox, J., et al. (2014). The ‘car’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/car/index.html">http://cran.r-project.org/web/packages/car/index.html</a></p>
<p>McDonald, R. P. (1999). <em>Test Theory: A Unified Treatment.</em> Mahwah, NJ: Erlbaum.</p>
<p>Organization for Economic Co-operation and Development (OECD). (2008). <em>Handbook on Constructing Composite Indicators</em>. <a href="http://www.oecd.org/std/42495745.pdf">http://www.oecd.org/std/42495745.pdf</a></p>
<p>Revelle, W. (2014). The ‘psych’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/psych/index.html">http://cran.r-project.org/web/packages/psych/index.html</a></p>
<p>Statistics Canada. (2010). Survey Methods and Practices. Ottawa, Canada: Minister of Industry. <a href="http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X">http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X</a></p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published September 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsFri, 19 Sep 2014 15:48:52 +0000cpl0001934 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/08/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-08</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong><strong>Call to Create a UNT R Users Group</strong></strong></h3>
<p><em>Link to the last RSS article here: </em><a href="http://it.unt.edu/benchmarks/issues/2014/07/rss-matters">A <em>new</em> recommended way of dealing with multiple missing values: Using missForest for all your imputation needs.</a> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:richard.herrington@unt.edu">Dr. Richard Herrington</a> and <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, RSS Team </strong></p>
<p><span style="font-size: medium;"><strong>R</strong></span> has noticeably gained visibility at UNT over the last few years. Recently I strolled through UNT’s campus bookstore and noticed a number of courses using R within their courses – this is a recent development; Not too long ago, there was only a handful of people on campus using R regularly. For those who may have not heard of the R statistical system, Wikipedia’s R entry provides a nice overview of the history and specifics of the <a href="http://en.wikipedia.org/wiki/R_(programming_language)">R system</a>.</p>
<p>Given the increase popularity, we believe that it might be time to form an R <a href="http://en.wikipedia.org/wiki/Users%27_group">User’s group</a> here on campus (RUG perhaps?). To our knowledge no such user group exists on campus; the closest one that we are aware of is on the University of Texas at Dallas campus. Other’s might exist in the surrounding area, but we have been unable to find these groups using the <a href="http://r-users-group.meetup.com/"><em>R User’s Group Meetup Search Tool</em></a></p>
<h4><strong>R UNT User Group Poll</strong></h4>
<p>To facilitate the the organization of this group, we have created an online poll to: i) query the interest in such a group, and ii) Collect contact information regarding a first meetup time. If you are interested in being part of such group, please <a href="https://unt.az1.qualtrics.com/SE/?SID=SV_71ApFQtHvqAaccJ">provide us some contact information through this poll</a>. If you are not sure, browse through our favorite R news feed aggregator <a href="http://www.r-bloggers.com/">R-Bloggers</a>,to get a sense of what this user group <em>could</em> be about.</p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published August 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsWed, 20 Aug 2014 21:27:29 +0000cpl0001918 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/07/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-07</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>A <em>new</em> recommended way of dealing with multiple missing values: Using missForest for all your imputation needs.</strong></h3>
<p><em>Link to the last RSS article here: <a href="https://it.unt.edu/benchmarks/issues/2014/06/rss-matters">Basic Graph Creation and Manipulation in R</a></em>. <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant</strong></p>
<p><span style="font-size: medium;"><strong>A</strong></span> couple of months ago we provided an article tutorial for using the ‘rrp’ package for multiple missing value imputation. The ‘rrp’ package has consistently been our most recommended tool for dealing with missing values. However, the ‘rrp’ package has not been updated (i.e. adapted) to new versions of R since the release of R-2.15.1 (over a year ago). This has presented challenges to its utility – basically necessitating the install of R-2.15.0 in order to use the ‘rrp’ package. Very recently it was discovered that the rrp package is no longer available (even from R-forge) for any Windows install of R. This prompted us to find a new <em>go-to</em> package for missing value imputation.</p>
<p>The good news is this: we have now found a satisfactory replacement for the beloved ‘rrp’ package. The ‘missForest’ package (Stekhoven, 2013, Stekhoven, 2012) provides not only a function for conducting multiple imputation of mixed data (numeric and factor variables in one data frame), but it also has a utility to parallelize the process of doing such imputations. Below we offer a quick example of how to use the function with a simple data set. Please keep in mind, the function is not terribly fast and when applied to large data sets it may take a considerable amount of time to complete the imputations (even when using the parallelize argument).</p>
<p>First, import some (simulated) example data. Notice we are importing the same data set twice; one version with no missing values and one version with missing values (Missing Completely At Random [MCAR]). Note the data files can be imported directly from the RSS URLs provided (i.e. simply copy the script and paste into your R console to follow along).</p>
<p><span style="color: #ff0000;">no.miss <- read.table(</span></p>
<p><span style="color: #ff0000;"> "http://www.unt.edu/rss/class/Jon/R_SC/Module4/missForest_noMiss.txt",</span></p>
<p><span style="color: #ff0000;"> header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)</span></p>
<p><span style="color: #ff0000;">wi.miss <- read.table(</span></p>
<p><span style="color: #ff0000;"> "http://www.unt.edu/rss/class/Jon/R_SC/Module4/missForest_Miss.txt",</span></p>
<p><span style="color: #ff0000;"> header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)</span></p>
<p>Next, we may want to take a look at the proportion of missing values (cells) which are present in the data file (i.e. number of missing cells divided by the product of the number of rows multiplied by number of columns). Below, we see only 4.37% (1181) of the total cells (6 columns * 4500 rows = 27000 cells) are missing values (i.e. 1181 / 27000 = .0437).</p>
<p><span style="color: #ff0000;">ncol(wi.miss); nrow(wi.miss)</span></p>
<p><span style="color: #0000ff;">[1] 6</span></p>
<p><span style="color: #0000ff;">[1] 4500</span></p>
<p><span style="color: #ff0000;">length(which(is.na(wi.miss) == "TRUE")) / (nrow(wi.miss)*ncol(wi.miss))</span></p>
<p><span style="color: #0000ff;">[1] 0.04374074</span></p>
<p>Next, we need to load the required package (missForest) and its dependencies (i.e. randomForest, foreach, itertools, & iterators).</p>
<p><span style="color: #ff0000;">library(missForest)</span></p>
<p><span style="color: #0000ff;">Loading required package: randomForest</span></p>
<p><span style="color: #0000ff;">randomForest 4.6-7</span></p>
<p><span style="color: #0000ff;">Type rfNews() to see new features/changes/bug fixes.</span></p>
<p><span style="color: #0000ff;">Loading required package: foreach</span></p>
<p><span style="color: #0000ff;">foreach: simple, scalable parallel programming from Revolution Analytics</span></p>
<p><span style="color: #0000ff;">Use Revolution R for scalability, fault tolerance and more.</span></p>
<p><span style="color: #0000ff;">http://www.revolutionanalytics.com</span></p>
<p><span style="color: #0000ff;">Loading required package: itertools</span></p>
<p><span style="color: #0000ff;">Loading required package: iterators</span></p>
<p>Apply the ‘missForest’ function with all arguments set to default values. The function returns a list object with 3 elements: “ximp” which is the imputed data, “OOBerror” which is the estimated (out of bag) imputation error, and “error” which is the true imputation error (the “error” is only returned when an ‘xtrue’ value is provided). Please note: the function <strong>does</strong> accept a data frame; the package documentation states that the data must be in a matrix (all numeric); however that is not the case.</p>
<p><span style="color: #ff0000;">im.out.1 <- missForest(xmis = wi.miss, maxiter = 10, ntree = 100,</span></p>
<p><span style="color: #ff0000;"> variablewise = FALSE,</span></p>
<p><span style="color: #ff0000;"> decreasing = FALSE, verbose = FALSE,</span></p>
<p><span style="color: #ff0000;"> mtry = floor(sqrt(ncol(wi.miss))), replace = TRUE,</span></p>
<p><span style="color: #ff0000;"> classwt = NULL, cutoff = NULL, strata = NULL,</span></p>
<p><span style="color: #ff0000;"> sampsize = NULL, nodesize = NULL, maxnodes = NULL,</span></p>
<p><span style="color: #ff0000;"> xtrue = NA, parallelize = "no")</span></p>
<p> <span style="color: #0000ff;">missForest iteration 1 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 2 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 3 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 4 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 5 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 6 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 7 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 8 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 9 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 10 in progress...done!</span></p>
<p>To extract only the imputed data from the output (list), we use the familiar “$” operator to index the output object and retrieve the ‘ximp’ data frame. We can then compare the summaries of the original (no missing) data to the missing data and the imputed data.</p>
<p><span style="color: #ff0000;">im.miss.1 <- im.out.1$ximp</span></p>
<p><span style="color: #ff0000;">summary(no.miss)</span></p>
<p> <span style="color: #0000ff;"> id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1713 New Jork : 457 female:2241 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1167 Los Angelinas: 438 male :2259 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III:1620 San Francis : 393 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 Bahston : 356 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 352 </span></p>
<p><span style="color: #0000ff;"> Max. :1012027 Carlot : 346 </span></p>
<p><span style="color: #0000ff;"> (Other) :2158 </span> </p>
<p> <span style="color: #0000ff;"> age education </span></p>
<p><span style="color: #0000ff;"> Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p> <span style="color: #0000ff;">Median :33.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.69 Mean :11.12 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p><span style="color: #0000ff;"> Max. :82.00 Max. :22.00</span></p>
<p><span style="color: #ff0000;">summary(wi.miss)</span></p>
<p> <span style="color: #0000ff;"> id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1627 New Jork : 434 female:2119 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1090 Los Angelinas: 415 male :2143 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III :1541 San Francis : 373 NA's : 238 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 NA's: 242 Bahston : 334 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 332 </span></p>
<p><span style="color: #0000ff;"> Max. :1012027 (Other) :2387 </span></p>
<p><span style="color: #0000ff;"> NA's : 225 </span> </p>
<p> <span style="color: #0000ff;">age education </span></p>
<p><span style="color: #0000ff;"> Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p><span style="color: #0000ff;"> Median :33.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.66 Mean :11.12 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p><span style="color: #0000ff;"> Max. :80.00 Max. :22.00 </span></p>
<p><span style="color: #0000ff;"> NA's :234 NA's :242</span></p>
<p><span style="color: #ff0000;">summary(im.miss.1)</span></p>
<p> <span style="color: #0000ff;">id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1713 New Jork : 457 female:2239 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1167 Los Angelinas: 438 male :2261 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III:1620 San Francis : 393 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 Bahston : 356 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 352 </span></p>
<p><span style="color: #0000ff;"> Max. :1012027 Carlot : 346 </span></p>
<p><span style="color: #0000ff;"> (Other) :2158 </span></p>
<p><span style="color: #0000ff;"> age education </span></p>
<p> <span style="color: #0000ff;">Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p><span style="color: #0000ff;"> Median :34.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.67 Mean :11.11 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p><span style="color: #0000ff;"> Max. :80.00 Max. :22.00</span></p>
<p>The OOBerror rates are returned as two statistics; the first number returned is the normalized root mean squared error (NRMSE; for continuous variables) and the second is the proportion falsely classified (PFC; for categorical variables). The OOBerror rates can be retrieved by using the familiar “$” from the output object. Others (Waljee, et al.; 2013) have compared the ‘missForest’ function to other imputation methods and found “it [missForest] had the least imputation error for both continuous and categorical variables … and it had the smallest prediction difference [error]…” (p.1). </p>
<p><span style="color: #ff0000;">im.out.1$OOBerror</span></p>
<p> <span style="color: #0000ff;">NRMSE PFC</span></p>
<p><span style="color: #0000ff;">0.0000187039 0.1652550716</span></p>
<p>One of the major benefits of the ‘missForest’ function is that it has an argument for utilizing multiple cores (i.e. processors) of a computer in <em>parallel</em>. Below we repeat the example from above showing how to exploit this functionality. Keep in mind, the larger the data set, the greater the benefit achieved by parallelizing the imputation. First, we need to load the ‘doParallel’ package and its dependency (i.e. the ‘parallel’ package).</p>
<p><span style="color: #ff0000;">library(doParallel)</span></p>
<p><span style="color: #0000ff;">Loading required package: parallel</span></p>
<p>Next, we need to register the number of cores (or processors) of our computer.</p>
<p><span style="color: #ff0000;">registerDoParallel(cores = 2)</span></p>
<p><span style="color: #ff0000;"></span>Now we can apply the ‘missForest’ function while breaking the work down into equal numbers of 'variables' or 'forests' for each core to work on (here we break the number of variables).</p>
<p><span style="color: #ff0000;">im.out.2 <- missForest(xmis = wi.miss, maxiter = 10, ntree = 100,</span></p>
<p><span style="color: #ff0000;"> variablewise = FALSE,</span></p>
<p><span style="color: #ff0000;"> decreasing = FALSE, verbose = FALSE,</span></p>
<p><span style="color: #ff0000;"> mtry = floor(sqrt(ncol(wi.miss))), replace = TRUE,</span></p>
<p><span style="color: #ff0000;"> classwt = NULL, cutoff = NULL, strata = NULL,</span></p>
<p><span style="color: #ff0000;"> sampsize = NULL, nodesize = NULL, maxnodes = NULL,</span></p>
<p><span style="color: #ff0000;"> xtrue = NA, parallelize = "variables")</span></p>
<p> <span style="color: #0000ff;">missForest iteration 1 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 2 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 3 in progress...done!</span></p>
<p><span style="color: #0000ff;"> missForest iteration 4 in progress...done!</span></p>
<p>Again, extract only the imputed data from the output (list) using the familiar “$” operator to index the ‘ximp’ data frame. We can then compare the summaries of the original (no missing) data to the missing data and the imputed data.</p>
<p>i<span style="color: #ff0000;">m.miss.2 <- im.out.2$ximp</span></p>
<p><span style="color: #ff0000;">summary(no.miss)</span></p>
<p> <span style="color: #0000ff;"> id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1713 New Jork : 457 female:2241 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1167 Los Angelinas: 438 male :2259 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III:1620 San Francis : 393 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 Bahston : 356 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 352 </span> </p>
<p> <span style="color: #0000ff;">Max. :1012027 Carlot : 346 </span></p>
<p><span style="color: #0000ff;"> (Other) :2158 </span></p>
<p><span style="color: #0000ff;"> age education </span></p>
<p><span style="color: #0000ff;"> Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p><span style="color: #0000ff;"> Median :33.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.69 Mean :11.12 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p><span style="color: #0000ff;"> Max. :82.00 Max. :22.00</span></p>
<p><span style="color: #ff0000;">summary(wi.miss)</span></p>
<p> <span style="color: #0000ff;">id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1627 New Jork : 434 female:2119 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1090 Los Angelinas: 415 male :2143 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III :1541 San Francis : 373 NA's : 238 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 NA's: 242 Bahston : 334 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 332 </span></p>
<p><span style="color: #0000ff;"> Max. :1012027 (Other) :2387 </span> </p>
<p> N<span style="color: #0000ff;">A's : 225 </span></p>
<p><span style="color: #0000ff;"> age education </span></p>
<p><span style="color: #0000ff;"> Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p><span style="color: #0000ff;"> Median :33.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.66 Mean :11.12 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p> <span style="color: #0000ff;">Max. :80.00 Max. :22.00 </span></p>
<p><span style="color: #0000ff;"> NA's :234 NA's :242 </span></p>
<p><span style="color: #ff0000;">summary(im.miss.2)</span></p>
<p> <span style="color: #0000ff;">id region city.names gender </span></p>
<p><span style="color: #0000ff;"> Min. : 858 I :1713 New Jork : 457 female:2244 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 245659 II :1167 Los Angelinas: 438 male :2256 </span></p>
<p><span style="color: #0000ff;"> Median : 499423 III:1620 San Francis : 393 </span></p>
<p><span style="color: #0000ff;"> Mean : 501929 Bahston : 356 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 758180 Astin : 352 </span></p>
<p><span style="color: #0000ff;"> Max. :1012027 Carlot : 346 </span></p>
<p><span style="color: #0000ff;"> (Other) :2158 </span></p>
<p><span style="color: #0000ff;"> age education </span></p>
<p><span style="color: #0000ff;"> Min. :18.00 Min. : 2.00 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:29.00 1st Qu.: 9.00 </span></p>
<p><span style="color: #0000ff;"> Median :34.00 Median :11.00 </span></p>
<p><span style="color: #0000ff;"> Mean :34.67 Mean :11.12 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:39.00 3rd Qu.:13.00 </span></p>
<p><span style="color: #0000ff;"> Max. :80.00 Max. :22.00</span></p>
<h3><strong>Conclusions</strong></h3>
<p>RSS has previously been recommending the use of the ‘rrp’ package for multiple missing value imputation, primarily because unlike alternatives, the ‘rrp.impute’ function could accept and impute categorical variables as well as continuous (or nearly continuous) variables. However, the ‘rrp’ package has not been consistently maintained since the release of R-2.15.1. The ‘missForest’ package, which recently came to our attention, offers the same benefit and is available for the most recent release of R (R-3.1.0). Furthermore, the ‘missForest’ function contains the added benefit of parallelization for larger data sets. It is for these reasons (i.e. benefits) which RSS recommends its usage in any missing value situation. For those interested in learning more about what R can do; please visit the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page.</p>
<p>Until next time; document everything, audio is good but audio with video is better.</p>
<h3><span style="font-size: 1.17em;">References / Resources</span></h3>
<p>Little, R. J. A., & Rubin, D. B. (1985). <em>Statistical Analysis with Missing Data. </em>New York:John Wiley & Sons.</p>
<p>Stekhoven, D., J. (2013). Package missForest: Nonparametric missing value imputation using random forest. Package documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/missForest/index.html">http://cran.r-project.org/web/packages/missForest/index.html</a></p>
<p>Stekhoven, D., J. (2012). MissForest – Non-parametric missing value imputation for mixed-type data. <em>Bioinformatics, 28</em>(1), 112 – 118.</p>
<p>Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., & Higgins, P. DR. (2013). Comparison of imputation methods for missing laboratory data in medicine. <em>BMJ Open, 3</em>(8), 1 – 7. DOI: 10.1136/1136-bmjopen-2013-002847</p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published July 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsThu, 17 Jul 2014 22:18:41 +0000cpl0001906 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/06/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-06</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong><strong>Basic Graph Creation and Manipulation in R.</strong></strong></h3>
<p><em>Link to the last RSS article here: <a href="https://it.unt.edu/benchmarks/issues/2014/05/rss-matters">Know where you are going before you get there: Statistical Action Plans work. </a> -- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant</strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>he purpose of this article is to provide some key information for creating and manipulating graphs in R. Only the basics will be covered here because there have been entire books published regarding the graphical capabilities of R with and without additional packages (e.g. Andrews, 2012; Deepayan, 2008). Furthermore, those who have already mastered the basics covered in this article are encouraged to explore the CRAN Task View for graphics (Lewin-Koh, 2014). Some examples are provided below for what might be considered necessary skills for anyone working with data. The focus of the examples below is oriented toward initial data analysis (i.e. graphs which display basic descriptive and relational properties of one or two variables). </p>
<p>First, we import some (fictional) example data. The data can be imported directly from the RSS URL as provided in the script below (i.e. simply copy the script provided and paste into the R console to follow along):</p>
<p><span style="color: #ff0000;">df.1 <- read.table("http://www.unt.edu/rss/class/Jon/Benchmarks/BasicPlotData.txt", header = TRUE, sep = ",", na.strings = "NA", dec = ".", strip.white = TRUE)</span></p>
<p><span style="color: #ff0000;">summary(df.1)</span></p>
<p><span style="color: #ff0000;">nrow(df.1)</span></p>
<p><span style="color: #ff0000;">ncol(df.1)</span></p>
<p><span style="color: #ff0000;">names(df.1)</span></p>
<p>Before we actually create any graphs, it might be sensible to take a look at the graphics window and its menus first. Actually, although the habit here is to refer to it as a ‘graphics window’ it is really a graphics ‘device’ (i.e. dev). To create a blank graphics window, or graphics device, we use the simple function below:</p>
<p><span style="color: #ff0000;">dev.new()</span></p>
<p>which produces the empty graphics window below.</p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_001.png" alt="Empty graphics window" width="451" height="480" /></p>
<p>Notice in the above image, there are three menu items: File, History, and Resize. The File tab allows you to Save the image (in a variety of popular formats), Copy the image (in either of two formats), Print the image, or Close the device (i.e. window). Of course, you can also copy, save, or print the contents of the graphics window by right clicking on it with your mouse and selecting one of those operations. The History tab allows us to turn on (or off) the recorder; which will keep a record of subsequent graphs produced in this window and allow us to page up or page down to scroll through previous and next graphs produced. We can also save or clear the history from this tab. The Resize tab simply allows us to resize the graphics window based on some basic commands; of course you can also use a mouse click-and-drag from one corner to resize the graphics device. For now, we will close the graphics window using script (rather than “File”, “close Device”).</p>
<p><span style="color: #ff0000;">graphics.off()</span></p>
<p>It is often desirable to ‘attach’ a data frame if one is going to be repeatedly calling specific variables of it. We do so here in order to simplify the indexing of the data frame we imported.</p>
<p><span style="color: #ff0000;">attach(df.1)</span></p>
<h3><strong>Defaults</strong></h3>
<p>One of the most commonly used graphic displays is the simple histogram; which is used to display a distribution of values (i.e. continuous or nearly continuous variable). Below we create a simple histogram of the ‘age’ variable (of the ‘df.1’ data frame); supplying only the variable name (because we attached the data frame) and omitting all other arguments (which provides a default histogram):</p>
<p><span style="color: #ff0000;">hist(age)</span></p>
<p>which produces the following simple histogram.</p>
<p align="center"> <img src="/benchmarks/sites/default/files/BasicPlots_002.png" alt="Histogram of Age" width="451" height="480" /></p>
<p>The default graph produced by the ‘hist’ function provides a most basic histogram based on default options for the function’s many arguments. The histogram above provides the necessary information; it is very plain, some might even consider it boring. </p>
<p>By far the most commonly used graphical function is the simple ‘plot’ function. The ‘plot’ function can be used to display single variables (e.g. categorical variables’ frequency counts) or multiple variables’ relationships (e.g. scatterplots & scatterplot matrices). To use the ‘plot’ function in its most primitive form, we will supply it with the gender variable of our data frame:</p>
<p>plot(gender)</p>
<p>which produces the following graph (i.e. an example of a bar chart or bar graph).</p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_003.png" alt="Bar graph" width="452" height="480" /> </p>
<p>Notice in the above, we supplied only the variable, without specifying any optional parameters to other arguments of the ‘plot’ function. Notice there is no main title to the graph, as there was with the histogram previously; nor is there an x-axis line. Furthermore, the x-axis labels (female, male) are taken directly from the variable supplied; as was the case for the x-axis label (age) in the histogram previously. Both default graphs are produced using grayscale; which makes printing them easy and some publications require manuscripts to contain only grayscale (i.e. no colors). Therefore, the default graphs can be quite handy, even if they tend to be a little boring. Below is an example of the ‘plot’ function producing a simple scatterplot by supplying only the two variables:</p>
<p><span style="color: #ff0000;">plot(neuroticism, extroversion)</span></p>
<p>which produces the following graph (i.e. scatterplot). </p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_004.png" alt="Scatterplot" width="454" height="480" /> </p>
<p>The ‘plot’ function does not supply a main title to the graph when supplied with two (or more) variables. Also notice, the x-axis and y-axis labels (neuroticism, extroversion) are supplied exactly as they are given from the data frame. And again, the graph is displayed in grayscale by default. To create a scatterplot matrix you can simply supply the ‘plot’ function with the columns of a data frame (or matrix); such as:</p>
<p><span style="color: #ff0000;">plot(df.1[,22:24])</span></p>
<p>which produces the following graph. </p>
<p align="center"> <img src="/benchmarks/sites/default/files/BasicPlots_005.png" alt="graph" width="452" height="480" /></p>
<p>Keep in mind, depending on the number of cases, more than a few variables in a single scatterplot matrix can defeat the purpose of displaying the data in this way (i.e. more than 4 or 5 variables in a scatterplot matrix often makes each cell of the matrix too small to interpret). Again, notice there is no main title, x-axis label, y-axis label, or colors in the default scatterplot matrix.</p>
<p>It should also be noted that there are other graphics functions available in a base install of R which may occasionally be useful; such as the ‘boxplot’ function which produces a box and whisker plot;</p>
<p><span style="color: #ff0000;">boxplot(age ~ gender)</span></p>
<p>the ‘coplot’ function which produces a conditional plot;</p>
<p><span style="color: #ff0000;">coplot(income ~ age | gender)</span></p>
<p>and the ‘pairs’ function which produces a scatterplot or scatterplot matrix; in fact, the ‘plot’ command from above calls this function to produce these types of graphs;</p>
<p><span style="color: #ff0000;">pairs(df.1[,22:24])</span></p>
<p>For a demonstration of some of the base functionality of R (in terms of graphics), use the ‘demo’ function:</p>
<p><span style="color: #ff0000;">graphics.off()</span></p>
<p><span style="color: #ff0000;">demo(graphics)</span></p>
<p>For more complex data, the ‘persp’ function provides the ability to produce a 3-dimensional perspective plot.</p>
<p><span style="color: #ff0000;">graphics.off()</span></p>
<p><span style="color: #ff0000;">demo(persp)</span></p>
<h3><strong>Non-Defaults</strong></h3>
<p>Like most things in R, however, everything displayed by the ‘hist’ function and the ‘plot’ function can be manipulated using optional arguments. Below we provide a few examples using some common non-default options for arguments of the ‘hist’ and ‘plot’ functions. To start fresh, we will again close the graphics device:</p>
<p><span style="color: #ff0000;">graphics.off()</span></p>
<p>Revisiting the histogram of age from above, but adding some color (col), providing a specific main title (main), x-axis label (xlab), y-axis label (ylab), x-axis limits (i.e. zero to 80), and y-axis limits (i.e. zero to 600):</p>
<p><span style="color: #ff0000;">hist(age, col = "lightgreen", main = "Histogram of Age",</span></p>
<p><span style="color: #ff0000;"> xlab = "Age in years", ylab = "Frequency count",</span></p>
<p><span style="color: #ff0000;"> xlim = c(0,80), ylim = c(0,600)) </span></p>
<p>which produces the following graph. </p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_006.png" alt="Histogram of Age 2" width="453" height="480" /> </p>
<p>Revisiting our bar graph from earlier, but supplying some color, a main title, and axis labels:</p>
<p><span style="color: #ff0000;">plot(gender, col = "lightblue", main = "Bar graph of Gender",</span></p>
<p><span style="color: #ff0000;"> xlab = "Gender", ylab = "Frequency count")</span></p>
<p>which produces the following graph.</p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_007.png" alt="Bar Graph of Gender" width="453" height="480" /></p>
<p>And finally, we revisit our scatterplot with some color, main and axis titles, as well as specific limits for both axes. Also notice we used the ‘pch’ argument to specify a particular character for the points in the scatterplot. We also used the ‘cex’ argument to specify the size of those characters; where smaller numbers produce smaller points or characters in the scatterplot. </p>
<p><span style="color: #ff0000;">plot(neuroticism, extroversion, col = "purple",</span></p>
<p><span style="color: #ff0000;"> main = "Scatterplot of Neuroticism and Extroversion",</span></p>
<p><span style="color: #ff0000;"> xlab = "Neuroticism", ylab = "Extroversion",</span></p>
<p><span style="color: #ff0000;"> xlim = c(5,15), ylim = c(5,15), pch = "*", cex = 2)</span></p>
<p>which produces the following graph. </p>
<p align="center"> <img src="/benchmarks/sites/default/files/BasicPlots_008.png" alt="Scatterplot 2" width="452" height="480" /></p>
<h3><strong>Supplemental Graphics Functions</strong></h3>
<p>There are other useful graphics device functions which can be used in conjunction with what we have done above. Some of the popular operations involve adding a line of some type to histograms, such as what is done below with the ‘lines’ function:</p>
<p><span style="color: #ff0000;">hist(age, col = "lightblue", main = "Histogram of Age",</span></p>
<p><span style="color: #ff0000;"> xlab = "Age in years", ylab = "Density proportion",</span></p>
<p><span style="color: #ff0000;"> xlim = c(0,80), prob = TRUE)</span></p>
<p><span style="color: #ff0000;">lines(density(age), col = "blue", lty = 2, lwd = 2)</span></p>
<p>which produces the following graph. Notice in the script directly above, we used ‘prob = TRUE’ to indicate that the histogram should reflect probability densities rather than frequency counts and our y-axis label (ylab) has been changed accordingly. The ‘lines’ function can be used in a variety of ways to add a line to a graphic; here we specify a density line using line type dashed (lty = 2) with a line width of moderately thick (lwd = 2). The defaults for both line type and line width are 1; which correspond to solid line and thin line respectively. </p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_009.png" alt="Histogram of Age 2" width="452" height="480" /> </p>
<p>The ‘abline’ function is an alternative to the ‘lines’ function for placing a line over an existing graph. As an example, we apply a line which represents a linear model (lm) to our previous scatterplot.</p>
<p><span style="color: #ff0000;">plot(neuroticism, extroversion, col = "purple",</span></p>
<p><span style="color: #ff0000;"> main = "Scatterplot of Neuroticism and Extroversion",</span></p>
<p><span style="color: #ff0000;"> xlab = "Neuroticism", ylab = "Extroversion",</span></p>
<p><span style="color: #ff0000;"> xlim = c(5,15), ylim = c(5,15), pch = "*", cex = 2)</span></p>
<p><span style="color: #ff0000;">abline(lm(extroversion ~ neuroticism), col = "red",</span></p>
<p><span style="color: #ff0000;"> lty = 3, lwd = 3)</span></p>
<p>which produces the following graph. Notice here we chose to use line type dotted (lty = 3) and a line width of thick (lwd = 3).</p>
<p align="center"> <img src="/benchmarks/sites/default/files/BasicPlots_010.png" alt="Scatterplot 3" width="454" height="480" /></p>
<h3><strong>Graphics Parameters</strong></h3>
<p>The ‘par’ function is used to specify graphics parameters; most commonly applied to the base ‘plot’ function and resulting graphics devices (i.e. graphics windows). One of the most useful applications of the ‘par’ function is the ability to place more than one graph in a graphics device. For example, you may want to place one graph below another:</p>
<p><span style="color: #ff0000;">gen <- which(df.1[,4] == "female")</span></p>
<p><span style="color: #ff0000;">females <- df.1[gen,5]</span></p>
<p><span style="color: #ff0000;">males <- df.1[-gen,5]</span></p>
<p> </p>
<p><span style="color: #ff0000;">par(mfrow = c(2,1))</span></p>
<p><span style="color: #ff0000;">hist(females, xlim = c(0,80), col = "pink")</span></p>
<p><span style="color: #ff0000;">hist(males, xlim = c(0,80), col = "lightblue")</span></p>
<p>Notice in the ‘par’ function we are specifying rows and columns of a single graphics device display. The example here specifies two rows and one column (within the graphics device). The resulting graphics device displays a histogram of female participants’ age over a histogram of male participants’ age. </p>
<p align="center"> <img src="/benchmarks/sites/default/files/BasicPlots_011.png" alt="Histogram of males & females" width="453" height="480" /></p>
<p>A second example uses the ‘par’ function to specify 2 rows and 2 columns – keep in mind you can put any graph in each row / column (i.e. cell); this example simply uses histograms.</p>
<p><span style="color: #ff0000;">par(mfrow = c(2,2))</span></p>
<p><span style="color: #ff0000;">hist(age)</span></p>
<p><span style="color: #ff0000;">hist(education)</span></p>
<p><span style="color: #ff0000;">hist(income)</span></p>
<p><span style="color: #ff0000;">hist(bmi)</span></p>
<p>The above script produces the matrix of four graphs in the single graphics device displayed below. Notice how the order of each histogram function corresponds to each <em>cell</em> of the graphics device window (i.e. age in the upper left, education in the upper right, income in the lower left, and BMI in the lower right). </p>
<p align="center"><img src="/benchmarks/sites/default/files/BasicPlots_012.png" alt="Histogram of age, education, etc." width="454" height="480" /> </p>
<p>There are two ways to reset the rows and columns of your graphics device. First, simply close the graphics device (as was shown earlier) or second, simply re-specify the rows and columns as 1 each (which is the default when a graphics device is originally opened):</p>
<p><span style="color: #ff0000;">par(mfrow = c(1,1))</span></p>
<p>Keep in mind, you can have multiple graphics devices open at the same time simply using the ‘dev.new’ function:</p>
<p><span style="color: #ff0000;">plot(gender)</span></p>
<p><span style="color: #ff0000;">dev.new()</span></p>
<p><span style="color: #ff0000;">hist(age)</span></p>
<p>which produces two graphs, each in its own device (i.e. window). Notice each device is numbered, starting with 2. This allows you to reference each device (active versus inactive) individually if so desired. </p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/BasicPlots_013.png" alt="Two Graphs" width="640" height="335" /></p>
<p>Keep in mind we have only used one argument (of more than 70) of the ‘par’ function. If you would like to see all the available graphics device parameter (‘par’) arguments, please take a look at the ‘help’ file; which can be accessed with the following:</p>
<p><span style="color: #ff0000;">help(par)</span></p>
<p>Of course, each of the functions used in this article has an associated ‘help’ file which can be accessed from the R console and each function is included in the base install of R (no additional packages were used to produce any of the above graphs). Lastly, you will surely have noticed the limited number and hue of colors used in this article. At the time of writing (June 5, 2014), there were 657 colors available for use in R. To see the entire list of available colors, use the ‘colors’ function:</p>
<p><span style="color: #ff0000;">colors()</span></p>
<h3><strong>Conclusions</strong></h3>
<p>Clearly there are many, many other ways to graphically display data using R; as the CRAN Task View for graphics (Lewin-Koh, 2014) shows. There are two especially popular packages for graphical display of multivariate data; the ‘lattice’ package (Deepayan, 2014) and the ‘latticist’ package (Andrews, 2014) which provides a Graphical User Interface (GUI) for the functions of ‘lattice’. There are two other packages worth looking into. The ‘car’ package (Fox & Weisberg, 2014) provides very good ‘scatterplot’ and ‘scatterplotMatrix’ functions which fit linear model (regression) lines, smoothed (e.g. lowess) lines, and boxplots for each axis by default. Also, the ‘scatterplot3d’ package (Ligges & Mächler, 2003), as one might expect from the name, provides a function for producing 3-dimensional scatterplots which look very good. Please visit Module 12 of the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page for examples of everything mentioned in this article (including the more complex data displays mentioned in this paragraph).</p>
<p>Until next time, remember what George Carlin said: <em>“Intelligence tests are biased toward the literate.”</em></p>
<h3>References / Resources</h3>
<p>Andrews, F. (2014). Package latticist. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/latticist/index.html">http://cran.r-project.org/web/packages/latticist/index.html</a></p>
<p>Andrews, F. (2012). latticist: A Graphical User Interface for Exploratory Visualisation R package version 0.9-44. URL <a href="http://latticist.googlecode.com/">http://latticist.googlecode.com/</a></p>
<p>Baron, J. (2014). R Reference Card. Available at CRAN Contributed Documents: <a href="http://cran.r-project.org/doc/contrib/refcard.pdf">http://cran.r-project.org/doc/contrib/refcard.pdf</a></p>
<p>Carlin, G. (1937 – 2008). <a href="http://www.just-one-liners.com/ppl/george-carlin">http://www.just-one-liners.com/ppl/george-carlin</a></p>
<p>Deepayan, S. (2014). Package lattice. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/lattice/index.html">http://cran.r-project.org/web/packages/lattice/index.html</a></p>
<p>Deepayan, S. (2008). Lattice: Multivariate Data Visualization with R. New York: Springer Science+Business Media, LLC.</p>
<p>Fox, J., & Weisberg, S. (2014). Package car. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/car/index.html">http://cran.r-project.org/web/packages/car/index.html</a></p>
<p>Fox, J., & Weisberg, S. (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: <a href="http://socserv.socsci.mcmaster.ca/jfox/Books/Companion">http://socserv.socsci.mcmaster.ca/jfox/Books/Companion</a></p>
<p>Lewin-Koh, N. (2014). CRAN Task View: Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization. Available at CRAN: <a href="http://cran.r-project.org/web/views/Graphics.html">http://cran.r-project.org/web/views/Graphics.html</a></p>
<p>Ligges, U., & Mächler, M. (2003). Scatterplot3d - an R Package for Visualizing Multivariate Data. Journal of Statistical Software 8(11), 1-20. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/scatterplot3d/index.html">http://cran.r-project.org/web/packages/scatterplot3d/index.html</a></p>
<p>Short, T. (2004). R Reference Card. Available at CRAN Contributed Documents: <a href="http://cran.r-project.org/doc/contrib/Short-refcard.pdf">http://cran.r-project.org/doc/contrib/Short-refcard.pdf</a> </p>
<p> </p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published June 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsFri, 13 Jun 2014 17:11:01 +0000cpl0001891 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/05/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-05</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Know where you are going before you get there: Statistical Action Plans work.</strong></h3>
<p><em>Link to the last RSS article here: </em><a href="http://it.unt.edu/benchmarks/issues/2014/04/rss-matters"><em>RSS’s Top R Package and Software List</em></a><em> -- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant</strong></p>
<p><span style="font-size: medium;"><strong>C</strong></span>lients often come to RSS wondering what analysis they should do or what analysis they should do next. Clients are often looking for some remedy, fix, or contingency because they have realized their data, for one reason or another, do not meet the assumptions of the analysis they were expecting to conduct. Many of these clients have successfully proposed the research to grant committees, colleagues, or dissertation / thesis committees. However, the proposed analysis or analyses were chosen based on what the researcher or student knew at the time (prior to proposing the study). Often those analyses contain assumptions which are not carefully considered and upon collection of data it is realized the data do not meet those assumptions. The problem becomes apparent only after data has been collected and the researcher realizes they are going to need to learn an analysis unfamiliar to them while under some publication or dissertation / thesis defense deadline.</p>
<p>It is for these reasons and occasions that we (RSS) generally recommend creating a Statistical Action Plan (SAP) prior to proposal of the study. The current article describes the SAP general format and offers suggestions for why it can be helpful. The SAP is nothing more than a document which contains the planned analyses and the order in which they are likely to be conducted. It should also contain alternative analysis in case the data is discovered to violate the assumptions of a planned traditional analysis. It is both a reference to turn to for the analyses to answer the research questions (i.e. hypotheses) and a way to plan contingencies in case the data do not meet standard parametric assumptions. Furthermore, it offers student researchers a guide to analyses they may not be familiar with and need to familiarize themselves with prior to proposal or data collection. Along those same lines, the SAP should include which software will be used to conduct the planned analyses; especially considering the limitations of some software packages (Starkweather, 2013a).</p>
<h3><strong>SAP Format</strong></h3>
<p>Generally speaking, quantitative data analysis follows a four stage process. The first stage in the process is <em>Initial Data Analysis</em> (IDA). During this stage it is critical to know thy data, become intimately familiar with it; so decisions about subsequent analysis can be made appropriately. Generally the focus of this stage is on univariate descriptive statistics and associated plots. Some of the most important operations during IDA include some of the most mundane tasks of a data analyst. Simple operations such as creating a frequency distribution and the appropriate graph for every variable (e.g. bar charts for categorical variables, histograms for continuous or nearly continuous variables). Here the focus should be on the shape of the distributions of each variable (e.g. Are the continuous variables normally distributed? Are the categorical variables evenly balanced, or do you have severely unbalanced categories? Do your survey items display floor or ceiling effects?). Other necessary steps in this process follow from those simple charts and graphs. Inspection of data entry errors, missing values, and outliers will need to be completed – and these should be easy to identify from these simple charts and graphs. Data entry errors need to be corrected by reviewing and comparing the actual data collection materials (e.g. recording devices, surveys and responses, etc.) to the electronic data file values. For example, if your dataset contains a gender or sex variable and you create a bar chart with the bars representing the gender of each participant (male and female), then if one participant has a value other than those two you know you have some sort of data entry error, coding error, or a missing value. Missing values need to be investigated further when identified (Little & Rubin, 1987). What percentage of the data matrix (i.e. number of rows multiplied times number of columns) is missing? A determination must be made as to whether the missing values are missing at random (MAR) or not missing at random. In other words, MAR means; given the observed data, the missingness mechanism does not depend on the unobserved data.Once that determination has been made, an imputation strategy can be decided upon. If the values are missing at random, then there are multitudes of missing value imputation procedures available for single or multiple imputation (e.g. random recursive partitioning, maximum likelihood imputation, sequential nearest neighbors’ imputation, etc.; see Starkweather, 2014 and Starkweather, 2010). If the values are not missing at random, then some strategy must be devised to account for the pattern of missing while imputing those values; in other words, there must be a model estimated which controls for the relationships among the variables and imputes values estimated to contain the least bias.</p>
<p>The second stage generally involves <em>preliminary data analysis</em>. This stage is primarily concerned with assumption checking and making sure you measured what you think you measured. Bivariate linearity, multivariate normality, and multivariate outliers should be assessed. Again, a good place to start is with relatively simple tables and graphs, such as scatterplots and scatterplot matrices along with associated correlations and correlation matrices. Keep in mind, there is more than one type of correlation and the type used is largely determined by the type of variables being correlated (e.g. Pearson product moment correlation, Spearman’s rho, Kendall’s tau, point biserial correlation, polychoric correlation, tetrachoric correlation, etc.). One goal at this stage is to understand the nature of the relationships among the variables – not just the variables of most interest to the hypotheses, but also the auxiliary, confounding, demographic and any other type of variables as well. It is important to realize there are three key aspects of any relationship among two (or more) variables: significance (which can be meaningless with large sample sizes), direction (i.e. positive, negative), and magnitude. The level of magnitude which indicates importance varies with each study and / or field, as does the effect size (e.g. <em>R<sup>2</sup></em> or percentage of variance shared / accounted for). Obviously, this highlights the importance of a thorough literature review and becoming familiar with acceptable effect sizes within the field or subject of study. Similarly, different fields often use (or are only familiar with) different metrics; for example some fields rely upon (and expect) Mahalanobis’ distance for assessing multivariate normality and multivariate outliers (Starkweather, 2013b); others might choose Cook’s distance or some other measure of multivariate distance or leverage (also called influence).</p>
<p>As mentioned briefly at the beginning of the previous paragraph, this stage (second stage) may also involve more complex analysis such as Item Response Theory (IRT) or factor analysis to make sure you measured the variables of interest appropriately. This step is required if you are using a survey to assess the primary variables of interest. Does the factor structure (or item difficulty, item discrimination, etc.) of your sample conform to what has been established of the items as reported in the literature? Also, in regards to surveys; do not use Cronbach’s Alpha unless your data meet the assumptions associated with it, most survey data does not and there are more appropriate statistics to use for reliability (Starkweather, 2012b). Another goal of the second stage in complex research designs might be variable or model selection analysis. For example, if the study is primarily exploratory and involves collection of extremely large data (e.g. genetics); then a variable or model selection technique might be used to reduce the variables down to a set containing the most important variables (e.g. Relative Importance, Bayesian Model Selection; see Starkweather, 2011). This second stage might also contain strategies for developing weights in order to correct for imbalances in the data or to statistically control confounding variables. Weighting strategies (e.g. propensity scores) should not be avoided, they are often very effective (see Kish, 1990); but choosing the <em>right</em> weights is essential.</p>
<p>The third stage of the data analysis process generally involves the <em>primary data analysis</em>; this is the stage in which the major analyses required to answer the hypothesis or hypotheses of the study are conducted. This is the stage in which the theoretical model is fit to the data – that model may be something as simple as a factorial Analysis of Variance (ANOVA) or it may be very complex, such as a Structural Equation Model (SEM). The main goal of this stage is to determine if the data and model fit well. There are often many measures of model fit (e.g. Root Mean Square Error of Approximation [RMSEA], Normed Fit Index [NFI], Non-Normed Fit Index [NNFI], Akian Information Criterion [AIC], Bayesian Information Criterion [BIC], etc.). Therefore, it is again important to have completed a thorough literature review in order to understand what represents appropriate fit in your discipline. Keep in mind; whenever fitting a model is required or hypothesized, it is generally a good idea to fit some competing models in order to give goodness-of-fit metrics some context. In other words, if you are hypothesizing one model, you had better fit at least one (and probably two) more models in order to have some empirical evidence for the model you hypothesized (being the <em>best</em> model). Something else to keep in mind at this stage – with respect to goodness-of-fit measures – a chi-square statistic is virtually meaningless when fitting an even moderately complex model. This is because even moderately complex models require large sample sizes and as everyone knows, chi-square becomes less and less meaningful as sample size increases. Last, but not least, this stage should include extensive evaluation of residual values. All models are capable of producing residuals – the difference between the actual values of the data and the predicted values of the model (e.g. Y minus Y-hat or predicted Y [y – ŷ] in regression, matrix of association minus the reproduced matrix or predicted matrix in many multivariate analyses, etc.). One assumption of common parametric analyses is normally distributed residual values – obviously this cannot be checked until the model has been fit; and should be checked carefully.</p>
<p>The fourth stage of the process involves <em>secondary </em>or <em>subsequent data analysis</em>. This stage involves analyses for testing secondary hypotheses or individual hypotheses nested within, or of lesser importance than, the larger goal (hypothesis) of the study. Consider something as simple as a one-way ANOVA. The model would consist of two variables: one categorical variable with more than two categories (often called an independent variable), predicting one continuous or nearly continuous outcome variable (often called a dependent variable). Evaluating the effect of the independent variable on the dependent variable would entail interpretation of the omnibus <em>F</em> (the main effect) which would inform whether or not the model fit well and a main effect was present. However, the <em>F</em> test does not inform where the significant differences lie. In order to identify which group or groups were significantly (and meaningfully – with effect sizes) different from which other group or groups, planned contrasts or post hoc testing would be necessary. These planned contrasts or post hoc tests would be done as the fourth stage of the process or SAP. In a regression setting, the model fit is evaluated with a combination of <em>R<sup>2</sup> </em>type of statistics (and often an ANOVA summary type table with an <em>F </em>test) while the simple effects or fourth stage is done with the individual predictor coefficients’ (often with <em>t </em>tests for each predictor’s standardized coefficient). In more complex settings, such as path models or SEM, the fit of the model is evaluated in the third stage and the individual path coefficients or structure coefficients are evaluated in the fourth stage (often with t-tests of the standardized coefficients). The fourth stage might also be the stage in which confounding variables are controlled or mediation and / or moderation are evaluated. This stage may also include post-stratification, as in multilevel regression (also known as Hierarchical Linear Modeling [HLM]) with post-stratification. </p>
<h3><strong>Conclusions</strong></h3>
<p>Obviously, the main idea of this article was to help researchers, primarily graduate students, better prepare for data collection. It is important to note that although the stages of a Statistical Action Plan are listed and described above as sequential, a researcher may need to return to previous steps throughout the process. Again, this is one of the benefits of forcing one’s self to create such a plan – it necessitates thinking about what type of data is needed to answer the research question or specific hypotheses and it motivates consideration of alternative analyses as a contingency if the resulting data does not conform to the assumptions of the planned primary analysis strategy. As many people have recognized over the historical course of science, more effort spent in planning research pays substantial benefits as the study is conducted and analyzed. In essence, it is much better to plan potential contingencies and learn about them (i.e. unfamiliar analysis) prior to data collection than it is after data collection and one is facing a thesis / dissertation defense or publication deadline. Lastly, an Adobe.pdf version of this article can be found <a href="http://www.unt.edu/rss/rssmattersindex.htm">here</a> (along with several of the resources listed below). Other, potentially useful resources are also located <a href="http://www.unt.edu/rss/class/Jon/ResourcesWkshp/">here</a>. </p>
<p>Until next time; “<em>a failure to plan at the beginning [of the semester] on </em>your part <em>does not represent a crisis at the end [of the semester] on </em>my part<em>.” </em>– Kevin J. Armstrong, PhD<em>.</em></p>
<h3><span style="font-size: 1.17em;">References / Resources</span></h3>
<p>Kish, L. (1990). Weighting: Why, when, and how? Paper presented at the Proceedings of the Survey Research Methods section of the American Statistical Association. Available at: <a href="https://www.amstat.org/sections/SRMS/Proceedings/papers/1990_018.pdf">https://www.amstat.org/sections/SRMS/Proceedings/papers/1990_018.pdf</a></p>
<p>Little, R. J. A., & Rubin, D. B. (1987). <em>Statistical analysis with missing data. </em>New York: John Wiley & Sons. [Still one of the most important resources for understanding missing values].</p>
<p>Starkweather, J. (2014). Your one-stop multiple missing value imputation shop: R 2.15.0 with the rrp package. Benchmarks Online, February 2014. Available at: <a href="http://it.unt.edu/benchmarks/issues/2014/02/rss-matters">http://it.unt.edu/benchmarks/issues/2014/02/rss-matters</a></p>
<p>Starkweather, J. (2013a). Why R; it’s not a question, it’s an answer. Benchmarks Online, October 2013. Available at: <a href="http://it.unt.edu/benchmarks/issues/2013/10/rss-matters">http://it.unt.edu/benchmarks/issues/2013/10/rss-matters</a></p>
<p>Starkweather, J. (2013b). Multivariate outlier detection with Mahalanobis’ distance. Benchmarks Online, July 2013. Available at: <a href="http://it.unt.edu/benchmarks/issues/2013/07/rss-matters">http://it.unt.edu/benchmarks/issues/2013/07/rss-matters</a></p>
<p>Starkweather, J. (2012a). Statistical Resources (updated). Benchmarks Online, July 2012. Available at: <a href="http://it.unt.edu/benchmarks/issues/2012/07/rss-matters">http://it.unt.edu/benchmarks/issues/2012/07/rss-matters</a></p>
<p>Starkweather, J. (2012b). Step out of the past: Stop using coefficient alpha; there are better ways to calculate reliability. Benchmarks Online, June 2012. Available at: <a href="http://it.unt.edu/benchmarks/issues/2012/06/rss-matters">http://it.unt.edu/benchmarks/issues/2012/06/rss-matters</a></p>
<p>Starkweather, J. (2011). Sharpening Occam’s Razor: Using Bayesian Model Averaging in R to Separate the Wheat from the Chaff. Benchmarks Online, February 2011. Available at: <a href="http://it.unt.edu/benchmarks/issues/2011/02/rss-matters">http://it.unt.edu/benchmarks/issues/2011/02/rss-matters</a></p>
<p>Starkweather, J. (2010). How to identify and impute multiple missing values using R. Benchmarks Online, November 2010. Available at: <a href="http://web3.unt.edu/benchmarks/issues/2010/11/rss-matters">http://web3.unt.edu/benchmarks/issues/2010/11/rss-matters</a> </p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published May 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>Wed, 14 May 2014 20:12:44 +0000cpl0001878 at http://it.unt.edu/benchmarks