By Jonathan Starkweather, Ph.D., consultant
Simulating data can be a very useful learning tool. It can help each of us better understand the 'real world' data we collect by allowing us to mimic the structure of data we hope to collect or data we have collected. It can help us better understand the analyses and models we wish to use, as well as the algorithms upon which those rely. Simulation can also be extremely useful in a classroom setting, whether to help students understand the things mentioned above, or simply to provide quick data examples for exercises, e.g., homework. The output is not shown in the examples provided in this article because, it is hoped readers will work through the examples using R to demonstrate them.
Univariate data simulation
There are a large number of theoretical distributions which can be simulated using basic R functions, i.e., functions available without additional packages. For example, the function 'rnorm' creates random deviates from a Normal distribution, given a mean and standard deviation.
If no mean or standard deviation are supplied, the default values for them will be a mean of zero and a standard deviation of one, i.e., the Standard Normal distribution or standardized; Z-scores.
There are associated functions for doing the same thing with other theoretical distributions; such as: 'rbeta' (Beta distribution), 'rexp' (Exponential distribution), 'rchisq' (non-central Chi-Squared distribution), 'rpois' (Poisson distribution), 'rf' (F distribution), 'rt' (Student's t distribution), etc. Each of the distribution functions create random realizations of a theoretical distribution, given the parameters for each distribution (e.g. mean and standard deviation for Normal distributions). Each distribution also has three other functions which provide density, distribution function, and quantile. So, for the Normal distribution(s) there are four standard functions: 'rnorm', 'dnorm' (density), 'pnorm' (distribution function), and 'qnorm' (for quantiles).
Common Regression-style simulation
The section heading printed above was chosen because we wanted to demonstrate a couple of commonly used single-outcome prediction / explanation models. First, Ordinary Least Squares, OLS, linear regression. In the example below, we create textbook perfect realizations of such models. These examples are "perfect" because the assumptions of the models are met: only linear relationships, homoscedasticity of variances, normal distribution of residuals, no multicollinearity, no measurement error, and only the correct variables are included in the model, i.e. no model misspecifications [errors of omission or errors of inclusion]. Starting simple, with a bivariate example, i.e. simple correlation, we demonstrate how there are multiple ways to simulate this type of data. It is important to note that because the functions used below are drawing *random* deviates from theoretical distributions—given user supplied the parameters; the statistics returned will not be *exactly* the same as those specified during data generation. However, they will be very close to those specified.
Zx <- scale(x)
Zy <- .8*Zx + rnorm(100, 0, sqrt(1 - (.8^2)))
Check to see if we can recover the parameter(s) used to create the data....
summary(lm(Zy ~ 0 + Zx))
To see the more familiar non-standardized version, simply use the original 'x' variable as the predictor and apply the same 'scaling' or parameters (i.e. mean & standard deviation) to the outcome, 'y'.
summary(lm(y ~ x))
Of course, it is then simple to add complexity, such as a second predictor and an interaction term.
Zx1 <- scale(x1)
x2 <-rnorm(100, 10, 1.5)
Zx2 <- scale(x2)
x3 <- x1*x2
Zx3 <- scale(x3)
Zy <- (.7*Zx1) + (.4*Zx2) + (.2*Zx3) +rnorm(100, 0, sqrt(1 –
(.49 + .16 + .04)))
y <- (1.5*Zy) + 10
df.1 <- data.frame(x1, x2, x3, y)
rm(x, Zx, Zx1, x1, Zx2, x2, Zx3, x3, Zy, y)
summary(lm(y ~ x1 + x2 + x3, data = df.1))
summary(lm(scale(y) ~ 0 + scale(x1) + scale(x2) + scale(x1)*scale(x2), data = df.1))
The second common single outcome model is the binary logistic regression model, which will be demonstrated below using the 'arm' package (Gelman, et al., 2016); which contains the 'invlogit' function.
Here we are dealing with a binary outcome variable and so, we use the 'rbinom' function.
Checking results with both 'numeric' and 'factor' (i.e. categorical) versions of the outcome, 'y'.
Detaching the package 'arm' and its dependent packages.
Simulating multivariate data structures
We can use the 'mvrnorm' function from the 'MASS' package (Ripley, et al., 2017; Venables & Ripley, 2002) to create multivariate normal deviates; given means and correlations among the variables. This function can be useful for a variety of data structures, such as simulating multicollinear predictor variables in a regression-style model, simulating components, simulating factors (there are better ways of doing this, covered further below), simulating canonical correlations, etc. To demonstrate the function, we create four multivariate normal variables with specified correlations (Sigma = sig), specified means (mu = 0 for all four), and use 'empirical = TRUE' to replicate the exact correlations among the simulated variables.
Check the correlations of the matrix supplied by ‘sig’ versus the correlation matrix of the 1000 cases of the four variables.
We could then use the first two columns of the data frame as factor scores to recreate a data structure for factor analysis with two related factors (r = 0.8). Again, there are other ways to do this; mentioned further below.
Check the factor structure (i.e. loadings) with an oblique rotation strategy using the ‘GPArotation’ package (Bernaards & Jennrich, 2005; 2014).
factanal(x = df.5, factors = 2, rotation = "oblimin")
Although the above example is nice for gaining some insight into multivariate normal data; it is recommended that if one is interested in simulating common specific types of multivariate data structures there are two packages which we (R&SS) find invaluable. The 'psych' package (Revelle, 2017a; 2017b) contains functions for simulating ANOVA / linear models, multilevel models, factor structures (hierarchical models, bi-factor models), simplex and circumplex structures; as well as others. The second package we (R&SS) find invaluable is the 'lavaan' package (Rosseel, et al., 2012; 2017) which has functions for simulating data for structural models (e.g., structural equation models [SEM]) and the model syntax (for 'lavaan' functions) is very intuitive and easy to learn. To see the full (and frequently updated) functionality of these two packages, please visit their respective web pages on CRAN and on the package authors' web pages. As one example, below we create a structural model using the 'lavaan' package.
Specify the population's structural model with coefficients.
Simulate the data from the specified model.
To check the model, specify the SEM structural model...
...and then fit that model to the data.
Lastly, it is important to note that many functions in R rely upon a random number generator to properly conduct random operations, e.g., rnorm(n = 1000, mean = 10, sd = 1.5). If replicability and/or reproducibility are needed, then one must `set the seed'. Setting the seed at the beginning of a script allows randomized functions to return the exact same results. Setting the seed is done with the following: set.seed(12345678), with any numeric combination inside the parentheses. Obviously, the specific numeric combination used allows the same result to be produced, i.e., changing the seed number will produce differing results.
A version of the R script used in this article can be found on the Do-It-Yourself Introduction to R website in the Module 9 section. Data Science and Analytics and High-Performance Computing services are available to help facilitate quality research.
Until next time; It is pitch black. You are likely to be eaten by a grue.
Note: If you found these materials useful, please participate in the DSA Client Feedback Survey.
References and Resources
Bernaards, C. A. & Jennrich, R. I. (2005). Gradient Projection Algorithms and Software for Arbitrary Rotation Criteria in Factor Analysis. Educational and Psychological Measurement, 65, 676 – 696. < http://www.stat.ucla.edu/research/gpa >
Bernaards, C. A. & Jennrich, R. I. (2014). Package ‘GPArotation’ (many rotations available) documentation and user manual are available at CRAN: https://cran.r-project.org/web/packages/GPArotation/index.html
Gelman, A., Su, Y., Yajima, M., Hill, J., Pittau, M. G., Kerman, J., Zheng, T., & Dorie, V. (2016). Package ‘arm’ documentation and manual are available at CRAN: https://cran.r-project.org/web/packages/arm/index.html
Revelle, W. (2017a). psych: Procedures for Personality and Psychological Research. Northwestern University press: Evanston, IL.
Revelle, W. (2017b). Package ‘psych’ documentation and manual are available at CRAN: https://cran.r-project.org/web/packages/psych/index.html
Ripley, B. D., et al., (2017). Package ‘MASS’ (‘mvrnorm’ function) documentation and manual are available at CRAN: https://cran.r-project.org/web/packages/MASS/index.html
Rosseel, Y., et al. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1 – 36. < http://www.jstatsoft.org/v48/i02/ >
Rosseel, Y., et al. (2017). Package ‘lavaan’ documentation and user manual are available at CRAN: https://cran.r-project.org/web/packages/lavaan/index.html
Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer: New York. ISBN 0-387-95457-0