RSS Matters content
http://it.unt.edu/benchmarks/rss-matters
enRSS Matters
http://it.unt.edu/benchmarks/issues/2015/05/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-05</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>You may be (stuck) here! And here are some potential reasons why.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2015/04/rss-matters">Time Series Analysis: Basic Forecasting.</a> </em><em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>I </strong></span>often read <a href="http://www.r-bloggers.com/">R-bloggers</a> (Galili, 2015) to see new and exciting things users are doing in the wonderful world of R. Recently I came across Norm Matloff’s (2014) <a href="http://matloff.wordpress.com/2014/09/15/why-are-we-still-teaching-about-t-tests/">blog post</a> with the title “Why are we still teaching <em>t</em>-tests?” To be honest, many RSS personnel have echoed Norm’s sentiments over the years. There do seem to be some fields which are perpetually stuck in decades long past – in terms of the statistical methods they teach and use. Reading Norm’s post got me thinking it might be good to offer some explanations, or at least opinions, on why some fields tend to be stubbornly behind the analytic times. This month’s article will offer some of my own thoughts on the matter. I offer these opinions having been academically raised in one such <em>Rip Van Winkle</em> (Washington, 1819) field and subsequently realized how much of what I was taught has very little practical utility with real world research problems and data.</p>
<h4><strong>The Lady, Her Tea, and the Reverend</strong></h4>
<p>It is extremely beneficial to review the history of statistics in order to understand why some fields seem to be slow in adopting contemporary methods and analyses. There are very few books I would consider *required* reading for anyone with a serious interest in applied statistical analysis. Two such books will be briefly discussed here. First, <em>The lady tasting tea: How statistics revolutionized science in the twentieth century</em> by David Salsburg (2001); which is a history book, not a statistics textbook. Salsburg’s book provides a very good review of the creation and application, as well as the persons associated with the creation, of statistical analyses during what Salsburg refers to as <em>the statistical revolution</em>. Salsburg goes into detail about the persons and personalities behind each breakthrough in the field of statistics, such as early pioneers like Karl Pearson, Charles Spearman, Egon Pearson, Jerzy Neyman, and Sir Ronald Fisher; as well as more recent trail blazers like David Cox, George Box, Donald Rubin, and Bradley Efron; and many more between. However, Salsburg’s book only covers one perspective of statistics: the <em>Frequentist</em> perspective, which includes the ubiquitous Null Hypothesis Significance Testing (NHST) and associated <em>p</em>-values. Very, very briefly, this perspective assumes that the model parameters are fixed and assumed to be known and the data are essentially random; for instance, if the null hypothesis is true, what is the probability of this data? These types of problems can be stated in the general form; what is the probability of the data given a hypothesis? In symbols, this translates to: </p>
<p align="center"><em>P(D|H)</em></p>
<p>The other book I consider *required* reading for anyone with a serious interest in applied statistical analysis covers the other perspective of statistics: the <em>Bayesian</em> perspective. The Bayesian perspective differs from traditional Frequentist inference by assuming that the data are fixed and model parameters are described by a probability distributions, which sets up problems in the form of; what is the probability of a hypothesis (or parameter), given the data at hand? These types of problems can be stated with symbols as: </p>
<p align="center"><em>P(H|D)</em></p>
<p>Sharon McGrayne’s (2011) book, <em>The theory that would not die: How Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy</em> is similar to Salsburg’s (2001) book in that both are history books, not statistical textbooks. McGrayne’s book, obviously, begins with the Reverend Thomas Bayes’ ideas from the 1740s. The book tracks the origins of Bayes’ Rule as a theory and concept which for many years was only theoretical because the complex computations required to actually put it into practice were impossible. The book charts the history of the resurgence of Bayes’ Rule as computers emerged in the twentieth century which allowed scientists to apply Bayes’ Rule to a variety of (often top secret) complex, practical, real world problems.</p>
<h4><strong>The Desire to be Quantitative</strong></h4>
<p>The importance of the histories mentioned above is critical to understanding how some fields have been slow to adopt more modern methods and analyses. As history can show us, much of the previous 100 years of statistical analysis has been dominated by the Frequentist perspective. Most of the methods and analysis of the Frequentist perspective are designed for use in strictly experimental or quasi-experimental research designs. Therefore, as new scientific disciplines emerged and developed with a desire to be empirically grounded, the only methods available were the traditional analyses – what I refer to as the <em>usual suspects</em>. These usual suspects include all the things presented in the vast majority of first year applied statistics courses in departments such as Psychology, Sociology, Education, etc. In fact, it has been my experience that the many, many textbooks used for these classes contain the exact same content and it is often presented in the exact same order. The content begins with definitions (e.g. population, sample, the scales of measurement [Stevens, 1946], independent variable, dependent variable, etc.), then descriptive statistics are covered (e.g. measures of central tendency, variability, shape, & relationship), followed by a discussion of the normal distribution and properties of the Standard Normal Distribution (e.g. Z<em>-</em>scores, also called standard scores), then a brief discussion of NHST and statistical power, then the <em>Z</em>-test is discussed, then the <em>t</em>-tests are discussed (e.g. one-sample, independent samples, dependent samples), then oneway analysis of variance [ANOVA] with perhaps a light treatment of factorial ANOVA, then regression – mostly with only one predictor, then subsequent chapters / syllabi cover several non-parametric analogues for the methods previously discussed (e.g. Mann-Whitney <em>U</em>, Wilcoxon signed-ranks test, Kruskal-Wallis oneway ANOVA, Chi-square tests, etc.). Now, there is nothing inherently wrong with these methods, they work very well for research designs which provide the types of data they are designed to handle. Unfortunately, these usual suspect analyses each have fairly extensive assumptions which, when the analyses are applied to data which fails to meet those assumptions the resulting statistics are heavily biased or perhaps even invalid. Again, most of these methods were developed for research situations which are truly experimental (i.e. random sampling from a well-defined population of interest, random assignment of cases to conditions of an independent variable, and experimental manipulation of that independent variable while controlling all other variables as much as possible). Unfortunately, true experimental designs are not possible for most of the research done in the emerging or younger scientific disciplines (e.g. Psychology, Sociology, Education, etc.). </p>
<h4><strong>Intergenerational Momentum</strong></h4>
<p>The previous section hinted at what I mean by <em>Intergeneration Momentum</em>. The previous section shows how initially the younger sciences had limited options when it came to data analysis – the Frequentist perspective was the only perspective and therefore, only the usual suspects were available. However, intergenerational momentum is responsible for the fact that the vast majority of young science researchers are still using those usual suspects when more effective methods have been developed. Max Planck (1950) said, “a scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it” (p. 33 – 34). Unfortunately, even Planck’s mechanism for the advancement of science fails in some fields because some mentors stubbornly stick with one or a few analyses. Worse still, some of these mentors use their authority, or power, as the gate-keepers of a successful thesis or dissertation, to pressure their graduate students to use the mentors’ preferred analysis or analyses. Therefore, the <em>intergeneration</em> reliance upon outdated, and potentially inadequate, analyses continues in a self-replicating stagnation. One of the most frequent examples of an analysis stubbornly being used despite its creator and namesake attempting to enlighten researchers to its limitations is: <em>Cronbach’s alpha coefficient</em> (Cronbach, 1951). “I no longer regard the alpha formula as the most appropriate way to examine most data” (Cronbach & Shavelson, 2004, p. 403). Alpha has three critical assumptions; two of which (Tau equivalency & uncorrelated errors), are virtually never satisfied by data resulting from most surveys (for more on this topic, see Starkweather, 2012). Like many of the usual suspects (i.e. traditional Frequentist analyses) the assumptions are often not assessed adequately or are simply ignored – meaning, an untold number of research conclusions are likely based upon very biased or simply invalid statistical results.</p>
<h4><strong>Looking Toward the Future</strong></h4>
<p>The primary unit of analysis, for many of the newer or young sciences, is the human being or some aspect of human experience. Unfortunately, from a research perspective, human beings are extremely complex entities and they are constantly interacting with other complex entities (e.g. other humans, social / cultural systems, political systems, economic systems, etc.). Therefore, researchers whose primary units of analysis are human beings should be collecting data which will allow them to fit, compare, and revise complex statistical models capable of accurately representing the complexity of the researcher’s subjects and their numerous interactions with other complex entities (e.g. other humans & other complex systems mentioned above). It is well past the time to recognize that our forbearers’ General Linear Model [GLM] statistics (e.g. <em>t</em>-tests, ANOVAs, regressions, etc.) should no longer be the default modeling solutions. After all, how many current researchers generate their reports, or manuscripts, using a 1921– 1940 Corona typewriter<a href="#Footnote1"><sup>1</sup></a>?</p>
<p align="center"><img src="/benchmarks/sites/default/files/YouAreHere_001.png" alt="Smith Corona typewriter" width="640" height="426" /> </p>
<p>The above typewriter, beautiful as it is, also highlights another area of stagnation among many contemporary researchers. Statistical software has advanced at an incredible rate over the last two decades. Yes, my zealously <a href="http://www.r-project.org/">R</a>-centric eyes are looking at you SPSS and SAS users. There are two, among many, important factors for recommending <a href="http://www.r-project.org/">R</a> over the other two software packages. First, <a href="http://www.r-project.org/">R</a> is completely free, like the air you breathe is free. It seems to me almost irresponsible to continue using expensive software (e.g. SPSS & SAS) in this economic climate when free alternatives exist. Second, <a href="http://www.r-project.org/">R</a> has all the capabilities of SPSS and SAS but, the reverse is not true. <a href="http://www.r-project.org/">R</a> contains the most cutting edge functionality due to its regular rapid update schedule and the continued expansion of its functionality through new procedures being developed by theoretical and applied statisticians’ submitted packages (for more on this topic; see Starkweather, 2013).</p>
<p align="center"> <img src="/benchmarks/sites/default/files/YouAreHere_002.png" alt="Round peg, square hole ..." width="442" height="450" /></p>
<p>Lastly, the image<a href="#Footnote2"><sup>2</sup></a> above reflects the idea that far too many research analysts are using Frequentist methods when Bayesian methods are much better suited for the types of hypotheses and data of the new or young sciences. The problems with the Frequentist perspective, and in particular NHST, have been thoroughly discussed for many years (Efron, 1986; Cohen, 1994; Krantz, 1999; Hubbard, & Bayarri, 2003; Gigerenzer, Krauss, & Vitouch, 2004; Gelman, & Stern, 2006). The bottom line is this, Bayes methods are not a cure all, but they are likely much better for the vast majority of research situations in the new or young sciences. There are many ‘introduction to Bayesian statistics’ text books available in a variety of fields (see Starkweather, 2011). Furthermore, there are alternatives to both Frequentists and Bayesian methods; such as machine learning techniques, computational artificial intelligence methods, soft modeling methods, and evolutionary optimization based methods (swarm algorithms, MCMC methods, genetic algorithms, ant colony optimization, etc.). Additionally, there are wrapper techniques which can be applied to most any analysis and improve the precision of estimates; such as resampling methods like the bootstrap, boosting, bagging, and model averaging (e.g. ensemble averaging). It’s time to de-emphasize the usual suspects of NHST and integrate Bayesian and / or other more current methods into curricula to break the stagnation which severely limits these new or young sciences.</p>
<p>Until next time; here’s a gentle reminder that May 4<sup>th</sup> is not *only* <a href="http://en.wikipedia.org/wiki/Star_Wars_Day">Star Wars Day</a>…“<em>Tin soldiers and Nixon coming…</em>”</p>
<h4><span style="font-size: 1em;">References and Resources</span></h4>
<p>Cohen, J. (1994). The Earth is round (<em>p </em>< .05). <em>American Psychologist, 49</em>(12), 997 – 1003.</p>
<p>Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. <em>Psychometrika, 16</em>(3), 297 – 334. doi:10.1007/bf02310555</p>
<p>Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. <em>Educational and Psychological Measurement, 64</em>(3), 391 – 418.</p>
<p>Fisher, R. A. (1935). <em>The design of experiments.</em> New York: Hafner Press.</p>
<p>Galili, Tal. (2015). R-bloggers: R news and tutorials contributed by (563) R bloggers [A Word Press blog]. Available at: <a href="http://www.r-bloggers.com/">http://www.r-bloggers.com/</a></p>
<p>Gelman, A., & Stern, H. (2006). The difference between “Significant” and “Not Significant” is not itself statistically significant. <em>The American Statistician, 60</em>(4), 328 – 331.</p>
<p>Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), <em>The Sage handbook of quantitative methodology for the social sciences</em>, pp 391 – 408. Thousand Oaks, CA: Sage Publication.</p>
<p>Efron, B. (1986). Why isn’t everyone a Bayesian. <em>The American Statistician, 40</em>(1), 1 – 5.</p>
<p>Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (<em>p</em>’s) versus errors (α’s) in classical statistical testing. <em>The American Statistician, 57</em>(3), 171 – 182.</p>
<p>Irving, Washington. (1819). Rip Van Winkle in <em>The sketchbook of Geoffrey Crayon, Gent.</em> [Geoffrey Crayon was Irving’s pseudonym]. New York: Ebenezer Washington & Henry Brevoort.</p>
<p>Krantz, D. H. (1999). The null hypothesis testing controversy in Psychology. <em>Journal of the American Statistical Association, 44</em>(448), 1372 – 1381.</p>
<p>Matloff, Norm. (2014). Mad (Data) Scientist: Musings, useful code, etc. on R and data science [A Word Press blog]. Available at: <a href="http://matloff.wordpress.com/2014/09/15/why-are-we-still-teaching-about-t-tests/">http://matloff.wordpress.com/2014/09/15/why-are-we-still-teaching-about-t-tests/</a></p>
<p>McGrayne, S. B. (2011). <em>The theory that would not die: How Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy</em>. New Haven, CT: Yale University Press.</p>
<p>Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. <em>Philosophical Magazine, 2</em>, 559 – 572.</p>
<p>Planck, M. (1932). <em>Where is science going?</em> New York: W. W. Norton & Company, Inc. Freely available in several formats (e.g. Adobe.pdf) at: <a href="https://archive.org/details/whereissciencego00plan_0">https://archive.org/details/whereissciencego00plan_0</a></p>
<p>Planck, M. (1950). <em>Scientific autobiography and other papers</em>. London: Williams & Norgate LTD.</p>
<p>Salsburg, D. (2001). <em>The lady tasting tea: How statistics revolutionized science in the twentieth century</em>. New York: W. H. Freeman and Company.</p>
<p>Spearman, C. (1904). General intelligence: Objectively determined and measured. <em>American Journal of Psychology, 15</em>(2), 201 – 292.</p>
<p>Starkweather, J. (2011). Go Forth and Propagate: Book Recommendations for Learning and Teaching Bayesian Statistics. Benchmarks: <em>RSS Matters, </em>September 2011. Available at: <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BayesBooks_L_JDS_Sep2011.pdf">http://www.unt.edu/rss/class/Jon/Benchmarks/BayesBooks_L_JDS_Sep2011.pdf</a></p>
<p>Starkweather, J. (2012). Step out of the past: Stop using coefficient alpha; there are better ways to calculate reliability. Benchmarks: <em>RSS Matters, </em>June 2012. Available at: <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/Omega_JDS_Jun2012.pdf">http://www.unt.edu/rss/class/Jon/Benchmarks/Omega_JDS_Jun2012.pdf</a></p>
<p>Starkweather, J. (2013). Why R; it’s not a question, it’s an answer. Benchmarks: <em>RSS Matters, </em>October 2013. Available at: <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/WhyR_L_JDS_Oct2013.pdf">http://www.unt.edu/rss/class/Jon/Benchmarks/WhyR_L_JDS_Oct2013.pdf</a></p>
<p>Stevens, S. S. (1946). On the theory of scales of measurement. <em>Science, 103</em>(2684), 677 – 680.</p>
<p>Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.), <em>Handbook of Experimental Psychology, </em>pp 1 – 49. New York: Wiley.</p>
<h4 style="text-align: left;" align="center">Footnotes</h4>
<p style="text-align: left;" align="center">Footnote<sup><a name="Footnote1"></a>1</sup>: Image found at the Smith Corona virtual museum (gallery for 1<sup>st</sup> generation typewriters, specifically the Corona #3 model): <a href="http://www.smithcorona.com/wp-content/tn3/0/1915CoronaTypewriterCompanyInc.Corona3.jpg">http://www.smithcorona.com/wp-content/tn3/0/1915CoronaTypewriterCompanyInc.Corona3.jpg</a></p>
<p>Footnote<sup><a name="Footnote2"></a>2</sup>: Image found at the TribePad blog: <a href="http://tribepad.com/2012/01/the-round-peg-round-hole-approach-to-talent-technology/">http://tribepad.com/2012/01/the-round-peg-round-hole-approach-to-talent-technology/</a></p>
<p> </p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published May 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsThu, 21 May 2015 15:30:16 +0000cpl00011039 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2015/04/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-04</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Time Series Analysis: Basic Forecasting.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2015/03/rss-matters">Confirmatory Factor Analysis and Structural Equation Modeling Group Differences: Measurement Invariance.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>his month’s article will provide a <em>very</em> gentle introduction to basic time series analysis. The primary reference for this article is Hyndman and Athanasopoulos (2015) and it is highly recommended, not least because it is completely <a href="https://www.otexts.org/book/fpp">free</a> and regularly updated at OTexts. If you are unfamiliar, there is a growing group of academics and researchers using <a href="http://www.otexts.org/">www.OTexts.org</a> [Online, Open Access Textbooks] to remove barriers to learning – a most honorable endeavor. The book by Hyndman and Athanasopoulos also has a companion R package; ‘<a href="http://cran.r-project.org/web/packages/fpp/index.html">fpp</a>’ (Hyndman, 2013) which, obviously, makes working through the examples presented in the book much easier.</p>
<p>As with any introduction, this one includes some necessary notation and terms which must be defined prior to actually learning any of the data analysis techniques. Say we have a vector of time series data, <em>y</em>, and there are nine values in this time series (<em>t</em> = 9). The most recent value is referred to as <em>y<sub>t</sub></em> and the last value as <em>y<sub>t-</sub></em><sub>8</sub>. Continuing the notation, <em>y<sub>t</sub></em><sub>+1</sub> is used when referring to a forecast value (i.e. the predicted next value of the time series). Next, there are a few terms worth noting. The term <em>trend</em> refers to a general pattern (e.g. increase or decrease) in the time series over the course of the series. Hyndman and Athanasopoulos (2015) define a trend as the following; “a trend exists when there is a long-term increase or decrease in the data” (p. 28). The term <em>seasonal</em> refers to patterns in the series which occur at regular intervals (e.g. season of the year, semesters of an academic year, days of the week, or even times of a day). Another term widely used is <em>cycle</em>, which “occurs when the data exhibit rises and falls that are not of a fixed period” (Hyndman and Athanasopoulos, p. 28). Basic time series are conceptually composed of either an additive model:</p>
<p align="center"><em>y<sub>t</sub> = S<sub>t</sub> + T<sub>t</sub> + E<sub>t</sub></em></p>
<p>or a multiplicative model:</p>
<p align="center"><em>y<sub>t</sub> = S<sub>t</sub> </em>*<em> T<sub>t</sub> </em>*<em> E<sub>t</sub></em></p>
<p>In both models, the <em>y<sub>t</sub></em> is the data at period <em>t</em>, <em>S<sub>t</sub></em> refers to the seasonal component at time t, the <em>T<sub>t</sub></em> refers to the trend (or cycle) component at time <em>t</em>, and the <em>E<sub>t</sub></em> refers to everything else (i.e. error) at time <em>t</em>. Hyndman and Athanasopoulos (2015) state that “the additive model is most appropriate if the magnitude of the seasonal fluctuations or the variation around the trend (or cycle) does not vary with the level of the time series” (p. 147). Alternatively, Hyndman and Athanasopoulos state that “when the variation in the seasonal pattern, or the variation around the trend (or cycle), appears to be proportional to the level of the time series, then the multiplicative model is more appropriate” (p. 147 – 148).</p>
<p>Below we will be using R Commander (package: ‘<a href="http://cran.r-project.org/web/packages/Rcmdr/index.html">Rcmdr</a>’) and the epack R Commander plugin (package: ‘<a href="http://cran.r-project.org/web/packages/RcmdrPlugin.epack/index.html">RcmdrPlugin.epack</a>’), as well as one of the main time series packages (package: ‘<a href="http://cran.r-project.org/web/packages/tseries/index.html">tseries</a>’). Keep in mind, the R Commander package has several dependent packages; as does the epack plugin (including package: ‘tseries’). The examples will be using monthly average stock price for BP PLC from January 1<sup>st</sup>, 2010 until January 1<sup>st</sup>, 2011. If you are wondering why only these 12 data points were chosen, please see <a href="http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill">here</a>. Below we load R Commander and the epack plugin so we can import the data from Yahoo Finance. The function used to retrieve the data is the ‘histprice2’ function from the epack plugin.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_001.png" alt="Basic Forecasting Example 1" width="580" height="480" /> </p>
<p align="center"> </p>
<p>It is worth noting that the ‘histprice2’ function is actually calling a function from the ‘tseries’ package called ‘get.hist.quote’ which; by default, uses Yahoo Finance data. This is important because although we are using monthly stock prices, the ‘get.hist.quote’ function is capable of retrieving daily historical data. As a quick example, consider the data imported below which contains the daily closing price of the S&P 500 from January 1964 until January 2014.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_002.png" alt="Basic Forecasting Example 2" width="640" height="176" /> </p>
<p align="center"> </p>
<p>It is generally a good idea to begin with a graph of the data, while keeping in mind those terms from above (e.g. trend, seasonality, cycle). First, take a look at the S&P 500 data.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_003.png" alt="Basic Forecasting Example 3" width="640" height="64" /> </p>
<p> </p>
<p style="text-align: left;" align="center"><img src="/benchmarks/sites/default/files/BasicTS_004.png" alt="Basic Forecasting Example 4" width="452" height="480" /> </p>
<p style="text-align: left;" align="center">Now imagine you were attempting to forecast where the S&P 500 would be if you were in 1980. You would likely have very narrow prediction intervals (similar to confidence intervals). Contrast those imagined intervals with the intervals you would imagine based on the complete data in the graph, with those two ominous bunny ear spikes…very foreboding. Do you see any trends, seasonality, or cycles in the series? One way to get a better idea of those types of patterns is to plot segments of the data, only one decade at a time perhaps using four decades.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_005.png" alt="Basic Forecasting Example 5" width="640" height="108" /> </p>
<p><img src="/benchmarks/sites/default/files/BasicTS_006.png" alt="Basic Forecasting Example 6" width="453" height="480" /></p>
<p>As you can see above, there does not seem to be any discernible patterns among the four segments. Of course, part of the problem above is that each of the four panels has a different y-axis scale; which means they are not directly comparable in terms of the variance in each series displayed. We could remedy that by forcing each graph to have the same scale, but let’s turn our attention to a much smaller series of data; BP PLC average closing stock price for each month in 2010.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_007.png" alt="Basic Forecasting Example 7" width="640" height="80" /> </p>
<p><img src="/benchmarks/sites/default/files/BasicTS_008.png" alt="Basic Forecasting Example 8" width="452" height="480" /> </p>
<p>In the above graph we can see the stock price dropped nearly half its value between March and June. Most analysts likely would have predicted BP’s stock to stay between $50 and $60 throughout 2010 – even when using complex multivariate models which will not be covered here. However, on April 20<sup>th</sup>, 2010 the Deepwater Horizon oil rig exploded and thus began one of the worst petroleum related oceanic environmental disasters. The point in highlighting this particular data set is to remind all of us that no matter how sophisticated the model, there is always uncertainty. Statistics is a tool for helping to make informed decisions in the presence of uncertainty; but models are not reality and no model is perfect. However, the more data available and analyzed; the less uncertainty one is likely to have in estimates resulting from a model. As the current article is to be an introduction, let’s return to some of the more basic concepts of time series analysis.</p>
<h4><strong>Autocorrelation</strong></h4>
<p>Autocorrelation can be considered a measure of the momentum of a time series. In most time series, it is reasonable to suspect that the most recent data points are likely to contribute most influence on the next (i.e. future) data point. Autocorrelation is a type of correlation statistic specifically for correlating the most recent data point to other data points in the series. Recall, the most recent point is notated <em>y<sub>t</sub></em> and subsequently older points labeled <em>y<sub>t-1</sub></em>, <em>y<sub>t-2</sub></em>, <em>y<sub>t-3</sub></em>…<em>y<sub>t-k</sub></em> (where <em>k </em>= <em>t </em>– 1). The maximum number of autocorrelations calculated is one minus the number of data points (i.e. <em>k</em>). Each autocorrelation represents a different <em>lagged value</em> – which refers to the number of points between the most recent data and the older data. Our BP data contains 12 values and therefore we can compute 11 autocorrelations (<em>r<sub>1</sub></em>, <em>r<sub>2</sub></em>, <em>r<sub>3</sub></em>, … <em>r<sub>11</sub></em>). Graphing is generally the preferred method of inspecting the autocorrelations of a time series. The function used is simply ‘acf’ and by default it produces the desired graph. However, we can simply print the autocorrelations by changing ‘plot = TRUE’ (the default) to ‘plot = FALSE’ as seen below. One can also get the partial autocorrelation by specifying ‘type = “partial”’.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_009.png" alt="Basic Forecasting Example 9" width="640" height="154" /> </p>
<p><img src="/benchmarks/sites/default/files/BasicTS_010.png" alt="Basic Forecasting Example 10" width="452" height="480" /></p>
<p>The blue dotted lines, in the plot above, represent a default confidence interval of 95% (i.e. ‘ci = 0.95’) which can be changed with the ‘ci’ argument (e.g. ‘ci = 0.80’ for 80%).</p>
<h4><strong>Stationarity and Non-Stationarity</strong></h4>
<p>An important concept in time series analysis is stationarity and particularly the recognition of non-stationarity in a particular time series. Stationarity refers to the idea that the time series fluctuates around a constant mean and the variance around that mean remains constant. As one might expect, most time series exhibit non-stationarity; in other words, most time series do not fluctuate around a fixed mean and they do not fluctuate uniformly. Fortunately, there is a function available in R to test for stationarity; the Kwiatkowski, Phillips, Schmidt, and Shin (KPSS; 1992) test for the null hypothesis that the time series is level or trend stationary. Using the ‘kpss.test’ function actually requires running the function twice; once to test if the series is level and once to test if the series displays trend.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_011.png" alt="Basic Forecasting Example 11" width="640" height="206" /> </p>
<p>Both tests above indicate our time series is non-stationary (i.e. both <em>p-</em>values are less than 0.05); which indicates our series does not vary uniformly and does not vary around a constant mean. When a series is non-stationary (as the tests above indicate is the case with our example), then forecasting is much more difficult. However, a <em>differencing operation</em> can be used to transform a non-stationary series into a series where the assumptions of stationarity are met. For example, with the above result we would apply a <em>d = </em>1 differencing operation first and then apply the KPSS tests to the resulting differences. The differencing operation simply takes the difference between each time datum (<em>d = </em>1), or the difference between each other datum (<em>d = </em>2), etc. If the resulting differenced time series shows stationarity (i.e. <em>p > </em>0.05) then the models discussed below (e.g. ARIMA) are appropriate. The differencing operation is not used here due to the space and scope constraints of this document.</p>
<h4><strong>Auto-Regressive Model</strong></h4>
<p>The Auto-Regressive model (AR) is nothing more complex than a linear regression model for time series. The AR model essentially assumes the current time point datum is linearly related most to the previous point and subsequently less to each previous time point as specified in the model. When specifying an AR model, you must specify the <em>order</em> which indicates how many lags to use. The function for fitting an AR model in R is simply ‘ar(data)’, as can be seen below with no order specified – by default the order is 1 (i.e. AR-1). In the example below, the ‘names’ function is used to display the named objects which are returned by the ‘ar’ function. Of particular use is the ‘partialacf’ which returns the partial autocorrelations (simply type: “ar(bp_close)$partialacf” into the console to return these values). Also typically informative are the residuals of the model (simply type: “ar(bp_close)$resid” into the console to return these values); larger residuals indicate poorer fit.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_012.png" alt="Basic Forecasting Example 12" width="640" height="214" /></p>
<h4><strong>Autoregressive-Moving-Average Models</strong></h4>
<p>Autoregressive-moving-average (ARMA) models are based on two polynomial functions; one for the autoregression (AR) and a second for the moving-average (MA). The basic function for fitting an ARMA model to a univariate time series is simply the ‘arma’ function as demonstrated below. Again, the function defaults to specify AR-1 and MA-1; order one (AR) and order two (MA) which reflect the lag periods for each component of the model.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_013.png" alt="Basic Forecasting Example 13" width="640" height="461" /> </p>
<p>As can be seen in the summary above, only the autoregressive portion of the ARMA model is significant. As should be expected; the autoregressive coefficient is fairly close to what was observed in the AR model from the previous section.</p>
<h4><strong>Autoregressive Integrated Moving Average Models</strong></h4>
<p>The autoregressive integrated moving average (ARIMA) model is an appropriate choice (over the ARMA model) when non-stationarity is suspected or observed in the time series. The application of the ‘Arima’ function below has default values specified for the arguments. The ‘Arima’ function is supplied the time series data and two other components can be specified. The ‘order’ – which is the non-seasonal part of the ARIMA model, the three components (p, d, q) are the AR order, the degree of differencing (to correct for non-stationarity – as discussed at the beginning of this document), and the MA order. The ‘seasonal’ argument allows one to specify the seasonal part of the ARIMA model, plus the period (which defaults to frequency(x)). This should be a list with components order and period, but a specification of just a numeric vector of length 3 will be turned into a suitable list with the specification as the order.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_014.png" alt="Basic Forecasting Example 14" width="640" height="371" /> </p>
<p>The output provided by simply listing the output object or retrieving a summary of the output object provides the basic information; primarily the coefficients (and their standard errors [s.e.]). To see all the elements of the output, in case one wanted to extract specific parts of it for further computation; we can use the ‘names’ function and then index each element with the ‘$’ and name.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_015.png" alt="Basic Forecasting Example 15" width="640" height="132" /> </p>
<p>We can also apply some functions to use our ARIMA model to forecast predictions providing an estimate of the expected next time series point(s) and show that prediction graphically. The ‘forecast’ function provides 80% and 95% confidence intervals as well as a point estimate. Both functions used below are requesting only one time point ahead predictions; however, both functions are capable of forecasting multiple future time points.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_016.png" alt="Basic Forecasting Example 16" width="640" height="129" /> </p>
<p><img src="/benchmarks/sites/default/files/BasicTS_017.png" alt="Basic Forecasting Example 17" width="454" height="480" /> </p>
<p>The rather large interval (shaded area) around the point estimate is reasonable given the rather large fluctuation of the existing series (i.e. the drastic decrease in price once the oil spill was made public). In the graph above, the blue shading represents the 80% confidence interval, while the gray shading represents the 95% interval.</p>
<p>There also ways of filtering the time series. Below the series is filtered with exponential smoothing and then (separately) filtered as a non-seasonal model.</p>
<p><img src="/benchmarks/sites/default/files/BasicTS_018.png" alt="Basic Forecasting Example 18" width="640" height="394" /> </p>
<p>Lastly, much of the above has been covered on the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> web site and specifically <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module9/BP_TimeSeries.R">here</a> in Module 10.</p>
<p>Until next time, <em>why are you wearing that stupid man suit?</em></p>
<h4><span style="font-size: 1em;">References & Resources</span></h4>
<p>Fox, J. (2005). The R Commander: A basic statistics graphical user interface to R. <em>Journal of Statistical Software, 14</em>(9), 1 – 42. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/Rcmdr/index.html">http://cran.r-project.org/web/packages/Rcmdr/index.html</a></p>
<p>Hodgess, E. (2012). Package RcmdrPlugin.epack. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/RcmdrPlugin.epack/index.html">http://cran.r-project.org/web/packages/RcmdrPlugin.epack/index.html</a></p>
<p>Hyndman, R. J. (2013). Package fpp. Documentation available CRAN: <a href="http://cran.r-project.org/web/packages/fpp/index.html">http://cran.r-project.org/web/packages/fpp/index.html</a></p>
<p>Hyndman, R. J., & Athanasopoulos, G. (2015). <em>Forecasting: principles and practice</em>. Freely available online at: <a href="https://www.otexts.org/fpp">https://www.otexts.org/fpp</a></p>
<p>Hyndman, R. J. (2014). CRAN Task View: Time Series Analysis. Available at CRAN: <a href="http://cran.r-project.org/web/views/TimeSeries.html">http://cran.r-project.org/web/views/TimeSeries.html</a></p>
<p>Hyndman, Koehler, Ord, & Snyder (2008) <em>Forecasting with exponential smoothing </em>(3<sup>rd</sup> ed.). Wiley.</p>
<p>Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root. <em>Journal of Econometrics, 54</em>(1), 159 – 178. doi:10.1016/0304-4076(92)90104-Y. A free pdf of the 1992 paper is available at: <a href="http://www.deu.edu.tr/userweb/onder.hanedar/dosyalar/kpss.pdf">http://www.deu.edu.tr/userweb/onder.hanedar/dosyalar/kpss.pdf</a></p>
<p>Makridakis, Wheelwright, & Hyndman (1998) <em>Forecasting: methods and applications</em>. Springer.</p>
<p>Trapletti, A., & Hornik, K. (2015). Package tseries. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/tseries/index.html">http://cran.r-project.org/web/packages/tseries/index.html</a></p>
<p> </p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published April 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsMon, 20 Apr 2015 22:45:36 +0000cpl00011023 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2015/03/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-03</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong><strong>Confirmatory Factor Analysis and Structural Equation Modeling Group Differences: Measurement Invariance</strong></strong><strong>.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2015/02/rss-matters">Data Reduction for making Comparisons: Principle Component Scores.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>his month’s article focuses on an explanation of measurement invariance. This article is specifically oriented toward the context of detecting group differences among latent variables for confirmatory factor analysis (CFA) models or in a structural equation models (SEM). Social scientists are often concerned with identifying group differences (e.g. differences between genders, ethnicities, locations, etc.). SEM is often applied in an effort to model the complex relationships of latent variables between groups for CFA-type models. Therefore, it is likely that many social scientists would find this article useful as a means to evaluate group differences among complex latent variable model structures. Attempting to evaluate or discover group differences among latent variables is necessarily complex due to the underlying factor models which support the latent models (i.e. SEM). So, it is necessary to recognize such complexity and evaluate the sequentially imposed constraints on the group differences – which implicitly leads to a discussion of <em>measurement invariance</em>. An excellent reference for this material is a <em>relatively</em> new book by Beaujean (2014), particularly chapter 4.</p>
<p>Measurement invariance is not a single unified concept; although generally we can define measurement invariance as stable measurement parameters across multiple groups, settings, and time periods. Commonly, the parameters referred to in the previous sentence refer to the factor structure (i.e. specific observed variables to latent variables, etc.), factor loadings, intercepts, and the latent variable means of a measurement model (i.e. factor model). Typically, there are a series of sequentially imposed measurement constraints, ranked as level 1 (configural invariance), level 2 (weak invariance), level 3 (strong invariance), and level 4 (strict invariance). Configural invariance refers to the <em>configuration </em>or structure of the factor model (i.e. which observed variables go with which latent factors). Weak invariance refers to factor loadings (and configuration) being the same between two groups, settings, or time periods. Strong invariance refers to the intercepts (configuration, and loadings) of the factor model and strict invariance refers to the latent variable means (configuration, loadings, and intercepts) being the same between two groups, settings, or time periods.</p>
<p>Testing for measurement invariance consists of a series of statistical hypotheses that assume population group factor parameters are equal between the groups. Fortunately, there is (of course) a function in R for testing measurement invariance in CFA and SEM models. The package ‘semTools’ (Pornprasertmanit, et al., 2015) contains the function ‘measurementInvariance’ which will be demonstrated below. The ‘measurementInvariance’ function takes a ‘lavaan’ package (Rosseel, et al., 2015) model object and raw data and tests the fit of the object while checking for chi-square (and fit indices) differences between two (or more) groups.</p>
<h4><strong>The examples</strong></h4>
<p>First, we import some (simulated) data. Keep in mind, the data is available for readers to duplicate what is done in this article by using the script shown in the article (script also available <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BenchmarksFeb2015.R">here</a>; data available <a href="http://www.unt.edu/rss/class/Jon/ExampleData/measInvar_df.txt">here</a>). The data includes two groups (<em>n<sub>1</sub> = 500 </em>& <em>n<sub>2</sub> = 502</em>) with (<em>N<sub>i</sub> = 1002</em>) responses on (<em>j = 24</em>) variables (x1, x2, x3, … x24).</p>
<p><span style="color: #ff0000;">df.1 <- read.table("http://www.unt.edu/rss/class/Jon/ExampleData/measInvar_df.txt",</span></p>
<p><span style="color: #ff0000;"> header = TRUE, sep = ",", na.strings = "NA", dec = ".",</span></p>
<p><span style="color: #ff0000;"> strip.white = TRUE)</span></p>
<p><span style="color: #ff0000;">summary(df.1)</span></p>
<p> <span style="color: #0000ff;">group x1 x2 </span></p>
<p><span style="color: #0000ff;"> Min. :1.000 Min. :-3.703924 Min. :-4.24310 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:1.000 1st Qu.:-0.843175 1st Qu.:-0.88901 </span></p>
<p><span style="color: #0000ff;"> Median :2.000 Median : 0.076019 Median :-0.06182 </span></p>
<p><span style="color: #0000ff;"> Mean :1.501 Mean : 0.001051 Mean :-0.05534 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:2.000 3rd Qu.: 0.853784 3rd Qu.: 0.80925 </span></p>
<p><span style="color: #0000ff;"> Max. :2.000 Max. : 3.579749 Max. : 3.77787 </span></p>
<p><span style="color: #0000ff;"> x3 x4 x5 </span></p>
<p> <span style="color: #0000ff;">Min. :-4.015567 Min. :-3.88353 Min. :-3.86466 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:-0.886522 1st Qu.:-0.89705 1st Qu.:-0.85205 </span></p>
<p><span style="color: #0000ff;"> Median : 0.046421 Median :-0.07672 Median :-0.02942 </span></p>
<p><span style="color: #0000ff;"> Mean : 0.004654 Mean :-0.05154 Mean :-0.02075 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 0.876326 3rd Qu.: 0.82199 3rd Qu.: 0.84022 </span></p>
<p><span style="color: #0000ff;"> Max. : 3.503825 Max. : 3.60557 Max. : 2.94853 </span></p>
<p><span style="color: #0000ff;"> x6 x7 x8 </span></p>
<p> <span style="color: #0000ff;">Min. :-4.82883 Min. :-3.415288 Min. :-3.56686 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.:-0.86454 1st Qu.:-0.847181 1st Qu.:-0.80358 </span></p>
<p><span style="color: #0000ff;"> Median : 0.01619 Median : 0.042244 Median : 0.03872 </span></p>
<p><span style="color: #0000ff;"> Mean : 0.02247 Mean : 0.005208 Mean : 0.05161 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.: 0.90802 3rd Qu.: 0.853102 3rd Qu.: 0.89892 </span></p>
<p><span style="color: #0000ff;"> Max. : 4.06204 Max. : 3.199517 Max. : 4.16097 </span></p>
<p><span style="color: #0000ff;"> x9 x10 x11 x12 </span> </p>
<p> <span style="color: #0000ff;">Min. : 6.656 Min. : 6.187 Min. : 6.298 Min. : 6.081 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 9.251 1st Qu.: 9.261 1st Qu.: 9.213 1st Qu.: 9.257 </span></p>
<p><span style="color: #0000ff;"> Median :10.085 Median :10.058 Median :10.041 Median :10.107 </span></p>
<p><span style="color: #0000ff;"> Mean :10.057 Mean :10.038 Mean :10.041 Mean :10.059 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:10.834 3rd Qu.:10.873 3rd Qu.:10.850 3rd Qu.:10.831 </span></p>
<p><span style="color: #0000ff;"> Max. :13.628 Max. :13.615 Max. :13.949 Max. :13.481 </span></p>
<p><span style="color: #0000ff;"> x13 x14 x15 x16 </span> </p>
<p> <span style="color: #0000ff;">Min. : 6.077 Min. : 6.471 Min. : 6.450 Min. : 6.463 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 9.202 1st Qu.: 9.210 1st Qu.: 9.171 1st Qu.: 9.223 </span></p>
<p><span style="color: #0000ff;"> Median :10.010 Median :10.049 Median :10.022 Median : 9.990 </span></p>
<p><span style="color: #0000ff;"> Mean :10.004 Mean :10.008 Mean : 9.979 Mean : 9.991 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:10.796 3rd Qu.:10.795 3rd Qu.:10.808 3rd Qu.:10.785 </span></p>
<p><span style="color: #0000ff;"> Max. :13.692 Max. :13.386 Max. :13.386 Max. :14.251 </span></p>
<p><span style="color: #0000ff;"> x17 x18 x19 x20 </span> </p>
<p> <span style="color: #0000ff;">Min. : 6.154 Min. : 6.854 Min. : 6.687 Min. : 5.959 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 9.190 1st Qu.: 9.233 1st Qu.: 9.227 1st Qu.: 9.190 </span></p>
<p><span style="color: #0000ff;"> Median :10.020 Median :10.033 Median : 9.988 Median : 9.945 </span></p>
<p><span style="color: #0000ff;"> Mean : 9.999 Mean :10.019 Mean :10.002 Mean : 9.957 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:10.729 3rd Qu.:10.795 3rd Qu.:10.784 3rd Qu.:10.741 </span></p>
<p><span style="color: #0000ff;"> Max. :13.122 Max. :13.044 Max. :13.510 Max. :12.746 </span></p>
<p><span style="color: #0000ff;"> x21 x22 x23 x24 </span></p>
<p><span style="color: #0000ff;"> Min. : 6.657 Min. : 6.466 Min. : 6.111 </span> <span style="color: #0000ff;">Min. : 6.468 </span></p>
<p><span style="color: #0000ff;"> 1st Qu.: 9.309 1st Qu.: 9.250 1st Qu.: 9.281 1st Qu.: 9.318 </span></p>
<p><span style="color: #0000ff;"> Median :10.036 Median :10.022 Median : 9.984 Median :10.040 </span></p>
<p><span style="color: #0000ff;"> Mean :10.025 Mean :10.002 Mean :10.003 Mean :10.050 </span></p>
<p><span style="color: #0000ff;"> 3rd Qu.:10.742 3rd Qu.:10.735 3rd Qu.:10.760 3rd Qu.:10.787 </span></p>
<p><span style="color: #0000ff;"> Max. :13.497 Max. :13.164 Max. :12.962 Max. :13.449</span></p>
<p>Upon initial inspection, the two groups appear to be virtually identical in terms of how the factor model fits each group’s data.</p>
<p><span style="color: #ff0000;">factanal(df.1[1:500, 2:9], factors = 2) # Group 1.</span></p>
<p><span style="color: #0000ff;">Call:</span></p>
<p><span style="color: #0000ff;">factanal(x = df.1[1:500, 2:9], factors = 2)</span></p>
<p><span style="color: #0000ff;">Uniquenesses:</span></p>
<p><span style="color: #0000ff;"> x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">0.338 0.401 0.323 0.348 0.507 0.485 0.556 0.572</span></p>
<p><span style="color: #0000ff;">Loadings:</span></p>
<p><span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">x1 0.812 </span></p>
<p><span style="color: #0000ff;">x2 0.774 </span></p>
<p><span style="color: #0000ff;">x3 0.823 </span></p>
<p><span style="color: #0000ff;">x4 0.807 </span></p>
<p><span style="color: #0000ff;">x5 0.702</span></p>
<p><span style="color: #0000ff;">x6 0.716</span></p>
<p><span style="color: #0000ff;">x7 0.666</span></p>
<p><span style="color: #0000ff;">x8 0.654</span></p>
<p> <span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">SS loadings 2.588 1.882</span></p>
<p><span style="color: #0000ff;">Proportion Var 0.323 0.235</span></p>
<p><span style="color: #0000ff;">Cumulative Var 0.323 0.559</span></p>
<p><span style="color: #0000ff;">Test of the hypothesis that 2 factors are sufficient.</span></p>
<p><span style="color: #0000ff;">The chi square statistic is 21.21 on 13 degrees of freedom.</span></p>
<p><span style="color: #0000ff;">The p-value is 0.0689</span></p>
<p><span style="color: #ff0000;">factanal(df.1[501:1002,2:9], factors = 2) # Group 2.</span></p>
<p><span style="color: #0000ff;">Call:</span></p>
<p><span style="color: #0000ff;">factanal(x = df.1[501:1002, 2:9], factors = 2)</span></p>
<p><span style="color: #0000ff;">Uniquenesses:</span></p>
<p><span style="color: #0000ff;"> x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">0.371 0.359 0.363 0.317 0.519 0.515 0.541 0.498</span></p>
<p><span style="color: #0000ff;">Loadings:</span></p>
<p><span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">x1 0.793 </span></p>
<p><span style="color: #0000ff;">x2 0.801 </span></p>
<p><span style="color: #0000ff;">x3 0.798 </span></p>
<p><span style="color: #0000ff;">x4 0.826 </span></p>
<p><span style="color: #0000ff;">x5 0.691</span></p>
<p><span style="color: #0000ff;">x6 0.696</span></p>
<p><span style="color: #0000ff;">x7 0.677</span></p>
<p><span style="color: #0000ff;">x8 0.708</span></p>
<p><span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">SS loadings 2.594 1.923</span></p>
<p><span style="color: #0000ff;">Proportion Var 0.324 0.240</span></p>
<p><span style="color: #0000ff;">Cumulative Var 0.324 0.565</span></p>
<p><span style="color: #0000ff;">Test of the hypothesis that 2 factors are sufficient.</span></p>
<p><span style="color: #0000ff;">The chi square statistic is 16.2 on 13 degrees of freedom.</span></p>
<p><span style="color: #0000ff;">The p-value is 0.238</span></p>
<p>Next, we load the ‘lavaan’ and ‘semTools’ packages in order to specify the CFA model and test for the levels of measurement invariance formally.</p>
<p><span style="color: #ff0000;">library(lavaan)</span></p>
<p><span style="color: #0000ff;">This is lavaan 0.5-17</span></p>
<p><span style="color: #0000ff;">lavaan is BETA software! Please report any bugs.</span></p>
<p><span style="color: #ff0000;">library(semTools)</span></p>
<p><span style="color: #0000ff;">###############################################################################</span></p>
<p><span style="color: #0000ff;">This is semTools 0.4-6</span></p>
<p><span style="color: #0000ff;">All users of R (or SEM) are invited to submit functions or ideas for functions.</span></p>
<p><span style="color: #0000ff;">###############################################################################</span></p>
<p><span style="color: #ff0000;">cfa.model <- '</span></p>
<p><span style="color: #ff0000;"> f1 =~ x1 + x2 + x3 + x4</span></p>
<p><span style="color: #ff0000;"> f2 =~ x5 + x6 + x7 + x8</span></p>
<p><span style="color: #ff0000;"> f1 ~~ 0*f2</span></p>
<p><span style="color: #ff0000;"> '</span></p>
<p><span style="color: #ff0000;">measurementInvariance(cfa.model, data = df.1, group = "group")</span></p>
<p><span style="color: #0000ff;">Measurement invariance tests:</span></p>
<p><span style="color: #0000ff;">Model 1: configural invariance:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 48.209 40.000 0.175 0.997 0.020 19980.029</span></p>
<p><span style="color: #0000ff;">Model 2: weak invariance (equal loadings):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 51.489 46.000 0.268 0.998 0.015 19941.851</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 2]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 3.280 6.000 0.773 -0.001</span></p>
<p><span style="color: #0000ff;">Model 3: strong invariance (equal loadings + intercepts):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 56.353 52.000 0.315 0.999 0.013 19905.257</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 8.145 12.000 0.774 -0.001</span></p>
<p><span style="color: #0000ff;">[Model 2 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 4.864 6.000 0.561 0.000</span></p>
<p><span style="color: #0000ff;">Model 4: equal loadings + intercepts + means:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 1222.336 54.000 0.000 0.622 0.208 21057.420</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1174.127 14.000 0.000 0.375</span></p>
<p><span style="color: #0000ff;">[Model 3 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1165.983 2.000 0.000 0.376</span></p>
<p>Evaluating the output of the ‘measurementInvariance’ function necessarily starts with configual invariance (model 1) which assumes the factor pattern is equal for both groups. Next, the second hypothesis is evaluated; weak invariance (model 2) which evaluates the chi-square change (or delta: Δ) and associated <em>p-</em>value; as well as the change in the Comparative Fit Index (CFI). The output for the comparison between model 1 and model 2 indicates no statistically significant change in the chi-square value, and the CFI does not change very much either – which indicates the loadings of the two groups are <em>close enough</em>. When the loadings are essentially the same, then weak measurement invariance is supported. The next hypothesis, strong invariance (model 3), is then evaluated. Model 3 involves testing the hypothesis that the loadings <em>and intercepts</em> are the same, or statistically equivalent, for both groups. The output shows that the first comparison, model 1 to model 3, is not statistically significant (<em>p </em>= 0.774); meaning the chi-square value is not significantly different between those two models. The second comparison, model 2 to model 3, also is not statistically significant (<em>p </em>= 0.561). In other words, when the loadings and intercepts are constrained to be equal, the model fit is not significantly different than the actual model fit across the two groups. Therefore, strong measurement invariance is supported. However, when we evaluate the final hypothesis of measurement invariance, strict invariance (model 4), we find that the latent variable <em>means</em> appear to be different – based on the chi-square change; indicating a significant difference between the groups’ fit. There are several pieces of output which show this difference. First, numerically / visually compare the chi-square values for model 3 (χ² = 56.353, <em>df </em>=52, <em>p </em>= 0.315) and model 4 (χ² = 1222.336, <em>df </em>= 54, <em>p </em>< 0.000); which is a substantial change in chi-square. Also, notice how much the CFI changed from model 3 (<em>cfi</em> = 0.999) to model 4 (<em>cfi</em> = 0.622); while model 2 (<em>cfi</em> = 0.998) and model 1 (<em>cfi</em> = 0.997) are both very close to model 3. These differences (in chi-square & CFI) are also revealed in the two model comparisons. Comparing the change in fit between model 1 and model 4, we observe a significant chi-square change (χ²<sub>Δ</sub> = 1174.127, <em>df</em><sub>Δ</sub>=14, <em>p</em><sub>Δ</sub>< 0.000). Furthermore, comparing the change in fit between model 3 and model 4, we observe another significant chi-square change (χ²<sub>Δ</sub> = 1165.983, <em>df</em><sub>Δ</sub>=2, <em>p</em><sub>Δ</sub>< 0.000). The appropriate conclusion is; we do not have strict measurement invariance.</p>
<p>The utility of the ‘measuremenInvariance’ function extends beyond straightforward CFA and it can be applied to SEM settings as well. For instance, following the Anderson and Gerbing (1988) two stage approach to SEM, we can specify the measurement model of a SEM and use the ‘measurementInvariance’ function to check the levels (or models) of measurement invariance.</p>
<p><span style="color: #ff0000;">cfa.model <- '</span></p>
<p><span style="color: #ff0000;"> f1 =~ x1 + x2 + x3 + x4</span></p>
<p><span style="color: #ff0000;"> f2 =~ x5 + x6 + x7 + x8</span></p>
<p><span style="color: #ff0000;"> f3 =~ x9 + x10 + x11 + x12 + x13 + x14 + x15</span></p>
<p><span style="color: #ff0000;"> f4 =~ x16 + x17 + x18 + x19 + x20</span></p>
<p><span style="color: #ff0000;"> f5 =~ x21 + x22 + x23 + x24</span></p>
<p><span style="color: #ff0000;"> f1 ~~ 0*f2</span></p>
<p><span style="color: #ff0000;"> f1 ~~ f3</span></p>
<p><span style="color: #ff0000;"> f1 ~~ f4</span></p>
<p><span style="color: #ff0000;"> f1 ~~ f5</span></p>
<p><span style="color: #ff0000;"> f2 ~~ f3</span></p>
<p><span style="color: #ff0000;"> f2 ~~ f4</span></p>
<p><span style="color: #ff0000;"> f2 ~~ f5</span></p>
<p><span style="color: #ff0000;"> f3 ~~ f4</span></p>
<p><span style="color: #ff0000;"> f3 ~~ f5</span></p>
<p><span style="color: #ff0000;"> f4 ~~ f5</span></p>
<p><span style="color: #ff0000;"> '</span></p>
<p><span style="color: #ff0000;">measurementInvariance(cfa.model, data = df.1, group = "group")</span></p>
<p><span style="color: #0000ff;">Measurement invariance tests:</span></p>
<p><span style="color: #0000ff;">Model 1: configural invariance:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 492.800 486.000 0.406 0.999 0.005 61241.402</span></p>
<p><span style="color: #0000ff;">Model 2: weak invariance (equal loadings):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 508.302 505.000 0.450 1.000 0.004 61125.619</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 2]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 15.502 19.000 0.690 0.000</span></p>
<p><span style="color: #0000ff;">Model 3: strong invariance (equal loadings + intercepts):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 528.129 524.000 0.441 1.000 0.004 61014.161</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 35.329 38.000 0.594 0.000</span></p>
<p><span style="color: #0000ff;">[Model 2 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 19.827 19.000 0.405 0.000</span></p>
<p><span style="color: #0000ff;">Model 4: equal loadings + intercepts + means:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 1732.314 529.000 0.000 0.855 0.067 62183.796</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1239.513 43.000 0.000 0.145</span></p>
<p><span style="color: #0000ff;">[Model 3 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1204.184 5.000 0.000 0.145</span></p>
<p>It is also possible to specify a structural model of a SEM and check for measurement invariance; as show below.</p>
<p><span style="color: #ff0000;">str.model <- '</span></p>
<p><span style="color: #ff0000;"> f1 =~ x1 + x2 + x3 + x4</span></p>
<p><span style="color: #ff0000;"> f2 =~ x5 + x6 + x7 + x8</span></p>
<p><span style="color: #ff0000;"> f3 =~ x9 + x10 + x11 + x12 + x13 + x14 + x15</span></p>
<p><span style="color: #ff0000;"> f4 =~ x16 + x17 + x18 + x19 + x20</span></p>
<p><span style="color: #ff0000;"> f5 =~ x21 + x22 + x23 + x24</span></p>
<p><span style="color: #ff0000;"> f4 ~ f1</span></p>
<p><span style="color: #ff0000;"> f3 ~ f2</span></p>
<p><span style="color: #ff0000;"> f5 ~ f2 + f3</span></p>
<p><span style="color: #ff0000;"> f1 ~~ 0*f2</span></p>
<p><span style="color: #ff0000;"> f1 ~~ f3</span></p>
<p><span style="color: #ff0000;"> f1 ~~ f5</span></p>
<p><span style="color: #ff0000;"> f2 ~~ f4</span></p>
<p><span style="color: #ff0000;"> f3 ~~ f4</span></p>
<p><span style="color: #ff0000;"> f4 ~~ f5</span></p>
<p><span style="color: #ff0000;"> '</span></p>
<p><span style="color: #ff0000;">measurementInvariance(str.model, data = df.1, group = "group")</span></p>
<p><span style="color: #0000ff;">Measurement invariance tests:</span></p>
<p><span style="color: #0000ff;">Model 1: configural invariance:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 492.800 486.000 0.406 0.999 0.005 61241.402</span></p>
<p><span style="color: #0000ff;">Model 2: weak invariance (equal loadings):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 508.302 505.000 0.450 1.000 0.004 61125.619</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 2]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 15.502 19.000 0.690 0.000</span></p>
<p><span style="color: #0000ff;">Model 3: strong invariance (equal loadings + intercepts):</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 528.129 524.000 0.441 1.000 0.004 61014.161</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 35.329 38.000 0.594 0.000</span></p>
<p><span style="color: #0000ff;">[Model 2 versus model 3]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 19.827 19.000 0.405 0.000</span></p>
<p><span style="color: #0000ff;">Model 4: equal loadings + intercepts + means:</span></p>
<p><span style="color: #0000ff;"> chisq df pvalue cfi rmsea bic</span></p>
<p><span style="color: #0000ff;"> 1732.314 529.000 0.000 0.855 0.067 62183.796</span></p>
<p><span style="color: #0000ff;">[Model 1 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1239.513 43.000 0.000 0.145</span></p>
<p><span style="color: #0000ff;">[Model 3 versus model 4]</span></p>
<p><span style="color: #0000ff;"> delta.chisq delta.df delta.p.value delta.cfi</span></p>
<p><span style="color: #0000ff;"> 1204.184 5.000 0.000 0.145</span></p>
<p>The output above for both the measurement model and the structural model of the SEM show very similar results to what was observed with the initial CFA measurement invariance results. This is because only the first two latent factors (f1 & f2) contain group differences; while the remaining elements in the SEM do not display group differences (i.e. f3, f4, & f5 measurement structures). For those interested in duplicating everything done in this article (and seeing the results of the SEM fit with groups specified); please see the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> web site and specifically <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module9/MeasurementInvariance.R">here</a> in Module 9.</p>
<p>Lastly, it is very important to realize the example above used simulated data in order to demonstrate many aspects of measurement invariance. The examples above used a relatively small data set (<em>n </em>= 1002). Large sample sizes typically seen when conducting SEM are likely to provide statistically significant chi-square change statistics (chi-square is very sensitive to large sample sizes). Large sample sizes reduce the utility of the chi-square test. The implication being, that with large samples it would be very unlikely to establish measurement invariance using the chi-square change statistics. Therefore, Vandenber and Lance (2000) recommend using a CFI change of 0.2 as representative of a meaningful difference between models fit (p. 47).</p>
<p>Until next time, <em>have I told you about <span style="color: #000000;">Sammy Jankis</span>?</em></p>
<h4><span style="font-size: 1em;">References and Resources</span></h4>
<p>Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. <em>Psychological Bulletin, 103,</em> 411 - 423.</p>
<p>Beaujean, A. A. (2014). <em>Latent variable modeling in R: A step by step guide</em>. New York: Routledge.</p>
<p>Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. <em>International Journal of Psychological Research, 3</em>(1), 111 – 121.</p>
<p>Pornprasertmanit, S., et al. (2015). Package ‘semTools’.</p>
<p> Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/semTools/index.html">http://cran.r-project.org/web/packages/semTools/index.html</a></p>
<p>Rosseel, Y., et al. (2015). Package ‘lavaan’.</p>
<p> Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/lavaan/index.html">http://cran.r-project.org/web/packages/lavaan/index.html</a></p>
<p>Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. <em>Human Resource Management Review, 18</em>, 210 – 222. DOI:10.1016/j.hrmr.2008.03.003</p>
<p>Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. <em>Organizational Research Methods, 3</em>(1), 4 – 70.</p>
<p>van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. <em>European Journal of Developmental Psychology, 1</em>, 1 – 7. DOI:10.1080/17405629.2012.686740. </p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published March 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsThu, 19 Mar 2015 17:39:07 +0000cpl00011007 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2015/02/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-02</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Data Reduction for making Comparisons: Principle Component Scores</strong><strong>.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2015/01/rss-matters">Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.</a> -- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>wo years ago this column addressed one way of creating composite or indicator scores using Factor Analysis (Starkweather, <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/CompositeScores_JDS_Feb2012.pdf">2012</a>). That article approached composite score creation from a measurement modeling perspective in which each composite score represented a latent variable. The current article approaches composite score creation from a non-measurement modeling perspective.</p>
<p>This month’s article discusses how to create composite scores from many variables using the ultimate data reduction technique: Principle Component Analysis (PCA). PCA is not measurement model based; it is a linear model based data reduction technique used to reduce the number of observed variables down to their <em>principle components</em> while maximizing the <em>total </em>amount of variance explained in the observed variables. PCA assumes linear relationships among the observed variables (i.e. it is not appropriate if curvilinear relationship are discovered among the observed variables). For a more detailed explanation of the differences between PCA and FA, please consult Starkweather (<a href="http://www.unt.edu/rss/class/Jon/Benchmarks/PCAvsFAvsAA_JDS_July2010.pdf">2010</a>).</p>
<p>Occasionally, a data analyst is called upon to take many observed variables and combine them or reduce them to one variable or a few variables. The observed variables may, or may not, be directly related to one another and they may or may not be of the same scale. The one or a few resulting variables are weighted linear composite scores which can then be used to compare organizational units (e.g. departments within a larger organizational structure). In this situation, it is critically important to realize we are not interested in creating, assessing, or confirming a measurement model with latent variables and error. We are not assuming classical test theory model of measurement. We are solely interested in reducing many variables to one variable (or a few variables) so we can compare units. Those units may be individuals or organizations.</p>
<h4><strong>The Situation: <em>General Hospital</em></strong></h4>
<p>Our example this month concerns a (<em>fictional</em>) General Hospital. The hospital board requested the director, Annabelle Lecter, M.D., to compare each Service Department. Each service department (Informational Services [IS], Therapeutic Services [TS], Diagnostic Services [DS], and Support Services [SS]) contains various disparate organizational structures (see pages 1 – 3 <a href="http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln">here</a>). The service departments do not initially seem comparable because each has specific tasks, budgets, number and status of personnel, degree of patient interaction, physical supply needs (weekly, monthly, yearly), and so forth. The director has access to a variety of these types of variables for each department and wants to reduce all of this information down to a single variable on which to compare the departments. Some departments have very small values on some variables by design or purpose (of the specific department) and some departments have very large values on some variables by design or purpose (of the specific department).</p>
<p>At first, the director thinks it might be best to transform all these variables to <em>Z</em>-scores (i.e. standardize them) so they are all on the same scale and then simply add or average all the <em>Z</em>-scores to get one number for each department. The directory quickly realizes this is not tenable because <em>Z</em>-scores, although used to compare individuals across two (or more) variables, are not meant to be combined. If <em>Z</em>-scores are averaged, the mean should be at or very near zero. Furthermore, creating a composite score using either of these two techniques (sum or mean) explicitly assume each variable is equally important and essentially interchangeable (with respect to the resulting composite score).</p>
<p>What the director really needs is a technique which creates a composite score (for each department) in such a way that each observed variable is weighted by its ability to account for variance in all the observed variables (combined). The <em>variance in all observed variables</em> is represented by the variance-covariance matrix or correlation matrix of observed variables. By submitting the observed variables’ data (i.e. variance-covariance matrix or correlation matrix) to PCA and specifying the computation of Principle Component Scores (PCS) and then saving the scores of the <em>first</em> component, the director will have achieved her goal. Keep in mind, with PCA the first component is the one which accounts for the most variance and any subsequent components are accounting for variance <em>left over</em> after the variance which was accounted for by previous component has been removed. So, just to be clear; if the first component accounts for 48% of the variance of the observed variables, then that is 48% of 100% of the variance of the observed variables. If the second component accounts for 25% of the variance then that is 25% of the remaining 52% total variance of the observed variables (i.e. whatever is left after the first component has been extracted). So each subsequent component (i.e. component 3 through component <em>J</em> – 1, where <em>J</em> is the number of observed variables) is accounting for less and less of the total observed variables’ variance.</p>
<p>Now you may be asking the question; “but what does the component score <em>mean</em>?” In order to determine that, one would evaluate the direction and magnitude of loadings of each observed variable to the first component. The variables which have the largest absolute value loadings are those most contributing to the component (i.e. accounting for the most variance in all the observed variables). Loadings are interpreted just like correlation coefficients – positive vs. negative and between -1 and +1. If more than one component is evaluated, it is very likely the observed variables will coalesce on one or the other component decisively with each component’s definition (or name) becoming apparent based on which observed variables load most on a particular component. For example, say that the observed variables 1, 3, 5, 7, and 9 load most on the first component; while the observed variables 2, 4, 6, 8, and 10 load most on the second component. Then we would name the first component based on the content or meaning of observed variables 1, 3, 5, 7, and 9. Likewise, we would name the second component based on the content of the observed variables 2, 4, 6, 8, and 10.</p>
<p>Tutorials using PCA (with and without saving component scores) are available for each of the three most popular statistical software packages through the Research and Statistical Support instructional / tutorial websites (links provided directly below).</p>
<ul>
<li>For users of the statistical programming language environment R, please see: <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module7/M7_PCAandFA.R">http://www.unt.edu/rss/class/Jon/R_SC/Module7/M7_PCAandFA.R</a></li>
<li>For users of the SAS programming suite, please see: <a href="http://www.unt.edu/rss/class/Jon/SAS_SC/SAS_Module7.htm">http://www.unt.edu/rss/class/Jon/SAS_SC/SAS_Module7.htm</a></li>
<li>For users of the SPSS program, please see: <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/Module9/M9_PCA/SPSS_M9_PCA1.htm">http://www.unt.edu/rss/class/Jon/SPSS_SC/Module9/M9_PCA/SPSS_M9_PCA1.htm</a></li>
</ul>
<h4 style="text-align: left;" align="center"><strong>References</strong></h4>
<p>Hospital Organizational Structure: <a href="http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln">http://www.quia.com/files/quia/users/kkacher/OrganizStHsp/Org-St-Lesson-Pln</a></p>
<p>Starkweather, J. (2010). Principal Components Analysis vs. Factor Analysis…and Appropriate Alternatives. Available in original form at Benchmarks: <a href="http://it.unt.edu/benchmarks/issues/2010/07/rss-matters">http://it.unt.edu/benchmarks/issues/2010/07/rss-matters</a> and available as an Adobe.pdf <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/PCAvsFAvsAA_JDS_July2010.pdf">here</a>.</p>
<p>Starkweather, J. (2012). How to Calculate Empirically Derived Composite or Indicator Scores. Available in original form at Benchmarks: <a href="http://web3.unt.edu/benchmarks/issues/2012/02/rss-matters">http://web3.unt.edu/benchmarks/issues/2012/02/rss-matters</a> and available as an Adobe.pdf <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/CompositeScores_JDS_Feb2012.pdf">here</a>.</p>
<p> </p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published February 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsThu, 19 Feb 2015 21:51:10 +0000cpl0001996 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2015/01/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2015-01</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2014/12/rss-matters">Identifying or Verifying the Number of Factors to Extract using Very Simple Structure.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>W</strong></span>e use the term <em>explicit</em> because we are going to calculate these examples <em>by hand</em> with programing rather than simply loading a package and using functions to estimate parameters. The purpose of using these explicit methods is to hopefully convey a better understanding of what it means to <em>do</em> Bayesian statistics. </p>
<p>First, we must present a little bit about Bayesian statistics. Very, very briefly, Bayesian statistics requires three elements: a prior, likelihood, and a posterior. The prior is a distribution specified by the researcher which represents all <em>prior</em> information regarding the parameter the researcher is attempting to estimate. The prior represents an educated, best guess at the parameter (e.g. the mean of the prior) and the degree of certainty or confidence in that educated, best guess (e.g., the variance and shape of the prior distribution). The prior is specified before (i.e. <em>prior</em>) to data collection. The prior is then combined with the likelihood (a representation of the data at hand) to create a more informed, empirical distribution of the parameter being estimated. We call this last distribution the <em>posterior</em> distribution. The mean of the posterior is our estimate of the parameter. Interval estimates can then be calculated from the posterior which truly will represent the interval which contains the actual population parameter; we call those intervals, <em>credible intervals</em> (rather than confidence intervals – which <em>do not</em> tell you the probability of the population parameter being contained in this interval).</p>
<p>Let's say we want to estimate <strong>the mean</strong> IQ scores on the Weschler Adult Intelligence Scale (WAIS) of a small town, X.Town, which has a population of 10000 individuals. Let's start by importing the X.Town data.</p>
<p><span style="color: #ff0000;">x.town.df <- read.table("http://www.unt.edu/rss/class/Jon/ExampleData/X.Town.sample.txt",</span></p>
<p><span style="color: #ff0000;"> header = TRUE, sep = ",", dec = ".", na.strings = "NA")</span></p>
<p><span style="color: #ff0000;">nrow(x.town.df)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p>We know from a mountain of normative data and prior research that the U.S. population distribution of WAIS scores has a mean (µ) of 100 and a standard deviation (σ) of 15. This information represents a best case scenario; where we <em>know</em> the population distribution and that distribution is normally distributed with an identified mean and standard deviation. Generally, we would not have such great prior information; so consider an alternative where we have virtually no prior information accept to know the WAIS questions / procedures which allow a possible score to range from 1 to 200. In such a case, our specification of a prior distribution would mean each score in that range is equally likely -- which prompts us to specify a <em>uniform</em> distribution (i.e. a distribution in which each value has an equal probability of being represented). A uniform prior is also known as an un-informative or un-informed prior. In both examples below we are using a population of 10000 individuals. </p>
<p><span style="color: #ff0000;">uninformed.prior <- rep(seq(1:200), 50)</span></p>
<p><span style="color: #ff0000;">length(uninformed.prior)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p><span style="color: #ff0000;">summary(uninformed.prior)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;"> 1.00 50.75 100.50 100.50 150.20 200.00</span></p>
<p><span style="color: #ff0000;">hist(uninformed.prior)</span></p>
<p><img src="/benchmarks/sites/default/files/EB_002.png" alt="Histogram of Uninformed.prior" width="451" height="480" /> </p>
<p>However, with the WAIS and the knowledge of the U.S. population, we can specify a Gaussian (i.e. normal) distribution as our prior.</p>
<p><span style="color: #ff0000;">informed.prior <- rnorm(10000, mean = 100, sd = 15)</span></p>
<p><span style="color: #ff0000;">length(informed.prior)</span></p>
<p><span style="color: #0000ff;">[1] 10000</span></p>
<p><span style="color: #ff0000;">summary(informed.prior)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;"> 37.51 89.93 100.10 100.10 110.40 157.30</span></p>
<p><span style="color: #ff0000;">hist(informed.prior)</span></p>
<p><span style="color: #ff0000;"><img src="/benchmarks/sites/default/files/EB_001.png" alt="Histogram of uninformed.prior" width="454" height="480" /></span></p>
<p>Clearly; the two example priors above are extremes (i.e. worst case and best case); there are a variety of other distributions which can be specified as priors (e.g. Cauchy, Poisson, beta, etc.) and the prior <strong>is not</strong> required to be symmetrical. For more information on the variety of distributions, see: <a href="http://en.wikipedia.org/wiki/List_of_probability_distributions">http://en.wikipedia.org/wiki/List_of_probability_distributions</a></p>
<p>Our research questions are as follows: What is the mean WAIS score of the population (<em>n</em> = 10000) of X.Town; and, does that mean differ from the larger (U.S.) population? In more precise terms, what is the population mean of X.Town WAIS scores and is that mean <em>larger</em> than the known U.S. population mean. To be clear, there are two populations we are referring to here; the population of X.Town (<em>N</em> = 10000) and the larger population of the U.S.</p>
<p>It is unrealistic to think we would have all 10000 adult citizens' data from X.Town; we would generally have a sample of that town's data. Note; the 7th column of our X Town data file contains the WAIS scores. Here we randomly sample (<em>n</em> = 1000) cases from the entire X.Town data (<em>N</em> = 10000):</p>
<p><span style="color: #ff0000;">wais.sample <- sample(x.town.df[,7], 1000, replace = FALSE)</span></p>
<p><span style="color: #ff0000;">length(wais.sample)</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<h4><strong>Traditional Frequentist Perspective: Null Hypothesis Significance Testing (NHST).</strong></h4>
<p>In a traditional <em>frequentist</em> setting, we would begin by simply calculating the sample mean as our best estimate of the entire X.Town population mean WAIS score:</p>
<p><span style="color: #ff0000;">M <- mean(wais.sample)</span></p>
<p><span style="color: #ff0000;">M</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p>and the standard error of that mean if we wanted confidence intervals for that estimate (of the entire X.Town's mean):</p>
<p><span style="color: #ff0000;">std.err <- sqrt(15^2 / length(wais.sample))</span></p>
<p><span style="color: #ff0000;">std.err</span></p>
<p><span style="color: #0000ff;">[1] 0.4743416</span></p>
<p>Then using an alpha value (e.g. 0.05) look up the associated critical value (i.e. +/-1.96) in a table; then calculate the lower and upper bounds of the confidence interval for our estimate (i.e. the confidence interval for the estimated mean of X.Town).</p>
<p><span style="color: #ff0000;">lower.bound <- (-1.96*std.err) + M</span></p>
<p><span style="color: #ff0000;">lower.bound</span></p>
<p><span style="color: #0000ff;">[1] 106.7008</span></p>
<p><span style="color: #ff0000;">upper.bound <- (+1.96*std.err) + M</span></p>
<p><span style="color: #ff0000;">upper.bound</span></p>
<p><span style="color: #0000ff;">[1] 108.5602</span></p>
<p>Then, we would run a one sample t-test using our random sample of X.Town adults' WAIS scores, comparing <strong>the mean</strong> of the sample scores (<em>M</em>; as our best estimate of the entire X.Town's mean) to the mean of the U.S. population (mu<em>:</em> µ); using the standard error of the mean (<em>std.err</em>) and some pre-designated probability cutoff (e.g. 0.05) to determine statistical significance.</p>
<p><span style="color: #ff0000;">t.test(wais.sample, alternative = 'greater', mu = 100, conf.level = .95)</span></p>
<p> <span style="color: #0000ff;">One Sample t-test</span></p>
<p><span style="color: #0000ff;">data: wais.sample</span></p>
<p><span style="color: #0000ff;">t = 17.0653, df = 999, p-value < 2.2e-16</span></p>
<p><span style="color: #0000ff;">alternative hypothesis: true mean is greater than 100</span></p>
<p><span style="color: #0000ff;">95 percent confidence interval:</span></p>
<p><span style="color: #0000ff;"> 106.8944 Inf</span></p>
<p><span style="color: #0000ff;">sample estimates:</span></p>
<p><span style="color: #0000ff;">mean of x</span></p>
<p><span style="color: #0000ff;"> 107.6305</span></p>
<p>It is important to recall (or review) what the above test is doing. We have drawn a random sample of data from X.Town and we are testing <strong>the mean</strong> of that sample against a known (U.S.) population mean to determine if the sample indeed comes from that population (i.e. the null hypothesis). Notice we are using the sample mean (<em>n</em> = 1000) as a representation of the entire X.Town's WAIS scores (<em>N</em> = 10000).</p>
<h4><strong>Bayesian Perspective: Bayesian Statistics; Bayesian Inference; Bayesian Parameter Estimation.</strong></h4>
<p>All three of the above terms are often used to refer to Bayesian data analysis. The examples below were all adapted from Kaplan (2014). Our example explores the normal prior for the normal sampling model in which the variance σ² (sigma squared) is assumed to be known. Thus, the problem is one of estimating <strong>the mean</strong> µ (mu). Let <em>y</em> denote a data vector of size <em>n </em>(<em>y</em> = the sample of 1000 WAIS scores). We assume that <em>y</em> follows a normal distribution shown with the equation below:</p>
<p style="text-align: center;"> <em>p</em>(<em>y</em>|µ, σ²) = (1/sqrt(2*p*σ)) * exp(-((<em>y</em> - µ)²) / (2*σ²))</p>
<p>To clarify and show an example in R, we use the following:</p>
<p><span style="color: #ff0000;">mu <- 100</span></p>
<p><span style="color: #ff0000;">o <- 15</span></p>
<p><span style="color: #ff0000;">y <- wais.sample</span></p>
<p>We use the word ‘output’ to refer to <em>p</em>(<em>y</em>|µ, σ²) from above; which is read as the probability of <em>y</em>, given a mean of mu (µ), and variance of sigma squared (σ²).</p>
<p><span style="color: #ff0000;">output <- (1/sqrt(2*pi*o)) * exp(-((y - mu)^2) / (2*o^2))</span></p>
<p><span style="color: #ff0000;">summary(output)</span></p>
<p> <span style="color: #0000ff;">Min. 1st Qu. Median Mean 3rd Qu. Max.</span></p>
<p><span style="color: #0000ff;">0.000289 0.047630 0.078600 0.069690 0.096360 0.103000</span></p>
<p>Next, we specify the prior. We have plenty of confidence that our prior distribution of the mean is normal with its own mean and variance hyper-parameters, <em>k</em> and <em>t</em>² (using <em>t</em> in R code to refer to tau: τ), respectively, which for this example are known. The prior distribution can be written as: </p>
<p align="center"><em>p</em>(µ|<em>k</em>,<em>t</em>²) = (1/sqrt(2* p*<em>t</em>²)) * exp(-((µ - <em>k</em>)²) / (2*<em>t</em>²))</p>
<p>The term, <em>p</em>(µ|<em>k</em>,<em>t</em>²), can be read as the probability of µ given <em>k</em> and <em>t</em>².</p>
<p><span style="color: #ff0000;">k <- mean(y); k</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p><span style="color: #ff0000;">t <- sd(y); t</span></p>
<p><span style="color: #0000ff;">[1] 14.13976</span></p>
<p><span style="color: #ff0000;">n <- length(y); n</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<p><span style="color: #ff0000;">prior.mean <- (1/sqrt(2*pi*t^2)) * exp(-((mu - k)^2) / (2*t^2))</span></p>
<p><span style="color: #ff0000;">prior.mean</span></p>
<p><span style="color: #0000ff;">[1] 0.02439102</span></p>
<p>Combine the prior information with the likelihood of the data (given the population variance; sigma squared [σ²] and the sample size [<em>n</em>]) to create the posterior distribution. Using some algebra, the posterior distribution can be obtained as: </p>
<p align="center"><em>p</em>(µ|<em>y</em>)~<em>N</em>[ ((<em>k</em>/<em>t</em>²)+(<em>n</em>*mean(<em>y</em>)/σ²)) / ((1/<em>t</em>²)+(<em>n</em>/σ²)), (<em>t</em>²*σ²)/(σ²+(<em>n</em>*<em>t</em>²)) ]</p>
<p>Thus, the posterior distribution of mu (µ) is normal with a mean:</p>
<p><span style="color: #ff0000;">posterior.mu <- ((k/t^2)+(n*mean(y)/o^2)) / ((1/t^2)+(n/o^2))</span></p>
<p><span style="color: #ff0000;">posterior.mu</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p>and variance:</p>
<p><span style="color: #ff0000;">posterior.o2 = (t^2*o^2)/(o^2+(n*t^2))</span></p>
<p><span style="color: #ff0000;">posterior.o2</span></p>
<p><span style="color: #0000ff;">[1] 0.2247471</span></p>
<p>So, the posterior distribution can be simulated using these two parameters (and <em>n </em>= 1000); which in R, should be:</p>
<p><span style="color: #ff0000;">posterior <- rnorm(n = length(y), mean = posterior.mu,</span></p>
<p><span style="color: #ff0000;"> sd = sqrt(posterior.o2))</span></p>
<p><span style="color: #ff0000;">hist(posterior)</span></p>
<p> <img src="/benchmarks/sites/default/files/EB_004.png" alt="Histogram of Posterior" width="454" height="480" /></p>
<p>In a traditional frequentist analysis, one would be required to report both the estimated mean (i.e. mean of the sample) and a confidence interval with lower and upper bounds of that mean. However, a frequentist confidence interval only tells us; if this same study was repeated 100 times, we would expect the sample mean to be between the upper and lower bounds 95 times (if using a 95% confidence interval). It <strong>does not</strong> tell us the probability of the population parameter being included in the interval. Here in the Bayesian setting, we use the posterior distribution and simply take the quantiles (i.e. probabilities) to compute the lower and upper bounds of a <em>credible interval</em> – which does give us the probability that the actual population parameter is included in this interval.</p>
<p><span style="color: #ff0000;">quantile(posterior, c(.05,.95))</span></p>
<p> <span style="color: #0000ff;">5% 95%</span></p>
<p><span style="color: #0000ff;">106.8662 108.4625</span></p>
<p>It is critically important to recognize, the above example is <strong>only</strong> interested in estimating the mean of X.Town's WAIS scores. The example is NOT attempting to estimate the entire X.Town's distribution of WAIS scores. So let's compare the actual mean of X.Town's WAIS scores to the sample mean, and the mean of the posterior distribution (of course, in a real research situation you would not have the 'actual' parameter -- i.e. mean of the entire population of X.Town).</p>
<p><span style="color: #ff0000;">mean(x.town.df$wais)</span></p>
<p><span style="color: #0000ff;">[1] 107.8662</span></p>
<p><span style="color: #ff0000;">mean(wais.sample)</span></p>
<p><span style="color: #0000ff;">[1] 107.6305</span></p>
<p><span style="color: #ff0000;">mean(posterior)</span></p>
<p><span style="color: #0000ff;">[1] 107.6389</span> </p>
<p>Undoubtable readers will notice the virtually identical estimates provided by the mean of the posterior (i.e. Bayesian estimate) and simply the mean of the sample (i.e. frequentist estimate); and both of those are very, very close to the X.Town population mean. There are two very important reasons for this. First, the Bayesian and Frequentist methods will result in virtually the same parameter estimate(s) with large samples. The prior is weighted very lightly and the likelihood (a representation of the data at hand) contributes the bulk of the weight to the estimation when large samples are used in a Bayesian analysis. Second, the data used in the examples above is simulated data and a truly random sample (<em>n </em>= 1000) was taken from the entire population (<em>N = </em>10000). Therefore, our results here have very low bias as a result of the truly random sample and the fact that 10% of the population was contained in the sample. Most research is not conducted on a truly random sample and very few research endeavors include 10% of the population as the sample.</p>
<p>Lastly, hypothesis testing and statistical significance are not foreign to the Bayesian perspective. For example, if one were interested in conducting a Bayesian <em>t</em>-test, you would use something called Bayes Factors which has been covered on the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> web site and specifically <a href="http://www.unt.edu/rss/class/Jon/R_SC/Module10/BayesFactor.R">here</a> in Module 11. Bayes Factors were also discussed in a previous <a href="http://web3.unt.edu/benchmarks/issues/2011/03/rss-matters">RSS Matters</a> article (<a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BayesFactors_JDS_Mar2011.pdf">Adobe.pdf version</a>). </p>
<p>Until next time, “<em>knowledge is freedom and ignorance is slavery.”</em></p>
<p>-- The above quote is attributed to Miles Dewey Davis III (1926 – 1991): <a href="http://www.goodreads.com/author/quotes/54761.Miles_Davis">http://www.goodreads.com/author/quotes/54761.Miles_Davis</a></p>
<p> </p>
<h4 style="text-align: left;" align="center">Highly Recommended Reference</h4>
<p>Kaplan, D. (2014). <em>Bayesian Statistics for the Social Sciences</em>. New York: The Guilford Press. </p>
<h4 style="text-align: left;" align="center">Other Important Resources</h4>
<p>Albert, J. (2007). <em>Bayesian Computation with R.</em> New York: Springer Science + Business Media, LLC.</p>
<p>Berry, D. A. (1996). <em>Statistics: A Bayesian Perspective.</em> Belmont, CA: Wadsworth Publishing Company.</p>
<p>Berry, S. M., Carlin, B. P., Lee, J. J., & Muller, P. (2011). <em>Bayesian Adaptive Methods for Clinical Trials. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Bolker, B. M. (2008). <em>Ecological Models and Data in R.</em> Princeton, NJ: Princeton University Press.</p>
<p>Bolstad, W. M. (2004). <em>Introduction to Bayesian Statistics.</em> Hoboken NJ: John Wiley & Sons, Inc.</p>
<p>Broemeling, L. D. (2007). <em>Bayesian Biostatistics and Diagnostic Medicine. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Congdon, P. (2005). <em>Bayesian Models for Categorical Data. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Congdon, P. (2006). <em>Bayesian Statistical Modeling. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Dey, D. K., Ghosh, S., & Mallick, B. K. (2011). <em>Bayesian Modeling in Bioinformatics. </em>Boca Raton, FL: Taylor & Francis Group, LLC.</p>
<p>Efron, B. (1986). Why isn’t everyone a Bayesian? <em>The American Statistician, 40</em>, 1 – 5.</p>
<p>Gelman, A., & Hall, J. (2007). <em>Data Analysis Using Regression and Multilevel/Hierarchical Models. </em>New York: Cambridge University Press.</p>
<p>Gelman, A., & Meng, X. (2004). <em>Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. <em>British Journal of Mathematical and Statistical Psychology, 66</em>, 8 – 38.</p>
<p>Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). <em>Bayesian Data Analysis </em>(2<sup>nd</sup> ed.).Boca Raton, FL: Chapman & Hall/CRC.</p>
<p>Geweke, J. (2005). <em>Contemporary Bayesian Econometrics and Statistics.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p>Ghosh, J. K., Delampady, M., & Samanta, T. (2006). <em>An Introduction to Bayesian Analysis: Theory and Methods. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Hoff, P. D. (2009). <em>A First Course in Bayesian Statistical Methods.</em> New York: Springer Science + Media, LLC.</p>
<p>Jackman, S. (2009). <em>Bayesian Analysis for the Social Sciences.</em> West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Jeffreys, H. (1939). <em>Theory of Probability </em>(1<sup>st</sup> ed.). London: Oxford University Press.</p>
<p>Jeffreys, H. (1948). <em>Theory of Probability </em>(2<sup>nd</sup> ed.). London: Oxford University Press.</p>
<p>Koop, G. (2003). <em>Bayesian Econometrics.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p>Koop, G., Poirier, D., & Tobias, J. (2007). <em>Bayesian Econometric Methods. </em>New York: Cambridge University Press.</p>
<p>Kruschke, J. K. (2011). <em>Doing Bayesian Data Analysis. </em>Burlington, MA: Academic Press.</p>
<p>Lancaster, T. (2004). <em>An Introduction to Modern Bayesian Econometrics. </em>Malden, MA: Blackwell Publishing.</p>
<p>Lee, P. M. (2004). <em>Bayesian Statistics: An Introduction </em>(3<sup>rd</sup> ed.). New York: Oxford University Press Inc.</p>
<p>Link, W. A., & Barker, R. J. (2010). <em>Bayesian Inference with Ecological Applications. </em>London: Academic Press (Elsevier Ltd.).</p>
<p>Lynch, S. M. (2007). <em>Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Mallick, B., Gold, D. L., & Baladandayuthapani, V. (2009). <em>Bayesian Analysis of Gene Expression. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Marin, J., & Robert, C. P. (2007). <em>Bayesian Core: A Practical Approach to Computational Bayesian Statistics. </em>New York: Springer Science + Business Media, LLC.</p>
<p>McGrayne, S. B. (2011). <em>The Theory that Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy.</em> New Haven, CT: Yale University Press.</p>
<p>Rossi, P. E., Allenby, G. M., & McCulloch, R. (2005). <em>Bayesian Statistics and Marketing. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Rouder, J. N., Speckman, P. L., Sun, D., & Morey, R. (2009). Bayesian <em>t</em> tests for accepting and rejecting the null hypothesis. <em>Psychonomic Bulletin & Review, 16</em>(2), 225 – 237.</p>
<p>Sorensen, D., & Gianola, D. (2002). <em>Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. </em>New York: Springer Science + Business Media, LLC.</p>
<p>Stone, L., D. (1975). <em>Theory of Optimal Search. </em>Mathematics in Science and Engineering, Vol. 118. New York: Academic Press, Inc.</p>
<p>Tanner, M. A. (1996). <em>Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. </em>New York: Springer-Verlag, Inc.</p>
<p>Taroni, F., Bozza, S., Biedermann, A., Garbolino, P., & Aitken, C. (2010). <em>Data Analysis in Forensic Science: A Bayesian Decision Perspective. </em>West Sussex, UK: John Wiley & Sons, Ltd.</p>
<p>Williamson, J. (2010). <em>In Defense of Objective Bayesianism. </em>Oxford, UK: Oxford University Press.</p>
<p>Woodworth, G. G. (2004). <em>Biostatistics: A Bayesian Introduction.</em> Hoboken, NJ: John Wiley & Sons, Inc.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published January 2015 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsTue, 20 Jan 2015 23:47:03 +0000cpl0001981 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/12/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-12</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Identifying or Verifying the Number of Factors to Extract using Very Simple Structure.</strong></h3>
<p><em>Link to the last RSS article here: <a href="http://it.unt.edu/benchmarks/issues/2014/11/rss-matters">Statistical Resources (update; version 3).</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>F</strong></span>actor analysis is perhaps one of the most frequently used analyses. It is versatile and flexible; meaning, it can be applied to a variety of data situations and types, and it can be applied in a variety of ways. However, conducting factor analysis generally requires the data analyst to make several decisions. Analysts often run several factor analyses, even when attempting to <em>confirm</em> an established factor structure; in order to assess the fit of the data to several factor models (e.g. one factor model, two factor model, three factor model, etc.). Over the 100 years since Spearman (1904) developed factor analysis there have been many, many criteria proposed for determining the number of factors to extract (e.g. eigenvalues greater than one, Horn’s [1965] parallel analysis, Cattell’s [1966] scree plot or test, Velicer’s [1976] Minimum Average Partial [MAP] criterion, etc.). Each of these proposed criteria have strengths and weaknesses; and they occasionally conflict with one another, which makes using one criterion over another a risky proposition. This month’s article demonstrates a very handy method for comparing multiple criteria in the pursuit of choosing to extract the appropriate number of factors during factor analysis. </p>
<p>In popular culture it is not uncommon to hear someone say, “There’s an <em>app</em> for that.” The phrase generally refers to the idea that an <em>application</em> exists (for a smart phone) which does the task being discussed. Likewise, here at RSS we very frequently find “There’s a <em>pack</em> for that.” This phrase refers to the virtual certainty of finding an R <em>package</em> which has a function devoted to some analysis or technique we are discussing. The primary package we will be using here is one package which contains a great many useful functions and as a result is very often <em>the </em>package we end up using for a variety of analyses. The primary package we will be using here is the ‘psych’ package (Revelle, 2014). The ‘psych’ package has grown substantially over the last few years and includes many very useful functions – if you have not taken a look at it recently, you might want to check it out. </p>
<p>Our examples below will actually require two packages, the ‘psych’ package and the ‘GPArotation’ package (Bernaards & Jennrich, 2014). The ‘GPArotation’ package should be familiar to anyone with experience doing factor analysis – it provides functions for several rotation strategies. The primary function we demonstrate below is the ‘vss’ function from the ‘psych’ package. The <em>Very Simple Structure </em>(VSS; Revell & Rocklin, 1979) function provides a nice output of criteria for varying levels of factor model complexity (i.e. number of factors to extract). The Very Simple Structure (VSS) terminology is used to refer to the idea that all loadings which are less than the maximum loading (of an item to a factor) are suppressed to zero – thus forcing a particular factor model to be much more interpretable or more clearly distinguished. Then, fit of several models of increasing rank complexity (i.e. more and more factors specified) can be assessed using the residual matrix of each model (i.e. original matrix minus the reproduced matrix of the models). We will also be using both the ‘fa’ function (from the ‘psych’ package) and the ‘factanal’ function (from the ‘stats’ package – included with all installations of R) to fit factor analysis models to the data structures. </p>
<h4><strong>Examples</strong></h4>
<p>The first two examples used here can easily be duplicated using the scripts provided below (i.e. the data file is available at the URL in the script / screen capture image). The third example is the example contained in the help file of the ‘vss’ function and can be accessed using the script below. First, load the two packages we will be using.</p>
<p> <img src="/benchmarks/sites/default/files/VSS_001.png" alt="Load packages" width="640" height="287" /></p>
<p>Next, we will import the comma delimited text (.txt) file from the RSS server using the URL and file name (vss_df.txt) contained in the script / image below. We also run a simple ‘summary’ on the data frame to make sure it was imported correctly.</p>
<p><img src="/benchmarks/sites/default/files/VSS_002.png" alt="Import the comma delimited text (.txt) file from the RSS server " width="475" height="480" /> </p>
<p>The simulated data includes a sample identification number for each participant (s.id), a grouping variable (group 1 or group 2), age of each participant (age in years), sex of each participant (female or male), class standing of each participant (freshman, sophomore, junior, or senior), and 30 item scores. Next, we will identify which participants belong to group 1 and which belong to group 2; as well as the number of participants in each group.</p>
<p><img src="/benchmarks/sites/default/files/VSS_003.png" alt="Identify group participants " width="640" height="98" /> </p>
<p>So, we have 418 participants in group 1 and 982 participants in group 2. Generally when analysts intend to do factor analysis they have an idea of how many factors they believe the appropriate factor model contains; and often they have an idea of whether an orthogonal or oblique rotation strategy is warranted. For this first example (i.e. group 1) looking at the 30 item scores (i.e. columns 6 through 35), we believe there are two factors and therefore; we specify 3 factors (<em>n</em> = 3) in the ‘vss’ function. We also believe the factors are likely to be meaningfully related and consequently, we specify an oblimin rotation strategy. Next, we apply the ‘vss’ function to group 1. Also note, we specified Maximum Likelihood Estimation as the Factor Method (fm = “mle”) because this is the method used by default with the ‘factanal’ (i.e. factor analysis) function of the ‘stats’ package. We specified the number of observations (i.e. number of rows, cases, or participants) using the length of the group 1 vector (g1). Recall from above, the group 1 vector contains the row numbers of all the participants from group 1.</p>
<p> <img src="/benchmarks/sites/default/files/VSS_004.png" alt="“Very Simple Structure” table" width="640" height="300" /></p>
<p>The first few rows of output (i.e. “Very Simple Structure” table) show the function called and the <em>maximum</em> complexity values. This is a good example because the VSS complexity rows are conflicting; VSS complexity 1 shows a 2-factor model is best while VSS complexity 2 indicates a 3-factor model is best. The VSS complexity 2 is a bit misleading because both the 2-factor model and 3-factor model display a VSS complexity 2 of 0.80; as can be seen in the first column of output under the “Statistics by number of factors” table. So, in fact both complexity 1 and complexity 2 are in agreement. Furthermore, the Velicer MAP <em>minimum</em> is reached with the 2-factor model; which can also be seen in the third column of the “Statistics by number of factors” table. The Bayesian Information Criterion (BIC) <em>minimum</em> is reached with the 2-factor model; as well as the Sample Size adjusted BIC (SABIC) – shown in columns 10 and 11 respectively of the “Statistics by number of factors” table. The ‘vss’ function also produces a plot (by default) which shows the number of factors on the x-axis and the VSS (complexity) Fit along the y-axis with lines and numbers in the Cartesian plane representing the (3) different factor models (see below).</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSSChart.png" alt="“Very Simple Structure” chart" width="453" height="480" /> </p>
<p>To interpret the graph, focus on the model (1, 2, or 3 factor models) which has the highest line (and numerals) in relation to the y-axis; but also note any transitions of the model lines. In this example, the transitions are all very nearly flat but a later example will better demonstrate the utility of this type of plot.</p>
<p>Next, we can verify the fit of our 2-factor model using either the ‘fa’ function (from the ‘psych’ package) and / or the ‘factanal’ function (of the ‘stats’ package).</p>
<p><img src="/benchmarks/sites/default/files/VSS_005_1.png" alt="Verify the fit of our 2-factor model" width="475" height="480" /> </p>
<p>*Note: the last few lines of output from the ‘fa’ function are cut off (i.e. not shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_006_0.png" alt="Verify the fit of our 2-factor model " width="476" height="480" /></p>
<p>*Note: last few lines of output from the ‘factanal’ function are cut off (i.e. not shown).</p>
<p>We will now assess the group 2 (g2) data. This group is believed to be best served with a 3-factor model; so we specify 4 factors (<em>n </em>= 4) in the ‘vss’ function call; again with the factor method set to Maximum Likelihood Estimation (fm = “mle”) and an oblique rotation strategy (rotate = “oblimin”).</p>
<p><img src="/benchmarks/sites/default/files/VSS_007.png" alt="VSS 3-factor model supported" width="640" height="267" /> </p>
<p>In this example all of the indices in the top table (“Very Simple Structure”) are in agreement; although both VSS complexity metrics display the same <em>maximum</em> for a 3-factor model and a 4-factor model. Looking at the first two columns of the “Statistics by number of factors” table shows the identical complexity <em>maximums</em> (0.84) for both the 3-factor model (row 3) and the 4-factor model (row 4) with both complexities 1 and 2 (columns 1 and 2). But, given the other indices agreement in support of the 3-factor model, that would be the model most appropriate. The plot (below) reinforces the interpretation of the tabular output above.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSSPlot.png" alt="VSS Plot" width="453" height="480" /></p>
<p>The plot (above) shows that the 3-factor model is meaningfully better than the 1-factor or 2-factor models and the 4-factor model does not show any improvement over the 3-factor model – which is evident because the number 4 in the plot is not [further] above the line associated with the 3-factor model (i.e. no gain or transition upward; as is the case from 1-factor to 2-factors and to 3-factors). Therefore, we fit the 3-factor model to our data using the ‘fa’ function (of the ‘psych’ package) and / or the ‘factanal’ function of the ‘stats’ package.</p>
<p><img src="/benchmarks/sites/default/files/VSS_008_0.png" alt="Fit the 3-factor model to our data using the ‘fa’ function " width="474" height="480" /> </p>
<p>*Note: the last few lines of output from the ‘fa’ function are cut off (i.e. not shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_009.png" alt="Fit the 3-factor model to our data using the ‘fa’ function " width="475" height="480" /> </p>
<p>*Note: last few lines of output from the ‘factanal’ function are cut off (i.e. not shown).</p>
<p>The next example is straight from the help file of the ‘vss’ function and is discussed here because it demonstrates a situation when the tables of output from the ‘vss’ function are not in agreement. When this situation occurs, one must rely upon the plot produced by the ‘vss’ function rather than the textual output. First, open the help file (here the plain text version is shown).</p>
<p><img src="/benchmarks/sites/default/files/VSS_012.png" alt="Open the help file " width="619" height="480" /> </p>
<p>Next, scroll to the bottom of the help file and copy / paste the relevant lines of script into the R console.</p>
<p><img src="/benchmarks/sites/default/files/VSS_013.png" alt="Scroll to the bottom of the help file and copy / paste the relevant lines of script into the R console. " width="640" height="330" /> </p>
<p>As mentioned previously, the tables of statistics do not provide a clear answer to the question of which factor model is best (i.e. how many factors should be extracted). However, if we review the associated plot, we can clearly see the 4-factor model is the best (i.e. highest; even when embedded within models with more than 4 factors, with good separation from previous models).</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="/benchmarks/sites/default/files/VSS_014.png" alt="Review the associated plot" width="452" height="480" /> </p>
<h4> <strong>Conclusions</strong></h4>
<p>The intent of this article was to raise awareness of the dangers of using only one criteria or method for deciding upon the number of factors to extract when conducting factor analysis. This article also demonstrated the ease with which an analyst can compute and evaluate several such criteria to reach a more informed decision. More extensive examples of the data analysis solutions are available at the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page. Lastly, a copy of the script file used for the above examples is available <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/VerySimpleStructure.R">here</a>.</p>
<p>Until next time; remember what George Carlin said: <em>“just ‘cause you got the monkey off your back doesn’t mean the circus left town</em>.”</p>
<h4 style="text-align: left;" align="center"><strong>References / Resources</strong></h4>
<p>Bernaards, C., & Jennrich, R. (2014). The ‘GPArotation’ package. Documentation available at <a href="http://cran.r-project.org/web/packages/GPArotation/index.html">CRAN</a>; the package <a href="http://cran.r-project.org/web/packages/GPArotation/GPArotation.pdf">manual</a> and the package <a href="http://cran.r-project.org/web/packages/GPArotation/vignettes/Guide.pdf">vignette</a>.</p>
<p>Carlin, G. (1937 – 2008). <em>Just One-Liners</em>. <a href="http://www.just-one-liners.com/ppl/george-carlin">http://www.just-one-liners.com/ppl/george-carlin</a></p>
<p>Cattell, R. B. (1966). The scree test for the number of factors. <em>Multivariate Behavioral </em><em>Research, 1</em>(2), 245 – 276.</p>
<p>Horn, J. (1965). A rationale and test for the number of factors in factor analysis. <em>Psychometrika, 30</em>(2), 179 – 185.</p>
<p>Horn, J. L., & Engstrom, R. (1979). Cattell's scree test in relation to bartlett's chi-square test and other observations on the number of factors problem. <em>Multivariate Behavioral </em><em>Research, 14</em>(3), 283 – 300.</p>
<p>McDonald, R. P. (1999). <em>Test Theory: A Unified Treatment.</em> Mahwah, NJ: Erlbaum. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. <em>Philosophical Magazine, 2</em>, 559 – 572.</p>
<p>Revelle, W. (2014). The ‘psych’ package. Documentation available at <a href="http://cran.r-project.org/web/packages/psych/index.html">CRAN</a>; the package <a href="http://cran.r-project.org/web/packages/psych/psych.pdf">manual</a> and the package <a href="http://cran.r-project.org/web/packages/psych/vignettes/overview.pdf">vignette</a>.</p>
<p>Revelle, W., & Rocklin, T. (1979). Very simple structure: An alternative procedure for estimating the optimal number of interpretable factors. <em>Multivariate Behavioral Research, 14</em>, 403 – 414. Available at: <a href="http://personality-project.org/revelle/publications/vss.pdf">http://personality-project.org/revelle/publications/vss.pdf</a></p>
<p>Spearman, C. (1904). General Intelligence: Objectively Determined and Measured. <em>American Journal of Psychology, 15</em>, 201 – 292.</p>
<p>Statistics Canada. (2010). <em>Survey Methods and Practices</em>. Ottawa, Canada: Minister of Industry. <a href="http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X">http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X</a></p>
<p>Thompson, B. (2004). <em>Exploratory and confirmatory factor analysis: Understanding concepts and applications</em>. Washington, DC: American Psychological Association.</p>
<p>Velicer, W. (1976). Determining the number of components from the matrix of partial correlations. <em>Psychometrika, 41</em>(3), 321 – 327.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published December 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsWed, 17 Dec 2014 17:26:22 +0000cpl0001965 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/11/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-11</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Statistical Resources (update; version 3).</strong></h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/10/rss-matters"> BOOtstrapping the Generalized Linear Model.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>his month’s article originally appeared first in November of 2011, but periodically it is necessary to update it with more current resources. The original article was motivated by a Research and Statistical Support (RSS) workshop given for graduate students and contains much the same content as was presented in the workshop: Statistical Resources. The following materials are, for the most part, freely available through the World Wide Web. The resources mentioned below fall, generally, into three categories; the resources we at RSS maintain, the resources available to UNT community members, and resources available to the general public with access to the web.</p>
<h4><strong>RSS Resources</strong></h4>
<p>The main <a href="http://www.unt.edu/rss/">RSS website</a> offers several resources, both specific resources aimed at particular software and more general resources (e.g., <a href="http://www.unt.edu/ACS/datamanage.htm">Data Management Services</a>). One of the key resources available to members of the UNT community is the opportunity to set up a <a href="http://www.unt.edu/rss/Consulting.htm">consulting</a> appointment with RSS staff. The <a href="https://untsitsm.saasit.com/Login.aspx?ProviderName=UNT&Scope=SelfService&role=SelfService&CommandId=SearchOffering&SearchString=rss_apt">link</a> to contact RSS staff for consultation is prominently displayed on each of the pages associated with RSS. The link guides clients to a web interface, known as the Front Range system, which forwards the service request to RSS staff, who then contact the requestor directly (generally through email). Please, read the frequently asked questions (<a href="http://www.unt.edu/rss/FAQ.htm">FAQ</a>) prior to submitting a Front Range request. It is also important to note that RSS staff maintains a rather extensive collection of digital and paper copies of articles, book chapters and whole books. RSS staff members often lend copies of these (in whole or part) to clients so clients can research various analytic or methodological concepts to their own satisfaction (and often the satisfaction of their colleagues, advisors, or committees, etc.).</p>
<p>A second frequently used resource RSS offers consists of the <a href="http://www.unt.edu/rss/Instructional.htm">instructional</a> services for RSS supported software. These were initially short courses offered in a classroom twice per semester; however, they have been migrated to the online format so that they may reach a wider audience and allow self-paced learning. These pages were designed to show how a particular software package can be used (e.g., <a href="http://www.unt.edu/rss/class/Jon/R_SC/">R</a>, <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/">SPSS</a>, <a href="http://www.unt.edu/rss/class/Jon/SAS_SC/">SAS</a>), they are not designed to teach statistics or how to interpret statistics (although some interpretation is offered among the many pages). In fact, some of the software supported by RSS is not directly related to statistics (e.g., <a href="http://www.unt.edu/rss/SURVEYclasslinks.html">survey technology</a> such as <a href="http://www.unt.edu/rss/class/survey/QSurvey.html">Zope and QSurvey</a>). On each of the R, SPSS, SAS short course pages you will also find links to resources specific to those software packages; from user manuals provided by the software producer (e.g., <a href="http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm">SPSS Manuals</a>, <a href="http://cran.r-project.org/web/views/">CRAN Task Views</a>) to other users’ user guides or websites (e.g. <a href="http://www.statmethods.net/">Quick-R</a>, <a href="http://lists.mcgill.ca/archives/stat-l.html">STAT-L</a>). There is even an R specific search engine available called, <a href="http://www.rseek.org/">RSeek</a>.</p>
<p>Another resource RSS offers is displayed right here; the contributions by RSS staff to the <a href="http://web3.unt.edu/benchmarks/"><em>Benchmarks</em></a> online publication in the <em>RSS Matters</em> column. Each article in the <em>RSS Matters</em> column is linked to the previous article and an <a href="http://www.unt.edu/rss/rssmattersindex.htm">index of <em>RSS Matters</em></a> articles is maintained on the RSS website. The index is quite handy for finding particular topics (e.g., canonical correlation), rather than clicking back through the years of articles available through the column links.</p>
<p>RSS has recently introduced a new service for instructors at UNT in which we can provide a randomly sampled data set from a fictional population named <a href="http://www.unt.edu/rss/class/Jon/Example_Sim/">Examplonia</a>. Examplonia is a fictional country which provides a meaningful context for statistical analysis examples. The population data for Examplonia was generated to provide a statistical population from which random samples could be drawn for the completion of example statistical analysis problems. The current version of the Examplonia population contains a variety of univariate, bivariate, and multivariate effects; including random effects based on hierarchical structure. If you are an instructor for a statistics course, you may be interested in obtaining some simulated data for your class (i.e. data for in-class demonstrations, homework assignments, etc.). Learn more about the population by visiting the <a href="http://www.unt.edu/rss/class/Jon/Example_Sim/">Examplonia</a> webpage.</p>
<p>RSS has also implemented some new services this year; all of which are focused on making software available to researchers through a web browser and relieving them of need to download and install software. Meaning, <a href="http://www.sagemath.org/">Sage Mathematics</a> and <a href="http://www.rstudio.org/">RStudio</a> along with the other services, can be accessed through a web browser. Sage Mathematics is mathematical computing software which can integrate the use of <strong>R</strong>. A brief introduction can be found at the Sage link above. RStudio is an integrated development environment for running the <strong>R</strong> statistical package. A brief introduction can be found <a href="http://web3.unt.edu/benchmarks/issues/2012/05/rss-matters">here</a>. Another new service is called <a href="http://rss.unt.edu:8083/tiki-index.php">Tiki Wiki</a>; an open source, freely available, content management system (CMS). More information can be found <a href="https://info.tiki.org/">here</a>. The final new service introduced this year is called <a href="http://rss.unt.edu:8082/">Galaxy Server</a> , also open source and freely available. Galaxy is “a web-based platform for data intensive biomedical research” (for more information, see: <a href="https://usegalaxy.org/">here</a>). These servers/services are available to faculty and advanced graduate students; however those interested need to submit a request for an access account for each service. Once a user has setup an account, they can simply visit the servers using their preferred web browser and conduct analyses using the software without having to install the software on their local machines. RSS is also working on implementing a Concerto Server; however, as of this writing we are still learning about it and are not ready to fulfill requests for access yet (more information can be found <a href="http://www.concerto-signage.org/deploy">here</a>).</p>
<h4><strong>Online Statistical Textbooks</strong></h4>
<p>The <a href="http://onlinestatbook.com/rvls/">Rice Virtual Lab in Statistics</a> is a valuable site for anyone interested in learning or teaching some of the basics of traditional (i.e. frequentist) statistics. The site offers several <a href="http://onlinestatbook.com/stat_sim/index.html">animations</a> for understanding concepts which are often difficult for newcomers to statistics (e.g., <a href="http://onlinestatbook.com/stat_sim/sampling_dist/index.html">sampling distribution characteristics</a> & the <a href="http://onlinestatbook.com/stat_sim/normal_approx/index.html">Central Limit Theorem</a>). The Rice Virtual Lab in Statistics also offers an online (free; no registration required) introductory statistics textbook. The textbook is called <a href="http://davidmlane.com/hyperstat/index.html">HyperStat</a> and contains chapters which cover the usual contents such as describing univariate and bivariate data, elementary probability, the normal distribution, point estimation, interval estimation, Null Hypothesis testing, statistical power, t-tests, Analysis of Variance (ANOVA), prediction, chi-square, non-parametric tests, and effect size estimates.</p>
<p>Another online repository of statistical resources is the site maintained by Michael Friendly at York University. The <a href="http://www.math.yorku.ca/SCS/StatResource.html">site</a> offers a variety of links to resources for a variety of software, tutorials for specific analyses, and sections of links for statistical societies, associations, and academic departments; as well as links to support more general computing resources (e.g., using Unix). A similar <a href="http://www.claviusweb.net/statistics.shtml">site</a> listing various statistical resources on the web is maintained by Clay Helberg.</p>
<p><a href="http://www.statsoft.com/textbook/">Statsoft</a>, the company behind the statistical software <a href="http://www.statsoft.com/">Statistica</a>, also offers web surfers a textbook covering a variety of statistical topics. The Statsoft site covers topics ranging from <a href="http://www.statsoft.com/textbook/elementary-statistics-concepts/button/1/">elementary concepts</a>, <a href="http://www.statsoft.com/textbook/basic-statistics/?button=1">basic statistics</a>, <a href="http://www.statsoft.com/textbook/anova-manova/?button=1">ANOVA/MANOVA</a> to multivariate topics such as <a href="http://www.statsoft.com/textbook/principal-components-factor-analysis/?button=1">principle components and factor analysis</a>, <a href="http://www.statsoft.com/textbook/multidimensional-scaling/?button=2">multidimensional scaling</a>, and <a href="http://www.statsoft.com/textbook/structural-equation-modeling/?button=2">structural equation modeling</a>. Unlike Statnotes, mentioned above, the Statsoft site does not offer software output or interpretation (although graphs and tables are often used). However, one handy feature of the Statsoft site is the interactive glossary; each hyperlinked word sends the users to the definition/entry for that word in the glossary. The Statsoft textbook is also <a href="https://www.statsoft.com/products/statistics-methods-and-applications-book/order/">available</a> in printed form for $80.00 plus shipping.</p>
<h4><strong>Miscellaneous Other Resources</strong></h4>
<p>Another resource option for members of the UNT community, which is often overlooked, is the <a href="http://www.library.unt.edu/">UNT library system</a>. The library’s <a href="http://iii.library.unt.edu/">general catalog</a> contains a monumental collection of resources, from textbooks being used in current courses to books which focus on the statistical analyses used in particular fields and authoritative books devoted to specific types of analysis (e.g., searching “logistic regression” yielded 66 returns). Furthermore, the electronic resources offer access to thousands of periodicals (i.e. journals) from a variety of databases (e.g. EBSCOHost, Medline, ERIC, LexisNexis, & JSTOR). One of the most frequently used databases by RSS staff is the JSTOR database, which contains many of the most prominent methodological and statistical journals – with almost all articles available (through the UNT portal) in full text (i.e. Adobe.pdf format). Another commonly used resource is the <a href="http://www.jstatsoft.org/">Journal of Statistical Software</a>, which contains articles on a variety of statistical computing applications/software, as well as articles covering statistical methods. One more often consulted resource is the <em>little green books</em> which are actually a series published by <a href="http://www.sagepub.com/home.nav">Sage</a>. The <a href="http://www.sagepub.com/productSearch.nav?seriesId=Series486">Quantitative Applications in the Social Sciences</a> series are a collection of thin, soft covered, books; each dealing with a specific research or statistical topic. The UNT library carries approximately 145 of the series’ editions and the RSS staff has collected most of the series as well. There are approximately 170 books in the series and a typical researcher would be hard pressed not to find something of value among them. Of course, there are more general resources, such as <a href="http://www.google.com/">Google</a>, <a href="http://www.scholarpedia.org/">Scholarpedia</a>, <a href="http://www.wikipedia.org/">Wikipedia</a>, and even <a href="http://www.youtube.com/watch?v=mL27TAJGlWc">Youtube</a>; all of which can be useful.</p>
<p>Until next time, remember; GIYF – Google is your friend.</p>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published November 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsFri, 21 Nov 2014 17:40:59 +0000cpl0001957 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/10/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-10</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><em>BOO</em>tstrapping the Generalized Linear Model</h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/09/rss-matters">Factor Analysis with Binary items: A quick review with examples.</a></em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:richard.herrington@unt.edu">Dr. Richard Herrington</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>R</strong></span>esearchers do not need to be afraid - the availability of fast computers and public domain software libraries such as R and the R package <em>boot</em>, make forays into <em>bootstrap confidence interval estimation</em> reasonably straight forward. R package <em>boot</em> was designed to be general enough to allow the data analyst to simulate the empirical sampling distribution of most estimators (and then some), and to calculate corresponding confidence intervals for that estimator. There are a few tricks to learn when using package boot, but once those small hurdles have been navigated, the lessons learned can be applied more generally to other estimation settings.</p>
<p>R package boot is comprised of a set of functions that are well documented both with theory and examples in the book: <em>Bootstrap Methods and Their Application</em>, by A.C. Davison and D.V. Hinkley (1997). The purpose of this short note is to demonstrated how to approximate nonparametric confidence intervals, using resampling methods, for the <em>generalized linear model</em> (glm) using the R package <em>boot</em>.</p>
<p>We’ll start off by simulating a data set from the following probability regression model: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="411" height="356">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>samp.size<-5000</p>
<p> </p>
<p>x1 <- rnorm(samp.size) </p>
<p>x2 <- rnorm(samp.size)</p>
<p>x3 <- rnorm(samp.size)</p>
<p>x4 <- rnorm(samp.size)</p>
<p> </p>
<p># True Model</p>
<p># x0 x1 x2 x3 x4 x1*x2</p>
<p>z <- 1 + 2*x1 + 3*x2 + 4*x3 + 5*x4 + 10*x1*x2 </p>
<p>pr <- 1/(1+exp(-z)) </p>
<p>y <- rbinom(samp.size,1,pr) </p>
<p> </p>
<p>> sim.data.df <- data.frame(y=y,x1=x1,x2=x2,x3=x3,x4=x4, <br /> ,x5=x1*x2)</p>
<pre>> head(sim.data.df)</pre>
<pre> y x1 x2 x3 x4 x5</pre>
<pre>1 0 0.9632201 -1.0871521 -2.0283342 0.5727080 -1.0471668</pre>
<pre>2 0 2.8738768 -1.4818353 0.1265646 1.9195807 -4.2586121</pre>
<pre>3 1 -0.5552309 0.8576629 1.1878977 -0.7940654 -0.4762010</pre>
<pre>4 0 -0.7519217 0.7630796 -0.7534080 -0.6768429 -0.5737761</pre>
<pre>5 0 0.6789053 -1.6454898 0.5337027 -0.9163869 -1.1171318</pre>
<pre>6 0 1.4138792 -0.3052833 1.0388294 -0.9189572 -0.4316337</pre>
<p>.</p>
<p>.</p>
<p>.</p>
</div>
</td>
</tr>
</tbody>
</table>
<br /><br /></td>
</tr>
</tbody>
</table>
<p>Using the R function <em>glm </em>we can estimate the model coefficients using a binomial probability model for the <em>y</em> outcome variable:</p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="531" height="266">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>glm.fit<-glm(y~x1+x2+x3+x4+x1*x2,</p>
<p> data=sim.data.df,</p>
<p> family="binomial")</p>
<p>glm.fit</p>
<p> </p>
<pre>> glm.fit</pre>
<pre> </pre>
<pre>Call: glm(formula = y ~ x1 + x2 + x3 + x4 + x1 * x2, family = "binomial", </pre>
<pre> data = sim.data.df)</pre>
<pre> </pre>
<pre>Coefficients:</pre>
<pre>(Intercept) x1 x2 x3 x4 x1:x2 </pre>
<pre> 1.009 1.973 3.101 4.081 5.113 10.144 </pre>
<pre> </pre>
<pre>Degrees of Freedom: 4999 Total (i.e. Null); 4994 Residual</pre>
<pre> Null Deviance: 6910 </pre>
<pre> Residual Deviance: 1265 AIC: 1277</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p><br />R function <em>glm </em>does a reasonably good job of recovering the population regression coefficients – although we did use a very large sample size in comparison to the number of variables in the model.</p>
<p>R package <em>caret</em> provides a useful helper function for displaying kernel density estimated histograms for the predictors as a function of the two level outcome variable <em>y</em>:</p>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="431" height="149">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>library(caret)</p>
<p>featurePlot(x = sim.data.df[,c(2:6)],</p>
<p> y = as.factor(sim.data.df$y),</p>
<p> plot = "density",</p>
<p> scales = list(x = list(relation="free"),</p>
<p> y = list(relation="free")),</p>
<p> adjust = 1.5,</p>
<p> pch = "|",</p>
<p> layout = c(3, 3),</p>
<p> auto.key = list(columns = 2))</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The resulting plot is returned: </p>
<p> <img src="/benchmarks/sites/default/files/graphs.png" alt="Graphic output" width="609" height="423" /></p>
<p>The chosen population coefficients separate the groups with a large difference between the groups (1/0) on the predictor variables. We can calculate the marginal probabilities of the estimated predictors to see how large the average probability change is, in moving from a 50% probability of being in group 1, to the estimated probability of being in group 1, given a unit change in the predictors: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="504" height="106">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>library(arm)</p>
<p>glm.coefs<-coef(glm.fit)</p>
<p>invlogit(glm.coefs) - .50</p>
<pre>(Intercept) x1 x2 x3 x4 x1:x2 </pre>
<pre> 0.2327767 0.3779851 0.4569387 0.4833883 0.4940197 0.4999607 </pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>We have chosen very large predictor <em>effect sizes</em> for the simulation. Essentially, predictors <em>x4</em> and <em>x5</em> maximally predict the probability of <em>y=1</em> membership: knowledge of predictors x4 and x5 move our predicted marginal probability of y=1 from .50 (absent the information from <em>x4</em> and <em>x5</em>) to .99 given the information provided by <em>x4</em> and <em>x5</em>.</p>
<p>Now on to the bootstrap confidence intervals: first we need to create a wrapper function that will pass the resampled data, and their corresponding indices, to the <em>glm</em> function:</p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="504" height="178">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>glm.coefs<-function (dataset, index)</p>
<p>{</p>
<p> sim.data.df<-dataset[index,]</p>
<p> </p>
<p> glm.fit <-try(glm(y~x1+x2+x3+x4, #+x1*x2,</p>
<p> data=sim.data.df,</p>
<p> family="binomial"), silent = TRUE)</p>
<p> </p>
<p> coefs<-try(coef(glm.fit), silent=TRUE)</p>
<p> print(coefs)</p>
<p> </p>
<p> return(coefs)</p>
<p>}</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The vector that contains the indices of the resampled data (<em>index</em>) will be passed to the <em>glm</em> function. Lastly, our wrapper function for glm - <em>glm.coefs</em> – will return the estimated coefficients back to the <em>boot</em> function for tabulation and post-processing. Additionally, we have used the <em>try</em> function so that if a resampled data set fails <em>glm</em> estimation, the <em>glm.coefs</em> and <em>boot</em> will not break out with error, but will instead continue with missing values for the coefficients. Lastly, we have put a print statement within the body of glm.coefs, so that we can monitor the estimated coefficients values as they are being estimated.</p>
<p>Our last bit of R script sends the data and glm.coefs function to boot for processing: </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="146">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<p>boot.fit<-boot(sim.data.df, glm.coefs, R=1000)</p>
<p>boot.fit</p>
<p> </p>
<p>for(ii in 1:length(boot.fit$t0))</p>
<p> {</p>
<p> cat(rep("\n",5))</p>
<p> print(names(boot.fit$t0[ii]))</p>
<p> cat(rep("\n",2))</p>
<p> print(boot.ci(boot.fit, conf = 0.95, type = c("norm","perc","basic"),index = ii))</p>
<p> }</p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>The for loop in this script isn’t necessary, but is merely a short-cut for printing out the results of three different types of confidence intervals (CI) for for the six estimated parameters (intercept and x1-x6). Notice that we capture the true population parameter for each of the three CI types. This a simply a consequence of having used few predictors, an initial large sample size, and 1000 bootstrap samples in the bootstrap CI estimation. </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="416">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<pre>> boot.fit</pre>
<pre> </pre>
<pre>ORDINARY NONPARAMETRIC BOOTSTRAP</pre>
<pre> </pre>
<pre> </pre>
<pre>Call:</pre>
<pre>boot(data = sim.data.df, statistic = glm.coefs, R = 1000)</pre>
<pre> </pre>
<pre> </pre>
<pre>Bootstrap Statistics :</pre>
<pre> original bias std. error</pre>
<pre>t1* 1.008756 0.007386088 0.08582566</pre>
<pre>t2* 1.973487 0.011373649 0.12787464</pre>
<pre>t3* 3.101113 0.027926437 0.15442723</pre>
<pre>t4* 4.080900 0.027597606 0.17447659</pre>
<pre>t5* 5.113291 0.036752067 0.21991954</pre>
<pre>t6* 10.144203 0.074247504 0.42935352</pre>
<pre>> for(ii in 1:length(boot.fit$t0))</pre>
<pre>+ {</pre>
<pre>+ cat(rep("\n",5))</pre>
<pre>+ print(names(boot.fit$t0[ii]))</pre>
<pre>+ cat(rep("\n",2))</pre>
<pre>+ print(boot.ci(boot.fit, conf = 0.95, type = c("norm","perc","basic"), index = ii))</pre>
<pre>+ }</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="567" height="810">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<table style="width: 889px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top">
<p> [1] "(Intercept)"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 0.833, 1.170 ) ( 0.824, 1.164 ) ( 0.854, 1.194 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p>[1] "x1"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 1.711, 2.213 ) ( 1.704, 2.191 ) ( 1.756, 2.243 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> [1] "x2"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 2.771, 3.376 ) ( 2.731, 3.369 ) ( 2.833, 3.471 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> [1] "x3"</p>
<p> </p>
<p> </p>
<p>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</p>
<p>Based on 1000 bootstrap replicates</p>
<p> </p>
<p>CALL :</p>
<p>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc",</p>
<p> "basic"), index = ii)</p>
<p> </p>
<p>Intervals :</p>
<p>Level Normal Basic Percentile </p>
<p>95% ( 3.711, 4.395 ) ( 3.704, 4.369 ) ( 3.793, 4.457 ) </p>
<p>Calculations and Intervals on Original Scale</p>
<p> </p>
<p> </p>
</td>
</tr>
</tbody>
</table>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p> </p>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td bgcolor="white" width="618" height="459">
<table style="width: 100%;" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<div class="shape">
<pre>[1] "x4"</pre>
<pre> </pre>
<pre> </pre>
<pre>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</pre>
<pre>Based on 1000 bootstrap replicates</pre>
<pre> </pre>
<pre>CALL : </pre>
<pre>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc", </pre>
<pre> "basic"), index = ii)</pre>
<pre> </pre>
<pre>Intervals : </pre>
<pre>Level Normal Basic Percentile </pre>
<pre>95% ( 4.646, 5.508 ) ( 4.621, 5.498 ) ( 4.728, 5.606 ) </pre>
<pre>Calculations and Intervals on Original Scale</pre>
<pre> </pre>
<pre>[1] "x1:x2"</pre>
<pre> </pre>
<pre>BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS</pre>
<pre>Based on 1000 bootstrap replicates</pre>
<pre> </pre>
<pre>CALL : </pre>
<pre>boot.ci(boot.out = boot.fit, conf = 0.95, type = c("norm", "perc", </pre>
<pre> "basic"), index = ii)</pre>
<pre> </pre>
<pre>Intervals : </pre>
<pre>Level Normal Basic Percentile </pre>
<pre>95% ( 9.23, 10.91 ) ( 9.15, 10.84 ) ( 9.45, 11.13 ) </pre>
<pre>Calculations and Intervals on Original Scale</pre>
<p> </p>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p><strong style="font-size: 1em;"><span style="font-size: xx-small;">Originally published October 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></p>R_statsTue, 21 Oct 2014 23:07:53 +0000cpl0001947 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/09/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-09</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong>Factor Analysis with Binary items: A quick review with examples.</strong><strong><strong><br /></strong></strong></h3>
<p><em>Link to the last RSS article here:<a href="http://it.unt.edu/benchmarks/issues/2014/08/rss-matters"> Call to Create a UNT R Users Group</a>.</em> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, Research and Statistical Support Consultant Team </strong></p>
<p><span style="font-size: medium;"><strong>T</strong></span>here have been several clients in recent weeks that have come to us with binary survey data which they would like to factor analyze. The current article was written in order to provide a simple resource for others who may find themselves in a similar situation.</p>
<p>Of course, our professional conscience requires that we mention at the outset; if you are creating a survey (online, paper & pencil, or any other format) you should create the items and response choices in such a way that the responses may be considered interval or ratio; or at the very least, ordinal – not nominal categories and particularly not binary categories. We also feel compelled to advise you against the use of two other types of items. Please do not use any type of contingency or dependent items (e.g. if you answered ‘yes’ to item 6, go to item 6a; if you answered ‘no’ to item 6, please move forward to item 7). Also, please do not use any type of multiple response items (e.g. ‘choose all those which apply’). If you would like more information on why we make the recommendations above, please consult the substantial literature on survey development (e.g. McDonald, 1999; OECD, 2008, Statistics Canada, 2010).</p>
<h4><strong>Examples</strong></h4>
<p>First, import some (simulated) example data. The data used here is available at the URL given in the ‘read.table’ function below. The data contains eight binary items (x1, x2, x3, x4, x5, x6, x7, & x8) with 1000 cases (i.e. rows) which support two orthogonal factors.</p>
<p><span style="color: #ff0000;">df.1 <- read.table(</span></p>
<p><span style="color: #ff0000;"> "http://www.unt.edu/rss/class/Jon/Benchmarks/BinaryDataFA.txt",</span></p>
<p><span style="color: #ff0000;"> header = TRUE, sep = ",", na.strings = "NA", dec = ".",</span></p>
<p><span style="color: #ff0000;"> strip.white = TRUE)</span></p>
<p><span style="color: #ff0000;">head(df.1)</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">1 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">2 0 1 1 1 0 0 0 0</span></p>
<p><span style="color: #0000ff;">3 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">4 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">5 0 0 0 0 0 0 0 0</span></p>
<p><span style="color: #0000ff;">6 0 1 0 0 0 0 0 0</span></p>
<p><span style="color: #ff0000;">nrow(df.1)</span></p>
<p><span style="color: #0000ff;">[1] 1000</span></p>
<p>Notice above, the data is numeric; this is important because if you simply supply this data to a factor analysis function, that function will (by default) calculate the matrix of association assuming those numbers are interval or ratio – which would be incorrect or potentially very biased. Therefore, what is really needed is a way to calculate the correct matrix of association (for the factor analysis) using the appropriate correlation statistic for each pair of variables in our data. Fortunately, the ‘polycor’ package (Fox, 2014) contains a function called ‘hetcor’ for doing just that. The ‘hetcor’ function basically looks at each pair of variables in a data frame and computes the appropriate <em>heterogeneous correlation</em> for each pair based on the type of variables which make up each pair. Recall that with categorical variables, the polychoric correlation is appropriate, and the tetrachoric correlation is a special case of the polychoric correlation (for when both variables being correlated are binary). The ‘hetcor’ function is capable of calculating Pearson correlations (for numeric data), polyserial correlations (for numeric and ordinal data), and polychoric correlations (for ordered or non-ordered factors) – from a single data frame with all of the above mentioned types of variables.</p>
<p>So, because the data is imported as numeric, we must first recode it as factor (i.e. categorical); which can be done very easily using the ‘sapply’ function. There are other packages and functions which allow more precise control over recoding variables; such as the ‘recode’ function in the ‘car’ package (Fox, et al., 2014).</p>
<p><span style="color: #ff0000;">df.2 <- sapply(df.1, as.factor)</span></p>
<p><span style="color: #ff0000;">head(df.2)</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">[1,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[2,] "0" "1" "1" "1" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[3,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[4,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[5,] "0" "0" "0" "0" "0" "0" "0" "0"</span></p>
<p><span style="color: #0000ff;">[6,] "0" "1" "0" "0" "0" "0" "0" "0"</span></p>
<p>Once the <em>numeric</em> data have been recoded <em>as factor</em>, we can proceed by loading the ‘polycor’ package which contains the ‘hetcor’ function.</p>
<p><span style="color: #ff0000;">library(polycor)</span></p>
<p><span style="color: #0000ff;">Loading required package: mvtnorm</span></p>
<p><span style="color: #0000ff;">Loading required package: sfsmisc</span></p>
<p>Now we can compute the appropriate correlation matrix and assign that matrix to a new object (het.mat). Notice below, we are extracting only the correlation matrix ($cor) from the output of the ‘hetcor’ function.</p>
<p><span style="color: #ff0000;">het.mat <- hetcor(df.2)$cor</span></p>
<p><span style="color: #0000ff;">Warning messages:</span></p>
<p><span style="color: #0000ff;">1: In polychor(x, y, ML = ML, std.err = std.err) :</span></p>
<p><span style="color: #0000ff;"> inadmissible correlation set to 1</span></p>
<p><span style="color: #0000ff;">2: In hetcor.data.frame(dframe, ML = ML, std.err = std.err, bins = bins, :</span></p>
<p><span style="color: #0000ff;"> the correlation matrix has been adjusted to make it positive-definite</span></p>
<p><span style="color: #ff0000;">het.mat</span></p>
<p> <span style="color: #0000ff;">x1 x2 x3 x4 x5</span></p>
<p><span style="color: #0000ff;">x1 1.000000000 0.910975550 0.844483311 0.691731074 -0.002245134</span></p>
<p><span style="color: #0000ff;">x2 0.910975550 1.000000000 0.859541108 0.808750265 0.037625262</span></p>
<p><span style="color: #0000ff;">x3 0.844483311 0.859541108 1.000000000 0.723304581 -0.026716610</span></p>
<p><span style="color: #0000ff;">x4 0.691731074 0.808750265 0.723304581 1.000000000 -0.001185206</span></p>
<p><span style="color: #0000ff;">x5 -0.002245134 0.037625262 -0.026716610 -0.001185206 1.000000000</span></p>
<p><span style="color: #0000ff;">x6 -0.039424602 -0.004851113 -0.046661991 -0.001214029 0.993573475</span></p>
<p><span style="color: #0000ff;">x7 0.002335945 0.005438252 -0.014930707 -0.009831874 0.879110898</span></p>
<p><span style="color: #0000ff;">x8 -0.036916591 -0.054512229 0.006043798 0.031313650 0.794959194</span></p>
<p> <span style="color: #0000ff;">x6 x7 x8</span></p>
<p><span style="color: #0000ff;">x1 -0.039424602 0.002335945 -0.036916591</span></p>
<p><span style="color: #0000ff;">x2 -0.004851113 0.005438252 -0.054512229</span></p>
<p><span style="color: #0000ff;">x3 -0.046661991 -0.014930707 0.006043798</span></p>
<p><span style="color: #0000ff;">x4 -0.001214029 -0.009831874 0.031313650</span></p>
<p><span style="color: #0000ff;">x5 0.993573475 0.879110898 0.794959194</span></p>
<p><span style="color: #0000ff;">x6 1.000000000 0.849171046 0.781588616</span></p>
<p><span style="color: #0000ff;">x7 0.849171046 1.000000000 0.703973732</span></p>
<p><span style="color: #0000ff;">x8 0.781588616 0.703973732 1.000000000</span></p>
<p>Although there are two warnings listed above, the function does in fact return the appropriate correlation matrix. Now we can proceed with the factor analysis using this ‘het.mat’ correlation matrix as the matrix of association for the factor analysis.</p>
<p><span style="color: #ff0000;">fa.1 <- factanal(covmat = het.mat, factors = 2, rotation = "varimax")</span></p>
<p><span style="color: #ff0000;">fa.1</span></p>
<p><span style="color: #0000ff;">Call:</span></p>
<p><span style="color: #0000ff;">factanal(factors = 2, covmat = het.mat, rotation = "varimax")</span></p>
<p><span style="color: #0000ff;">Uniquenesses:</span></p>
<p><span style="color: #0000ff;"> x1 x2 x3 x4 x5 x6 x7 x8</span></p>
<p><span style="color: #0000ff;">0.164 0.005 0.252 0.345 0.005 0.008 0.243 0.368</span></p>
<p><span style="color: #0000ff;">Loadings:</span></p>
<p><span style="color: #0000ff;"> Factor1 Factor2</span></p>
<p><span style="color: #0000ff;">x1 0.913</span></p>
<p><span style="color: #0000ff;">x2 0.997</span></p>
<p><span style="color: #0000ff;">x3 0.863</span></p>
<p><span style="color: #0000ff;">x4 0.809</span></p>
<p><span style="color: #0000ff;">x5 0.997 </span></p>
<p><span style="color: #0000ff;">x6 0.996 </span></p>
<p><span style="color: #0000ff;">x7 0.870 </span></p>
<p><span style="color: #0000ff;">x8 0.794 </span> </p>
<p> <span style="color: #ff0000;"> Factor1 Factor2</span></p>
<p><span style="color: #ff0000;">SS loadings 3.378 3.232</span></p>
<p><span style="color: #ff0000;">Proportion Var 0.422 0.404</span></p>
<p><span style="color: #ff0000;">Cumulative Var 0.422 0.826</span></p>
<p><span style="color: #ff0000;">The degrees of freedom for the model is 13 and the fit was 12.2084</span></p>
<p>Another equally effective way to factor analyze binary data (or any other type of data), using a correlation matrix, is with the ‘fa’ function from the ‘psych’ package (Revelle, 2014). Again, we use the correlation matrix we generated with the ‘hetcor’ function. Please note, the default method of extraction for the ‘fa’ function is minimum residuals (method = minres) and not maximum likelihood (method = ml).</p>
<p><span style="color: #ff0000;">library(psych)</span></p>
<p><span style="color: #ff0000;">fa.2 <- fa(r = het.mat, nfactors = 2, n.obs = nrow(df.2), rotate = "varimax")</span></p>
<p><span style="color: #0000ff;">Loading required package: MASS</span></p>
<p><span style="color: #0000ff;">Loading required package: GPArotation</span></p>
<p><span style="color: #0000ff;">Loading required package: parallel</span></p>
<p><span style="color: #ff0000;">fa.2</span></p>
<p><span style="color: #0000ff;">Factor Analysis using method = minres</span></p>
<p><span style="color: #0000ff;">Call: fa(r = het.mat, nfactors = 2, n.obs = nrow(df.2), rotate = "varimax")</span></p>
<p><span style="color: #0000ff;">Standardized loadings (pattern matrix) based upon correlation matrix</span></p>
<p> <span style="color: #0000ff;"> MR1 MR2 h2 u2 com</span></p>
<p><span style="color: #0000ff;">x1 -0.11 0.93 0.87 0.128 1</span></p>
<p><span style="color: #0000ff;">x2 -0.09 0.96 0.94 0.062 1</span></p>
<p><span style="color: #0000ff;">x3 -0.11 0.92 0.86 0.141 1</span></p>
<p><span style="color: #0000ff;">x4 -0.07 0.86 0.75 0.250 1</span></p>
<p><span style="color: #0000ff;">x5 0.98 0.10 0.96 0.036 1</span></p>
<p><span style="color: #0000ff;">x6 0.97 0.08 0.94 0.058 1</span></p>
<p><span style="color: #0000ff;">x7 0.91 0.09 0.84 0.160 1</span></p>
<p><span style="color: #0000ff;">x8 0.87 0.07 0.76 0.242 1</span></p>
<p> <span style="color: #0000ff;">MR1 MR2</span></p>
<p><span style="color: #0000ff;">SS loadings 3.51 3.41</span></p>
<p><span style="color: #0000ff;">Proportion Var 0.44 0.43</span></p>
<p><span style="color: #0000ff;">Cumulative Var 0.44 0.87</span></p>
<p><span style="color: #0000ff;">Proportion Explained 0.51 0.49</span></p>
<p><span style="color: #0000ff;">Cumulative Proportion 0.51 1.00</span></p>
<p><span style="color: #0000ff;">Mean item complexity = 1</span></p>
<p><span style="color: #0000ff;">Test of the hypothesis that 2 factors are sufficient.</span></p>
<p><span style="color: #0000ff;">The degrees of freedom for the null model are 28 and the objective function was 23.3 with Chi Square of 23199.31</span></p>
<p><span style="color: #0000ff;">The degrees of freedom for the model are 13 and the objective function was 13.77</span></p>
<p><span style="color: #0000ff;">The root mean square of the residuals (RMSR) is 0.04</span></p>
<p><span style="color: #0000ff;">The df corrected root mean square of the residuals is 0.06</span></p>
<p><span style="color: #0000ff;">The harmonic number of observations is 1000 with the empirical chi square 99.24 with prob < 2.3e-15</span></p>
<p><span style="color: #0000ff;">The total number of observations was 1000 with MLE Chi Square = 13694.45 with prob < 0</span></p>
<p><span style="color: #0000ff;">Tucker Lewis Index of factoring reliability = -0.273</span></p>
<p><span style="color: #0000ff;">RMSEA index = 1.029 and the 90 % confidence intervals are 1.011 1.04</span></p>
<p><span style="color: #0000ff;">BIC = 13604.65</span></p>
<p><span style="color: #0000ff;">Fit based upon off diagonal values = 0.99</span></p>
<p><span style="color: #0000ff;">Measures of factor score adequacy </span></p>
<p><span style="color: #0000ff;"> MR1 MR2</span></p>
<p><span style="color: #0000ff;">Correlation of scores with factors 1 1</span></p>
<p><span style="color: #0000ff;">Multiple R square of scores with factors 1 1</span></p>
<p><span style="color: #0000ff;">Minimum correlation of possible factor scores 1 1</span></p>
<h4><strong>Conclusions</strong></h4>
<p>As demonstrated above, using binary data for factor analysis in R is no more difficult than using continuous data for factor analysis in R. Although not demonstrated here, if one has polytomous and other types of mixed variables one wants to factor analyze, one can also use the ‘hetcor’ function (i.e. heterogeneous correlations) located in the ‘polycor’ package (Fox, 2014). More extensive examples of the use of the ‘hetcor’ function are available at the RSS <a href="http://www.unt.edu/rss/class/Jon/R_SC/">Do-it-yourself Introduction to R</a> course page where many other examples (not just factor analysis) are provided. Lastly, a copy of the script file used for the above examples is available <a href="http://www.unt.edu/rss/class/Jon/Benchmarks/BinaryFA.R">here</a>.</p>
<p>Until next time; remember what George Carlin said:<em>“inside every cynical person, there is a disappointed idealist</em>.”</p>
<h4><span style="font-size: 1em;">References / Resources</span></h4>
<p>Carlin, G. (1937 – 2008). <a href="http://www.just-one-liners.com/ppl/george-carlin">http://www.just-one-liners.com/ppl/george-carlin</a></p>
<p>Fox, J. (2014). The ‘polycor’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/polycor/index.html">http://cran.r-project.org/web/packages/polycor/index.html</a></p>
<p>Fox, J., et al. (2014). The ‘car’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/car/index.html">http://cran.r-project.org/web/packages/car/index.html</a></p>
<p>McDonald, R. P. (1999). <em>Test Theory: A Unified Treatment.</em> Mahwah, NJ: Erlbaum.</p>
<p>Organization for Economic Co-operation and Development (OECD). (2008). <em>Handbook on Constructing Composite Indicators</em>. <a href="http://www.oecd.org/std/42495745.pdf">http://www.oecd.org/std/42495745.pdf</a></p>
<p>Revelle, W. (2014). The ‘psych’ package. Documentation available at CRAN: <a href="http://cran.r-project.org/web/packages/psych/index.html">http://cran.r-project.org/web/packages/psych/index.html</a></p>
<p>Statistics Canada. (2010). Survey Methods and Practices. Ottawa, Canada: Minister of Industry. <a href="http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X">http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=12-587-X</a></p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published September 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsFri, 19 Sep 2014 15:48:52 +0000cpl0001934 at http://it.unt.edu/benchmarksRSS Matters
http://it.unt.edu/benchmarks/issues/2014/08/rss-matters
<div class="field field-type-date field-field-pubdate">
<div class="field-label">Publication Date: </div>
<div class="field-items">
<div class="field-item odd">
<span class="date-display-single">2014-08</span> </div>
</div>
</div>
<h3 style="text-align: center;"><img src="/benchmarks/sites/default/files/rssunt.gif" alt="RSS Logo" width="506" height="101" /></h3>
<h3><strong><strong>Call to Create a UNT R Users Group</strong></strong></h3>
<p><em>Link to the last RSS article here: </em><a href="http://it.unt.edu/benchmarks/issues/2014/07/rss-matters">A <em>new</em> recommended way of dealing with multiple missing values: Using missForest for all your imputation needs.</a> <em>-- Ed.</em></p>
<p><strong>By <a href="mailto:richard.herrington@unt.edu">Dr. Richard Herrington</a> and <a href="mailto:Jonathan.Starkweather@unt.edu">Dr. Jon Starkweather</a>, RSS Team </strong></p>
<p><span style="font-size: medium;"><strong>R</strong></span> has noticeably gained visibility at UNT over the last few years. Recently I strolled through UNT’s campus bookstore and noticed a number of courses using R within their courses – this is a recent development; Not too long ago, there was only a handful of people on campus using R regularly. For those who may have not heard of the R statistical system, Wikipedia’s R entry provides a nice overview of the history and specifics of the <a href="http://en.wikipedia.org/wiki/R_(programming_language)">R system</a>.</p>
<p>Given the increase popularity, we believe that it might be time to form an R <a href="http://en.wikipedia.org/wiki/Users%27_group">User’s group</a> here on campus (RUG perhaps?). To our knowledge no such user group exists on campus; the closest one that we are aware of is on the University of Texas at Dallas campus. Other’s might exist in the surrounding area, but we have been unable to find these groups using the <a href="http://r-users-group.meetup.com/"><em>R User’s Group Meetup Search Tool</em></a></p>
<h4><strong>R UNT User Group Poll</strong></h4>
<p>To facilitate the the organization of this group, we have created an online poll to: i) query the interest in such a group, and ii) Collect contact information regarding a first meetup time. If you are interested in being part of such group, please <a href="https://unt.az1.qualtrics.com/SE/?SID=SV_71ApFQtHvqAaccJ">provide us some contact information through this poll</a>. If you are not sure, browse through our favorite R news feed aggregator <a href="http://www.r-bloggers.com/">R-Bloggers</a>,to get a sense of what this user group <em>could</em> be about.</p>
<h4 style="text-align: left;"><strong><span style="font-size: xx-small;">Originally published August 2014 -- Please note that information published in <em>Benchmarks Online</em> is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - <a href="http://www.unt.edu/">http://www.unt.edu</a> . You can also consult the UNT Helpdesk - <a href="http://www.unt.edu/helpdesk/">http://www.unt.edu/helpdesk/</a>. Questions and comments should be directed to <a href="mailto:benchmarks@unt.edu">benchmarks@unt.edu</a>.</span></strong></h4>R_statsWed, 20 Aug 2014 21:27:29 +0000cpl0001918 at http://it.unt.edu/benchmarks