Method of correlation analysis: an example. Correlation analysis is ...
In scientific research,the need to find a link between the productive and factor variables (the yield of a culture and the amount of precipitation, the height and weight of a person in homogeneous groups by sex and age, the pulse rate and body temperature, etc.).
The second are the signs that contribute to the change of those associated with them (the first).
The concept of correlation analysis
There are many definitions of the term. Proceeding from the foregoing, it can be said that correlation analysis is a method used to test a hypothesis about the statistical significance of two or more variables, if the researcher can measure them, but not change them.
There are other definitions of theconcepts. Correlation analysis is a method of processing statistical data, which consists in studying correlation coefficients between variables. This compares the correlation coefficients between one pair or a set of pairs of characteristics to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random variables with an optional presence of a strict functional character, in which the dynamics of one random variable leads to the dynamics of the mathematical expectation of the other.
The concept of falsity of correlation
When conducting a correlation analysis,take into account that it can be carried out in relation to any set of characteristics, often absurd in relation to each other. Sometimes they have no causal relationship with each other.
In this case, they speak of a false correlation.
The problems of correlation analysis
Based on the above definitions, you canto formulate the following tasks of the described method: to obtain information about one of the unknown variables by means of another; determine the tightness of the relationship between the variables being studied.
Correlation analysis involves determining the relationship between the features being studied, and therefore the tasks of correlation analysis can be supplemented with the following:
- identification of factors that have the greatest impact on the result;
- identification of previously unexplained causes of links;
- the construction of a correlation model with its parametric analysis;
- study of the significance of communication parameters and their interval estimation.
The correlation between correlation analysis and regression
Terms of use
Effective factors depend on one toseveral factors. The method of correlation analysis can be used if there is a large number of observations about the magnitude of the productive and factor indicators (factors), while the factors studied must be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case the Pearson correlation coefficients are the result of the correlation analysis, or, if the signs do not obey this law, the Spearman rank correlation coefficient is used.
Rules for the selection of correlation analysis factors
When applying this method, it is necessarydetermine the factors that affect the performance indicators. They are selected taking into account the fact that causality should be present between the indicators. In the case of creating a multifactorial correlation model, those that significantly affect the resultant index are selected, with the interdependent factors with a coefficient of pair correlation greater than 0.85 in the correlation model preferably not included, as well as those in which the relationship with the resultant parameter is non-linear or functional character.
Display Results
The results of the correlation analysis can be presented in text and graphic forms. In the first case, they are represented as a correlation coefficient, in the second - in the form of a scatter diagram.
If there is no correlation between the parameters of the pointon the diagram are located chaotically, the average degree of communication is characterized by a greater degree of ordering and is characterized by more or less uniform remoteness of the marked marks from the median. A strong link tends to a straight line, and for r = 1 a point chart is an even line. The inverse correlation is different from the direction of the graph from the upper left to the lower right, the straight line from the lower left to the upper right corner.
Three-dimensional representation of the scatter (scattering) diagram
In addition to the traditional 2D representation of the scatter diagram, a 3D mapping of the graphical representation of the correlation analysis is currently used.
A matrix of the scattering diagram is also used,which displays all the paired graphs in one figure in the matrix format. For n variables, the matrix contains n rows and n columns. The diagram located at the intersection of the i-th row and the j-th column is a graph of the variables Xi in comparison with Xj. Thus, each row and column is one dimension, a single cell displays a scatter diagram of two dimensions.
Estimation of tightness of communication
The tightness of the correlation relation is determined fromcorrelation coefficient (r): strong - r = ± 0.7 to ± 1, mean - r = ± 0.3 to ± 0.699, weak - r = 0 to ± 0.299. This classification is not strict. The figure shows a slightly different scheme.
An example of the application of the method of correlation analysis
In the UK, a curious study was undertaken. It is devoted to the relationship of smoking with lung cancer, and was carried out by correlation analysis. This observation is presented below.
Professional group | smoking | mortality |
Farmers, foresters and fishermen | 77 | 84 |
Miners and quarry workers | 137 | 116 |
Manufacturers of gas, coke and chemicals | 117 | 123 |
Manufacturers of glass and ceramics | 94 | 128 |
Workers of furnaces, forge, casting and rolling mills | 116 | 155 |
Workers of electrical engineering and electronics | 102 | 101 |
Engineering and related professions | 111 | 118 |
Woodworking production | 93 | 113 |
Leather goods | 88 | 104 |
Textile workers | 102 | 88 |
Manufacturers of work clothes | 91 | 104 |
Food, drink and tobacco workers | 104 | 129 |
Manufacturers of paper and printing | 107 | 86 |
Manufacturers of other products | 112 | 96 |
Builders | 113 | 144 |
Artists and Decorators | 110 | 139 |
Drivers of stationary engines, cranes, etc. | 125 | 113 |
Workers not included elsewhere | 133 | 146 |
Transport and Communications Workers | 115 | 128 |
Warehouse workers, storekeepers, packers and workers of filling machines | 105 | 115 |
Office workers | 87 | 79 |
Sellers | 91 | 85 |
Employees of the sport and recreation service | 100 | 120 |
Administrators and managers | 76 | 60 |
Professionals, technicians and artists | 66 | 51 |
We begin the correlation analysis. The solution is better to start for clarity with the graphical method, for which we construct a scatter diagram (spread).
It demonstrates a direct connection. However, based on only the graphic method, it is difficult to make an unambiguous conclusion. Therefore, we continue to perform the correlation analysis. An example of calculating the correlation coefficient is presented below.
Using software (for example, MSExcel will be described below), we determine the correlation coefficient, which is 0.716, which means a strong relationship between the parameters studied. We determine the statistical reliability of the obtained value from the corresponding table, for which we need to subtract 25 pairs of values of 2, resulting in 23 and, on this line in the table, we find r critical for p = 0.01 (since this is medical data, a more rigorous in the remaining cases, p = 0.05), which is 0.51 for a given correlation analysis. The example demonstrated that r is greater than r critical, the value of the correlation coefficient is considered statistically reliable.
The use of software in conducting a correlation analysis
The described type of statistical data processingcan be implemented using software, in particular, MS Excel. Correlation analysis in Excel involves the calculation of the following parameters using functions:
1. The correlation coefficient is determined using the CORREL function (array1, array2). Array1,2 is a cell of the range of values of productive and factor variables.
The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting with Excel 2007, you can use the PEARSON function with the same arrays.
Graphical representation of the correlation analysis in Excel is made using the "Diagrams" panel with the selection "Spot chart".
After indicating the initial data, we obtain a graph.
2. Evaluation of the significance of the pair correlation coefficient using Student's t-test. The calculated value of the t-test is compared with the tabular (critical) valueof this indicator from the corresponding table of values of the considered parameter taking into account a given level of significance and the number of degrees of freedom. This assessment is carried out using the function TIRE (probability, degree_freedom).
3. Matrix of coefficients of pair correlation. The analysis is performed using the "Data analysis" tool, in which "Correlation" is selected. Statistical evaluation of the coefficients of pair correlation is carried out when comparing its absolute value with the tabular (critical) value. If the calculated coefficient of pair correlation is exceeded above this critical one, we can say, with a given degree of probability, that the null hypothesis about the significance of the linear connection is not rejected.
Finally
Use in scientific research methodcorrelation analysis allows you to determine the relationship between various factors and performance indicators. In this case, it is necessary to take into account that a high correlation coefficient can be obtained from an absurd pair or a set of data, and this kind of analysis must be performed on a sufficiently large data set.
After obtaining the calculated value of r, itit is desirable to compare it with r critical to confirm the statistical certainty of a certain value. Correlation analysis can be performed manually using formulas, or with the help of software tools, in particular MS Excel. Here it is also possible to construct a scatter (scattering) diagram for the purpose of visualizing the relationship between the studied correlation analysis factors and the resultant trait.