Logistic regression

Suggested pre-reading

The concept of logistic regression

Logistic regression is a kind of linear regression where the independent variable (Y) is not continuous (does not have an order with equidistant scale steps). All variables are transformed using the function for natural logarithms. A standard linear regression is made where the outcome is transformed back using the inverse of natural logarithms (e.g. the exponential function). The beta coefficients produced by the logistic regression is often transformed to odds ratios. Logistic regression creates equations similar to the equations created by standard linear regression:

Standard linear regression equation: Y = a + b1x1 + b2x2 + b3x3

Binary logistic equation: Log-odds = a + b1x1 + b2x2 + b3x3

X represents the different independent variables and b their respective beta coefficients. Possible different situations within the concept of logistic regression are:

  • The dependent variable is binary and the sample consists of  independently chosen observations ⇒Unconditional binary logistic regression
  • The dependent variable is binary and the sample consists of independently chosen matched pairs ⇒Conditional binary logistic regression
  • The dependent variable is categorical with more than two categories / values and the sample consists of  independently chosen observations  ⇒Multinominal logistic regression (=multiclass logistic regression)
  • The dependent variable is ordinal (has an order but not equidistant scale steps) and the sample consists of  independently chosen observations ⇒ Ordered logistic regression (=ordinal regression)

The independent variables (X) can be categorical (no order), ordinal (ordered) or continuous (ordered with equidistant scale steps). The rest of this page will focus on unconditional binary logistic regression. First have a look at this video by Steve Grambow explaining what unconditional binary logistic regression is:

History

(Coming)

Reasons for why unconditional logistic binary regression is very useful

Binary logistic regression is perfect for evaluating the correlation between any variable and a dichotomous dependent variable. Many group comparisons should preferably be analysed using logistic regression rather than a chi-square or t-test. The reason for why logistic regression has an advantage over the simpler group comparison, such as chi-square and t-test is:

  • The ability to adjust for other covariates. The importance of each variable can be given while simultaneously adjusting for the influence of other variables. This usually gives a better reflection of the true importance of a variable.
  • Univariate regression followed by a stepwise multivariate regression. The outcome of the unadjusted regression is not a result, it is only a sorting mechanism to eliminate insignificant variables. The following stepwise regression will further reduce the number of variables. Hence the problem of multiple testing is reduced significantly.

Let me explain the latter using an example. Assume we want to know which variables differ between two groups, those who have experienced an illness compared to those who has not (or it could be those who say yes to a question compared to those who say no). Also assume that we want to investigate 50 different variables, some being categorical while others are continuous.  Using chi-square or t-test to compare the two groups would result in 50 p-values, some below and some above 0.05. However, a p-value below 0.05 can occur by chance. To compensate for the possibility of getting statistical significance by pure chance we need to lower the limit where we consider a statistical finding as significant. There are multiple ways of doing this. Just as an example using the simple Bonferroni adjustment means that only p-values below 0.001 should be considered as statistically significant. A consequence is that very few or none of the calculated p-values would be considered as statistically significant.

Using logistic regression as described below reduces this problem. We would first use unadjusted (univariate = bivariate) logistic regression as a sorting mechanism to decide which variables to put into the second step, a stepwise multivariate regression. Thus, the p-values in the univariate regression is not to be considered as a result and does not even have to be further commented when discussing results. The final result is only the variables surviving the multivariate stepwise regression (=adjusted logistic regression). A common scenario is to start with 50 independent variables. Perhaps 10-20 of these are statistically significant in the univariate regression. If we put in these 10-20 variables in the forward stepwise logistic regression it is reasonable to believe that 2-5 variables remains statistically significant in the final model. Assume that we finally get 5 p-values (with odds ratios). According to Bonferroni any p-value below 0.01 can now be considered as statistically significant. Consequently we have substantially reduced the problem with multiple testing.

The two arguments for using multivariate binary logistic regression (when the dependent variable is dichotomous) rather than a simple group comparison would also apply to multivariate linear regression (if the dependent variable is continuous) and multivariate Cox regression (if a time factor is the dependent variable).

Prerequisites for unconditional logistic binary regression

  • All your observations are chosen independently. This means that your observations should not be grouped in a way that can have an influence on the outcome. (You can have grouped observations and adjust for this. However, that is a bit complicated in logistic regression and cannot be done in all statistical software)
  • You have enough observations to investigate your question. The number of observations you need should be estimated in advance by doing a sample size calculation.
  • The dependent variable is a binary class variable (only two options). Usually this variable describes if a phenomenon exists (coded as 1) versus does not exist (coded as 0). An example can be if a patient has a disease or if a specified event happened.
  • Categorical variables must be mutually exclusive. This means that it must be clear if an observation should be coded as a 0 or as a 1 (or anything else if the categorical variable has more than two values).
  • There should be a reasonable variation between the dependent variable and categorical independent variables. You can check for this by doing a cross-table between the outcome variable and categorical independent variables. Logistic regression might be unsuitable if there are cells in the cross-table with few or no observations.
  • The independent variables do not correlate to much to each other in case you have more than one independent variable. This phenomenon is labelled multicollinearity. You should test for this when you do a multiple binary logistic regression.
  • Any continuous independent variables must have a linear correlation with the logit or the binary dependent variable. The model constructed will not fit well if there is a non-linear correlation such as a polynomial correlation. You do not need to test for this before doing the logistic regression but checking this should be a part of the procedure when you do the regression..

Ensure that all variables are coded so interpretation of the outcome is facilitated. If you have a class variable (categorical variable) ensure that coding is sensible. If one variable is ethnicity then the question is if you should have one option for each possible alternative or merge them into fewer groups. This depends on your focus and how the individuals relates to the different options. It might be worth considering if merging several alternatives in categorical (class) variables into only two values/groups might facilitate interpretation of the outcome.

Preparations before performing logistic regression

  1. Data cleaning: Do a frequency analysis for each variable, one at the time. You are likely to find some surprises such as a few individuals with a third gender or a person with an age that is unreasonable or more missing data than you expect. Go back to the source and correct all errors. Recheck all affected variables after correction by doing a new frequency analysis again. This must be done properly before proceeding.
  2. Investigate traces of potential bias: Have a look at the proportion of missing data for each variable. There are almost always some missing data. Is there a lot of missing data in some variables? Do you have a reasonable explanation for why? Can it be a sign that there is some built in bias (systematic error in the selection of observations/individuals) in your study that may influence the outcome?
  3. Adjust independent variables to facilitate interpretation: Quite often a few variables needs to be transformed to a new variable. It is usually easier to interpret the result of a logistic regression if the independent variables are either a binary or a continuous variable. A class variable with several options might be better off either being merged to a binary variable or split into several binary variables. For continuous variables it may sometimes be easier to interpret the result if you change the scale. A typical example of the latter is that it is often better to transform age in years to a new variable showing age in decades. When you want to interpret the influence of age an increase in one year is often quite small but an increase in age of one decade is usually more relevant. Other examples are systolic blood pressure or laboratory investigations such as S-Creatinine (a blood test to measure kidney function).
  4. Decide strategy in case of multiple regression: If you have only one independent variable then do a bivariate (=univariate) binary logistic regression. However, if you have multiple independent variables then you need to choose a strategy to include independent variables in your analysis. There are a few different possibilities to do this:
    1. Decide due to logical reasons / theories a combination of independent variables and use them. The number of independent variables should not be too many, preferably less than 10. This would be the preferred method if you have a good theory of how to link the variables to each other.
    2. If you have many independent variables and no theory of which ones are useful you may let the computer suggest which variables are relevant to retain in the final model by doing one regression for all possible combinations of independent variables. This is only feasible if you don’t have too many independent variables. A few statistical software packages can automate this process. However, even with a fast computer the number of independent variables that can be analysed in this way is limited (more than 15-20 would be difficult).
    3. If you have many independent variables and no theory of which ones are useful you may let the computer suggest which variables are relevant to include using a step-wise procedure. This procedure can start with many independent variables (no direct upper limit) and gradually eliminate the ones of lesser importance. This procedure is also called a fishing expedition. Nothing wrong in that but important to maintain a suspicion towards statistically significant findings that seems peculiar. If you have many independent variables (>15) do forward stepwise regression. If you have a moderate number of independent variables (fifteen or less) do backwards stepwise regression.

Building a multivariate binary logistic regression model – predetermined variables

This describes how to build a model as described in 4A above.

Commands to do a logistic regression in SPSS:

Viewing and interpreting the outcome of a logistic regression in SPSS:

Building a multivariate binary logistic regression model – stepwise

This describes how to build a model as described in 4C above:

  1. Do a logistic regression with one independent variable at the time. Save the output. (There is a confusion around the issue if regression with only one independent variable should be labelled as unadjusted, univariate or bivariate. They are all correct but “unadjusted” will probably cause least confusion so I recommend using that.) This step is just to sift out variables of potential interest for further investigation. Hence, the outcome of this unadjusted regression is not part of our final results, it is just a sorting mechanism. There is no given rule of where to have the cut off to let variables proceed to the next step in the analysis. A p-value between 0.05-0.2 should work.
  2. If the previous step shows that more than one independent variable from the preceeding step are put forward for further analysis then you must check these for zero order correlations. This means checking if any of the independent variables of potential interest correlates strongly to each other. If that is the case then you must make a choice before proceeding. There is no clear cut definition of a “strong correlation”. I suggest that two independent variables having a Pearson (or Spearman) correlation coefficient above +0.7 or below -0.7 with a p-value <0.05 should be considered as being too correlated. If that is the case then you need to omit one of them from further analysis. This choice should be influenced on what is most practical to keep in the further analysis (what is likely to be more useful). The best way to do this in SPSS is to do a standard multivariate linear regression and in the Statistics button tick that you want Covariance matrix and Collinearity diagnostics. Ignore all output except these two outputs.
  3. The next step is to do a forward or backward stepwise multivariate logistic regression. When you do this remember to tick the casewise listing of residuals in the options button (if you are using SPSS). Carefully investigate if a lot of observations are excluded in this analysis. The reason for this is a lot of missing data in one or a few variables. This may represent a potential bias (data are not missing randomly but for a reason).
  4. In SPSS look in the output for one table labelled Casewise list. This table lists observations that does not fit the model well. Their raw data should be checked. You may want to consider removing extreme observations with a ZResid value above +2.5 or below -2.5.
  5. When you have the final model you need to see how well it performs. This is done by looking at different measures such as Nagelkerke R-square and Area under Curve (AUC = C-statistics). See below.
  6. These data are normally presented in a table where the first column lists all independent variables evaluated, the second column presents the outcome of the bivariate regression and the third column presents the outcome of the multivariate regression (see examples below).
  7. Data should always be presented in a table as described in item 6. However, if data are suitable they may lend themselves to also be presented as a user friendly look up table or probability nomogram. Look up tables might be a good option if you only have dichotomous independent variables remaining in the final model. Probability nomograms are often a good option if you have only one continuous variable remaining in the final model (see examples below). If you end up in a final model with multiple continuous variables the options are:
    1. Simply present data in a table as described in item 6 above. This is often the best choice.
    2. Create a complicated probability nomogram (unlikely to be user friendly).
    3. Create a web based calculator where the user puts in information and a probability is calculated.
    4. Make a phone app that does all calculations when the user puts in data.
    5. Investigate the consequences of dichotomising all independent contonous variables except one. Sometimes the explanatory power of the model (measured by Nagelkerke R-square and AUC) only goes down marginally with a few per cent. This price might be acceptable to enable presenting a complicated finding in a much more user friendly way. If that is the case it enables you to create a user friendly probability nomogram.

Evaluating and validating your new model

  • Hosmer and Lemeshow test investigates if the new model describes the observations better than pure chance. A low p-value indicates that it is not better than pure chance while a high p-value says it is better. This test is too sensitive if you have large data sets. Many statistical software packages delivers this if you tick the right box when you command the software to do logistic regression.
  • Nagelkerke R square describes how well variation in the the dependent variable is described by variation in the independent variables. Zero means that the new model is useless and 1.0 that it is absolutely perfect. Thus, the higher the better. Many statistical software packages delivers this if you tick the right box when you command the software to do logistic regression.
  • Test your model in the same set of observations with a ROC analysis. Look at the figure for Area under the curve (AUC). This usually goes from 0.5 (the model is as good as pure chance) to 1.0 (the model is perfect). ROC analysis is a separate analysis requiring you to save the estimated probability for the event for each observation when you do the logistic regression.
  • If you have many observations then you can use 70-75% of them to construct your model and 25-30% of them to test your model with a ROC analysis.
  • The ultimate test is to apply your model to another set of observations obtained from another site / context. Calculating AUC in this new set of observations estimates how useful your model is.

Examples of presenting the outcome of logistic regression

Results from multivariate binary logistic regression should always be presented in a table where the first column lists all independent variables evaluated, the second column presents the outcome of the bivariate regression and the third column presents the outcome of the multivariate regression (see examples below). However, if data are suitable they may also lend themselves to be presented as a user friendly look up table or probability nomogram. Look up tables might be a good option if you only have dichotomous independent variables remaining in the final model. Probability nomograms are often a good option if you have only one continuous variable remaining in the final model (see examples below). If you end up in a final model with multiple continuous variables the options are:

  • Simply present data in a table as described above. This must be done anyway and it is not always a good idea to do more.
  • Investigate the consequences of dichotomising all independent continuous variables except one. Sometimes the explanatory power of the model (measured by Nagelkerke R-square and AUC) only goes down marginally with a few per cent. This price might be acceptable to enable presenting a complicated finding in a much more user friendly way. If that is the case it enables you to create a user friendly probability nomogram. This is often worth investigating.
  • Create a complicated probability nomogram (unlikely to be user friendly).
  • Create a web based calculator where the user puts in information and a probability is calculated.
  • Make a phone app that does all calculations when the user puts in data.

Practical examples of presenting the outcome from multivariate logistic regression:

More information

You should cite this article if you use its information in other circumstances. An example of citing this article is:
Ronny Gunnarsson. Logistic regression [in Science Network TV]. Available at: http://science-network.tv/logistic-regression/. Accessed September 25, 2017.

Comments are closed.