Case Study: Logistic Regression and Regularized Logistics Regression Applied to Estimating the Probability of Cesarean Section

Back to main page

Case study background and problem formulations

Instructions for optimization with PSG Run-File, PSG MATLAB Toolbox, PSG MATLAB Subroutines and PSG R.

——————————————————————–
Estimating the probability of Cesarean Section and the probability of Cephalopelvic Disproportion/Failure to Progress (CPD): Chen, G.,Uryasev, S. and T. Young. On Prediction of the Cesarean Delivery Risk in a Large Private Practice, American Journal of Obstetrics and Gynecologists,191/2, 2004, 624-632
——————————————————————–

Problem 1: maximizing log-likelihood
maximize logexp_sum (maximizing log-likelihood)

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset1 Problem Statement Data Solution 6 12,690 -0.495793 0.08
Data and solution in MATLAB Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 Matlab code Data Solution 6 12,690 -0.495793 0.05
Data and solution in R Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 R code Data 6 12,690 -0.495793 0.05

Problem 2: maximizing regularized log-likelihood
maximize logexp_sum – polynom_abs (maximizing regularized log-likelihood)

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
polynom_abs = Polynomial Absolute
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset1 Problem Statement Data Solution 6 12,690 -0.498204 0.05
Data and solution in MATLAB Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 Matlab code Data Solution 6 12,690 -0.496348 0.04
Data and solution in R Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 R code Data 6 12,690 -0.496348 0.04

Problem 3: maximizing log-likelihood under cardinality constraint
maximize logexp_sum (maximizing log-likelihood)
Constraint: <= 4
cardn
Solver: precision = 9

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
cardn = Cardinality
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset1 Problem Statement Data Solution 6 12,690 -0.497135 0.35
Data and solution in MATLAB Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 Matlab code Data Solution 6 12,690 -0.497134 <0.1
Data and solution in R Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 R code Data 6 12,690 -0.497134 <0.1

Problem 4: 4-fold Cross-validation (4 in-sample data and 4 out-of-sample data) for maximization of the log-likelihood function
4-fold crossvalidation
Maximize logexp_sum

Value:
logistic (function Logistic on the in-sample data)
logistic (function Logistic on the out-of-sample data)
——————————————————————–
crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset1 Cycle statement Data Solution 6 9,517 -0.496 0.15
Dataset2 6 9,517 -0.495 0.18
Dataset3 6 9,517 -0.498 0.05
Dataset4 6 9,517 -0.494 0.08
Data and solution in MATLAB Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 Matlab code Data Solution 6 9,517 -0.496 0.11
Dataset2 6 9,517 -0.495 0.14
Dataset3 6 9,517 -0.498 0.05
Dataset4 6 9,517 -0.494 0.07
Data and solution in R Environment

Problem Datasets # of Variables # of Scenarios Objective Value Solving Time, PC 3.50GHz (sec)
Dataset1 R code Data 6 9,517 -0.496 0.11
Dataset2 6 9,517 -0.495 0.14
Dataset3 6 9,517 -0.498 0.05
Dataset4 6 9,517 -0.494 0.07

CASE STUDY SUMMARY
This case study finds an optimal estimate of the cesarean section rate in a women population. The risk of difficult labor is described by a probabilistic model that depends on measurable demographic factors. We evaluated the effects of demographic factors on the probability of Cesarean section. This case study considers 6 primary factors: age, height, weight, maternal weight gain, gestational age, and birth weight. Background for this case study is described in Chen et al. (2004).

We considered four formulations of the logistic regression optimization problem:

• Problem 1. Maximization of the log-likelihood function (“plain vanilla” logistic regression).
• Problem 2. Maximization of the log-likelihood function minus additional regularization term (regularized logistic regression).
• Problem 3. Maximization of the log-likelihood function subject to constraint on cardinality.
• Problem 4. Cross-Validation applied to Maximization of the log-likelihood function.

Problem 1 was implemented in PSG by maximizing the log-likelihood function which is a standard PSG function (“logexp_sum”). This problem formulation was considered in Chen et al (2004).

The regularization term in Problem 2 was subtracted from the log-likelihood function to improve the out-of-sample performance of the regression model. The regularization is very popular in data-mining applications, see for instance, Shi et al (2008). For regularization we used the “polynom_abs” function, which is a standard function of PSG. Coefficients for this polynomial absolute function were obtained with the steepest descent algorithm which optimizes out-of-sample performance.

The constraint on cardinality in the Problem 3 was used to reduce the number of factors and improve the out-of-sample performance of the regression model.

Problem 4 is the 4-fold Cross-Validation for the Maximization of the log-likelihood (which was done in Problem 1). In each pass we selected ¾ of the data as in-sample dataset on which we calibrated the model. Then we tested the performance of the models on the remaining (out-of-sample) ¼ part of data to observe how the model predicts the probability of Cesarean section.

 
References
• Chen, G., Uryasev, S., and T.K. Young (2004): On the prediction of the cesarean delivery risk in a large private practice. American Journal of Obstetrics and Gynecology, 191, 617-25.
• Shi W., Wahba, G., Wright S, Lee, K., Klein, R, Klein, B. (2008): LASSO-Patternsearch algorithm with application to ophthalmology and genomic data. Stat Interface., 1(1), 137-153.