Case Study: Logistic Regression and Regularized Logistics Regression Applied to Estimating the Probability of Cesarean Section

Case study background and problem formulations

Instructions for optimization with PSG Run-File, PSG MATLAB Toolbox, PSG MATLAB Subroutines and PSG R.

——————————————————————–
Estimating the probability of Cesarean Section and the probability of Cephalopelvic Disproportion/Failure to Progress (CPD): Chen, G.,Uryasev, S. and T. Young. On Prediction of the Cesarean Delivery Risk in a Large Private Practice, American Journal of Obstetrics and Gynecologists,191/2, 2004, 624-632
——————————————————————–

Problem 1: maximizing log-likelihood
maximize logexp_sum (maximizing log-likelihood)

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.14GHz (sec)
Dataset1	Problem Statement	Data	Solution	6	12,690	-0.495793	0.08

Data and solution in MATLAB Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	Matlab code	Data	Solution	6	12,690	-0.495793	0.05

Data and solution in R Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	R code	Data		6	12,690	-0.495793	0.05

Problem 2: maximizing regularized log-likelihood
maximize logexp_sum – polynom_abs (maximizing regularized log-likelihood)

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
polynom_abs = Polynomial Absolute
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.14GHz (sec)
Dataset1	Problem Statement	Data	Solution	6	12,690	-0.498204	0.05

Data and solution in MATLAB Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	Matlab code	Data	Solution	6	12,690	-0.496348	0.04

Data and solution in R Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	R code	Data		6	12,690	-0.496348	0.04

Problem 3: maximizing log-likelihood under cardinality constraint
maximize logexp_sum (maximizing log-likelihood)
Constraint: <= 4
cardn
Solver: precision = 9

Value: logistic
——————————————————————–
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
cardn = Cardinality
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.14GHz (sec)
Dataset1	Problem Statement	Data	Solution	6	12,690	-0.497135	0.35

Data and solution in MATLAB Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	Matlab code	Data	Solution	6	12,690	-0.497134	<0.1

Data and solution in R Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	R code	Data		6	12,690	-0.497134	<0.1

Problem 4: 4-fold Cross-validation (4 in-sample data and 4 out-of-sample data) for maximization of the log-likelihood function
4-fold crossvalidation
Maximize logexp_sum

Value:
logistic (function Logistic on the in-sample data)
logistic (function Logistic on the out-of-sample data)
——————————————————————–
crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices
logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)
logistic = Logistic calculates values of logistic function for every observation (scenario)
——————————————————————–
Data and solution in Run-File Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.14GHz (sec)
Dataset1	Cycle statement	Data	Solution	6	9,517	-0.496	0.15
Dataset2				6	9,517	-0.495	0.18
Dataset3				6	9,517	-0.498	0.05
Dataset4				6	9,517	-0.494	0.08

Data and solution in MATLAB Environment

Problem Datasets				# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	Matlab code	Data	Solution	6	9,517	-0.496	0.11
Dataset2				6	9,517	-0.495	0.14
Dataset3				6	9,517	-0.498	0.05
Dataset4				6	9,517	-0.494	0.07

Data and solution in R Environment

Problem Datasets			# of Variables	# of Scenarios	Objective Value	Solving Time, PC 3.50GHz (sec)
Dataset1	R code	Data	6	9,517	-0.496	0.11
Dataset2			6	9,517	-0.495	0.14
Dataset3			6	9,517	-0.498	0.05
Dataset4			6	9,517	-0.494	0.07

CASE STUDY SUMMARY
This case study finds an optimal estimate of the cesarean section rate in a women population. The risk of difficult labor is described by a probabilistic model that depends on measurable demographic factors. We evaluated the effects of demographic factors on the probability of Cesarean section. This case study considers 6 primary factors: age, height, weight, maternal weight gain, gestational age, and birth weight. Background for this case study is described in Chen et al. (2004).

We considered four formulations of the logistic regression optimization problem:

• Problem 1. Maximization of the log-likelihood function (“plain vanilla” logistic regression).
• Problem 2. Maximization of the log-likelihood function minus additional regularization term (regularized logistic regression).
• Problem 3. Maximization of the log-likelihood function subject to constraint on cardinality.
• Problem 4. Cross-Validation applied to Maximization of the log-likelihood function.

Problem 1 was implemented in PSG by maximizing the log-likelihood function which is a standard PSG function (“logexp_sum”). This problem formulation was considered in Chen et al (2004).

The regularization term in Problem 2 was subtracted from the log-likelihood function to improve the out-of-sample performance of the regression model. The regularization is very popular in data-mining applications, see for instance, Shi et al (2008). For regularization we used the “polynom_abs” function, which is a standard function of PSG. Coefficients for this polynomial absolute function were obtained with the steepest descent algorithm which optimizes out-of-sample performance.

The constraint on cardinality in the Problem 3 was used to reduce the number of factors and improve the out-of-sample performance of the regression model.

Problem 4 is the 4-fold Cross-Validation for the Maximization of the log-likelihood (which was done in Problem 1). In each pass we selected ¾ of the data as in-sample dataset on which we calibrated the model. Then we tested the performance of the models on the remaining (out-of-sample) ¼ part of data to observe how the model predicts the probability of Cesarean section.

References
• Chen, G., Uryasev, S., and T.K. Young (2004): On the prediction of the cesarean delivery risk in a large private practice. American Journal of Obstetrics and Gynecology, 191, 617-25.
• Shi W., Wahba, G., Wright S, Lee, K., Klein, R, Klein, B. (2008): LASSO-Patternsearch algorithm with application to ophthalmology and genomic data. Stat Interface., 1(1), 137-153.

Dr. Stan Uryasev

Dr. Stan Uryasev

Case Study: Logistic Regression and Regularized Logistics Regression Applied to Estimating the Probability of Cesarean Section

UF Resources

Campus

Website