# Case Study: Logistic Regression and Regularized Logistics Regression Applied to Estimating the Probability of Cesarean Section

** Case study background and problem formulations**

——————————————————————–

<Calculator of the Risk of Cesarean Section

The calculator posted at the previous link estimates the probability of Cesarean Section and the probability of Cephalopelvic Disproportion/Failure to Progress (CPD). It is based on the paper: < Chen, G.,Uryasev, S. and T. Young. On Prediction of the Cesarean Delivery Risk in a Large Private Practice, American Journal of Obstetrics and Gynecologists,191/2, 2004, 624-632

——————————————————————–

**Problem 1: maximizing log-likelihood**

maximize logexp_sum (maximizing log-likelihood)

Value: logistic

——————————————————————–

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

logistic = Logistic calculates values of logistic function for every observation (scenario)

——————————————————————–

Data and solution in Run-File Environment

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.495793 | 0.08 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.495793 | 0.05 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | R code | Data | 6 | 12,690 | -0.495793 | 0.05 |

**Problem 2: maximizing regularized log-likelihood**

maximize logexp_sum – polynom_abs (maximizing regularized log-likelihood)

Value: logistic

——————————————————————–

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

polynom_abs = Polynomial Absolute

logistic = Logistic calculates values of logistic function for every observation (scenario)

——————————————————————–

Data and solution in Run-File Environment

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.498204 | 0.05 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.496348 | 0.04 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | R code | Data | 6 | 12,690 | -0.496348 | 0.04 |

**Problem 3: maximizing log-likelihood under cardinality constraint**

maximize logexp_sum (maximizing log-likelihood)

Constraint: <= 4

cardn

Solver: precision = 9

Value: logistic

——————————————————————–

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

cardn = Cardinality

logistic = Logistic calculates values of logistic function for every observation (scenario)

——————————————————————–

Data and solution in Run-File Environment

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Problem Statement | Data | Solution | 6 | 12,690 | -0.497135 | 0.35 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Matlab code | Data | Solution | 6 | 12,690 | -0.497134 | <0.1 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | R code | Data | 6 | 12,690 | -0.497134 | <0.1 |

**Problem 4: 4-fold Cross-validation (4 in-sample data and 4 out-of-sample data) for maximization of the log-likelihood function**

4-fold crossvalidation

Maximize logexp_sum

Value:

logistic (function Logistic on the in-sample data)

logistic (function Logistic on the out-of-sample data)

——————————————————————–

crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices

logexp_sum = log-likelihood function for logistic regression (Logarithms Exponents Sum)

logistic = Logistic calculates values of logistic function for every observation (scenario)

——————————————————————–

Data and solution in Run-File Environment

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.14GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Cycle statement | Data | Solution | 6 | 9,517 | -0.496 | 0.15 |

Dataset2 | 6 | 9,517 | -0.495 | 0.18 | |||

Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||

Dataset4 | 6 | 9,517 | -0.494 | 0.08 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | Matlab code | Data | Solution | 6 | 9,517 | -0.496 | 0.11 |

Dataset2 | 6 | 9,517 | -0.495 | 0.14 | |||

Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||

Dataset4 | 6 | 9,517 | -0.494 | 0.07 |

Problem Datasets | # of Variables | # of Scenarios | Objective Value | Solving Time, PC 3.50GHz (sec) | |||
---|---|---|---|---|---|---|---|

Dataset1 | R code | Data | 6 | 9,517 | -0.496 | 0.11 | |

Dataset2 | 6 | 9,517 | -0.495 | 0.14 | |||

Dataset3 | 6 | 9,517 | -0.498 | 0.05 | |||

Dataset4 | 6 | 9,517 | -0.494 | 0.07 |

**CASE STUDY SUMMARY**

This case study finds an optimal estimate of the cesarean section rate in a women population. The risk of difficult labor is described by a probabilistic model that depends on measurable demographic factors. We evaluated the effects of demographic factors on the probability of Cesarean section. This case study considers 6 primary factors: age, height, weight, maternal weight gain, gestational age, and birth weight. Background for this case study is described in Chen et al. (2004).

We considered four formulations of the logistic regression optimization problem:

• Problem 1. Maximization of the log-likelihood function (“plain vanilla” logistic regression).

• Problem 2. Maximization of the log-likelihood function minus additional regularization term (regularized logistic regression).

• Problem 3. Maximization of the log-likelihood function subject to constraint on cardinality.

• Problem 4. Cross-Valiadtion applied to Maximization of the log-likelihood function.

Problem 1 was implemented in PSG by maximizing the log-likelihood function which is a standard PSG function (“logexp_sum”). This problem formulation was considered in Chen et al (2004).

The regularization term in Problem 2 was subtracted from the log-likelihood function to improve the out-of-sample performance of the regression model. The regularization is very popular in data-mining applications, see for instance, Shi et al (2008). For regularization we used the “polynom_abs” function, which is a standard function of PSG. Coefficients for this polynomial absolute function were obtained with the steepest descent algorithm which optimizes out-of-sample performance.

The constraint on cardinality in the Problem 3 was used to reduce the number of factors and improve the out-of-sample performance of the regression model.

Problem 4 is the 4-fold Cross-Validation for the Maximization of the log-likelihood (which was done in Problem 1). In each pass we selected ¾ of the data as in-sample dataset on which we calibrated the model. Then we tested the performance of the models on the remaining (out-of-sample) ¼ part of data to observe how the model predicts the probability of Cesarean section.

**References**

• Chen, G., Uryasev, S., and T.K. Young (2004): On the prediction of the cesarean delivery risk in a large private practice. American Journal of Obstetrics and Gynecology, 191, 617-25.

• Shi W., Wahba, G., Wright S, Lee, K., Klein, R, Klein, B. (2008): LASSO-Patternsearch algorithm with application to ophthalmology and genomic data. Stat Interface., 1(1), 137-153.