Case Study: Classification by Maximizing Area Under ROC Curve (AUC)

Back to main page
Case study background and problem formulations

Instructions for optimization with PSG Run-File, PSG MATLAB Toolbox, PSG MATLAB Subroutines and PSG R.

PROBLEM 1: problem_one_Loss
Minimize Pr_pen (minimizing probability that Loss function is below zero)
Box constraints (constraints on decision variables)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 10,000 0.22352 0.38
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

PROBLEM 2: problem_diff_of_two_Losses
Minimize Pr_pen (minimizing probability that difference of two Loss functions is below zero)
Box constraints (constraints on decision variables)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 100 0.2354 0.01
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

PROBLEM 3 SMALL: maximizing log-likelihood in logistic regression (small data set )
Maximize Logexp_sum (maximizing log-likelihood of logistic regression)
Calculate Pr_pen (calculating probability of difference of two random linear functions on optimal logistic regression point)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 7 2538 -0.494775 0.02
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

PROBLEM 4 SMALL : problem_diff_of_two_Losses (small data set )
Minimize Pr_pen (minimizing probability that difference of two Loss functions is below zero)
subject to
Linear = b (linear constraint)
Initial point = optimal logistic regression point (obtained in the Problem 3 small)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 552*1,986=1,096,272 0.3298798 0.57
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

PROBLEM 5 SMALL: problem_one_Loss (small data set )
Minimize Pr_pen (minimizing probability that Loss function is below zero)
subject to
Linear = b (linear constraint)
Initial point = optimal logistic regression point (obtained in the Problem 3 small)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 1,096,272 0.32975 355.45
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data

 

PROBLEM 3 LARGE: maximizing log-likelihood in logistic regression (large data set )
Maximize Logexp_sum (maximizing log-likelihood of logistic regression)
Calculate Pr_pen (calculating probability of difference of two random linear functions on optimal logistic regression point)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 7 12,690 -0.495793 0.05
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

PROBLEM 4 LARGE: problem_diff_of_two_Losses (large data set )
Minimize Pr_pen (minimizing probability that difference of two Loss functions is below zero)
subject to
Linear = b (linear constraint)
Initial point = optimal logistic regression point (obtained in the Problem 3 large)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 2,789*9,901=27,613,889 0.331084 4.34
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

PROBLEM 5 AGGREGATED: problem_one_Loss (aggregated data set)
Minimize Pr_pen (minimizing probability that Loss function is below zero)
subject to
Linear = b (linear constraint)
Initial point = optimal logistic regression point (obtained in the Problem 3 large)
Calculate Pr_pen (calculating probability of difference of two random linear functions on optimal point)
——————————————————————

# of Variables # of Scenarios Objective Value Solving Time, PC 3.14GHz (sec)
Dataset 6 185,944 0.332107 6.86
Environments
Run-File Problem Statement Data Solution
Matlab Toolbox Data
Matlab Subroutines Matlab Code Data
R R Code Data

 

CASE STUDY SUMMARY

In classification problems, AUC (Area Under ROC Curve) is a popular measure for evaluating the goodness of classifiers. Namely, a classifier which attains higher AUC is preferable to a lower AUC classifier. This motivates us directly maximize AUC for obtaining a classifier. Such a direct maximization is rarely applied because of numerical difficulties. Usually, computationally tractable optimization methods (e.g., logistic regression) are used, and then, their goodness is measured by AUC. This case study maximizes the AUC in a direct manner. Similar approach is used in paper [1], which considers an approximation approach to AUC maximization. More precisely, [1] maximizes a surrogate of AUC by replacing the 0-1 step function, which appears in a representation of AUC, with a sigmoid function, so that the derivative can be calculated. In contrast with [1], this case study directly maximizes AUC. Difficulties of the associated maximization problem are: (a) Objective function is nonconvex; (b) Number of scenarios may be huge so that standard PCs cannot handle all the scenarios at once in operation memory. We present AUC as follows:

AUC = 1- probability ((difference of two independent random linear functions) <=0).

PSG can calculate and minimize the probability of the difference of two linear functions with independent random vectors of coefficients. This case study maximizes AUC by minimizing PSG probability function pr_pen. Two equivalent variants are considered: 1)Problem 1 uses difference of two independent random linear functions presented by two different matrices of scenarios (with the same column headers); 2) Problem 2 uses one matrix of scenarios which is manually generated by taking differences of linear functions from two different matrices (this matrix can be created only for small dimensions because the number of rows in the resulting matrix equals the product of number of rows in the first and in the second matrix). Problems 1 and 2 are mathematically equivalent, but they have different data inputs.

Problem 3 solves logistic regression problem by maximizing PSG logexp_sum function using Matrix of Scenarios with binary “benchmark” (dependent variable). Further, this Matrix of Scenarios is divided in two different Matrices of Scenarios. The first Matrix of Scenarios includes rows with 1 benchmark value, and the second Matrix of Scenarios includes rows with 0 benchmark value. These two created matrices have zero benchmark values. Then, we evaluate AUC of this logistic regression classifier by calculating probability (pr_pen) of difference of two random linear functions presented by two different matrices of scenarios. Optimal logistic regression point is used as an initial point in the following Problems 4 and 5.

Problem 4 maximizes AUC by minimizing PSG probability function pr_pen , using difference of two random linear functions presented by two different matrices of scenarios under a linear constraint. This Problem uses the optimal logistic regression point (obtained in the Problem 3) as an initial point.

Problem 5 maximizes AUC by minimizing PSG probability function pr_pen , using one matrix of scenarios which is generated by taking differences of linear functions from two different matrices. The problem is solved with a linear constraint. This Problem uses the optimal logistic regression point (obtained in the Problem 3) as an initial point.

Problems 3-5 are solved for 3 datasets generated from one large data set.

References
[1] Miura K, Yamashita S, Eguchi S. (2010) Area Under the Curve Maximization Method in Credit Scoring. The Journal of Risk Model Validation, 4(2), pp.3-25.