Octave : logistic regression : difference between fmincg and fminunc
I often use fminunc
for a logistic regression problem. I have read on web that Andrew Ng uses fmincg
instead of fminunc
, with same arguments. The results are different, and often fmincg
is more exact, but not too much. (I am comparing the results of fmincg function fminunc against the same data)
So, my question is : what is the difference between these two functions? What algorithm does each function have implemented? (Now, I just use these functions without knowing exactly how they work).
Thanks :)
You will have to look inside the code of fmincg
because it is not part of Octave. After some search I found that it's a function file provided by the Machine Learning class of Coursera as part of the homework. Read the comments and answers on this question for a discussion about the algorithms.
In contrast to other answers here suggesting the primary difference between fmincg and fminunc is accuracy or speed, perhaps the most important difference for some applications is memory efficiency. In programming exercise 4 (i.e., Neural Network Training) of Andrew Ng's Machine Learning class at Coursera, the comment in ex4.m about fmincg is
%% =================== Part 8: Training NN ===================
% You have now implemented all the code necessary to train a neural
% network. To train your neural network, we will now use "fmincg", which
% is a function which works similarly to "fminunc". Recall that these
% advanced optimizers are able to train our cost functions efficiently as
% long as we provide them with the gradient computations.
Like the original poster, I was also curious about how the results of ex4.m might differ using fminunc instead of fmincg. So I tried to replace the fmincg call
options = optimset('MaxIter', 50);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
with the following call to fminunc
options = optimset('GradObj', 'on', 'MaxIter', 50);
[nn_params, cost, exit_flag] = fminunc(costFunction, initial_nn_params, options);
but got the following error message from a 32-bit build of Octave running on Windows:
error: memory exhausted or requested size too large for range of Octave's index type -- trying to return to prompt
A 32-bit build of MATLAB running on Windows provides a more detailed error message:
Error using find
Out of memory. Type HELP MEMORY for your options.
Error in spones (line 14)
[i,j] = find(S);
Error in color (line 26)
J = spones(J);
Error in sfminbx (line 155)
group = color(Hstr,p);
Error in fminunc (line 408)
[x,FVAL,~,EXITFLAG,OUTPUT,GRAD,HESSIAN] = sfminbx(funfcn,x,l,u, ...
Error in ex4 (line 205)
[nn_params, cost, exit_flag] = fminunc(costFunction, initial_nn_params, options);
The MATLAB memory command on my laptop computer reports:
Maximum possible array: 2046 MB (2.146e+09 bytes) *
Memory available for all arrays: 3402 MB (3.568e+09 bytes) **
Memory used by MATLAB: 373 MB (3.910e+08 bytes)
Physical Memory (RAM): 3561 MB (3.734e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I previously was thinking that Professor Ng chose to use fmincg to train the ex4.m neural network (that has 400 input features, 401 including the bias input) to increase the training speed. However, now I believe his reason for using fmincg was to increase the memory efficiency enough to allow the training to be performed on 32-bit builds of Octave/MATLAB. A short discussion about the necessary work to get a 64-bit build of Octave that runs on Windows OS is here.
According to Andrew Ng himself, fmincg
is used not to get a more accurate result (remember, your cost function will be same in either case, and your hypothesis no simpler or more complex) but because it is more efficient at doing gradient descent for especially complex hypotheses. He himself seems to use fminunc
where the hypothesis have few features, but fmincg
where it has hundreds.
Why does fmincg work?
Here is a copy of the source code with comments explaining the various algorithms used. It's a doozy, as it does the same thing a child's brain does when learning to discriminate between dog and chair.
This is the Octave source for fmincg.m.
function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
% Minimize a continuous differentialble multivariate function. Starting point
% is given by "X" (D by 1), and the function named in the string "f", must
% return a function value and a vector of partial derivatives. The Polack-
% Ribiere flavour of conjugate gradients is used to compute search directions,
% and a line search using quadratic and cubic polynomial approximations and the
% Wolfe-Powell stopping criteria is used together with the slope ratio method
% for guessing initial step sizes. Additionally a bunch of checks are made to
% make sure that exploration is taking place and that extrapolation will not
% be unboundedly large. The "length" gives the length of the run: if it is
% positive, it gives the maximum number of line searches, if negative its
% absolute gives the maximum allowed number of function evaluations. You can
% (optionally) give "length" a second component, which will indicate the
% reduction in function value to be expected in the first line-search (defaults
% to 1.0). The function returns when either its length is up, or if no further
% progress can be made (ie, we are at a minimum, or so close that due to
% numerical problems, we cannot get any closer). If the function terminates
% within a few iterations, it could be an indication that the function value
% and derivatives are not consistent (ie, there may be a bug in the
% implementation of your "f" function). The function returns the found
% solution "X", a vector of function values "fX" indicating the progress made
% and "i" the number of iterations (line searches or function evaluations,
% depending on the sign of "length") used.
%
% Usage: [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
%
% See also: checkgrad
%
% Copyright (C) 2001 and 2002 by Carl Edward Rasmussen. Date 2002-02-13
%
%
% (C) Copyright 1999, 2000 & 2001, Carl Edward Rasmussen
%
% Permission is granted for anyone to copy, use, or modify these
% programs and accompanying documents for purposes of research or
% education, provided this copyright notice is retained, and note is
% made of any changes that have been made.
%
% These programs and documents are distributed without any warranty,
% express or implied. As the programs were written for research
% purposes only, they have not been tested to the degree that would be
% advisable in any important application. All use of these programs is
% entirely at the user's own risk.
%
% [ml-class] Changes Made:
% 1) Function name and argument specifications
% 2) Output display
%
% Read options
if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')
length = options.MaxIter;
else
length = 100;
end
RHO = 0.01; % a bunch of constants for line searches
SIG = 0.5; % RHO and SIG are the constants in the Wolfe-Powell conditions
INT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracket
EXT = 3.0; % extrapolate maximum 3 times the current bracket
MAX = 20; % max 20 function evaluations per line search
RATIO = 100; % maximum allowed slope ratio
argstr = ['feval(f, X']; % compose string used to call function
for i = 1:(nargin - 3)
argstr = [argstr, ',P', int2str(i)];
end
argstr = [argstr, ')'];
if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end
S=['Iteration '];
i = 0; % zero the run length counter
ls_failed = 0; % no previous line search has failed
fX = [];
[f1 df1] = eval(argstr); % get function value and gradient
i = i + (length<0); % count epochs?!
s = -df1; % search direction is steepest
d1 = -s'*s; % this is the slope
z1 = red/(1-d1); % initial step is red/(|s|+1)
while i < abs(length) % while not finished
i = i + (length>0); % count iterations?!
X0 = X; f0 = f1; df0 = df1; % make a copy of current values
X = X + z1*s; % begin line search
[f2 df2] = eval(argstr);
i = i + (length<0); % count epochs?!
d2 = df2'*s;
f3 = f1; d3 = d1; z3 = -z1; % initialize point 3 equal to point 1
if length>0, M = MAX; else M = min(MAX, -length-i); end
success = 0; limit = -1; % initialize quanteties
while 1
while ((f2 > f1+z1*RHO*d1) | (d2 > -SIG*d1)) & (M > 0)
limit = z1; % tighten the bracket
if f2 > f1
z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3); % quadratic fit
else
A = 6*(f2-f3)/z3+3*(d2+d3); % cubic fit
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A; % numerical error possible - ok!
end
if isnan(z2) | isinf(z2)
z2 = z3/2; % if we had a numerical problem then bisect
end
z2 = max(min(z2, INT*z3),(1-INT)*z3); % don't accept too close to limits
z1 = z1 + z2; % update the step
X = X + z2*s;
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0); % count epochs?!
d2 = df2'*s;
z3 = z3-z2; % z3 is now relative to the location of z2
end
if f2 > f1+z1*RHO*d1 | d2 > -SIG*d1
break; % this is a failure
elseif d2 > SIG*d1
success = 1; break; % success
elseif M == 0
break; % failure
end
A = 6*(f2-f3)/z3+3*(d2+d3); % make cubic extrapolation
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3)); % num. error possible - ok!
if ~isreal(z2) | isnan(z2) | isinf(z2) | z2 < 0 % num prob or wrong sign?
if limit < -0.5 % if we have no upper limit
z2 = z1 * (EXT-1); % the extrapolate the maximum amount
else
z2 = (limit-z1)/2; % otherwise bisect
end
elseif (limit > -0.5) & (z2+z1 > limit) % extraplation beyond max?
z2 = (limit-z1)/2; % bisect
elseif (limit < -0.5) & (z2+z1 > z1*EXT) % extrapolation beyond limit
z2 = z1*(EXT-1.0); % set to extrapolation limit
elseif z2 < -z3*INT
z2 = -z3*INT;
elseif (limit > -0.5) & (z2 < (limit-z1)*(1.0-INT)) % too close to limit?
z2 = (limit-z1)*(1.0-INT);
end
f3 = f2; d3 = d2; z3 = -z2; % set point 3 equal to point 2
z1 = z1 + z2; X = X + z2*s; % update current estimates
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0); % count epochs?!
d2 = df2'*s;
end % end of line search
if success % if line search succeeded
f1 = f2; fX = [fX' f1]';
fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);
s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
d2 = df1'*s;
if d2 > 0 % new slope must be negative
s = -df1; % otherwise use steepest direction
d2 = -s'*s;
end
z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO
d1 = d2;
ls_failed = 0; % this line search did not fail
else
X = X0; f1 = f0; df1 = df0; % restore point from before failed line search
if ls_failed | i > abs(length) % line search failed twice in a row
break; % or we ran out of time, so we give up
end
tmp = df1; df1 = df2; df2 = tmp; % swap derivatives
s = -df1; % try steepest
d1 = -s'*s;
z1 = 1/(1-d1);
ls_failed = 1; % this line search failed
end
if exist('OCTAVE_VERSION')
fflush(stdout);
end
end
fprintf('\n');
fmincg uses the conjugate gradient method
If you look at the picture at this link, you will see that the CG method converges much faster than fminunc, but it assumes a number of constraints which I think are not required in the fminunc method (BFGS) (conjugate vectors vs non-conjugate vectors).
In other words, the fmincg method is faster but more coarse than fminunc, so it works better for huge matrices (many features, like thousands) vs smaller ones with up to hundreds of features. Hope this helps.