In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated.Throughout this article let q be a real number strictly between 1/2 and 1. Consider a data set of n-items (n a large integer) with `floor(n*q)`

of the items marked positive and `n-floor(n*q)`

marked negative. Fitting a logistic model as a function of only a constant is essentially using logistic regression to compute an average in a very roundabout manner.

The R code to do this is as follows:

```
```> n <- 100
> q <- 0.9
> d <- data.frame(y=rep(c(T,F),c(floor(q*n),n-floor(q*n))))
> glm(y~1,data=d,family=binomial(link='logit'))
Call: glm(formula = y ~ 1, family = binomial(link = "logit"), data = d)
Coefficients:
(Intercept)
2.197
Degrees of Freedom: 99 Total (i.e. Null); 99 Residual
Null Deviance: 65.02
Residual Deviance: 65.02 AIC: 67.02
> exp(2.197)/(1+exp(2.197))
[1] 0.8999798

And we see we correctly compute the logit() of the desired density. logit(p) is defined as log(p/(1-p)) and we have exp(logit(p))/(1+exp(logit(p))) = p for p in the open interval (0.1). Hence our check that exp(2.197)/(1+exp(2.197)) = 0.9 (to 4 decimal places). We can force this fit to fail by giving a start away from zero (but closer to zero than you would think!):

```
```> glm(y~1,data=d,family=binomial(link='logit'),start=-5)
Call: glm(formula = y ~ 1, family = binomial(link = "logit"), data = d,
start = -5)
Coefficients:
(Intercept)
3.153e+15
Degrees of Freedom: 99 Total (i.e. Null); 99 Residual
Null Deviance: 65.02
Residual Deviance: 720.9 AIC: 722.9
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

And we see a fitting failure (warning messages, residual deviance greater than null deviance and astronomical coefficients).

As we showed in the comments of How robust is logistic regression? there are bounded data problems that fail from a start of zero (which is bad news, as starting at zero is standard practice). So there are definitely easy examples the standard logistic regression solvers (Newton-Raphson or Iteratively Reweighted Least Squares) will fail on (for reasons of their own; not due to rounding, linear dependencies, overflow or unbounded situations).

The starting from zero failing example is as follows (showing bad start (0,0) and good start (-4,-5)):

```
```d <- data.frame(x=c(0,0,0.001,100,-1,-1),
y=c(F,T,F,F,F,T),
wt=c(50,1,50,1,5,10))
glm(y~x,data=d,family=binomial(link='logit'),
weights=d$wt)
glm(y~x,data=d,family=binomial(link='logit'),
weights=d$wt,start=c(-4,-5))

This example is counter to common experience that: for a bounded problem the start (0,0) should work (it in fact does not work here).

Empirically practitioners don’t run into this problem as much as they run into problems due to separated or quasi-separated data (which can not be fixed by merely changing the starting point). However, the question remains: is there a non-trivial class of problems we can prove the standard methods always succeed for? For example can the standard methods always compute the average when started from zero (the traditional start)? This is an important question for two reasons. First: we want to know can we prove anything at all after so much bad news. Second: being able to compute averages has some useful consequences we will indicate later.

Let’s prove we can in fact compute an average. We take q as defined earlier and adapt the math from The Simpler Derivation of Logistic Regression to define the following function f(x):

```
```f(x) = x + (q-p)/(p*(1-p))
where p = sigmoid(x) = exp(x)/(1+exp(x))

f(x) is in fact the Newton-Raphson algorithm adapted to solve for only the constant term of a logistic regression model. We start with our first coefficient guess as zero and our improved estimates of the logit of the average are: f(0), f(f(0)), f(f(f(0))) and so on. For k=0,1,2… define f(k,x) = f(x) if k=0 and f(f(k-1,x)) otherwise. We would like to show f(k,0) converges to logit(q) as k goes to infinity.

We call out a few facts about f() (using the fact 1/2 < q < 1):

- f(x) > x for all x in the half open interval [0,logit(q)).
- f(0) = 4*(q-1/2).
- f(logit(q)) = logit(q).
- f'(x) = (2p-1)*(q-p)/(p*(1-p)) > 0 where p = exp(x)/(1+exp(x)) for all x in the half-open interval [0,logit(q)).
- f(x) – x ≥ 4 (q – exp(x)/(1+exp(x))) for all x in the half-open interval [0,logit(q)).

Facts 1 through 3 are useful, but not enough to prove much. Facts 4 and 5 not only give us enough tools to show f(k,0) converges to logit(q) (for 1/2 < q < 1), but tools enough to show it does this monotonely: f(k,0) < f(k+1,0) ≤ logit(q). The proof is now simple: consider the original function f(x) as x varies from 0 to logit(q) (remember, q > 1/2 implies logit(q) > 0). f'(x) is positive for all points in the interior of the interval [0,logit(q)]. Therefore any maximum value of f(x) on the interval [0,logit(x)] must occur at one of the endpoints (since an interior maximum would have f'(x) = 0, which is not the case). The two candidates for max(f(x)) on the inteval are therefore f(0) and f(logit(q)). f(logit(q)) = logit(q) and we just have to check that f(0) ≤ logit(q) (or 4*(q-1/2) ≤ logit(q) for all q in the open interval (1/2,1)- an exercise left to the reader) and we are done. So 0 ≤ f(x) ≤ logit(q) for all x in [0,logit(q)]. We now know inductively: f(k,0) < f(k+1,0) ≤ logit(q). Fact 5 implies f(k,0) approaches logit(q) as k goes to infinity at a reasonable rate (the rate of iteration motion isn’t too small before we get to x=logit(q)). Also, once we have x near logit(q) we would expect the usual very fast quadratic convergence that Newton-Raphson methods are chosen for.

Now we seem to have worked hard to prove very little (as is often the case with iterated systems). However, I think we can not speculate why the standard optimizers (full-step Newton-Raphson and Iteratively Reweighted Least Squares) tend to work so well in practice. We can prove that these methods can always compute averages for q > 1/2 when starting at zero (the cases q=1/2 and q < 1/2 follow by inspection and symmetry). Thus these methods can solve any one dimensional problem over indicators (variables that are always zero and one). We can then imagine that any problem that can be approximated as a system of nearly conditionally independent indicators (so many variables, but all zero one and all nearly independent of each other when conditioned on the outcome) can also be solved (as the Newton Raphson iterations will update model coefficients nearly independently). So the intuition is that the standard implementation logistic regression will be unlikely to run into difficulties for problems that have bounded range nearly conditionally independent inputs. Or in other terms: for problems that Naive Bayes does well (which it often does, even though it depends only on independently computed averages) the Newton-Raphson optimizer should also perform well (and not spoil the logistic regression). The logistic regression model can in many cases be preferred over the Naive Bayes model due to its better handling of common dependencies.

Though, it must be emphasized it would make sense for more of the common implementations to be toughened up to defend against Newton-Raphson failures without user intervention.

Categories: Mathematics Statistics

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Basically I am calling this some hard work to prove a very small result (that the standard “big hammer” logistic regression can in fact compute simple averages). The work was kind of fun because it brought me back to Polya’s “conjecture, generalization, specialization and analogy” methods (or Zeilberger’s “undetermined generalization and specialization” methods). Roughly you guess something that would make your work easy and then see if you can prove that.