### Numerical derivatives in parallel

My last post was about a fairly esoteric new language feature in python: coroutines. I wasn't reading up on them because they're whiz-bang cool (okay, not only because they're whiz-bang cool) but because I have an actual problem that they might help with. For the triple system, I find I need to do very slow numerical optimization problems. That is, the objective function takes seconds to evaluate. It can't be parallelized internally, and numerical minimization is largely a sequential operation: evaluate the function, figure out which way is downhill, go there, repeat. So this code currently uses just one core, and is almost as slow as the highly parallel Markov-Chain Monte Carlo code that does hundreds of times as much work. So obviously it would be nice to find some parallelism. I have a way to do that: parallelize gradient evaluations.

The other reason for this post is to make a point about code robustness. Getting numerical derivatives right is a notoriously difficult process, and getting them both efficient and right is even harder. The package I have been using, MINUIT, has been used by the particle physicists for decades, and they have hammered out all sorts of corner cases that a naive implementation might miss. So I'm not going to use a naive implementation, not least because my problem is numerically difficult. This means the code is not simple, and I don't want to try to restructure it manually to be more concurrent; I want the computer to handle the concurrency for me.

MINUIT does its numerical optimization using a "quasi-Newton method". That is, it tries to approximate the function as a quadratic and jump straight to the minimum. The "quasi-" is because it builds the quadratic estimate by taking gradients at successive points and updating the Hessian matrix based on the measured gradients. This is a reasonably effective method, and it spends almost all its time evaluating numerical gradients.

The function I am using takes 20-30 input parameters and returns a goodness-of-fit value (actually a log-likelihood). So computing the gradient is actually a matter of computing derivatives independently in 20-30 directions. If this could be done in parallel, a substantial speedup would be possible. Importantly, the (highly non-trivial) MINUIT code is built to permit "analytic" gradient functions to be passed in, so this is a change that can be made without reimplementing the entire algorithm.

The numerical derivative algorithm is in principle simple: in dimension d, choose a step size h and compute (f(x+h)-f(x-h))/2h. The trick is choosing the right value for h - too large and the function doesn't look like a parabola, too small and round-off error wrecks the result. MINUIT uses several tricks to find the right h. First, if you also have f(x), you can estimate the second derivative, and use it to work out how much truncation error there is in the (worse) first-order estimate (f(x+h)-f(x))/h. This error is a pessimistic estimate for the truncation error in the symmetric calculation, and it can be used to balance truncation error and round-off. Of course, if your initial h was wildly wrong, your corrected value might not be okay either, so you might have to repeat this until it stabilizes. And obviously having a good initial guess matters, so you want to keep around estimates of the gradient and second derivative from previous function evaluations. Plus you have to be careful about floating-point misbehaviour. So the whole function is kind of complicated. But here it is, converted from MINUIT code to parallel, coroutine-using form:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import concurrent.futures
import asyncio
import time
import numpy as np

@asyncio.coroutine
def pmap(f, xs, loop=None):
fs = [asyncio.async(f(*x), loop=loop) for x in xs]
yield from asyncio.wait(fs)
return [f.result() for f in fs]

class Scalar:
def __init__(self, f,
step_initial=1e-8,
f_eps=None,
step_tolerance=0.3,
dtype=float,
pool=None):
self.f = f
self.x = None
self.f_x = None
self.dtype = dtype
try:
self.step_initial = self.dtype(step_initial)
except TypeError:
self.step_initial = np.asarray(step_initial)
if f_eps is None:
self.f_eps = np.finfo(self.dtype).eps
else:
self.f_eps = self.dtype(f_eps)
self.up = 1.
self.eps = 8*np.finfo(self.dtype).eps
self.eps2 = 2*np.sqrt(np.finfo(self.dtype).eps)
self.step_tolerance = step_tolerance
if pool is None:
self.pool = concurrent.futures.ProcessPoolExecutor()
else:
self.pool = pool
self.loop = asyncio.get_event_loop()

def __call__(self, x):
if self.x is None or not np.all(x==self.x):
self.x = x
self.f_x = self.loop.run_until_complete(self._f(x))
return self.f_x

@asyncio.coroutine
def _f(self, x):
return (yield from self.loop.run_in_executor(self.pool, self.f, x))

# FIXME: make sure dirin doesn't get too small
dirin = 2*self.step_initial
if isinstance(dirin, self.dtype):
dirin = np.array([dirin]*self.n)

x = np.asarray(x)
self(x)
self.n = len(x)
raise ValueError("Dimensionality of input appears to have changed "
"from %d to %d" % (len(self._gradient_steps),
len(x)))
if False:
return np.array([
for d in range(len(x))])
else:
return np.array(self.loop.run_until_complete(

@asyncio.coroutine
dfmin = 8*self.eps2*(np.abs(self.f_x)+self.up)
vrysml = 8*self.eps**2

xtf = self.x[d]
stepb4 = 0
step = max(step, vrysml)
step = max(step, 8*np.abs(self.eps2*self.x[d]))
if np.abs((step-stepb4)/step) < self.step_tolerance:
break
stepb4 = step

x1 = self.x.copy()
x1[d] = xtf+step
x2 = self.x.copy()
x2[d] = xtf-step
fs1, fs2 = yield from pmap(self._f, [(x1,), (x2,)])

step = (x1[d]-x2[d])/2 # in case of round-off

break

if __name__=='__main__':
k = np.array([1e-6,1e-3,1,1e3,1e6])
def f(x):
return np.sin(np.dot(k,x))
def df(x):
return np.cos(np.dot(k,x))*k

F = Scalar(f)
for i in range(20):
print()
x = np.random.randn(len(k))
print(df(x))


Unfortunately, coroutines require python 3.4, while the rest of the system uses python 2. So I did a conversion to a thread-based approach, which was gratifyingly simple:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import concurrent.futures
import time
import numpy as np

def pmap(f, xs, pool):
ts = [pool.submit(f,*x) for x in xs]
return [t.result() for t in ts]

class Scalar:
def __init__(self, f,
step_initial=1e-8,
f_eps=None,
step_tolerance=0.3,
dtype=float,
pool=None):
self.f = f
self.x = None
self.f_x = None
self.dtype = dtype
try:
self.step_initial = self.dtype(step_initial)
except TypeError:
self.step_initial = np.asarray(step_initial)
if f_eps is None:
self.f_eps = np.finfo(self.dtype).eps
else:
self.f_eps = self.dtype(f_eps)
self.up = 1.
self.eps = 8*np.finfo(self.dtype).eps
self.eps2 = 2*np.sqrt(np.finfo(self.dtype).eps)
self.step_tolerance = step_tolerance
if pool is None:
self.pool = concurrent.futures.ProcessPoolExecutor()
else:
self.pool = pool

def __call__(self, x):
if self.x is None or not np.all(x==self.x):
self.x = x
self.f_x = self.pool.submit(self.f,x).result()
return self.f_x

# FIXME: make sure dirin doesn't get too small
dirin = 2*self.step_initial
if isinstance(dirin, self.dtype):
dirin = np.array([dirin]*self.n)

x = np.asarray(x)
self(x)
self.n = len(x)
raise ValueError("Dimensionality of input appears to have changed "
"from %d to %d" % (len(self._gradient_steps),
len(x)))
return np.array(pmap(self._gradient1d,[(d,) for d in range(len(x))],

dfmin = 8*self.eps2*(np.abs(self.f_x)+self.up)
vrysml = 8*self.eps**2

xtf = self.x[d]
stepb4 = 0
step = max(step, vrysml)
step = max(step, 8*np.abs(self.eps2*self.x[d]))
if np.abs((step-stepb4)/step) < self.step_tolerance:
break
stepb4 = step

x1 = self.x.copy()
x1[d] = xtf+step
x2 = self.x.copy()
x2[d] = xtf-step
fs1, fs2 = pmap(self, [(x1,), (x2,)],

step = (x1[d]-x2[d])/2 # in case of round-off

break

if __name__=='__main__':
k = np.array([1e-6,1e-3,1,1e3,1e6])
def f(x):
return np.sin(np.dot(k,x))
def df(x):
return np.cos(np.dot(k,x))*k

F = Scalar(f)
for i in range(20):
print()
x = np.random.randn(len(k))
print(df(x))


Applying this to the actual triple system problem is a work in progress. There's also the very promising pure-python nested-sampling package nestle, which I would like to convert to use futures to do parallel computation, again because I want to apply it to the triple system. That's a longer-term problem, because my objective function currently crashes if you give it sufficiently-bogus orbital parameters.