Perturbation confusion confusion
Update: I'm making a bit of a turnabout here.
Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:
d (\x -> (d (x*) 2)) 1
Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:
d (\x -> (d (x*) 2)) (1::Integer)
It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.
So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.
However I am now completely convinced the situation is more complex than this and I'll have to address it again.
Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.
So I'll leave the following in place because removing it would be bad form and some of it still stands.
Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".
This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.
So let's go ahead and implement AD and try to reproduce the problem they point out.
Automatic Differentiation
Symbolic Differentiation
So let's try the example from the paper again:
Look at the type signature to d':
Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.
We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.
Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.
Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.
Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.
I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.
(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)
Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:
d (\x -> (d (x*) 2)) 1
Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:
d (\x -> (d (x*) 2)) (1::Integer)
It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.
So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.
However I am now completely convinced the situation is more complex than this and I'll have to address it again.
Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.
So I'll leave the following in place because removing it would be bad form and some of it still stands.
Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".
This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.
So let's go ahead and implement AD and try to reproduce the problem they point out.
Automatic Differentiation
> data D a = D { real :: a, infinitesimal :: a } deriving (Eq, Show) > instance Num a => Num (D a) where > fromInteger n = D (fromInteger n) 0 > D a a'+D b b' = D (a+b) (a'+b') > D a a'*D b b' = D (a*b) (a*b'+a'*b)We can now define a differentiation operator:
> d f x = infinitesimal (f (D x 1))We can use d to differentiate a function like f(x) = x3+2x2+x+1 at 2 to get:
> example0 = d (\x -> x^3+2*x^2+x+1) 2Imagine you were confused enough by AD to write Siskind and Pearlmutter's example exactly as described in equation (2) of the paper:
example1 = d (\x -> x*(d (\y -> x+y) 1)) 1We don't get an incorrect result. Instead, we get this error message:
Occurs check: cannot construct the infinite type: a0 = D a0 Expected type: D (D a0) Actual type: D a0 In the first argument of `(+)', namely `x' In the expression: x + yThe Haskell type checker identifies precisely where the problem is. x and y don't have the same types so they can't be added. Rather than immediately analyse how to fix this, let's try a completely different approach to calculus: symbolic differentiation. We'll define an expression type and write code to differentiate it
Symbolic Differentiation
> data E a = X | Const a | E a :+: E a | E a :*: E a deriving (Eq, Show) > diff X = 1 > diff (Const _) = 0 > diff (a :+: b) = diff a + diff b > diff (a :*: b) = a:*: diff b + diff a :*: bWe want to be able to evaluate these expressions:
> eval X x = x > eval (Const a) x = a > eval (a :+: b) x = eval a x + eval b x > eval (a :*: b) x = eval a x * eval b xWe can make this easier to use by making E a an instance of Num:
> instance Num a => Num (E a) where > fromInteger n = Const (fromInteger n) > a + b = a :+: b > a * b = a :*: bAnd now we can write an alternative to d:
> d' f x = eval (diff (f X)) x > example1 = d' (\x -> x^3+2*x^2+x+1) 2We of course get the same result.
So let's try the example from the paper again:
example1 = d' (\x -> x*(d' (\y -> x+y) 1)) 1We don't get an incorrect result. Instead, we get this error message:
Occurs check: cannot construct the infinite type: t0 = E t0 Expected type: E (E t0) Actual type: E t0 In the first argument of `(+)', namely `x' In the expression: x + yAn almost identical error message. So what's going on?
Look at the type signature to d':
d' :: Num t => (E a -> E t) -> t -> tWhen we evaluate
d' (\x -> x*(d' (\y -> x+y) 1)) 1the outer d' differentiates symbolically so internally it uses an expression type. This means that x is an expression, not a numerical value. But this means that the inner d' is being asked to evaluate its result at a value that is itself an expression, not a value. So internally it uses expressions of expressions. So x is of type E a and y is of type E (E a). It's no surprise we can't add them. Our bug is easily fixed. We define a lifting function:
> lift' x = Const xand now we can correctly evaluate:
> example2 = d' (\x -> x*(d' (\y -> lift' x+y) 1)) 1 > lift x = D x 0 > example3 = d (\x -> x*(d (\y -> lift x+y) 1)) 1Much the same discussion applies to the AD code. In that case, D a is the type of a's that have been infinitesimally perturbed. The type D (D a) is a value whose perturbation is itself perturbed. But Siskind and Pearlmutter say that we might be subject to perturbation confusion and could get 2. Curious. There is no way we can 'confuse' these distinct perturbations. Not no how. Not no way. They correspond to data of entirely different types. Far from being fraught with danger, the Haskell type system keeps us safe. Perturbation confusion in Haskell is about as likely as expression tree confusion.
Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.
We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.
Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.
Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.
Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.
I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.
(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)