Perturbation confusion confusion
Update: I'm making a bit of a turnabout here.
Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:
d (\x -> (d (x*) 2)) 1
Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:
d (\x -> (d (x*) 2)) (1::Integer)
It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.
So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.
However I am now completely convinced the situation is more complex than this and I'll have to address it again.
Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.
So I'll leave the following in place because removing it would be bad form and some of it still stands.
Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".
This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.
So let's go ahead and implement AD and try to reproduce the problem they point out.
Automatic Differentiation
Symbolic Differentiation
So let's try the example from the paper again:
Look at the type signature to d':
Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.
We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.
Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.
Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.
Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.
I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.
(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)
Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:
d (\x -> (d (x*) 2)) 1
Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:
d (\x -> (d (x*) 2)) (1::Integer)
It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.
So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.
However I am now completely convinced the situation is more complex than this and I'll have to address it again.
Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.
So I'll leave the following in place because removing it would be bad form and some of it still stands.
Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".
This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.
So let's go ahead and implement AD and try to reproduce the problem they point out.
Automatic Differentiation
> data D a = D { real :: a, infinitesimal :: a } deriving (Eq, Show) > instance Num a => Num (D a) where > fromInteger n = D (fromInteger n) 0 > D a a'+D b b' = D (a+b) (a'+b') > D a a'*D b b' = D (a*b) (a*b'+a'*b)We can now define a differentiation operator:
> d f x = infinitesimal (f (D x 1))We can use d to differentiate a function like f(x) = x3+2x2+x+1 at 2 to get:
> example0 = d (\x -> x^3+2*x^2+x+1) 2Imagine you were confused enough by AD to write Siskind and Pearlmutter's example exactly as described in equation (2) of the paper:
example1 = d (\x -> x*(d (\y -> x+y) 1)) 1We don't get an incorrect result. Instead, we get this error message:
Occurs check: cannot construct the infinite type: a0 = D a0 Expected type: D (D a0) Actual type: D a0 In the first argument of `(+)', namely `x' In the expression: x + yThe Haskell type checker identifies precisely where the problem is. x and y don't have the same types so they can't be added. Rather than immediately analyse how to fix this, let's try a completely different approach to calculus: symbolic differentiation. We'll define an expression type and write code to differentiate it
Symbolic Differentiation
> data E a = X | Const a | E a :+: E a | E a :*: E a deriving (Eq, Show) > diff X = 1 > diff (Const _) = 0 > diff (a :+: b) = diff a + diff b > diff (a :*: b) = a:*: diff b + diff a :*: bWe want to be able to evaluate these expressions:
> eval X x = x > eval (Const a) x = a > eval (a :+: b) x = eval a x + eval b x > eval (a :*: b) x = eval a x * eval b xWe can make this easier to use by making E a an instance of Num:
> instance Num a => Num (E a) where > fromInteger n = Const (fromInteger n) > a + b = a :+: b > a * b = a :*: bAnd now we can write an alternative to d:
> d' f x = eval (diff (f X)) x > example1 = d' (\x -> x^3+2*x^2+x+1) 2We of course get the same result.
So let's try the example from the paper again:
example1 = d' (\x -> x*(d' (\y -> x+y) 1)) 1We don't get an incorrect result. Instead, we get this error message:
Occurs check: cannot construct the infinite type: t0 = E t0 Expected type: E (E t0) Actual type: E t0 In the first argument of `(+)', namely `x' In the expression: x + yAn almost identical error message. So what's going on?
Look at the type signature to d':
d' :: Num t => (E a -> E t) -> t -> tWhen we evaluate
d' (\x -> x*(d' (\y -> x+y) 1)) 1the outer d' differentiates symbolically so internally it uses an expression type. This means that x is an expression, not a numerical value. But this means that the inner d' is being asked to evaluate its result at a value that is itself an expression, not a value. So internally it uses expressions of expressions. So x is of type E a and y is of type E (E a). It's no surprise we can't add them. Our bug is easily fixed. We define a lifting function:
> lift' x = Const xand now we can correctly evaluate:
> example2 = d' (\x -> x*(d' (\y -> lift' x+y) 1)) 1 > lift x = D x 0 > example3 = d (\x -> x*(d (\y -> lift x+y) 1)) 1Much the same discussion applies to the AD code. In that case, D a is the type of a's that have been infinitesimally perturbed. The type D (D a) is a value whose perturbation is itself perturbed. But Siskind and Pearlmutter say that we might be subject to perturbation confusion and could get 2. Curious. There is no way we can 'confuse' these distinct perturbations. Not no how. Not no way. They correspond to data of entirely different types. Far from being fraught with danger, the Haskell type system keeps us safe. Perturbation confusion in Haskell is about as likely as expression tree confusion.
Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.
We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.
Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.
Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.
Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.
I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.
(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)
18 Comments:
I was also puzzled by the problem claimed in that paper. But then at the end of the paper S&P presented Haskell code that they claimed would demonstrate the problem.
I didn't try out this code. Does their presented code actually demonstrate a problem? Or does it fail with a type error such as the ones you showed? Or something else?
Where can I find a nice accessible introduction to AD? It intrigues me.
Here's an example (by Barak Perlmutter, which he provided me some time ago) where the Haskell code really does give an incorrect answer.
d (\x -> (d (x*) 2)) 1
This gives an incorrect answer when you keep the rank 1 type assigned by type inference to d. However, if you write the rank 2 type that d should logically have:
dd :: Num t => (forall a. Num a => a -> a) -> t -> t
then it fails with an error as well.
Anonymous, I wrote what I think is a very accessibly introduction some 4 or so years ago.
http://cdsmith.wordpress.com/2007/11/29/some-playing-with-derivatives/
If you write some Haskell code that requires lifts (that includes code like that using monad transformers or using fmaps with functors) but deliberately leave out the lifts and then at the end hunt around at random to find places where you can insert lifts to keep the type checker happy, then you can expect buggy code.
S&P exhibit the function called constant_one with type signature: Num a => D a -> a. The type signature should make you suspicious. In fact constant_one (D 10 1), say, is not 1. You don't fix the code by inserting a lift at "at the point of type violation". I'm not sure what to call that kind of debugging. When I debug code I think about the semantics of what I'm doing.
Note that you can write an analogous version of should_be_one for the symbolic differentiator too.
Chris,
> the Haskell code really does give an incorrect answer.
That's no incorrect answer. You've asked to evaluate the derivative at a weird value, not the integer 1. But I agree that it could be misread as 1, especially to someone not familiar with the Num typeclass. So you've convinced me that there are indeed dangers here. I've modified my position with an update.
I get a slightly different error message for example1:
*Main> d (\x -> x*(d (\y -> x+y) 1)) 1
<interactive>:1:12:
Occurs check: cannot construct the infinite type: a = D a
Expected type: D a
Inferred type: a
In the second argument of `(*)', namely `(d (\ y -> x + y) 1)'
In the expression: x * (d (\ y -> x + y) 1)
And now something curious happens: based solely on this error message my initial, arguably naive guess would be to insert lift around the second argument of (*), like this:
*Main> d (\x -> x * lift (d (\y -> x+y) 1)) 1
2
This even typechecks, but produces the wrong answer. How do you know where to insert lift? The type system does not help you here.
@Oleksandr Just inserting lifts anywhere to make the expression typecheck isn't good practice. It's easy to write code that typechecks but isn't correct.
The easiest rule i can suggest for now is that you need to place it around any real-valued variable involved in the computation that the lambda captures from its environment. In this case 'lift x'. It'd be nice if the type system could spot when we need this but I don't know if it can. I've never had any difficulty applying this rule but now I'm concerned that there may be situations where it's hard to apply it. I can't think of one.
Is the problem Haskell's overloading of numeric literals? I.e. would explicitly annotating them always fix things?
CC Shan gave a pretty good overview of the issue and presented (even if he did not invent) the approach used to resolve it that I currently use, which is to use universal quantification to keep you from mixing up your infinitesimals, here:
http://conway.rutgers.edu/~ccshan/wiki/blog/posts/Differentiation/
In http://hackage.haskell.org/package/ad, I make that quantified infinitesimal do double duty as the differentiation mode. (That said, with hindsight, I would like have figured out a way to quantify over the mode separately to enable fewer things to have to dispatch with the dictionary.)
Dan, thank you for your reply. In fact, I knew the answer to my question; I only wanted to emphasize that the type system of Haskell does not rule out perturbation confusions -- you still need to carefully insert lifts in appropriate places. I must admit that I haven't written any substantial program in Haskell that uses AD, so it may be that in practice lifts don't present a real problem. I guess it is time to write such a program.
By the way, I happen to be a PhD student of Barak, and one of the things I've been thinking about recently is typology of AD. There are a few interesting ideas floating around. For example, dual numbers are really the tangent bundle over the real line. What if we try to extend the tangent bundle construction to other types? Tangent bundle is a functor T from the category of smooth manifolds to itself. Furthermore, from the point of view of synthetic differential geometry, it is a representable functor represented by the "set" D of nilsquare elements. In particular, it enjoys some nice properties. Obviously, it commutes with products, so one can define T (a, b) = (T a, T b) for types. More interestingly, if follows from the representability that T (a -> b) is isomorphic to a -> T b. Furthermore, the representability of T implies that it admits the structure of a monad, of which lift is the unit. This structure can also be derived directly, without appealing to synthetic differential geometry. I can go on.
Ultimately I would like to have these ideas embodied in the form of a type system that would allow us to reason about AD. This dream is still under construction...
@Anonymous No, it's not just about literals.
@Oleksandr You did a good job of convincing me :-)
More generally, we can also build jet bundles. It's also fun to look at Lie algebras built this way from Lie groups: http://blog.sigfpe.com/2008/04/infinitesimal-rotations-and-lie.html
It seems a bit glib to say that the type of diff is "Num a=>(D a->D a)->(a->a)" and use that to justify the needed lifting etc. Because the type of diff is, mathematically speaking, "Num a=>(a->a)->(a->a)". The dual numbers are an implementation technique being inappropriately exposed in the signature of the API. This is the root cause of the modularity and nesting problems that make AD so brittle in most current systems.
Barak,
If by glib you mean that more could be said on the subject, then of course you are correct. There is no computable function for differentiation of type Num a=>(a->a)->(a->a) and this ultimately reflects the fact that classically, differentiation is not uniformly continuous, and constructively, there is no differentiation function at all.
A mathematical proof that there is no way to implement differentiation in a way that it is also a function (in the mathematical sense) seems like a good justification for using lifting to me.
Dan, can you point me to a proof that there is no way to implement differentiation in a way that is also a function, or perhaps name the theorem in question?
Thanks in advance!
@Anonymous I don't know a canonical source of the theorem but here's one I found through web searching: http://eccc.hpi-web.de/resources/pdf/ica.pdf
This book goes into quite a bit of detail: http://www.amazon.com/Computable-Analysis-Introduction-Theoretical-Computer/dp/3540668179/
One way to think of it is that differentiation is itself not uniformly continuous. You can make arbitrarily small perturbations to a function and wildly change its derivative.
To be fair, that "impossible to compute derivative" proof relies on assumption under which it is also impossible to compute, say, 1/x, or to test if x is positive. These are interesting results, but I don't think they actually imply that it is impossible to compute derivative functions automatically, in the sense we desire.
Post a Comment
<< Home