Comments on A Neighborhood of Infinity: Two Papers and a Presentation

barak,I've decided I like your point of view after...

2008-11-14T17:09:00.000-08:00

barak,

I've decided I like your point of view after all and I think I'll mention it if (1) my paper is accepted and (2) I get a chance to revise it. If this happens, I'd like to acknowledge you. What name should I use. (I think I know your full name but I could be wrong.)

barak,But this is what I mean by trivial:(linear f...

2008-11-14T13:56:00.000-08:00

barak,

But this is what I mean by trivial:

(linear f) => Jf f x = f

Anyway, I think we understand each other, we're just using different names. Everything in my paper could be carried out by passing linear filters into a reverse mode automatic differentiator. (So on one level, of course it is differentiation.) But I see myself as extracting out the parts of the algorithm that aren't about explicitly derivatives because in the case that your function is linear you don't need to think about calculus, just multiplying matrices.

> You're only computing derivatives in> ...

2008-11-14T11:24:00.000-08:00

> You're only computing derivatives in
> the sense that any linear map is the
> derivative of some function.

No, that's really not what I'm doing.

AD, at least in the form appropriate for consideration here, does not compute the derivative as a number. Rather it computes a linear function, namely a product with the Jacobian, that product being (J f x)*v for forward AD, and (J f x)^T*v for reverse AD, where J f x is the Jacobian matrix of f at x.

Here is another way to look at it. The Jacobian J f x is a matrix of numbers. But we can abstract away the representation of this matrix, instead considering it merely as a linear map. (In differential geometry this is called the push-forward.) That is what is done in AD. If we denote the forward-mode AD transform operator as Jf, then we have a numerical equality, Jf f x v = (J f x)*v, but calculated efficiently. And with Jr being the reverse mode operator, we have a numerical equality Jr f x v = (J f x)^T*v, but calculated efficiently.

When f is linear, Jf f x v = (J f x)*v = f v for any vectors x:R^n and v:R^n. These have the same extension, so we can say (linear f) => Jf f x = f. Similarly, (linear f) => Jr f x = f^*, where f^* is the adjoint of f.

Note that I'm not constructing any quadratic functions, or any other function whose derivative is f, or anything like that.

> ... representing linear functions
> as quadratic ...

From the linear function f:R^n -> R^m we can construct a bilinear form g x y = y^T (f x). Appropriate cascaded application of AD operators to g can yield f or f^*. This is made cleaner if we note that, in f x, the vector x is living in the input space of f; but in g x y, it is more natural to consider x to be living instead in the tangent space of some point in the input space of f; and to consider y to be living not in the output space of f, or even the tangent space of some point in the output space of f, but rather in the co-tangent space of some point in the output space of f. This renders the dot product in the definition of g meaningful.

barak,You're only computing derivatives in the sen...

2008-11-13T09:34:00.000-08:00

barak,

You're only computing derivatives in the sense that any linear map is the derivative of some function. But actually, I have considered this. At one point I toyed with the idea of representing linear functions as quadratic functions in this way: you get a single representation (as sparse as your implementation of the quadratic) from which I think a compiler can extract both the linear function and its adjoint. I think that's right.

Thanks for a wonderful blog. Most of your posts ar...

2008-11-13T08:30:00.000-08:00

Thanks for a wonderful blog. Most of your posts are great fun, but there is a technical connection here I feel obliged to point out.

The manuscript Two Tricks for the Price of One: Linear Filters and their Adjoints says:

It is closely related to adjoint mode automatic differentiation [5] although the principle under discussion here is more general as we are not in fact computing derivatives.

This isn't quite accurate. If we let

y = f x

where f : R^n -> R^m then we can define the Jacobian matrix operator J by

(J f x)_ij = (d/dx_j) (f_i x)

meaning that the (i,j)-th element of J f x is the derivative of the i-th component of f x with respect to the j-th component of x, evaluated at the given x.

Forward accumulation mode automatic differentiation efficiently calculates this matrix-vector product:

(J f x) v

for any vector v:R^n, and reverse accumulation mode automatic differentiation efficiently calculates this matrix-vector product:

(J f x)^T v

for any vector v:R^m.

In the case treated in the above manuscript f is linear, so we could write

f x = F x

where F is some matrix, so

J f x = F

(J f x)^T = F^T

(J f x)^T v = F^T v = f^* v

where f^* is the adjoint of f.

In fact the transformation described in the above manuscript is precisely reverse-mode AD, in this case. (Note that J f x is constant in x, so some simplifications occur. These are precisely the optimizations a good compiler would make in eliminating dead code or in simplifying a+0*b to a.)

bob c,Steganographic quines? Hmmm...there might be...

2008-09-18T11:20:00.000-07:00

bob c,

Steganographic quines? Hmmm...there might be some interesting ideas to play with there.

Wow, the quine is very cool. I once did one that ...

2008-09-17T14:45:00.000-07:00

Wow, the quine is very cool. I once did one that did the cheating approach of sticking the source in an image (http://bobcopeland.com/tile.html) but that was in perl where quines are really easy to write. Table of all the ascii values is a nice approach.

Twan,I actually wrote that paper a few years ago, ...

2008-09-16T15:22:00.000-07:00

Twan,

I actually wrote that paper a few years ago, before I'd started playing with languages with immutable values (like Haskell), so it reflects my C++ background. My procedural programming colleagues didn't blink at it because it corresponds to the notation in almost all of the languages they use (from Matlab and Mathematica to Fortran and C++). I suspect the reviewers won't blink at it either.

But fundamentally, I do agree with your criticism and if I get a chance I may change my notation.

In the adjoint paper, why do you write (a,b) = <...

2008-09-16T15:14:00.000-07:00

In the adjoint paper, why do you write (a,b) = <matrix>(a,b) when you mean an update? To me this is very confusing, using (a',b') = <matrix>(a,b) would have been clearer.

Fixed now.

2008-09-16T12:58:00.000-07:00

Fixed now.

Hey, the links to your papers don't seem to be wor...

2008-09-16T12:03:00.000-07:00

Hey, the links to your papers don't seem to be working.