▲Derivatives, Gradients, Jacobians and Hessiansblog.demofox.org

244 points by ibobev 19 hours ago | 58 comments

GistNoesis 11 hours ago [-]

The way that really made me understand gradients and derivative was when visualizing them as Arrow Maps. I even made a small tool https://github.com/GistNoesis/VisualizeGradient . This visualization helps understand optimization algorithm.

Jacobians can be understood as a collection of gradients when considering each coordinates of the output independently.

My mental picture for Hessian is to associate each point with the shape of a parabola (or saddle), which best match the function locally. It's easy to visualize once you realize it's the shape of what you see when you zoom-in on the point. (Technically this mental picture is more of a hessian + gradient tangent plane simultaneously multivariate Taylor expansion but I find them hard to mentally separate the slope from the curvature).

uoaei 4 hours ago [-]

I'm also a visual learner and my class on dynamical systems really put a lot into perspective, particularly the parts about classifying stable/unstable/saddle points by finding eigenvectors/values of Jacobians.

A lot of optimization theory becomes intuitive once you work through a few of those and compare your understanding to arrow maps like you suggest.

fouronnes3 11 hours ago [-]

There's something that's always been deeply confusing to me about comparing the Jacobian and the Hessian because their nature is very different.

The Hessian shouldn't have been called a matrix.

The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.

In other words, given a function R^n -> R^m:

Order 0: Output value: 1d array of shape (m) (a vector)

Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)

Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)

It all makes sense!

Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.

ndriscoll 9 hours ago [-]

At least in my undergrad multivariate real analysis class, I remember the professor arranging things to strongly suggest that the Hessian should be thought of as ∇⊗∇, and that this was the second term in a higher dimensional Taylor series, so that the third derivative term would be ∇⊗∇⊗∇ etc. Things like tensor products or even quotient spaces weren't assumed knowledge, so it wasn't explicitly covered, but I remember feeling the connection was obvious enough at the time. Then an introductory differential geometry class got into (n,m) tensors. So I'm quite sure mathematicians are fine dealing with tensors. My experience was undergrad engineering math tries to avoid even covectors though, so that will stay well clear of a coherent picture of multi-variable calculus. e.g. my engineering professors would talk of dirac δ as an infinite spike/spooky doesn't-really-exist thing that makes integrals work or whatever. My analysis professor just said δ(f) = f(0) is a linear functional.

tho2342i342342 4 hours ago [-]

∇⊗∇ would be more like \p_i f . \p_j f, not \p_{ij} f

setopt 2 hours ago [-]

I disagree, if you apply it in the order (∇⊗∇) f then you should get \partial_i \partial_j as elements of a rank-2 operator that is then applied to a function f. That is, presumably, what you mean by \p_{ij} f.

kandel 1 hours ago [-]

Well for me the Hessian is the second order derivative in the special case where the co-domain is of dim 1. It's just very easy to work with...

cdavid 10 hours ago [-]

I agree it is confusing, because starting with notation will confuse you. I personally don't like the partial derivative-first definition of those concepts, as it all sounds a bit arbitrary.

What made sense to me is to start from the definition of derivative (the best linear approximation in some sense), and then everything else is about how to represent this. vectors, matrices, etc. are all vectors in the appropriate vector space, the derivative is always the same form in a functional form, etc.

E.g. you want the derivative of f(M) ? Just write f(M+h) - f(M), and then look for the terms in h / h^2 / etc. Apply chain rules / etc. for more complicated cases. This is IMO a much better way to learn about this.

As for notation, you use vec/kronecker product for complicated cases: https://janmagnus.nl/papers/JRM093.pdf

mcabbott 10 hours ago [-]

This doesn't really help with programming, but in physics it's traditional to use up- and down-stairs indices, which makes the distinction you want very clear.

If input x has components xⁿ, and output f(x) components fᵐ, then the Jacobian is ∂ₙfᵐ which has one index upstairs and one downstairs. The derivative has a downstairs index... because x is in the denominator of d/dx, roughly? If x had units seconds, then d/dx has units per second.

Whereas if g(x) is a number, the gradient is ∂ₙg, and the Hessian is ∂ₙ∂ₙ₂g with two downstairs indices. You might call this a (0,2) tensor, while the Jaconian is (1,1). Most of the matrices in ordinary linear algebra are (1,1) tensors.

flufluflufluffy 8 hours ago [-]

We always referred to them as super/sub-scripts. So like xₙ is read “x sub n”

Upstairs/downstairs is kinda cute tho xD

mcabbott 5 hours ago [-]

Covariant and contravariant indices would be the formal terms. I'm not really sure whether I've seen "upstairs" written down.

Sub/superscript... strike me as the typographical terms, not the meaning? Like $x_\mathrm{alice}$ is certainly a subscript, and footnote 2 is a superscript, but neither is an index.

imtringued 40 minutes ago [-]

You're confusing too many things.

The Hessian is defined as the second order partial derivative of a scalar function. Therefore it will always give you a matrix.

What you're doing with the shape (m,n,n) isn't actually guaranteed at all since the output shape of an arbitrary function can be any tensor and you can apply the Hessian to each scalar value in the tensor to get another arbitrary tensor that has two dimensions more.

It's the Jacobian that is weird, since it is just a vector of gradients and therefore its partial derivative must also be a vector of Hessians.

sestep 18 hours ago [-]

A bit more advanced than this post, but for calculating Jacobians and Hessians, the Julia folks have done some cool work recently building on classical automatic differentiation research: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/

flerovium114 11 hours ago [-]

Have you tried using Enzyme (https://enzyme.mit.edu/)? It operates on the LLVM IR, so it's available in any language that breaks down into LLVM (e.g., Julia, where I've used it for surface gradients) and it produces highly optimized AD code. Pretty cool stuff.

sestep 9 hours ago [-]

Yeah I've used it (cool project indeed!), albeit mostly just in a project I and others in the autodiff community maintain which benchmarks many different autodiff tools against each other: https://github.com/gradbench/gradbench

tired_and_awake 10 hours ago [-]

About a decade ago I was interviewed for Apple's self driving car project and an exec on the project asked me to define these exact 4 things in great detail and provide examples. Shrugs.

vismit2000 3 hours ago [-]

This is a fantastic video on Jacobian [Mathemaniac]: https://www.youtube.com/watch?v=wCZ1VEmVjVo

ziofill 17 hours ago [-]

Mmh, this is a bit sloppy. The derivative of a function f::a -> b is a function Df::a -> a -o b where the second funny arrow indicates a linear function. I.e. the derivative Df takes a point in the domain and returns a linear approximation of f (the jacobian) at that point. And it’s always the jacobian, it’s just that when f is R -> R we conflate the jacobian (a 1x1 matrix in this case) with the number inside of it.

matheist 15 hours ago [-]

Sorry to actually your actually, but the derivative of a function f from a space A to a space B at the point a is a linear function Df_a from the tangent space of A at a to the tangent space of B at b = f(a).

When the spaces are Euclidean spaces then we conflate the tangent space with the space itself because they're identical.

By the way, this makes it easy to remember the chain rule formula in 1 dimension. There's only one logical thing it could be between spaces of arbitrary dimensions m, n, p: composition of linear transformations from T_a A to T_f(a) B to T_g(f(a)) C. Now let m = n = p = 1, and composition of linear transformations just becomes multiplication.

(Only half kidding)

btilly 10 hours ago [-]

The distinction between the space A and the tangent space of A becomes visually clear if we consider a function whose domain is a sphere. The derivative is properly defined on the tangent plane, which only touches the sphere at a single point. However in the neighborhood of that point, the plane and sphere are very, very close together. But are inevitably pulled away by the curvature of the sphere.

Of course that picture is not formally correct. We formally define the tangent space without having to embed the manifold in Euclidean space. But that picture is a correct description of an embedding of both the sphere and the tangent space at a single point.

5 hours ago [-]

ziofill 6 hours ago [-]

Oh I appreciate you actualling my actually ^^ but isn’t this case a special case of the one I wrote? I.e. when an and b are manifolds and admit tangent bundles?

13 hours ago [-]

beng-nl 13 hours ago [-]

Why, I’m sure you could come up with a succinct explanation of a monad :-)

ndriscoll 17 hours ago [-]

A perhaps nicer way to look at things[0] is to hold onto your base points explicitly and say Df:: a -> (b, a -o b) = (f(p),A(p)) where f(p+v)≈f(p)+A(p)v. Then you retain the information you need to define composition Dg∘Df=D(g∘f)=(Dg._1∘Df._1, Dg(Df._1)_.2∘Df._2). i.e the chain rule.

[0] which I learned from this talk https://youtube.com/watch?v=17gfCTnw6uE

esafak 12 hours ago [-]

It's deplorable that we can't write in Latex or something similar here in 2025, and have to resort to the gobbledygook above.

HeckFeck 13 minutes ago [-]

But at least we FINALLY have an AI Assistant in... WhatsApp??

ziofill 16 hours ago [-]

Yes! I love Conal Eliot’s work. The one you wrote is the compositional derivative which augments the regular derivative by also returning the function itself (otherwise composition won’t work well). For anyone interested look up “the simple essence of automatic differentiation”.

dbacar 15 hours ago [-]

I respect the time you spent to write such a post with all those limited input alternatives (bowes).

ndriscoll 15 hours ago [-]

You can do ≈ by long holding = on Android/Gboard. The only way I know to get ∘ is to copy/paste it from a Unicode reference. Likewise with ⊸, which I was too lazy to look up and didn't know the name of, but now I know is MULTIMAP (U+22B8).

tomsmeding 14 hours ago [-]

It's also \multimap in TeX. The name never made sense to me because while I've seen it used for a variety of linear functions in math, I've never seen it used for a multimap, and indeed the math name in common use for it seems to be "lollipop".

flufluflufluffy 17 hours ago [-]

Fantastic post! As short as it needs to be while still communicating its points effectively. I love walking up the generalization levels in math.

divbzero 14 hours ago [-]

Would love to see div and curl added to this post.

nickpsecurity 13 hours ago [-]

"What I just described is an iterative optimization method that is similar to gradient descent. Gradient descent simulates a ball rolling down hill to find the lowest point that we can, adjusting step size, and even adding momentum to try and not get stuck in places that are not the true minimum."

That is so much easier to understand than most descriptions. The whole opening was.

whatever1 18 hours ago [-]

I can look around me and find the minimum of anything without tracing its surface and following the gradient. I can also identify immediately global minima instead of local ones.

We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.

Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.

What are we missing?

ks2048 18 hours ago [-]

When you look at a 2D surface, you directly observe all the values on that surface.

For a loss-function, the value at each point must be computed.

You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.

For high dimensions, there's just way too many "points" to compute.

samsartor 17 hours ago [-]

And remember, optimization problems can be _incredibly_ high-dimensional. A 7B parameter LLM is a 7-billion-dimensional optimization landscape. A grid-search with a resolution of 10 (ie 10 samples for each dimension) would requre evaluating the loss function 10^(7*10^9) times. That is, the number of evaluations is a number with 7B digits.

Chinjut 18 hours ago [-]

You're thinking of situations where you are able to see a whole object at once. If you were dealing with an object too large to see all of, you'd have to start making decisions about how to explore it.

3eb7988a1663 16 hours ago [-]

The mental image I like: imagine you are lost in a hilly region with incredibly dense fog such that you can only see one foot directly in front of you. How do you find the base of the valley?

Gradient descent: take a step in the steepest downward direction. Look around and repeat. When you reach a level area, how do you know you are at the lowest point?

jpeloquin 17 hours ago [-]

Evaluating a function using a densely spaced grid and plotting it does work. This is brute-force search. You will see the global minima immediately in the way you describe, provided your grid is dense enough to capture all local variation.

It's just that when the function is implemented on the computer, evaluating so many points takes a long time, and using a more sophisticated optimization algorithm that exploits information like the gradient is almost always faster. In physical reality all the points already exist, so if they can be observed cheaply the brute force approach works well.

Edit: Your question was good. Asking superficially-naive questions like that is often a fruitful starting point for coming up with new tricks to solve seemingly-intractable problems.

whatever1 16 hours ago [-]

Thanks!

It does feels to me that we do some sort of sampling, definitely is not a naive grid search.

Also I find it easier to find the minima in specific directions (up, down, left, right) rather than let’s say a 42 degree one. So some sort of priors are probably used to improve sample efficiency.

zoogeny 15 hours ago [-]

People here are giving you mathematical answers which is what you are asking for, but I want to challenge your intuition here.

In construction, grading a site for building is a whole process involving surveying. If you dropped a person on a random patch of earth that hasn't previously been levelled and gave them no tools, it would be a significant challenge for that person to level the ground correctly.

What I'm saying is, your intuition that "I can look around me and find the minimum of anything" is almost certainly wrong, unless you have a superpower that no other person has.

whatever1 14 hours ago [-]

That is true we are only good at doing it for specific directions of the objective function. The one that we perceive as the minimizing direction. If you tell me find the minimum with a direction of 53 degrees likely I will fail, because I can’t easily visualize where this direction points towards

nwallin 15 hours ago [-]

When you look at, for instance, a bowl, or even one of those egg carton mattress things, and you want to find the global minimum, you are looking at a surface which is 2 dimensions in and 1 dimension out. It's easy enough for your brain to process several thousand points and say ok the bottom of the bowl is right here.

When a computer has a surface which is 2 dimensions in and 1 dimension out, you can actually just do the same thing. Check like 100 values in the x/y directions and you only have to check like 10000 values. A computer can do that easy peasy.

When a computer does ML with a deep neural network, you don't have 2 dimensions in and 1 dimension out. You have thousands to millions of dimensions in and thousands to millions of dimensions out. If you have 100000 inputs, and you check 1000 values for each input, the total number of combinations is 1000^100000. Then remember that you also have 100000 outputs. You ain't doin' that much math. You ain't.

So we need fancy stuff like Jacobians and backtracking.

whatever1 15 hours ago [-]

I don’t think it’s that simple. For the egg carton your eye will not spend almost any time looking at its top. You will spend most of the time sampling the bottom. I don’t know what we do, but it does not feel like a naive grid search.

cvoss 14 hours ago [-]

I really don't think you have the ability to use self-reflection to discern an algorithm that occurs in your unconscious visual cortex in a split second. You wouldn't feel like you were doing a naive grid search even if a naive grid search is exactly what you were doing.

You have suggested that the process in your mind to find a global minimum is immediate, apparently to contrast this with a standard computational algorithm. But such comparison fails. I don't know whether you mean "with few computational steps" or "in very little time"; the former is not knowable to you; the latter is not relevant since the hardware is not the same.

shoo 11 hours ago [-]

Many practical optimisation problems are less like "let's go hiking and climb a literal hill which we can see in front of us" and more like "find the best design in this space of possible designs that maximises some objective"

Here are some alternative example problems, that are a lot more high dimensional, and also where the dimensions are not spatial dimensions so your eyes give you absolutely no benefit.

(a) Your objective is to find a recipe that produces a maximally tasty meal, using the ingredients you have in your kitchen cupboard. To sample one point in recipe-space, you need to (1) devise a recipe, (2) prep and cook a candidate meal following the recipe, and (3) evaluate the candidate recipe, say by serving it to a bunch of your friends and family. That gets you one sample point. Maybe there are 1 trillion possible "recipes" you could make. Are you going to brute-force cook and serve them all to find a meal that maximises tastiness, or is there a more efficient way that requires fewer plan recipe->prep&cook->serve->evaluate cycles?

(b) Your objective is to find the most efficient design of a bridge, that can support the required load and stresses, while minimising the construction cost.

GuB-42 17 hours ago [-]

Your eyes compute gradients, as part of the shitton of visual processing your brain does to get an estimate of where the local and global minima are.

It is not perfect though, see the many optical illusions.

But we follow gradients all the time, consciously or not. You know you are at the bottom of the hole when all the paths go up for instance.

dcanelhas 30 minutes ago [-]

It has been suggested [citation needed] that the optical illusions of movement caused by gradients are there to to compensate for the time it takes to process the visual input. This should let you have an understanding of what is going on in the world around you right now, based on what happened on your retinas a few milliseconds ago.

it's not a bug - it's a feature :D

i_am_proteus 18 hours ago [-]

Without looking up the answer (because someone has already computed this for you), how would you find the highest geographic point (highest elevation) in your country?

raffael_de 14 hours ago [-]

well, first of all ... you can't. and it is very easy to come up with all sorts of (not even special) cases where you simply couldn't for literally obvious reasons. what you are imagining is some sort of stereoscopic ray tracing. that is anyway much more compute intensive then calculating a derivative.

cinntaile 17 hours ago [-]

What if you're trying to find the minimum of something that you can't see? Or what if the differences are so small that you can't perceive them with your eyes even though you can see?

adrianN 18 hours ago [-]

The inputs you can process visually are of trivial size even for naive algorithms, and probably also simple instances. I certainly can’t find global minima in 2d for any even slightly adversarial function.

hackinthebochs 18 hours ago [-]

You're ignoring all the calculations that go on unconsciously that realize your conscious experience of "immediately" apprehending the global minima.

fancyfredbot 18 hours ago [-]

Your visual cortex is a massively parallel processor.

pestatije 18 hours ago [-]

touch and sight sense essentially the same...the difference is in the magnitudes involved

amelius 17 hours ago [-]

> (...) The derivative of w with respect to x. Another way of saying that is “If you added 1 to x before plugging it into the function, this is how much w would change

Incorrect!

dang 12 hours ago [-]

Ok, but a good HN post should explain what is correct, so those who don't know can learn.

wiosnaintel 6 hours ago [-]

> (...) if the function was a straight line

There are a bit more to go from local linearization to a complete view of a derivative but it's not exactly incorrect.

amelius 1 hours ago [-]

Looks like the author fixed it. This was the original page:

https://web.archive.org/web/20250817152111/https://blog.demo...

throwpr 17 hours ago [-]

[dead]

Loading comments...