Solution 1:

$ \def\bbR#1{{\mathbb R}^{#1}} \def\a{\alpha}\def\b{\beta}\def\g{\gamma}\def\t{\theta} \def\l{\lambda}\def\s{\sigma}\def\e{\varepsilon} \def\n{\nabla}\def\o{{\tt1}}\def\p{\partial} \def\E{R}\def\F{{\cal F}}\def\L{{\cal L}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\vec#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3}} \def\c#1{\color{red}{#1}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $It will be convenient to use $\{\e_k,\o,J,I\}$ as the standard basis vector, the all-ones vector, all-ones matrix, and identity matrix, respectively. And to use $\{\odot,\oslash\}$ to denote elementwise multiplication and division.

Further, the indexed column vectors can be collected into matrices, e.g. $$\eqalign{ X &= \m{x_\o&\ x_2&\ldots&\ x_N} &\qiq x_i = X\e_i \\ Y &= \m{y_\o&\ y_2&\ldots&\ y_N} &\qiq y_j = Y\e_j \\ B &= \m{\b_\o&\b_2&\ldots&\b_K} &\qiq \b_k = B\e_k \\ }$$

The various functions can be defined for the vector argument $u$. $$\eqalign{ e &= \exp(u) &\qiq de = e\odot du \\ s &= \operatorname{Softmax}(u) \\ &= \frac{e}{\o:e} = e \oslash Je &\qiq ds = \BR{\Diag{s}-ss^T}\,du \\ \l &= \operatorname{log-sum-exp}(u) \\ &= \log(\o:e) &\qiq d\l = s:du \\ }$$ $$\eqalign{ \log(s) &= \log(e)-\log(Je) \;&=\; u -\l\o \qquad\qquad\qquad\qquad\qquad\qquad \\ \ell(y,s) &= -y:\log(s) \;&=\; \l{y:\o} - y:u \;=\; \l - y:u \\ }$$ where a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$ Furthermore, it commutes with elementwise multiplication and division $$\eqalign{ A:(B\odot C) &= (A\odot B):C &= \sum_{i=1}^m\sum_{j=1}^n A_{ij} B_{ij} C_{ij} \\ }$$

The Kronecker product $(\otimes)$ can also be used to good effect since $$\eqalign{ M_i &= \LR{I\otimes x_i^T} \\ M_i\b &= \LR{I\otimes x_i^T} \vec{B} = \vec{\e_i^TX^TB} = {B^TX\e_i} \\ y_i &\approx \operatorname{Softmax}(M_i\b) = \exp\LR{B^TX\e_i}\oslash J\exp\LR{B^TX\e_i} \\ Y &\approx \exp\LR{B^TX}\oslash J\exp\LR{B^TX} \;\doteq\; S \\ }$$ The last equation is the Matrix Model that must be optimized for $B$ with respect to the cross-entropy loss function. Note that we now have matrix analogs of all of the vector quantities $$\eqalign{ u &\to U=B^TX \\ e &\to E=\exp(U) \\ s &\to S=\operatorname{Softmax}(U) &= E\oslash JE \\ &&= E\cdot\c{\Diag{E^T\o}^{-1}} \\ &&= E\c{D^{-1}} \\ y &\to Y,\quad Y^T\o=\o,\;&JY=J,\quad Y:J=N \\ }$$ Calculate the gradient of the loss function. $$\eqalign{ {\ell} &= -\BR{Y:\log(S)} \\ &= Y:\log(JE) - Y:U \\ \\ d{\ell} &= Y:d\log(JE) - Y:dU \\ &= Y:(J\,dE\oslash JE) - Y:dB^TX \\ &= (Y\oslash JE):(J\,dE) - Y^T:X^TdB \\ &= \LR{Y{D^{-1}}}:\BR{J\,dE} - XY^T:dB \\ &= JY{D^{-1}}:dE - XY^T:dB \\ &= JD^{-1}:(E\odot dU) - XY^T:dB \\ &= (E\odot JD^{-1}):dB^TX - XY^T:dB \\ &= \LR{ED^{-1}}^T:X^TdB - XY^T:dB \\ &= X\LR{ED^{-1}}^T:dB - XY^T:dB \\ \\ \grad{\ell}{B} &= X\LR{ED^{-1} - Y}^T \;\doteq\; G \\ }$$ This $G$ matrix can be used in any standard gradient descent algorithm.
The Hessian will be a fourth-order tensor and much uglier.

Update

Here is an attempt at a Hessian calculation (with flattening via vectorization).

First, recall some identities involving the Diag() operator $$\eqalign{ &E\odot JP = E\cdot\Diag{P^T\o} \\ &H = \Diag{h} \iff h = H\o \\ &\Diag{H\o} = \Diag{h} = H \qiq &\Diag{H^n\o} = H^n \\ }$$ Calculate the differential of $G$. $$\eqalign{ G &= X\LR{ED^{-1}-Y}^T \\ dG &= X\LR{dE\;D^{-1}}^T - X\LR{E\,dD\,D^{-2}}^T \\ &= ​XD^{-1}\,dE^T - XD^{-2}\,dD\,E^T \\ &= ​XD^{-1}\LR{E^T\odot dU^T} - XD^{-2}\Diag{dE^T\o}\,E^T \\ &= ​XD^{-1}\LR{\c{E^T\odot X^TdB}} - XD^{-2}\Diag{(\c{E^T\odot X^TdB})\o}\,E^T \\ }$$ Note the following vec expansion of the $\c{\rm red}$ term $$\eqalign{ &\E = \Diag{\vec{E^T}} \\ &\vec{\c{E^T\odot X^TdB}} = \E\cdot\vec{X^TdB} = \E \LR{I\otimes X^T} d\b \\ }$$ and use this to vectorize the differential relationship and recover the Hessian. $$\eqalign{ dg &= \vec{dG} \\ &= \LR{I\otimes XD^{-1}} \vec{\c{E^T\odot X^TdB}} - \LR{E\boxtimes XD^{-2}} \LR{\c{E^T\odot X^TdB}}\o \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}} \;\vec{\c{E^T\odot X^TdB}} \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} d\b \\ \grad{g}{\b} &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} \;=\; \hess{\ell}{\b\,}{\b^T} \\ }$$ where $\{\boxtimes\}$ denotes the Khatri-Rao product which can be expanded in terms of Hadamard and Kronecker products $$\eqalign{ B\boxtimes A &= (B\otimes\o_m)\odot(\o_p\otimes A) \;\in \bbR{(pm)\times n} \\ \o_p &\in \bbR{p\times\o} \qquad A \in \bbR{m\times n} \qquad B \in \bbR{p\times n} \\\\ }$$ Note that the gradient can also be vectorized, if that is preferable $$\vec{G} \;=\; g \;=\; \grad{\ell}{\b}$$

Solution 2:

Notice that $$ L(\beta) = \frac{1}{N} \sum_{i=1}^N h_i(M_i \beta) $$ where $h_i: \mathbb R^K \to \mathbb R$ is the function defined by $$ h_i(u) = \ell(y_i, S(u)). $$ The softmax function and the cross-entropy loss function are natural companions, and they are happy to be combined together into the function $h_i$. By the chain rule, $$ L'(\beta) = \frac{1}{N} \sum_{i=1}^N h_i'(M_i \beta) M_i. $$ If we use the convention that the gradient is a column vector, then we have $$ \nabla L(\beta) = L'(\beta)^T = \frac{1}{N} \sum_{i=1}^N M_i^T \nabla h_i(M_i \beta). $$ So, to compute $\nabla L(\beta)$, we only need to figure out how to evaluate $\nabla h_i(u)$. We are nearly done already.

Before we compute $\nabla h_i(u)$, let's first simplify $h_i(u)$ as much as possible. The components of the label vector $y_i \in \mathbb R^K$ will be denoted $y_{i1}, \ldots, y_{iK}$. Note that $y_{i1} + \cdots + y_{iK} = 1$. Doing some high school algebra and using some properties of logarithms, we find that \begin{align} h_i(u) &= \sum_{k=1}^K - y_{ik} \log\left(\frac{e^{u_k}}{e^{u_1} + \cdots + e^{u_K}}\right) \\ &= \sum_{k=1}^K -y_{ik} u_k + y_{ik} \log(e^{u_1} + \cdots + e^{u_K}) \\ &= -\langle y_i, u \rangle + \sum_{k=1}^K y_{ik} \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{common factor}} \\ &= - \langle y_i, u \rangle + \log(e^{u_1} + \cdots + e^{u_K})\underbrace{\sum_{k=1}^K y_{ik}}_1 \\ &= \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{LogSumExp function}} - \langle y_i, u \rangle. \end{align} Look how much $h_i(u)$ simplified! And look how beautiful it is that our old friend the LogSumExp function has appeared. This goes to show that the softmax function and the cross-entropy loss function are meant to be together. This is why PyTorch's loss function torch.nn.functional.cross_entropy takes "logits" as input. (In other words, PyTorch's cross_entropy loss function assumes that you have not yet applied the softmax function.)

Now we are ready to compute the gradient. It can be shown that the gradient of the LogSumExp function is the softmax function, which is a beautiful fact. It follows that $$ \nabla h_i(u) = S(u) - y_i. $$ This is a beautiful and delightfully simple formula. Interpretation: If the predicted probability vector $S(u)$ agrees perfectly with the ground truth probability vector $y_i$, then $\nabla h_i(u)$ is zero, suggesting that no change to the value of $u$ is necessary.

Putting the above pieces together, we see that $$ \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$

We are now done with computing the gradient of $L$, but it's helpful to make one further observation. Let's look more closely at the $i$th term in the sum above, which we shall call $g_i$. The vector $\beta$ has block structure, since it is the concatenation of the vectors $\beta_1, \ldots, \beta_K$. Let $S_k$ be the $k$th component function of $S$. Notice that $$ g_i = M_i^T(S(M_i \beta) - y_i) = \underbrace{\begin{bmatrix} \hat x_i & & & \\ & \hat x_i & & \\ & & \ddots & \\ & & & \hat x_i \end{bmatrix}}_{K(d+1) \times K} \underbrace{\begin{bmatrix} S_1(M_i \beta) - y_{i1} \\ S_2(M_i \beta) - y_{i2} \\ \vdots \\ S_K(M_i \beta) - y_{iK}\end{bmatrix}}_{K \times 1}. $$ We can see that the $k$th block of $g_i$ is $(S_k(M_i \beta) - y_i) \hat x_i$. This is a very nice formula. Interpretation: If the predicted probability $S_k(M_i \beta)$ agrees perfectly with the ground truth probability $y_i$, then the $k$th block of $g_i$ is $0$, suggesting that no change needs to be made to $\beta_k$.


The above observation makes life easier when implementing multiclass logistic regression from scratch in Python. Due to the block structure of the vectors $\beta$ and $g_i \in \mathbb R^{K(d+1)}$, it is convenient to store $\beta$ and $g_i$ as Numpy arrays with shape $(d+1) \times K$. Then $M_i \beta$ can be computed using the Numpy command MiBeta = beta.T @ xiHat. And $g_i$ can be computed using the Numpy command gi = np.outer(xiHat,S(MiBeta) - yi). This makes it possible to implement multiclass logistic regression from scratch in Python very concisely, particularly when using stochastic gradient descent with a batch size of 1.


Next let's compute the Hessian of $L$. The Hessian matrix $H L(\beta)$ is the derivative of the function $$g(\beta) = \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$ To compute the derivative of $g$, it is helpful to know that $$ S'(u) = \text{diag}(S(u)) - S(u) S(u)^T, $$ which is analogous to the formula for the derivative of the sigmoid function. Again using the chain rule, we find that the derivative of $g$ is \begin{align} g'(\beta) &= \frac{1}{N} \sum_{i=1}^N M_i^T S'(M_i \beta) M_i. \end{align} If we look closely at this formula, it probably simplifies nicely, but I still need to work out the details.

So we have found our Hessian matrix $HL(\beta) = g'(\beta)$. Having computed the gradient and Hessian of $L$, it is now straightforward to train a multiclass logistic regression model from scratch using gradient descent or Newton's method in a language such as Python.