linear least squares minimizing distance from points to rays - is it possible?
ad 1) Because any linear function of T can be written in the form ε = N · T for some vector N, there is little hope to improve your method if we are restricted to linear least squares.
EDIT It's actually possible to formulate it as a linear least squares problem with linear restrictions, which allows to see that just computing the distance from the ray as you proposed should work fine and is easy to implement, see (*).
So better don't try to implement what I suggested under "ad 2)", because it is more complicated than necessary.
ad 2) Formulate it as a nonlinear least squares problem. Both the Gauss-Newton algorithm (GNA) and the Levenberg–Marquardt algorithm (LMA) are well suited to solve this type of problem. There are many free implementations of LMA available. It may be a good idea to generate a sufficient large number of random initial values to avoid problems with local minima or non-convergence.
I would compute the objective functions as follows: For each bearing measurement, apply a rotation matrix $R$ to the analytic expression for T such that the positive x-axis becomes the bearing ray. Then use ε = atan2(y-coord,x-coord) as measurement error whose sum of squares should be minimized.
A problem with this approach can occur when the target passes closely by the sensor, because the errors introduced by the assumption that the target travels at constant speed is not modeled well (and because atan2 has a singularity at (0,0) which was not there for your linear approach). This can be fixed by treating T for each bearing measurement as unknown, and minimizing it's deviation from the analytic expression for T. However, the random initial values should only be generated for A and V, and the initial values for T just computed with the analytic expression. (If this should really lead to convergence problems, then the initial values for T could also be randomly perturbed a bit.)
EDIT(*) When we treat T as unknown, we actually have to decide how we want to weight the errors of the measurements and the errors of our assumption against each other. In case we assume that the errors of the bearing measurements are negligible with respect to the assumption of constant velocity, we basically arrive again at the initial linear least squares formulation. It now reads $0$ = cross(B) · T and $0\leq$ B · T together with the fact that |T - (A+Vt-P)| should be minimized. This is identical to the initial formulation in case $0<$ B · T. Otherwise, we have to replace the distance from the line with the distance from the origin. (Which is equivalent to your initial idea to use the distance from the ray for the objective function.) The theory of quadratic programming has more to say about how the switching between the different cases has to be done correctly, but I hope it shouldn't be too difficult for this specific case.
EDIT 2 The squared distance from the bearing ray could be written in the form $[cross(B)\cdot T]^2 + [\max(0,-B\cdot T)]^2$. I would probably use this together with my favorite Levenberg-Marquardt implementation, because then I don't have to worry about solving the normal equations myself. If I would instead solve the "local" normal equations until the "local" normal equations no longer change, this would be equivalent to using the Gauss-Newton method. I'm just not sure whether this is guaranteed to work, that's why I pointed to the quadratic programming solution (which is guaranteed to work).