5.2 Some things to remember

1.
Principle of optimality, by Bellman: ”An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision” Proof is by contradiction. See page 54, optimal control theory by Donald kirk. Simplest proof I’ve seen. An important case is when the performance index \(J\) is quadratic as with LQR. We only looked at case where is no coupling term in the LQR in this course. \(J=\min _{u}\sum _{k=0}^{\infty }x^{T}Qx+u^{T}Ru\). This is solved for steady state by solving Riccati equation. For discrete case, use Matlab dare() function. See Introduction to Dynamic Programming: International Series in Modern Applied Mathematics and Computer Science, Volume 1 (Pergamon International Library ... Technology, Engineering & Social Studies)
2.
Remember difference between state variables, and decision variable. There can be more than one state variable in the problem, but the number of decisions to make at each state is different. see problem 1, HW 7 for example. The fire stations allocation problem. In that problem, we had one state variable, which is the number of stations available. There more state variables there are, the harder it will be to solve by hand.
3.
\[ \lim _{\lambda \rightarrow 0}\frac{J\left ( \mathbf{u}+\lambda \mathbf{d}\right ) -J\left ( \mathbf{u}\right ) }{\lambda }=\left [ \nabla J\left ( \mathbf{u}\right ) \right ] ^{T}\cdot \mathbf{d}\]

Remember, \(\nabla J\left ( \mathbf{u}\right ) \) is column vector. \(\nabla J\left ( \mathbf{u}\right ) =\begin{bmatrix} \frac{\partial J\left ( u\right ) }{\partial u_{1}}\\ \vdots \\ \frac{\partial J\left ( u\right ) }{\partial u_{n}}\end{bmatrix} \). This vector is the direction along which function \(J\left ( \mathbf{u}\right ) \) will increase the most, among all other directions, at the point it is being evaluated at.

4.
For polytope, this is useful trick. \begin{align*} \mathbf{u} & =\sum _{i=1}^{m}\lambda _{i}v^{i}\\ \left \Vert \mathbf{u}\right \Vert & =\left \Vert \sum _{i=1}^{m}\lambda _{i}v^{i}\right \Vert \\ & \leq \sum _{i=1}^{m}\left \Vert \lambda _{i}v^{i}\right \Vert \end{align*}

The last step was done using triangle inequality.

5.
Definition of continuity: If \(u^{k}\rightarrow u^{\ast }\) then \(J\left ( u^{k}\right ) \rightarrow J\left ( u^{\ast }\right ) \). We write \(\lim _{k\rightarrow \infty }J\left ( u^{k}\right ) =J\left ( u^{\ast }\right ) \). This is for all \(u^{k}\) sequences. See real analysis handout. If \(u^{k}\rightarrow u^{\ast }\) then this is the same as \(\lim _{k\rightarrow \infty }\left \Vert u^{k}-u^{\ast }\right \Vert =0\)
6.
closed sets is one which include all its limits points. (includes it boundaries). Use \([0,1]\) for closed. Use \((0,1)\) for open set. A set can be both open and closed at same time (isn’t math fun?, wish life was this flexible).
7.
Intersection of closed sets is also closed set. If sets are convex, then the intersection is convex. But the union of convex sets is not convex. (example, union of 2 circles).
8.
B-W, tells us that a sequence \(u^{k}\) that do not converge, as long as it is in a compact set, it will contain at least one subsequence in it, \(u^{k,i}\) which does converge to \(u^{\ast }\). So in a compact set, we can always find at least one subsequence that converges to \(u^{\ast }\) even inside non-converging sequences.
9.
If a set is not compact, then not all is lost. Assume set is closed but unbounded. Hence not compact. What we do, it set some \(R\) large enough, and consider the set of all elements \(\left \Vert u\right \Vert \leq R\). Then the new set is compact.
10.
\(J\left ( u\right ) =au^{2}+bu+c\) is coercive for \(a>0\). Note, the function \(J\left ( u\right ) \) to be coercive, has to blow up in all directions. For example, \(e^{u}\) is not coercive. If \(A\) is positive definite matrix and \(\mathbf{b}\in \Re ^{n}\) and \(c\in \Re \), then \(J\left ( u\right ) =\mathbf{u}^{T}A\mathbf{u}+\mathbf{b}^{T}\mathbf{u}+c\) is coercive function. To establish this, convert to scalar. Use \(\lambda _{\min }\left \Vert \mathbf{u}\right \Vert ^{2}\leq \mathbf{u}^{T}A\mathbf{u}\) and use \(\mathbf{b}^{T}\mathbf{u\leq }\left \Vert \mathbf{b}\right \Vert \left \Vert \mathbf{u}\right \Vert \), then \(J\left ( u\right ) \leq \lambda _{\min }\left \Vert \mathbf{u}\right \Vert ^{2}+\left \Vert \mathbf{b}\right \Vert \left \Vert \mathbf{u}\right \Vert +c\). Since P.D. matrix, then \(\lambda _{\min }>0\). Hence this is the same as \(J\left ( u\right ) =au^{2}+bu+c\) for \(a>0\). So coercive.
11.
If in \(J\left ( u\right ) =\mathbf{u}^{T}A\mathbf{u}+\mathbf{b}^{T}\mathbf{u}+c\)  the matrix \(A\) is not symmetric., write as \(J\left ( u\right ) =\frac{1}{2}\mathbf{u}^{T}\left ( A^{T}+A\right ) \mathbf{u}+\mathbf{b}^{T}\mathbf{u}+c\). Now it expressions becomes symmetric.
12.
\(\sum _{i,j}x_{i}x_{j}=\left ( \sum _{i}x_{i}\right ) ^{2}\)
13.
If \(\bar{\alpha }=\frac{1}{n}\sum _{i}\alpha _{i}\) then \(\sum _{i}\left ( \alpha _{i}-\bar{\alpha }\right ) ^{2}\geq 0\). Used to show Hessian is P.D. for given \(J\left ( u\right ) \). See HW 2, last problem.
14.
\(x^{T}Ax=\sum _{ij}A_{ij}x_{i}x_{j}\)
15.
To find a basic solution \(x_{B}\) which is not feasible, just find basic solution with at least one entry negative. Since this violates the constraints (we also use \(x\geq 0\)) for feasibility, then \(x_{B}\) solves \(Ax=b\) but not feasible. i.e. \(\begin{bmatrix} I & B \end{bmatrix}\begin{bmatrix} x_{B}\\ 0 \end{bmatrix} =b\) with some elements in \(x_{B}\) negative. For basic solution to also be feasible, all its entries have to be positive. (verify).
16.
Solution to \(Ax=b\) are all of the form \(x_{p}+x_{h}\) where \(x_{h}\) is solution to \(Ax=0\) and \(x_{p}\) is a particular solution to \(Ax=b\). This is similar to when we talk about solution to ODE. We look for homogeneous solution to the ODE (when the RHS is zero) and add to it a particular solution to original ODE with the rhs not zero, and add them to obtain the general solution.
17.
difference between Newton method and conjugate gradient, is that CG works well from a distance, since it does not need the Hessian. CG will converge in \(N\) steps if \(J(u)\) was quadratic function. Newton will converge in one step for quadratic, but it works well only if close to the optimal since it uses the Hessian as per above.
18.
CG has superlinear convergence. This does not apply to steepest descent.
19.
difference between steepest descent and conjugate direction is this: In SD, we use \(\nabla J(u^{k})\) as the direction to move at each step. i.e we use \[ u^{k+1} = u^{k} - h \frac{ \nabla J(u^{k}) }{\| \nabla J(u^{k}) \|} \] Where \(h\) above is either fixed step or optimal. But In CD we use \(v^{k}\) which is the mutual conjugate vector to all previous \(v^{i}\). See my table of summary for this below as they can get confusing to know the difference.
20.
To use Dynamic programming, the problem should have optimal substructure, and also should have an overlapping sub-problems. Sometimes hard to see or check for this.
21.
Steepest descent with optimal step size has quadratic convergence property.
22.
For symmetric \(Q\), then \(\frac{\partial \left ( x^{T}Qx\right ) }{\partial x}=2Qx\)