Dynamic Programming
Dynamic programming is a general type of approach to problem solving, and particular equations used for making a sequence of interrelated decisions must be developed to fit each situation.
Characteristics of dynamic programming problems
- The problem can be divided into stages, with a policy decision required at each stage. Dynamic programming problems require making a sequence of interrelated decisions, where each decision corresponds to one stage of the problem.
- Each stage has a number of states associated with the beginning of that stage. The states are the various possible conditions in which the system might be at that stage of the problem. The number of states may be either finite or infinite.
- The effect of the policy decision at each stage is to transform the current state to a state associated with the beginning of the next stage (possible according to a probability distribution). Dynamic programming problems can be interpreted in terms of networks. Each node corresponds to a state. The network would consist of columns of nodes, with each column corresponding to a stage, so that the flow from a node can go only to a node in the next column to the right. The links from a node to nodes in the next column correspond to the possible policy decisions on which state to go to next. The value assigned to each link usually can be interpreted as the immediate contribution to the objective function from making that policy decision. In most cases, the objective corresponds to finding either the shortest or the longest path through the network.
- The solution procedure is designed to find an optimal policy for the overall problem, i.e. a prescription of the optimal policy decision (x*n) at each stage (n) for each of the possible states (s). Dynamic programming provides the kind of policy prescription of what to do under every possible circumstance (which is why the actual decision made upon reaching a particular state at a given stage is referred to as a policy decision). Providing this additional information beyond simply specifying an optimal solution (optimal sequence of decisions) can be helpful in a variety of ways, including sensitivity analysis.
- Given the current state, an optimal policy for the remaining stages is independent of the policy decisions adopted in previous stages. Therefore, the optimal immediate decision depends on only the current state and not on how you got there. This is the principle of optimality for dynamic programming. For dynamic programming problems, knowledge of the current state of the system conveys all the information about its previous behaviour necessary for determining the optimal policy henceforth.
- The solution procedure begins often by finding the optimal policy for the last stage. The optimal policy for the last stage prescribes the optimal policy decision for each of he possible states at that stage. The solution of this one-stage problem is usually trivial.
- A recursive relationship that identifies the optimal policy for stage n, given the optimal policy for stage n + 1, is available. Finding the optimal policy decision when you start in state s at stage n requires finding the minimizing (or maximizing) value of xn.
Used notation:
N = number of stages.
n = label for current stage (n = 1, 2, …, N).
sn = current state for stage n.
xn = decision variable for stage n.
x*n = optimal value of xn (given sn).
fn(sn, xn) = contribution of stages n, n +1, …, N to objective function if system starts in state sn at stage n, immediate decision is xn, and optimal decisions are made hereafter. Further, we use the notation
f*n(sn) = fn(sn, x*n).
The recursive relationship will always be of the form
f*n(sn) = max {fn(sn, xn)} or f*n(sn) = min {fn(sn, xn)},
xn xn
where fn(sn, xn) would be written in terms of sn, xn, f*n+1(sn+1), and probably some measure of the immediate contribution of xn to the objective function. It is the inclusion off*n+1(sn+1) on the right-hand side, so that f*n(sn) is defined in terms of f*n+1(sn+1) that makes the expression for f*n(sn) a recursive relationship.
The recursive relationship keeps recurring as we move backward stage by stage. When the current stage number n is decreased by 1, the new f*n(sn) function is derived by using the f*n+1(sn+1) function that was just derived during the preceding iteration, and then this process keeps repeating. This property is emphasized in th next (and final) characteristic of dynamic programming.
- When we use the recursive relationship, the solution procedure starts at the end and moves backward stage by stage – each time finding the optimal policy for that stage – until it finds the optimal policy starting at the initial stage. This optimal policy immediately yields an optimal solution for the entire problem, namely, x*1 for the initial state s1, then x*2 for the resulting state s2, thenx*3 for the resulting state s3, and so forth to x*1 for the resulting state sN.
For all dynamic problems, a table such as the following would be obtained for each stage (n = N, N – 1, …, 1).
fn(sn, xn)
xn
sn f*n(sn) x*n
When this table is finally obtained for the initial stage (n = 1), the problem of interest is solved. Because the initial state is known, the initial decision is specified by x*1 in this table. The optimal value of the other decision variables is then specified by the other tables in turn according to the state of the system that results from the preceding decisions.
Deterministic Dynamic Programming
The dynamic programming approach to deterministic problems: the state at the next stage is completely determined by the state and policy decision at the current stage.
One categorization of deterministic dynamic programming problems is by the form of the objective function: the objective might be to minimize (or minimize) the sum (or the product) of the contributions from the individual stages.
Another categorization is in terms of the nature of the set of states for the respective stages: states sn might be represented by a discrete state variable or by a continuous state variable, or a state vector (more than one variable) is required.
The underlying basic structure for deterministic dynamic programming always remains the same:
Stage n Stage n+1
xn
State: sn → sn+1
Value: fn(sn, xn) Contribution of xn f*n+1(sn+1)
Example 1
The Build-Em-Fast Company has agreed to supply its best customer with three widgits during each of the next 3 weeks, even though producing them will require some overtime work. The relevant production data are as follows:
Week / Maximum Production,Regular Time / Maximum Production,
Overtime / Production Cost per Unit
Regular Time
1 / 2 / 2 / $300
2 / 3 / 2 / $500
3 / 1 / 2 / $400
The cost per unit produced with overtime for each week is $100 more than for regular time. The cost of storage is $50 per unit for each week it is stored. There is already an inventory of two widgets on hand currently, but the company does not want to retain any widgets in inventory after the 3 weeks.
Management wants to know how many units should be produced in each week to minimize the total cost of meeting the delivery schedule.
Let us solve this by using dynamic programming.
Solution
The decisions that need to be made are the number of units to produce each week, so the decision variables are
xn = number of widgets to produce in week n, for n = 1, 2, 3.
To choose the value of xn, it is necessary to know the number of widgets already on hand at the beginning of week n. Therefore, the “state of the system” then is
sn= number of widgets on hand at the beginning of week n, for n = 1, 2, 3.
Because of the shipment of three widgets to the customer each week, note that
sn+1 = sn + xn -3.
Also note that two widgets are on hand at the beginning of week 1, so
s1 = 2.
To introduce symbols for the data given in the above table for each week n, let
cn= unit production cost in regular time,
rn= maximum regular time production,
mn= maximum total production.
The corresponding data are given in the following table.
n / rn / mn / cn1 / 2 / 4 / 300
2 / 3 / 5 / 500
3 / 1 / 3 / 400
We wish to minimize the total cost of meeting the delivery schedule, so our measure of performance is total cost. For n = 1, 2, 3, let
pn(sn, xn) = cost incurred in week n when the state is sn and the decision is xn.
Thus,
.
Let
(sn) = optimal cost for week n onward (through the end of week 3) when starting week n in state sn, for n = 1, 2, 3.
Given sn, the feasible values of xn are the integers such that xn ≤ mn and xn ≥ 3 - sn (assuming sn ≤ 3) in order to provide the customer with 3 widgets in week n. Thus, because sn+1 = sn + xn - 3, the optimal value of xn (denoted by) is obtained from the following recursive relationship.
,
where
(s4) = 0for s4 = 0.
Recall that the company does not want to retain any widgets in inventory after the three weeks, so s4 = 0. Therefore, the optimal policy for week 3 obviously is to produce just enough widgets to have a total of three to ship to the customer, so
= 3 - s3.
Given and (s3), we can use the recursive relationship to solve for the optimal policy for week 2 and then for week 1. These calculations are shown below.
For n = 3:
s3 / f3*(s3) / x3*0 / 1,400 / 3
1 / 900 / 2
2 / 400 / 1
3 s3 / 0 / 0
For n = 2:
x2 / f2(s2, x2) = p2(s2, x2)+ f3*(s2+x2-3) / f2*(s2) / x2*s2 / 0 / 1 / 2 / 3 / 4 / 5
0 / 2,900 / 3,050 / 3,200 / 2,900 / 3
1 / 2,400 / 2,450 / 2,600 / 2,850 / 2,400 / 2
2 / 1,900 / 1,950 / 2,000 / 2,250 / 2,900 / 1,900 / 1
3 / 1,400 / 1,450 / 1,500 / 1,650 / 2,300 / 2,950 / 1,400 / 0
For n = 1:
x1 / f1(s1, x1) = p1(s1, x1)+ f2*(s1+x1-3) / f1*(s1) / x1*s1 / 0 / 1 / 2 / 3 / 4
2 / 3,200 / 3,050 / 3,000 / 2,950 / 2,950 / 4
Hence, the optimal plan is to produce four widgets in period 1, store three of them until period 2, and produce three in period 3, with a total cost of $2,950.
Example 2
Consider the following nonlinear programming problem:
MaximizeZ = ,
subject to
≤ 2.
(There are no nonnegativity constraints.) Let us use dynamic programming to solve this problem.
Solution
Let x1 be the decision variable at stage 1 and x2 be the decision variable at stage 2.
The key to applying dynamic programming to this problem is to interpret the right-hand side of the constraint as the amount of a resource being made available to activities 1 and 2 (whose levels are x1 and x2). Then the state of the system entering stage 1 (before choosing x1 and x2) and entering stage 2 (before choosing x2) is the amount of the resource still available for allocation to the remaining activities. Consequently, letting sn denote the state of the system entering stage n, we have
s1 = 2 and s2 = 2 –.
For n = 2:
At stage 2, we solve the following problem with variable x2:
Maximize f2(s2, x2) = x2,
subject to
x2 s2.
The optimal solution is, with.
For n = 1:
At stage 1, we maximize f1(2, x1) = ,
subject to
≤ 2.
x1= 1, -1are local maxima.
< 0 for x1 = 1, -1.
Hence, is locally concave around x1= 1 and x1= -1.
Since = = 1, whereas f(2, ) = 0 at the endpoints of the feasible region, we have both= 1 and = -1 as global maximizers and
Hence, the optimal solutions are = (1, 1),and = (1, -1) with an optimal objective function value of Z = 1.
Probabilistic Dynamic Programming
Probabilistic dynamic programming differs from deterministic dynamic programming in that the state at the next stage is not completely determined by the state and policy decision at the current stage. Rather, there is a probability distribution for what the next state will be, and this probability distribution is completely determined by the state and policy decision at the current stage.
Let S denote the number of possible states at stage n+1 and label these states as 1, 2, …, S. The system goes to state i with probability pi (i= 1, 2, …, S) given state sn and decision xn at stage n. If the system goes to state i, Ci is the contribution of stage n to the objective function. The resulting basic structure for probabilistic dynamic programming can be described diagrammatically.
Stage n Probability Stage n+1
p1 ….. C1 ….. state 1
. f*n+1(1)
State: sn decision xn .
.
pS ….. CS ….. state S
f*n+1(S)
Value: fn(sn, xn)
When such diagram is expanded to include all the possible states and decisions at all the stages, it is sometimes referred to as a decision tree. If the decision tree is not too large, it provides a useful way of summarizing the various possibilities.
Because of the probabilistic structure, the objective is to minimize (or maximize) the expected sum (or product) of the contributions from the individual stages, and the relationship between fn(sn, xn) and f*n+1(sn+1) necessarily is somewhat more complicated than for the deterministic dynamic programming. For example,
fn(sn, xn) = ∑Si=1 pi[Ci + f*n+1(i)],
with
f*n+1(i) = min fn+1(i, xn+1),
where this minimization is taken over the feasible values of xn+1.