Learning BlackJack with ANN (Aritificial Neural Network)

ECE 539

Final Project

Ip Kei Sam

ID: 9012828100

Abstract

Blackjack is a card game where the player attempts to beat the dealer by having the total points higher than the dealer’s hand but less than or equal to 21. The probabilistic nature of Blackjack makes it an illustrative application for learning algorithms. The learning system needs to explore different strategies with a higher probability of winning games. This project explores the method of Artificial Neural Network in Blackjack for learning strategies, and the reinforcement learning algorithm is used to learn strategies. Reinforcement learning is a process to map situations to actions when the learner is not told which actions to take but to discover which actions yield the highest reward by trial and error.The trained ANN will be used to play Blackjack without explicitly teaching the rules of the game. Furthermore, the efficiency of the ANN andthe results from the learning algorithm will be investigated and interpreted as different strategies for playing Blackjack and other random game.

Background

Initially with two cards to each player, the object of Blackjack is to draw cards from a deck of 52 cards to a total value of 21. The player can choose from the following actions:

  • Standto stay with the current hand and take no card.
  • Hit toadd a card to the hand to make the total card value closer to 21
  • Double DownWhen the player is holding 2 cards, the player can double his bet by hitting with only one more card and stand after that.
  • SplitHaving the pair of cards with the same values, the player can split his hand into two hands. The player may split up to 3 times in a game.

For simplicity of this project, Double Down and Split will not be considered in this project. The value of a hand is the sum of the values of each card in the hand, where each card from 2 to 10 is valued according, with J, Q, and K of the value of 10. Aces can be either 1 or 11. Each player plays against the dealer, and the goal is to obtain a hand with a greater value than the dealer’s hand but less than or equal to 21. A player may hit as many times as he wishes as long as it is not over 21. The player can also win by having 5 cards in hand with total points less than 21. The dealer hitswhen his hand is less than 17 and stand when it is greater than 17. Whenthe player is dealt 21 points in the first 2 cards, he automatically wins his bet if the dealer is not dealt 21. If the dealer has blackjack (21 points), the game is over and the dealer wins all bets, or ties with any player with blackjack (21 points).Figure 1 and 2 shows a demo of the Matlab Blackjack (blackjack.m) program. The player needs to press the hit or stand button each turn, and the total dealer and player points are calculated as shown in the bottom. It also shows the total balance remains in the player’s account and the amount of bet the player has put in for this game. In this example, the player bet $0. The program exits when the player balance is less than zero.

Figure 1: the initial state of the game, the first 2 cards are dealt to the player and to the dealer.

To measure the efficiency of the rules in Blackjack, I have simulated the program to play against the dealer for 1000 games, where the dealer follows the 17 points value. The efficiency can be observed from the percentage of winning and the percentage of draw games. The comparison of the play’s random moves versus the dealer’s 17 point rule is shown in Figure 2a.

Figure 2: as the player chooses to stand, the dealer chooses to hit but get busted. The player won.

Strategy / Win % / Tie %
Player’s random moves / 31% / 8%
Dealer’s 17 points rule / 61% / 8%

Figure 2a: Efficiency of different strategies in Blackjack. Each Strategy is played1000 games.

One of the goals of this project is to develop a better strategy with ANN that beats the Dealer’s 17 points rule, that is, the new strategy will have a higher wining percentage. Different configurations of MLP and preprocess of input training data sets will also be experimented later in this paper.Finally, it will explore some of the Blackjack strategies interpreted from the experiment results.

Applying Reinforcement Learning to Blackjack

Reinforcement learning is a process to map situations to actions such that the reward value is maximized. The learning algorithm decides which actions to take by finding the actions that yields the highest reward through trial and error. The taken actions will affect the immediate rewards and the subsequent rewards as well. In Blackjack, given a set of dealer’s and player’s cards, if the probability of winning of each outcome is known,it can always make the best decision (hit or stand) by taking an action that yields the highest winning probability in the next state.For each final state, the probability of winning is either 1 (if the player wins/draws) or 0 (if the player loses). In this project, the initial winning probability ofeach intermediate state is set to 0.5 and the learning parameter α is also initialized to 0.5. The winning probability of eachstate is updated for the dealer and playerafter each game. The winning probability of the previous state will get closer to the current state base on this equation:P(s) = P(s) + α*[P(s’)-P(s)], where α is the learning parameter, s is the current state and s’ is the next state.For example, figure 3 shows the first 3 rows taken from the result table in the output when theMatlab programgenlookuptable.m is simulated to play against the dealer based on random decision.

2.0000 5.0000 0 0 0 6.0000 6.0000 0 0 0 0.37001.0000 0

2.0000 5.0000 0 0 0 4.0000 6.0000 6.0000 0 0 0.2500 1.0000 0

2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 7.0000 0 01.0000 1.0000

Figure 3: the result table (lpxx) from the Matlab program output after one game.

The first 5 columns represent the dealer’s cards and the next 5 columns represent the player’s cards. The dealer and player can each have a maximum of 5 cards by the game rule. The card values in each hand are sorted in ascending order before they are inserted into the table. Column 11 is the winning probability of each state. Column 12 and 13 represented the action taken by the player, where [1 0] denotes a “hit” and [0 1] denotes a “stand” and [1 1] denotes an end state where no action is required. In the first row, the dealer has 2 and 5 (with a total score of 7) and the player has 6 and 6 (with a total score of 12). Based on random decision, the player decides to hit. The next round, the player gets a 4 with total points of 16 where the dealer gets a 10 with a total of 17. The player decides to hit again. This time the player is busted with a total score of 23 in the final state (row 3). Since the player lost, thisstate is rewarded with P(3)=0. The learning parameter α = 0.5. The winning probability of taking a “hit” in previous state is:

P(2) = P(2) + α*[P(3)-P(2)] = 0.5 + 0.5 * (0-0.5) = 0.2500

Similarly, P(1) = P(1) + α*[P(2)-P(2)] = 0.5 + 0.5 * (0-0.25) = 0.3750

After playing a sequence of games, the result table contains the winning probability of all intermediate states,the dealer’s cards and player’s cards and the player’staken actions.

When playing a new game, if the same card pattern is encountered, for example in row 2,

2.0000 5.0000 0 0 0 4.0000 6.0000 6.0000 0 0 0.2500 1.0000 0

The player will choose to “stand” this time as taking a “hit” will has a low winning probability. After the new game, the winning probability will again be updated based on the game result. If the player loses again this time by hitting an additional card, the wining probability will be: P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (0-0.25) = 0.125

Otherwise, if the player wins this time, the probability is:

P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (1-0.25) = 0.625

The wining probability of each state will converge after a large number of game sets. Of course, the more the game tried, the more accurate the wining probability will become.

Originally, the dealer is following the 17 point rule in Blackjack. To make the dealer more intelligent, reinforcement learning can be applied to the dealer as well.

Similarly, for the dealer, a result table can be obtained after each game. Let’s take a look at the following dealer result table:

2.0000 5.0000 0 0 0 6.0000 6.0000 0 0 0 0.75001.00000

2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.00001.00001.0000

In this game, the dealer was initially dealt 2 and 5 with a total of 7, where the player is dealt 6 and 6 with a total of 12. The dealer chose to hit this time, resulting a total of 17 points in the final state. The player decides to hit, however, resulting a total of 16 points. The dealer P(2) = 1, so P(1) = P(1) + α*[P(2)-P(1)] = 0.5 + 0.5 * (1-0.5) = 0.7500

When the same card pattern (in row 1) is encountered again in a new game, the dealer will choose to hit again since it will resultin a higher probability of winning with a “hit”.Again, the probability value will be updated based on the results after each new game.

The learning parameterαis initially set to 0.5 and it decreases after each gamebased on this equation: α = α * # of games remaining/ total # of game.

Therefore, as it finishes all test games, α becomes 0. If the player or the dealer always takes the actions that will result in the states with the highest winning probability, there will be some intermediate states with the initial winning probability of 0.5 that is always less than some states and will never be encountered. Therefore, random moves will be necessary in between in order to explore these states. The winning probability of the random moves will not be updated at the end of the game.The total number of game simulated in this program is 1000. In each game, 20% of the moves are randomly generated. Since α decrease after each game, the winning probability value of each state converges as the number of game played increases.

Let’s look at the previous example one more time:

2.0000 5.0000 0 0 0 6.0000 6.0000 0 0 0 0.75001.00000

2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.00001.00001.0000

The P(1) value starts at the default value of 0.5, the first time when it encounters the above pattern and if it wins, P(1) is updated to 0.7259. After running 5000 games in the simulation, this pattern is encountered 9 times. The P(1) value converges to 0.76 as shown in the figure 4 below. It suggested that if the player has two 6’s, taking a “hit” has a better chance to win.

Figure 4: the convergence of winning probability in 5000 games.

Let’s look another example.This time the player got 8 and 10 in hands:

7.0000 8.0000 0 0 0 8.0000 10.0000 0 0 0 0.25001.0000 0

3.0000 7.0000 8.0000 0 0 8.0000 10.0000 5.000 0 0 01.00001.0000

The P(1) value starts at the default value of 0.5, the first time when it encounters the above pattern, P(1) is updated to 0.25 when it loses. After running 5000 games in the simulation, this pattern is encountered 11 times. The P(1) value converges to 0.35 as shown in the figure 5. It suggested that if the player got 8 and 10, taking a “hit” will be very likely to lose. In this case, the player should only “hit” when the dealer’s hand is higher than the player’s hand.

Figure 5: the convergence of winning probability in 5000 games.

Both experiments show the convergences of the winning probability values in each state as the learning parameter decreases after each game.

Equation (1)

The update method based on equation (1) performs quite well in this experiment. As α is reduced over time, the method converges to the true winning probability value in each state given optimal actions by the player. Except for random moves, the next taken action (hit or stand) is in fact the optimal moves against the dealersince the method converges to an optimal policy for playing Blackjack. If α does not decrease to 0 at the end of the training, then this player also plays well against the dealerwhen the dealer change its way of playing slowly over time. Applying reinforcement learning to the dealer will certainly beat the dealer’s original 17 point rule.

Matlab Implementation

As the Matlab program (blackjackN.m) is simulated to play against the dealer, each time when a new card pattern is encountered, it will have a default probability value of 0.5. Its winning probability will be updated after each game based on the final result and will be added to the result table at the end. For the other states that have not been encountered by the program, it will not be available in the result table. When the program encounters a card pattern that is not found in the result table, it will make the decision based on random moves. There areapproximately 106total different states for the dealer’s and player’s card patterns. Out of these possible states, many of them do not need to be considered as they may go way beyond 21 points. For example, the state of having five 10’s in hand is not possible, as it will be busted when it reaches three 10. Moreover, it is not possible to simulate all these different states because of the limitation of the slow iterative operations in Matlab.

Matlab Simulation

All cards in one hand will be sorted before they are inserted into thetable. The dealer and the player have different tables of their own. Each table is the input data to the ANN for training either fromthe dealer table orfrom the player table. Based on the card pattern, the corresponding winning probabilitywill be returned.

The Matlab program (genlookuptable.m) is simulated to play against the dealer, starting with the initial empty dealer table and player table to explore the different combinations of the dealer’s and the player’s cards. There are 15000 games simulated, among which 28273 entries were generated for the dealer’s table and 32372 entries were generated for the player’s table. After all, the program had only encountered less than 10% of all possible states. However, the states that were encountered by the program are the most common states. Actually, only less than 10% out of all different states are commonly possible and encountered in playing Blackjack. For example, the states of having 4 cards or 5 cards in hand are actually very less frequently occurred, and having 4 or 5 tens in hands is a state that is impossible to occur in Blackjack.

Both the dealer and player learn how to play better as the number of games played increases. To test the accuracy of both the dealer table and the player table, I set up another sequence games to let the dealer play again the player with the generated lookup table.

The following tables show the summary of the experiments:

When applying Reinforcement learning to the player to play against a dealer with the 17 point rule, the player has a significant improvement on the winning percentage.Out of 1000 games, the player won 512 times.

Strategy / Win % / Tie %
Player (with learning) / 51.2% / 7%

When applying Reinforcement learning to the dealer (where the dealer does not follow the 17 points rule) to play against a player with random moves, the dealer has improved its winning percentage to 67%.

Strategy / Win % / Tie %
Dealer (with learning) / 67% / 5%

When applying Reinforcement learning to both dealer and player to play against each other, the percentage of tie games increases, because they are using more likely the same set of strategies based on Reinforcement learning.

Strategy / Win % / Tie %
Player’s random moves / 43% / 12%
Dealer’s 17 points rule / 45% / 12%

Finally, I played against the dealer with the dealer lookup table. I lost 11 games out of 20 games and have 1 tie game.

Strategy / Win % / Tie %
Human Player / 45% / 5%

Using the dealer and player tables has highly increased the number of games won for both dealer and the player. Therefore the tables generated for the dealer and player is good enough to use for the successive experiments.

Just by lookup at the values in the lookup tables, the game strategy can be interpreted. For the player, if the total point is higher than 14, in most cases, it suggests not to “hit”.

Similarly, it suggested not to “hit” if it is over 15 for the dealer. Without knowing the rules of Blackjack, it is able to determine the threshold of 14 for the player and 15 for the dealer by reinforcement learning and by the updates at the end of each game, even though it is a little bit conservative than the dealer’s 17 point rule. By looking at the player playing against the dealer when both are using lookup tables, the dealer has a lightly higher winning percentage. It suggested that the dealer table, which suggested not to “hit” at 15, players a little better than the player table, which suggested not to “hit” at 14.

Applying ANNto Blackjack

The next step is to prepare the input data for the ANN network by using the result lookup tables. Once the dealer and the player lookup tables are ready, they can be converted to the training data for the ANN.The dealer’s cards and the player’s cardsfrom the lookup tables are the inputs to the ANN. The output contains two neurons; each corresponds to oneaction (either hit or stand). The taken action depends on which output neuron has a higher value. This is a classification problem. To simplify the problem a little bit, I have set up several MLP structures in different stage to better classify the problem. The different MLP needed for the dealer and the player as well as the overall game flow chart are shown in figure 6 below. Note that the rules of Blackjack (and the 17-point dealer rule)are not implemented in the project. All decisions are based on MLP outputs.