{"title": "Approximate Dynamic Programming Finally Performs Well in the Game of Tetris", "book": "Advances in Neural Information Processing Systems", "page_first": 1754, "page_last": 1762, "abstract": "Tetris is a popular video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A close look at the literature of this game shows that while ADP algorithms, that have been (almost) entirely based on approximating the value function (value function based), have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our extensive experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small $10\\times 10$ and large $10\\times 20$ boards. Although the CBMPI's results are similar to those achieved by the CE method in the large board, CBMPI uses considerably fewer (almost 1/10) samples (call to the generative model of the game) than CE.", "full_text": "Approximate Dynamic Programming Finally\n\nPerforms Well in the Game of Tetris\n\nVictor Gabillon\n\nINRIA Lille - Nord Europe,\nTeam SequeL, FRANCE\nvictor.gabillon@inria.fr\n\nMohammad Ghavamzadeh\u2217\nINRIA Lille - Team SequeL\n\n& Adobe Research\n\nmohammad.ghavamzadeh@inria.fr\n\nBruno Scherrer\n\nINRIA Nancy - Grand Est,\n\nTeam Maia, FRANCE\nbruno.scherrer@inria.fr\n\nAbstract\n\nTetris is a video game that has been widely used as a benchmark for various op-\ntimization techniques including approximate dynamic programming (ADP) algo-\nrithms. A look at the literature of this game shows that while ADP algorithms\nthat have been (almost) entirely based on approximating the value function (value\nfunction based) have performed poorly in Tetris, the methods that search directly\nin the space of policies by learning the policy parameters using an optimization\nblack box, such as the cross entropy (CE) method, have achieved the best reported\nresults. This makes us conjecture that Tetris is a game in which good policies are\neasier to represent, and thus, learn than their corresponding value functions. So,\nin order to obtain a good performance with ADP, we should use ADP algorithms\nthat search in a policy space, instead of the more traditional ones that search in a\nvalue function space. In this paper, we put our conjecture to test by applying such\nan ADP algorithm, called classi\ufb01cation-based modi\ufb01ed policy iteration (CBMPI),\nto the game of Tetris. Our experimental results show that for the \ufb01rst time an ADP\nalgorithm, namely CBMPI, obtains the best results reported in the literature for\nTetris in both small 10 \u00d7 10 and large 10 \u00d7 20 boards. Although the CBMPI\u2019s\nresults are similar to those of the CE method in the large board, CBMPI uses\nconsiderably fewer (almost 1/6) samples (calls to the generative model) than CE.\n\nIntroduction\n\n1\nTetris is a popular video game created by Alexey Pajitnov in 1985. The game is played on a\ngrid originally composed of 20 rows and 10 columns, where pieces of 7 different shapes fall\nfrom the top \u2013 see Figure 1. The player has to choose where to place each falling piece by\nmoving it horizontally and rotating it. When a row is \ufb01lled, it is removed and all the cells\nabove it move one line down. The goal is to remove as many rows as possible before the\ngame is over, i.e., when there is no space available at the top of the grid for the new piece.\nIn this paper, we consider the variation of the game in\nwhich the player knows only the current falling piece, and\nnot the next several coming pieces. This game constitutes\nan interesting optimization benchmark in which the goal\nis to \ufb01nd a controller (policy) that maximizes the average\n(over multiple games) number of lines removed in a game\n(score).1 This optimization problem is known to be com-\nputationally hard.\nIt contains a huge number of board\ncon\ufb01gurations (about 2200 \ufffd 1.6 \u00d7 1060), and even in\nthe case that the sequence of pieces is known in advance,\n\ufb01nding the optimal strategy is an NP hard problem [4].\nApproximate dynamic programming (ADP) and reinforcement learning (RL) algorithms have been\nused in Tetris. These algorithms formulate Tetris as a Markov decision process (MDP) in which\nthe state is de\ufb01ned by the current board con\ufb01guration plus the falling piece, the actions are the\n\nFigure 1: A screen-shot of the game of\nTetris with its seven pieces (shapes).\n\n\u2217Mohammad Ghavamzadeh is currently at Adobe Research, on leave of absence from INRIA.\n1Note that this number is \ufb01nite because it was shown that Tetris is a game that ends with probability one [3].\n\n1\n\n\fpossible orientations of the piece and the possible locations that it can be placed on the board,2 and\nthe reward is de\ufb01ned such that maximizing the expected sum of rewards from each state coincides\nwith maximizing the score from that state. Since the state space is large in Tetris, these methods\nuse value function approximation schemes (often linear approximation) and try to tune the value\nfunction parameters (weights) from game simulations. The \ufb01rst application of ADP in Tetris seems\nto be by Tsitsiklis and Van Roy [22]. They used the approximate value iteration algorithm with two\nstate features: the board height and the number of holes in the board, and obtained a low score of 30\nto 40. Bertsekas and Ioffe [1] proposed the \u03bb-Policy Iteration (\u03bb-PI) algorithm (a generalization of\nvalue and policy iteration) and applied it to Tetris. They approximated the value function as a linear\ncombination of a more elaborate set of 22 features and reported the score of 3, 200 lines. The exact\nsame empirical study was revisited recently by Scherrer [16], who corrected an implementation bug\nin [1], and reported more stable learning curves and the score of 4, 000 lines. At least three other\nADP and RL papers have used the same set of features, we refer to them as the \u201cBertsekas features\u201d,\nin the game of Tetris. Kakade [11] applied a natural policy gradient method to Tetris and reported\na score of about 6, 800 lines. Farias and Van Roy [6] applied a linear programming algorithm to\nthe game and achieved the score of 4, 700 lines. Furmston and Barber [8] proposed an approximate\nNewton method to search in a policy space and were able to obtain a score of about 14, 000.\nDespite all the above applications of ADP in Tetris (and possibly more), for a long time, the best\nTetris controller was the one designed by Dellacherie [5]. He used a heuristic evaluation function\nto give a score to each possible strategy (in a way similar to value function in ADP), and eventually\nreturned the one with the highest score. Dellacherie\u2019s evaluation function is made of 6 high-quality\nfeatures with weights chosen by hand, and achieved a score of about 5, 000, 000 lines [19]. Szita\nand L\u02ddorincz [18] used the \u201cBertsekas features\u201d and optimized the weights by running a black box\noptimizer based on the cross entropy (CE) method [15]. They reported the score of 350, 000 lines\naveraged over 30 games, outperforming the ADP and RL approaches that used the same features.\nMore recently, Thiery and Scherrer [20] selected a set of 9 features (including those of Dellacherie\u2019s)\nand optimized the weights with the CE method. This led to the best publicly known controller (to\nthe best of our knowledge) with the score of around 35, 000, 000 lines.\nDue to the high variance of the score and its sensitivity to some implementation details [19], it\nis dif\ufb01cult to have a precise evaluation of Tetris controllers. However, our brief tour d\u2019horizon\nof the literature, and in particular the work by Szita and L\u02ddorincz [18] (optimizing the \u201cBertsekas\nfeatures\u201d by CE), indicate that ADP algorithms, even with relatively good features, have performed\nextremely worse than the methods that directly search in the space of policies (such as CE and\ngenetic algorithms). It is important to note that almost all these ADP methods are value function\nbased algorithms that \ufb01rst de\ufb01ne a value function representation (space) and then search in this\nspace for a good function, which later gives us a policy.\nThe main motivation of our work comes from the above observation. This observation makes us\nconjecture that Tetris is a game whose policy space is easier to represent, and as a result to search in,\nthan its value function space. Therefore, in order to obtain a good performance with ADP algorithms\nin this game, we should use those ADP methods that search in a policy space, instead of the more\ntraditional ones that search in a value function space. Fortunately a class of such ADP algorithms,\ncalled classi\ufb01cation-based policy iteration (CbPI), have been recently developed and analyzed [12,\n7, 13, 9, 17]. These algorithms differ from the standard value function based ADP methods in\nhow the greedy policy is computed. Speci\ufb01cally, at each iteration CbPI algorithms approximate the\nentire greedy policy as the output of a classi\ufb01er, while in the standard methods, at every given state,\nthe required action from the greedy policy is individually calculated based on the approximation\nof the value function of the current policy. Since CbPI methods search in a policy space (de\ufb01ned\nby a classi\ufb01er) instead of a value function space, we believe that they should perform better than\ntheir value function based counterparts in problems in which good policies are easier to represent\nthan their corresponding value functions. In this paper, we put our conjecture to test by applying\nan algorithm in this class, called classi\ufb01cation-based modi\ufb01ed policy iteration (CBMPI) [17], to the\ngame of Tetris, and compare its performance with the CE method and the \u03bb-PI algorithm. The choice\nof CE and \u03bb-PI is because the former has achieved the best known results in Tetris and the latter\u2019s\nperformance is among the best reported for value function based ADP algorithms. Our extensive\nexperimental results show that for the \ufb01rst time an ADP algorithm, namely CBMPI, obtains the best\nresults reported in the literature for Tetris in both small 10 \u00d7 10 and large 10 \u00d7 20 boards. Although\n2The total number of actions at a state depends on the falling piece, with the maximum of 32, i.e. |A| \u2264 32.\n\n2\n\n\fInput: parameter space \u0398, number of parameter vectors n, proportion \u03c1 \u2264 1, noise \u03b7\nInitialize: Set the parameter \u00b5 = \u00af0 and \u03c32 = 100I (I is the identity matrix)\nfor k = 1, 2, . . . do\n\nGenerate a random sample of n parameter vectors {\u03b8i}n\nFor each \u03b8i, play L games and calculate the average number of rows removed (score) by the controller\nSelect \ufffd\u03c1n\ufffd parameters with the highest score \u03b8\ufffd1, . . . , \u03b8\ufffd\nUpdate \u00b5 and \u03c3: \u00b5(j) = 1\n\ni=1 [\u03b8\ufffdi(j) \u2212 \u00b5(j)]2 + \u03b7\nFigure 2: The pseudo-code of the cross-entropy (CE) method used in our experiments.\n\ni=1 \u223c N (\u00b5, \u03c32I)\n\ufffd\u03c1n\ufffd\ufffd\ufffd\u03c1n\ufffd\n\n\ufffd\u03c1n\ufffd\ufffd\ufffd\u03c1n\ufffd\n\ni=1 \u03b8\ufffdi(j) and \u03c32(j) = 1\n\n\ufffd\u03c1n\ufffd\n\nthe CBMPI\u2019s results are similar to those achieved by the CE method in the large board, CBMPI\nuses considerably fewer (almost 1/6) samples (call to the generative model of the game) than CE. In\nSection 2, we brie\ufb02y describe the algorithms used in our experiments. In Section 3, we outline the\nsetting of each algorithm in our experiments and report our results followed by discussion.\n2 Algorithms\nIn this section, we brie\ufb02y describe the algorithms used in our experiments: the cross entropy (CE)\nmethod, classi\ufb01cation-based modi\ufb01ed policy iteration (CBMPI) [17] and its slight variation direct\npolicy iteration (DPI) [13], and \u03bb-policy iteration (see [16] for a description of \u03bb-PI). We begin by\nde\ufb01ning some terms and notations. A state s in Tetris consists of two components: the description\nof the board b and the type of the falling piece p. All controllers rely on an evaluation function that\ngives a value to each possible action at a given state. Then, the controller chooses the action with\nthe highest value. In ADP, algorithms aim at tuning the weights such that the evaluation function\napproximates well the optimal expected future score from each state. Since the total number of\nstates is large in Tetris, the evaluation function f is usually de\ufb01ned as a linear combination of a\nset of features \u03c6, i.e., f (\u00b7) = \u03c6(\u00b7)\u03b8. We can think of the parameter vector \u03b8 as a policy (controller)\nwhose performance is speci\ufb01ed by the corresponding evaluation function f (\u00b7) = \u03c6(\u00b7)\u03b8. The features\nused in Tetris for a state-action pair (s, a) may depend on the description of the board b\ufffd resulted\nfrom taking action a in state s, e.g., the maximum height of b\ufffd. Computing such features requires\nthe knowledge of the game\u2019s dynamics, which is known in Tetris.\n2.1 Cross Entropy Method\nCross-entropy (CE) [15] is an iterative method whose goal is to optimize a function f parameterized\nby a vector \u03b8 \u2208 \u0398 by direct search in the parameter space \u0398. Figure 2 contains the pseudo-code of\nthe CE algorithm used in our experiments [18, 20]. At each iteration k, we sample n parameter vec-\ni=1 from a multivariate Gaussian distribution N (\u00b5, \u03c32I). At the beginning, the parameters\ntors {\u03b8i}n\nof this Gaussian have been set to cover a wide region of \u0398. For each parameter \u03b8i, we play L games\nand calculate the average number of rows removed by this controller (an estimate of the evaluation\n, and use\nfunction). We then select \ufffd\u03c1n\ufffd of these parameters with the highest score, \u03b8\ufffd1, . . . , \u03b8\ufffd\nthem to update the mean \u00b5 and variance \u03c32 of the Gaussian distribution, as shown in Figure 2. This\nupdated Gaussian is used to sample the n parameters at the next iteration. The goal of this update\nis to sample more parameters from the promising part of \u0398 at the next iteration, and eventually\nconverge to a global maximum of f.\n2.2 Classi\ufb01cation-based Modi\ufb01ed Policy Iteration (CBMPI)\nModi\ufb01ed policy iteration (MPI) [14] is an iterative algorithm to compute the optimal policy of a\nMDP that starts with initial policy \u03c01 and value v0, and generates a sequence of value-policy pairs\n\n\ufffd\u03c1n\ufffd\n\nvk = (T\u03c0k )mvk\u22121\n\n(evaluation step),\n\n\u03c0k+1 = G\ufffd(T\u03c0k )mvk\u22121\ufffd\n\n(greedy step),\n\nwhere Gvk is a greedy policy w.r.t. vk, T\u03c0k is the Bellman operator associated with the policy \u03c0k,\nand m \u2265 1 is a parameter. MPI generalizes the well-known value and policy iteration algorithms for\nthe values m = 1 and m = \u221e, respectively. CBMPI [17] is an approximation of MPI that uses an\nexplicit representation for the policies \u03c0k, in addition to the one used for the value functions vk. The\nidea is similar to the classi\ufb01cation-based PI algorithms [12, 7, 13] in which we search for the greedy\npolicy in a policy space \u03a0 (de\ufb01ned by a classi\ufb01er) instead of computing it from the estimated value\nfunction (as in the standard implementation of MPI). As described in Figure 3, CBMPI begins with\nan arbitrary initial policy \u03c01 \u2208 \u03a0 and value function v0 \u2208 F.3 At each iteration k, a new value func-\n3Note that the function space F and policy space \u03a0 are de\ufb01ned by the choice of the regressor and classi\ufb01er.\n\n3\n\n\fInput: value function space F, policy space \u03a0, state distribution \u00b5\nInitialize: Set \u03c01 \u2208 \u03a0 and v0 \u2208 F to an arbitrary policy and value function\nfor k = 1, 2, . . . do\n\ni=1, s(i) iid\n\n\u223c \u00b5\n\ni=1, s(i) iid\n\n\u223c \u00b5\n\nfor j = 1 to M do\n\n\u2022 Perform rollouts:\nConstruct the rollout set Dk = {s(i)}N\nfor all states s(i) \u2208 Dk do\nPerform a rollout and return\ufffdvk(s(i))\nConstruct the rollout set D\ufffdk = {s(i)}N\ufffd\nfor all states s(i) \u2208 D\ufffdk and actions a \u2208 A do\nk(s(i), a)\nM \ufffdM\n\ufffdQk(s(i), a) = 1\nj=1 Rj\n\u2022 Approximate value function:\n\ufffdLFk (\ufffd\u00b5; v)\nvk \u2208 argmin\nv\u2208F\n\u2022 Approximate greedy policy:\n\u03c0\u2208\u03a0 \ufffdL \u03a0\n\u03c0k+1 \u2208 argmin\nk (\ufffd\u00b5; \u03c0)\n\nPerform a rollout and return Rj\n\n(regression)\n\nk(s(i), a)\n\n(classi\ufb01cation)\n\n(using Equation 1)\n\n(using Equation 4)\n\n(see Equation 2)\n\n(see Equation 3)\n\nFigure 3: The pseudo-code of the CBMPI algorithm.\n\nand s(i)\n\n0 , r(i)\n\nt ), and r(i)\n\nt = \u03c0k(s(i)\n\ntion vk is built as the best approximation of the m-step Bellman operator (T\u03c0k )mvk\u22121 in F (evalua-\ntion step). This is done by solving a regression problem whose target function is (T\u03c0k )mvk\u22121. To set\nup the regression problem, we build a rollout set Dk by sampling N states i.i.d. from a distribution\nm\ufffd of size\n\u00b5. For each state s(i) \u2208 Dk, we generate a rollout\ufffds(i), a(i)\nm, where a(i)\naction. From this rollout, we compute an unbiased estimate\ufffdvk(s(i)) of\ufffd(T\u03c0k )mvk\u22121\ufffd(s(i)) as\nand use it to build a training set\ufffd\ufffds(i),\ufffdvk(s(i))\ufffd\ufffdN\n\ni=1. This training set is then used by the regressor\nto compute vk as an estimate of (T\u03c0k )mvk\u22121. The regressor \ufb01nds a function v \u2208 F that minimizes\nthe empirical error\n\nt+1 are the reward and next state induced by this choice of\n\n\ufffdvk(s(i)) =\n\nt + \u03b3mvk\u22121(s(i)\nm ),\n\n(\u03b3 is the discount factor),\n\nm\u22121, r(i)\n\nm\u22121, s(i)\n\n1 , . . . , a(i)\n\nm\u22121\ufffdt=0\n\n0 , s(i)\n\n\u03b3tr(i)\n\n(1)\n\nt\n\nThe greedy step at\n\niteration k computes the policy \u03c0k+1 as the best approximation of\n\nG\ufffd(T\u03c0k )mvk\u22121\ufffd by minimizing the cost-sensitive empirical error (cost-sensitive classi\ufb01cation)\n\n\ufffdLFk (\ufffd\u00b5; v) =\n\n1\nN\ufffd\n\nN\ufffd\ufffdi=1\ufffd max\n\n\ufffdL \u03a0\nk (\ufffd\u00b5; \u03c0) =\n\n1\nN\n\n.\n\nN\ufffdi=1\ufffd\ufffdvk(s(i)) \u2212 v(s(i))\ufffd2\na\u2208A \ufffdQk(s(i), a) \u2212 \ufffdQk\ufffds(i), \u03c0(s(i))\ufffd\ufffd.\nm+1\ufffdM\n\n, . . . , a(i,j)\n\n, a(i,j)\n\n1\n\na(i,j)\nt\n\nTo set up this cost-sensitive classi\ufb01cation problem, we build a rollout set D\ufffdk by sampling N\ufffd states\ni.i.d. from a distribution \u00b5. For each state s(i) \u2208 D\ufffdk and each action a \u2208 A, we build M independent\nrollouts of size m + 1, i.e.,\ufffds(i), a, r(i,j)\nj=1, where for t \u2265 1,\n0\nt+1 are the reward and next state induced by this choice of\naction. From these rollouts, we compute an unbiased estimate of Qk(s(i), a) as \ufffdQk(s(i), a) =\nM\ufffdM\n\nk(s(i), a) where each rollout estimate is de\ufb01ned as\n\n= \u03c0k(s(i,j)\n\n), and r(i,j)\n\nm , s(i,j)\n\nm , r(i,j)\n\nand s(i,j)\n\nj=1 Rj\n\n, s(i,j)\n\n1\n\n1\n\nt\n\nt\n\nRj\n\nk(s(i), a) =\n\n\u03b3tr(i,j)\n\nt + \u03b3m+1vk\u22121(s(i,j)\nm+1).\n\nIf we remove the regressor from CBMPI and only use the m-truncated rollouts Rj\n\nk(s(i), a) =\n\nt\n\nt=0 \u03b3tr(i,j)\n\nrithm [13] that we also use in our experiments (see [17] for more details on the CBMPI algorithm).\n\nto compute \ufffdQk(s(i), a), then CBMPI become the direct policy iteration (DPI) algo-\n\n\ufffdm\n\nm\ufffdt=0\n\n4\n\n(2)\n\n(3)\n\n(4)\n\n\fIn our implementation of CBMPI (DPI) in Tetris (Section 3), we use the same rollout set\n(Dk = D\ufffdk) and rollouts for the classi\ufb01er and regressor. This is mainly to be more sample\nef\ufb01cient. Fortunately, we observed that this does not affect the overall performance of the al-\ngorithm. We set the discount factor \u03b3 = 1. Regressor: We use linear function approx-\n\nmethod. Classi\ufb01er: The training set of the classi\ufb01er is of size N with s(i) \u2208 D\ufffdk as input and\npolicies of the form \u03c0u(s) = argmaxa \u03c8(s, a)u, where \u03c8 is the policy feature vector (possibly dif-\nferent from the value function feature vector \u03c6) and u is the policy parameter vector. We compute\n\nimation for the value function, i.e., \ufffdvk(s(i)) = \u03c6(s(i))w, where \u03c6(\u00b7) and w are the feature\nand weight vectors, and minimize the empirical error \ufffdLFk (\ufffd\u00b5; v) using the standard least-squares\n\ufffd maxa \ufffdQk(s(i), a) \u2212 \ufffdQk(s(i), a1), . . . , maxa \ufffdQk(s(i), a) \u2212 \ufffdQk(s(i), a|A|)\ufffd as output. We use the\nthe next policy \u03c0k+1 by minimizing the empirical error \ufffdL \u03a0\nk (\ufffd\u00b5; \u03c0u), de\ufb01ned by (3), using the covari-\nu in CMA-ES, we only need to compute \ufffdL \u03a0\nk (\ufffd\u00b5; \u03c0u), and given the training set, this procedure does\n\nnot require any simulation of the game. This is in contrary with policy evaluation in CE that requires\nplaying several games, and it is the main reason that we obtain the same performance as CE with\nCBMPI with almost 1/6 number of samples (see Section 3.2).\n\nance matrix adaptation evolution strategy (CMA-ES) algorithm [10]. In order to evaluate a policy\n\n3 Experimental Results\nIn this section, we evaluate the performance of CBMPI (DPI) and compare it with CE and \u03bb-PI. CE\nis the state-of-the-art method in Tetris with huge performance advantage over ADP/RL methods [18,\n19, 20]. In our experiments, we show that for a well-selected set of features, CBMPI improves over\nall the previously reported ADP results. Moreover, its performance is comparable to that of the CE\nmethod, while using considerably fewer samples (call to the generative model of the game).\n3.1 Experimental Setup\nIn our experiments, the policies learned by the algorithms are evaluated by their score (average\nnumber of rows removed in a game) averaged over 200 games in the small 10 \u00d7 10 board and over\n20 games in the large 10\u00d720 board. The performance of each algorithm is represented by a learning\ncurve whose value at each iteration is the average score of the policies learned by the algorithm at\nthat iteration in 100 separate runs of the algorithm. In addition to their score, we also evaluate the\nalgorithms by the number of samples they use. In particular, we show that CBMPI/DPI use 6 times\nless samples than CE. As discussed in Section 2.2, this is due the fact that although the classi\ufb01er\nin CBMPI/DPI uses a direct search in the space of policies (for the greedy policy), it evaluates\neach candidate policy using the empirical error of Eq. 3, and thus, does not require any simulation\n\nof the game (other than those used to estimate the \ufffdQk\u2019s in its training set). In fact, the budget B\nof CBMPI/DPI is \ufb01xed in advance by the number of rollouts N M and the rollout\u2019s length m as\nB = (m + 1)N M|A|. In contrary, CE evaluates a candidate policy by playing several games, a\nprocess that can be extremely costly (sample-wise), especially for good policies in the large board.\nIn our CBMPI/DPI experiments, we set the number of rollouts per state-action pair M = 1, as\nthis value has shown the best performance. Thus, we only study the behavior of CBMPI/DPI as\na function of m and N. In CBMPI, the parameter m balances between the errors in evaluating\nthe value function and the policy. For large values of m, the size of the rollout set decreases as\nN = O(B/m), which in turn decreases the accuracy of both the regressor and classi\ufb01er. This leads\nto a trade-off between long rollouts and the number of states in the rollout set. The solution to\n\nthis trade-off (bias/variance tradeoff in estimation of \ufffdQk\u2019s) strictly depends on the capacity of the\nvalue function space F. A rich value function space leads to solve the trade-off for small values\nof m, while a poor space, or no space in the case of DPI, suggests large values of m, but not\ntoo large to still guarantee a large enough N. We sample the rollout states in CBMPI/DPI from\nthe trajectories generated by a very good policy for Tetris, namely the DU controller [20]. Since\nthe DU policy is good, this rollout set is biased towards boards with small height. We noticed\nfrom our experiments that the performance can be signi\ufb01cantly improved if we use boards with\ndifferent heights in the rollout sets. This means that better performance can be achieved with more\nuniform sampling distribution, which is consistent with what we can learn from the CBMPI and DPI\nperformance bounds. We set the initial value function parameter to w = \u00af0 and select the initial policy\n\u03c01 (policy parameter u) randomly. We also set the CMA-ES parameters (classi\ufb01er parameters) to\n\u03c1 = 0.5, \u03b7 = 0, and n equal to 15 times the number of features.\n\n5\n\n\fIn the CE experiments, we set \u03c1 = 0.1 and \u03b7 = 4, the best parameters reported in [20]. We also\nset n = 1000 and L = 10 in the small board and n = 100 and L = 1 in the large board.\nSet of Features: We use the following features, plus a constant offset feature, in our experiments:4\n(i) Bertsekas features: First introduced by [2], this set of 22 features has been mainly used in the\nADP/RL community and consists of: the number of holes in the board, the height of each column,\nthe difference in height between two consecutive columns, and the maximum height of the board.\n(ii) Dellacherie-Thiery (D-T) features: This set consists of the six features of Dellacherie [5], i.e., the\nlanding height of the falling piece, the number of eroded piece cells, the row transitions, the column\ntransitions, the number of holes, and the number of board wells; plus 3 additional features proposed\nin [20], i.e., the hole depth, the number of rows with holes, and the pattern diversity feature. Note\nthat the best policies reported in the literature have been learned using this set of features.\n(iii) RBF height features: These new 5 features are de\ufb01ned as exp(\u2212|c\u2212ih/4|2\n2(h/5)2\nc is the average height of the columns and h = 10 or 20 is the total number of rows in the board.\n\n), i = 0, . . . , 4, where\n\n3.2 Experiments\nWe \ufb01rst run the algorithms on the small board to study the role of their parameters and to select the\nbest features and parameters (Section 3.2.1). We then use the selected features and parameters and\napply the algorithms to the large board (Figure 5 (d)) Finally, we compare the best policies found in\nour experiments with the best controllers reported in the literature (Tables 1 and 2).\n3.2.1 Small (10 \u00d7 10) Board\nHere we run the algorithms with two different feature sets: Dellacherie-Thiery (D-T) and Bertsekas.\nD-T features: Figure 4 shows the learning curves of CE, \u03bb-PI, DPI, and CBMPI algorithms. Here\nwe use D-T features for the evaluation function in CE, the value function in \u03bb-PI, and the policy in\nDPI and CBMPI. We ran CBMPI with different feature sets for the value function and \u201cD-T plus\nthe 5 RBF features\u201d achieved the best performance (Figure 4 (d)).5 The budget of CBMPI and DPI\nis set to B = 8, 000, 000 per iteration. The CE method reaches the score 3000 after 10 iterations\nusing an average budget B = 65, 000, 000. \u03bb-PI with the best value of \u03bb only manages to score\n400. In Figure 4 (c), we report the performance of DPI for different values of m. DPI achieves\nits best performance for m = 5 and m = 10 by removing 3400 lines on average. As explained\n\nin Section 3.1, having short rollouts (m = 1) in DPI leads to poor action-value estimates \ufffdQ, while\nhaving too long rollouts (m = 20) decreases the size of the training set of the classi\ufb01er N. CBMPI\noutperforms the other algorithms, including CE, by reaching the score of 4300 for m = 2. The value\nof m = 2 corresponds to N = 8000000\n(2+1)\u00d732 \u2248 84, 000. Note that unlike DPI, CBMPI achieves good\nperformance with very short rollouts m = 1. This indicates that CBMPI is able to approximate the\nvalue function well, and as a result, to build a more accurate training set for its classi\ufb01er than DPI.\nThe results of Figure 4 show that an ADP algorithm, namely CBMPI, outperforms the CE method\nusing a similar budget (80 vs. 65 millions after 10 iterations). Note that CBMPI takes less iterations\nto converge than CE. More generally Figure 4 con\ufb01rms the superiority of the policy search and\nclassi\ufb01cation-based PI methods to value function based ADP algorithms (\u03bb-PI). This suggests that\nthe D-T features are more suitable to represent the policies than the value functions in Tetris.\nBertsekas features: Figures 5 (a)-(c) show the performance of CE, \u03bb-PI, DPI, and CBMPI algo-\nrithms. Here all the approximations in the algorithms are with the Bertsekas features. CE achieves\nthe score of 500 after about 60 iterations and outperforms \u03bb-PI with score of 350. It is clear that\nthe Bertsekas features lead to much weaker results than those obtained by the D-T features in Fig-\nure 4 for all the algorithms. We may conclude then that the D-T features are more suitable than the\nBertsekas features to represent both value functions and policies in Tetris. In DPI and CBMPI, we\nmanaged to obtain results similar to CE, only after multiplying the per iteration budget B used in the\nD-T experiments by 10. However, CBMPI and CE use the same number of samples, 150, 000, 000,\nwhen they converge after 2 and 60 iterations, respectively (see Figure 5). Note that DPI and CBMPI\nobtain the same performance, which means that the use of a value function approximation by CBMPI\n\n4For a precise de\ufb01nition of the features, see [19] or the documentation of their code [21].\n5Note that we use D-T+5 features only for the value function of CBMPI, and thus, we have a fair comparison\nbetween CBMPI and DPI. To have a fair comparison with \u03bb-PI, we ran this algorithm with D-T+5 features, and\nit only raised its performance to 800, still far from the CBMPI\u2019s performance.\n\n6\n\n\fd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n0\n4\n\n0\n0\n0\n3\n\n0\n0\n0\n2\n\n0\n0\n0\n1\n\n0\n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n0\n4\n\n0\n0\n0\n3\n\n0\n0\n0\n2\n\n0\n0\n0\n1\n\n0\n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\nCE\n\nParameter \u03bb\n\n0\n0.4\n\n0.7\n0.9\n\n5\n\n10\n\nIterations\n\n15\n\n20\n\n0\n\n20\n\n40\n60\nIterations\n\n80\n\n100\n\n(a) The cross-entropy (CE) method.\n\n(b) \u03bb-PI with \u03bb = {0, 0.4, 0.7, 0.9}.\n\nRollout size m of DPI\n\n1\n2\n\n5\n10\n\n20\n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n0\n4\n\n0\n0\n0\n3\n\n0\n0\n0\n2\n\n0\n0\n0\n1\n\n0\n\n2\n\n4\n\n6\n\nIterations\n\n8\n\n10\n\n2\n\n4\n\nRollout size m of CBMPI\n\n1\n2\n\n6\n\nIterations\n\n5\n10\n\n8\n\n20\n\n10\n\n(c) DPI with budget B = 8, 000, 000 per iteration\nand m = {1, 2, 5, 10, 20}.\nFigure 4: Learning curves of CE, \u03bb-PI, DPI, and CBMPI algorithms using the 9 Dellacherie-Thiery\n(D-T) features on the small 10\u00d7 10 board. The results are averaged over 100 runs of the algorithms.\ndoes not lead to a signi\ufb01cant performance improvement over DPI. At the end, we tried several values\nof m in this setting among which m = 10 achieved the best performance for both DPI and CBMPI.\n\n(d) CBMPI with budget B = 8, 000, 000 per itera-\ntion and m = {1, 2, 5, 10, 20}.\n\n3.2.2 Large (10 \u00d7 20) Board\nWe now use the best parameters and features in the small board experiments, run CE, DPI, and\nCBMPI algorithms in the large board, and report their results in Figure 5 (d). The per iteration\nbudget of DPI and CBMPI is set to B = 16, 000, 000. While \u03bb-PI with per iteration budget 620, 000,\nat its best, achieves the score of 2500 (due to space limitation, we do not report these results here),\nDPI and CBMPI, with m = 10, reach the scores of 12, 000, 000 and 21, 000, 000 after 3 and 6\niterations, respectively. CE matches the performances of CBMPI with the score of 20, 000, 000 after\n8 iterations. However, this is achieved with almost 6 times more samples, i.e., after 8 iterations,\nCBMPI and CE use 256, 000, 000 and 1, 700, 000, 000 samples, respectively.\nComparison of the best policies: So far the reported scores for each algorithm was averaged over\nthe policies learned in 100 separate runs. Here we select the best policies observed in our all experi-\nments and compute their scores more accurately by averaging over 10, 000 games. We then compare\nthese results with the best policies reported in the literature, i.e., DU and BDU [20] in both small\nand large boards in Table 1. The DT-10 and DT-20 policies, whose weights and features are given in\nTable 2, are policies learned by CBMPI with D-T features in the small and large boards, respectively.\nAs shown in Table 1, DT-10 removes 5000 lines and outperforms DU, BDU, and DT-20 in the small\nboard. Note that DT-10 is the only policy among these four that has been learned in the small board.\nIn the large board, DT-20 obtains the score of 51, 000, 000 and not only outperforms the other three\npolicies, but also achieves the best reported result in the literature (to the best of our knowledge).\n\n7\n\n\fd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n6\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\n0\n0\n6\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\n0\n\n50\n\n100\n\nIterations\n\n(a) The cross-entropy (CE) method.\n\nCE\n\n150\n\nRollout size m=10\nCBMPI\n\nDPI\n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n6\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\n)\n\n6\n0\n1\n\u00d7\n \n(\n \n\nd\ne\nv\no\nm\ne\nr\n \ns\ne\nn\n\ni\nl\n \n\nd\ne\ng\na\nr\ne\nv\nA\n\n0\n2\n\n0\n1\n\n0\n\nParameter \u03bb\n\n0\n0.4\n\n0.7\n0.9\n\n0\n\n20\n\n40\n60\nIterations\n\n80\n\n100\n\n(b) \u03bb-PI with \u03bb = {0, 0.4, 0.7, 0.9}.\n\nRollout size m of CBMPI\n\nRollout size m of DPI\n\n5\n\n10\n\n5\n\n10\n\nCE\n\n2\n\n4\n\n6\n\nIterations\n\n8\n\n10\n\n1\n\n2\n\n3\n\n4\n5\nIterations\n\n6\n\n7\n\n8\n\n(c) DPI (dash-dotted line) & CBMPI (dash line) with\nbudget B = 80, 000, 000 per iteration and m = 10.\nFigure 5: (a)-(c) Learning curves of CE, \u03bb-PI, DPI, and CBMPI algorithms using the 22 Bertsekas\nfeatures on the small 10 \u00d7 10 board. (d) Learning curves of CE, DPI, and CBMPI algorithms using\nthe 9 Dellacherie-Thiery (D-T) features on the large 10 \u00d7 20 board.\n\n(d) DPI (dash-dotted line) and CBMPI (dash line)\nwith m = {5, 10} and CE (solid line).\n\nDU\n3800\n\nBoards \\ Policies\nSmall (10 \u00d7 10) board\nLarge (10 \u00d7 20) board\nTable 1: Average (over 10, 000 games) score of DU, BDU, DT-10, and DT-20 policies.\nfeature\n\nDT-20\n4300\n\nDT-10\n5000\n\nBDU\n4200\n\n31, 000, 000\n\n36, 000, 000\n\n29, 000, 000\n\n51, 000, 000\n\nfeature\n\nweight\n\nweight\n\nlanding height\n\n-2.18 -2.68 column transitions -3.31 -6.32\n\neroded piece cells 2.42 1.38\n-2.17 -2.41\n\nrow transitions\n\nholes\n\nboard wells\n\nfeature\nhole depth\n\nweight\n\n-0.81 -0.43\n0.95 2.03 rows with holes -9.65 -9.48\n-2.22 -2.71\n1.27 0.89\n\ndiversity\n\nTable 2: The weights of the 9 Dellacherie-Thiery features in DT-10 (left) and DT-20 (right) policies.\n\n4 Conclusions\nThe game of Tetris has been always challenging for approximate dynamic programming (ADP)\nalgorithms. Surprisingly, much simpler black box optimization methods, such as cross entropy\n(CE), have produced controllers far superior to those learned by the ADP algorithms. In this paper,\nwe applied a relatively novel ADP algorithm, called classi\ufb01cation-based modi\ufb01ed policy iteration\n(CBMPI), to Tetris. Our results showed that for the \ufb01rst time an ADP algorithm (CBMPI) performed\nextremely well in both small 10\u00d710 and large 10\u00d720 boards and achieved performance either better\n(in the small board) or equal with considerably fewer samples (in the large board) than the state-of-\nthe-art CE methods. In particular, the best policy learned by CBMPI obtained the performance of\n51, 000, 000 lines on average, a new record in the large board of Tetris.\n\n8\n\n\fReferences\n[1] D. Bertsekas and S. Ioffe. Temporal differences-based policy iteration and applications in\n\nneuro-dynamic programming. Technical report, MIT, 1996.\n\n[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[3] H. Burgiel. How to Lose at Tetris. Mathematical Gazette, 81:194\u2013200, 1997.\n[4] E. Demaine, S. Hohenberger, and D. Liben-Nowell. Tetris is hard, even to approximate. In\nProceedings of the Ninth International Computing and Combinatorics Conference, pages 351\u2013\n363, 2003.\n\n[5] C. Fahey. Tetris AI, Computer plays Tetris, 2003. http://colinfahey.com/tetris/\n\ntetris.html.\n\n[6] V. Farias and B. van Roy. Tetris: A study of randomized constraint sampling. Springer-Verlag,\n\n2006.\n\n[7] A. Fern, S. Yoon, and R. Givan. Approximate Policy Iteration with a Policy Language Bias:\nSolving Relational Markov Decision Processes. Journal of Arti\ufb01cial Intelligence Research,\n25:75\u2013118, 2006.\n\n[8] T. Furmston and D. Barber. A unifying perspective of parametric policy search methods for\nMarkov decision processes. In Proceedings of the Advances in Neural Information Processing\nSystems, pages 2726\u20132734, 2012.\n\n[9] V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classi\ufb01cation-based policy itera-\n\ntion with a critic. In Proceedings of ICML, pages 1049\u20131056, 2011.\n\n[10] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strate-\n\ngies. Evolutionary Computation, 9:159\u2013195, 2001.\n\n[11] S. Kakade. A natural policy gradient. In Proceedings of the Advances in Neural Information\n\nProcessing Systems, pages 1531\u20131538, 2001.\n\n[12] M. Lagoudakis and R. Parr. Reinforcement Learning as Classi\ufb01cation: Leveraging Modern\n\nClassi\ufb01ers. In Proceedings of ICML, pages 424\u2013431, 2003.\n\n[13] A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of a Classi\ufb01cation-based Policy Itera-\n\ntion Algorithm. In Proceedings of ICML, pages 607\u2013614, 2010.\n\n[14] M. Puterman and M. Shin. Modi\ufb01ed policy iteration algorithms for discounted Markov deci-\n\nsion problems. Management Science, 24(11), 1978.\n\n[15] R. Rubinstein and D. Kroese. The cross-entropy method: A uni\ufb01ed approach to combinatorial\n\noptimization, Monte-Carlo simulation, and machine learning. Springer-Verlag, 2004.\n\n[16] B. Scherrer. Performance Bounds for \u03bb-Policy Iteration and Application to the Game of Tetris.\n\nJournal of Machine Learning Research, 14:1175\u20131221, 2013.\n\n[17] B. Scherrer, M. Ghavamzadeh, V. Gabillon, and M. Geist. Approximate modi\ufb01ed policy itera-\n\ntion. In Proceedings of ICML, pages 1207\u20131214, 2012.\n\n[18] I. Szita and A. L\u02ddorincz. Learning Tetris Using the Noisy Cross-Entropy Method. Neural\n\nComputation, 18(12):2936\u20132941, 2006.\n\n[19] C. Thiery and B. Scherrer. Building Controllers for Tetris. International Computer Games\n\nAssociation Journal, 32:3\u201311, 2009.\n\n[20] C. Thiery and B. Scherrer. Improvements on Learning Tetris with Cross Entropy. International\n\nComputer Games Association Journal, 32, 2009.\n\n[21] C. Thiery and B. Scherrer. MDPTetris features documentation, 2010.\n\nmdptetris.gforge.inria.fr/doc/feature_functions_8h.html.\n\nhttp://\n\n[22] J. Tsitsiklis and B Van Roy. Feature-based methods for large scale dynamic programming.\n\nMachine Learning, 22:59\u201394, 1996.\n\n9\n\n\f", "award": [], "sourceid": 881, "authors": [{"given_name": "Victor", "family_name": "Gabillon", "institution": "INRIA"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "INRIA & Adobe Research"}, {"given_name": "Bruno", "family_name": "Scherrer", "institution": "INRIA"}]}