Chapter 1
Introduction
This introductory chapter covers basic background on regret minimization and connections to game-theoretic equilibrium concepts. In particular, we begin by introducing and motivating the notion of \(\Phi\)-regret and its associated solution concept in games—\(\Phi\)-equilibrium. In the second part, we will introduce the canonical algorithmic template for minimizing \(\Phi\)-regret due to [GGM08[GGM08] G. J. Gordon, A. Greenwald and C. Marks. (2008). No-regret learning in convex games. International Conference on Machine Learning.].
1.1 Online learning and regret
We consider a learner who makes a sequence of decisions over \(T\) rounds. The learner interacts repeatedly with an environment. In each round \(t \in [T]\), the learner specifies a mixed strategy \(\boldsymbol{x}^{(t)} \in \mathcal{X}\), where \(\mathcal{X}\) is a convex and compact set. (A canonical case arises when \(\mathcal{X}\) is a probability simplex over a finite set of actions, but our focus here is on the general setting.) The environment then selects a utility vector \(\boldsymbol{u}^{(t)}\), so that the utility obtained by the learner at that round is \(\left\langle \boldsymbol{x}^{(t)},\boldsymbol{u}^{(t)} \right\rangle\). In the full-feedback setting, which is the focus of these notes, \(\boldsymbol{u}^{(t)}\) is revealed to the learner after the end of the round. An online algorithm produces a sequence of strategies based on the feedback observed up to that point.
What’s a sensible way of measuring the performance of the learner in this online environment? There are different notions of hindsight rationality. Perhaps the most common performance benchmark is external regret, defined as
The second term in the right-hand side of (1) is the cumulative utility obtained by the learner through the \(T\) rounds, whereas the first term is the optimal utility that could have been obtained in hindsight through a fixed strategy. We will soon introduce more powerful notions of hindsight rationality beyond external regret.
1.2 Games and solution concepts
We will often need to analyze what happens when multiple no-regret players repeatedly interact in a game. To do so, we begin by introducing the canonical normal-form representation of games. While any finite game can be cast in normal form, that representation is often inefficient. This will motivate introducing more compact game representations, as we shall do in the sequel.
Formally, we have a set of \(n\) players. In a normal-form game, each player \(i \in [n]\) has a finite set of available actions \(\mathcal{A}_{i}\); we will use the shorthand notation \(m_{i} \coloneqq |\mathcal{A}_{i}|\) for the number of actions. Every player \(i \in [n]\) has a utility function \(u_{i}\) mapping a joint action profile \((a_{1}, \dots , a_{n}) \in \mathcal{A}_{1} \times \dots \times \mathcal{A}_{n}\) to a real value \(u_{i} (a_{1}, \dots , a_{n})\). Players can randomize by specifying a probability distribution over their available actions, so that the strategy set of each player is the probability simplex \(\mathcal{X}_{i} = \Delta (\mathcal{A}_{i})\). Under a joint strategy \((x_{1}, \dots , x_{n}) \in \Delta (\mathcal{A}_{1}) \times \dots \times \Delta (\mathcal{A}_{n})\), the expected utility of player \(i \in [n]\) reads
Each player is trying to maximize its expected utility.
1.2.1 Correlated and coarse correlated equilibria
A major criticism of the Nash equilibrium is that, even though one always exists—as guaranteed by the famous theorem of John Nash [Nas50[Nas50] J. Nash. (1950). Equilibrium points in N-person games. Proceedings of the National Academy of Sciences, 36, 48–49.]—it is computationally intractable to find one [DGP08[DGP08] C. Daskalakis, P. Goldberg and C. Papadimitriou. (2008). The Complexity of Computing a Nash Equilibrium. SIAM Journal on Computing.]—let alone a welfare-optimal one [GZ89[GZ89] I. Gilboa and E. Zemel. (1989). Nash and Correlated Equilibria: Some Complexity Considerations. Games and Economic Behavior, 1, 80–93.]. As result, we shouldn’t expect simple, computationally bounded learning algorithms to converge to Nash equilibria; this raises the question: what are no-regret dynamics converging to in general-sum games?
It turns out that no-regret learning is inherently tied to coarse correlated equilibria [MV78[MV78] H. Moulin and J.-P. Vial. (1978). Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 7, 201–221.]. Let’s begin by recalling the basic definition and start building some intuition about this solution concept; for now, we restrict our attention to normal-form games.
Definition 1.1 (Coarse correlated equilibrium) .
A correlateddistribution \(\mu \in \Delta (\mathcal{A}_{1} \times \dots \times \mathcal{A}_{n})\) is an \(\epsilon\)-coarse correlated equilibrium (CCE) if for any player \(i \in [n]\) and deviation \(a_{i'} \in \mathcal{A}_{i}\),
This definition mirrors Nash equilibria, but with a critical difference: the underlying distribution \(\mu\) can be correlated; by contrast, in a Nash equilibrium \(\mu\) has to be a product distribution, reflecting the fact that players randomize independently. To explain this, let’s consider the following two distributions with respect to some \(2 \times 2\) bimatrix game (meaning that each of the two players has two available actions):
Both are distributions over \(\mathcal{A}_{1} \times \mathcal{A}_{2} = \{1, 2\} \times \{1, 2\}\), but only \(\mu'\) is a product distribution. Indeed, if Player 1—the row player—plays \((\frac{1}{3}, \frac{2}{3})\) and Player 2 plays \((\frac{1}{2}, \frac{1}{2})\), the induced distribution over the \(4\) outcomes matches \(\mu'\). In contrast, no pair of strategies gives rise to \(\mu\).
A Nash equilibrium is always a CCE; a Nash equilibrium is basically an uncorrelated (coarse) correlated equilibrium. But the set of CCEs can unlock new outcomes. Before we examine a concrete example, we also introduce the stronger notion of a correlated equilibrium, famously put forward by [Aum74[Aum74] R. Aumann. (1974). Subjectivity and Correlation in Randomized Strategies. Journal of Mathematical Economics, 1, 67–96.].
Definition 1.2 (Correlated equilibrium) .
A correlated distribution \(\mu \in \Delta (\mathcal{A}_{1} \times \dots \times \mathcal{A}_{n})\) is an \(\epsilon\)-correlated equilibrium (CE) if for any player \(i \in [n]\) and deviation function \(\phi_{i} : \mathcal{A}_{i} \to \mathcal{A}_{i}\), we have
Both correlated and coarse correlated equilibria can be interpreted through the use of a trusted third party—a mediator or correlation device—who samples a joint action profile \((a_{1}, \dots , a_{n})\) from the correlated distribution \(\mu\) and then provides the corresponding action \(a_{i}\) to each player \(i \in [n]\) as a recommendation. From this point of view, a distribution is a CCE or a CE if no player has an incentive to deviate from the recommendation, but for CEs the set of possible deviations is richer: in a CE, a player can decide whether to deviate after observing the recommendation, while in a CCE the decision has to be made in advance. This makes a CCE harder to justify in some applications, as one would need some binding mechanism to ensure the player would not be able to deviate after observing the recommendation.
We now go over a concrete example to further elucidate these concepts.
Example 1.3 .
We consider the “game of chicken.” This is a \(2 \times 2\) game—played between two drivers who are rapidly approaching an intersection from different streets—whose utilities are tabulated in Table 1. Each player can either play “Stop” or “Go.” If both players elect to “Go” a crash ensues—a bad outcome for both players. If a player chooses to “Stop” it gets no utility from the game, whereas if it proceeds while the other player chooses to “Stop” it gets a utility of \(1\) for safely crossing the intersection.
This game has exactly three Nash equilibria: i) (Go, Stop), ii) (Stop, Go), and iii) \(((\frac{5}{6}, \frac{1}{6}), (\frac{5}{6}, \frac{1}{6}))\), meaning that both players play “Stop” with probability \(\frac{5}{6}\). From these three outcomes, the first two are not equitable in that they favor one player over the other. The third outcome is even worse: it leads to a crash with some positive probability.
(C)CEs address these issues by unlocking new outcomes. In particular, let’s consider the correlated distribution \(\frac{1}{2} (\text{Go}, \text{Stop}) + \frac{1}{2} (\text{Stop}, \text{Go})\). It’s easy to verify that this is a CE, and thus a CCE. Under that distribution, both players get in expectation a utility of \(\frac{1}{2}\). Focusing CEs, there is a natural interpretation of this outcome through a traffic light, which provides a signal to each player. If Player 1 is recommended “Stop,” it means that Player 2 will play “Go” with probability \(1\), so stopping is in Player 1′s interest. On the other hand, if Player 1 is recommended “Go,” it means that Player 2 will play “Stop” with probability \(1\), so crossing is safe for Player 1. That is, in a CE, the signal a player observes updates that player’s beliefs concerning the behavior of the other players
Example 1.4 .
Our next example clarifies the difference between CCEs and CEs. We consider a \(4 \times 4\) bimatrix game described with the payoff matrices
for the row and column player, respectively. We label each player’s actions as “1,” “2,” “3,” and “4.” We claim that the distribution \(\mu = \frac{1}{2} (1, 1) + \frac{1}{2} (2, 2)\) is an exact CCE of this game, whereas the CE gap of \(\mu\) is large—namely, \(1\). Specifically, the swap deviation \(\phi\) that results in a large deviation gain is \(1 \mapsto 3\) and \(2 \mapsto 4\); the mapping for the rest of the actions is moot, as they are never played under \(\mu\). While each player obtains a utility of \(2\) under \(\mu\), deviating per \(\phi\) gives a utility of \(3\). At the same time, \(\mu\) is a CCE as it is robust with respect to constant deviations.
1.2.2 Computational properties
We have seen that CCEs and CEs give rise to new outcomes that are not attainable under independent randomization. Moreover, correlated equilibrium concepts have better computational properties than Nash equilibria. Specifically, a key property is that the set of (C)CEs is convex and can be described through a linear program.
Proposition 1.5 .
While the number of swap deviations of each player \(i \in [n]\) is \(|\mathcal{A}_{i}|^{|\mathcal{A}_{i}|}\), it’s enough to consider only a certain subset of swap deviations—ones that only change a single action; such deviations are called internal—with size \(|\mathcal{A}_{i}| (|\mathcal{A}_{i}| - 1)\); the simple proof is left as an exercise. This means that, for normal-form games with a constant number of players, a correlated equilibrium can be computed in polynomial time.
One caveat of this characterization is that the size of the LP grows exponentially with the number of players; the basic reason why this happens is that a correlated distribution in multi-player games is an exponential objective—one needs to specify the value of \(\prod_{i=1}^{n} |\mathcal{A}_{i}| - 1\) coordinates. As we shall see in the sequel, there is an ingenious algorithm for addressing this issue. For now, it is reassuring to know that one can compute a CE in normal-form games with a constant number of players. It is worth noting that one can also incorporate any linear objective function into the linear program, such as the social welfare—the sum of the players’ utilities.
1.2.3 Connection to no-regret learning
As we have alluded to, (coarse) correlated equilibria are closely tied to the framework of regret minimization in online learning.
\(\Phi\)-regret. To formalize this connection in its full generality, we will now introduce the important concept of \(\Phi\)-regret. It is a measure of the learner’s performance parameterized by a family of strategy deviations \(\Phi\). For a set of deviations \(\Phi\) comprising functions \(\phi : \mathcal{X} \to \mathcal{X}\), \(\Phi\)-regret is defined as
We covered a moment ago the special case where \(\Phi\) comprises only constant deviations: \(\Phi_{\text{const}} = \{\phi : \exists \boldsymbol{x}' \in \mathcal{X} \text{ such that } \phi (\boldsymbol{x}) = \boldsymbol{x}'\}\); this is indeed the most standard definition of regret in online learning, referred to as external regret. The key point about (5) is that the richer the set of deviations \(\Phi\), the stronger the induced notion of hindsight rationality. The other end of the spectrum where \(\Phi\) consists of all possible deviations \(\mathcal{X} \to \mathcal{X}\) gives rise to swap regret. The next example shows that an algorithm can experience large swap regret even when its external regret is small.
Example 1.6 .
Let’s say the learner picks a distribution over three actions, “1,” “2,” and “3.” Suppose further that the sequence of utilities and selected actions follow the pattern shown in Table 2, where we can assume \(T = 0 \operatorname{mod} 3\). In this example, the learner obtains overall a utility of \(\frac{T}{3}\). In fact, this matches the optimal strategy in hindsight. So, the external regret of the learner is \(0\) in this example. At the same time, consider the swap deviation
Under that deviation, the learner would be able to collect maximal utility, implying that the swap regret of the learner is \(\Omega (T)\).
This example shows that external regret—and the associated solution concept of a CCE—is a weak benchmark. In fact, a CCE can be supported on strictly dominated actions [VZ13[VZ13] Y. Viossat and A. Zapechelnyuk. (2013). No-regret dynamics and fictitious play. Journal of Economic Theory, 148, 825–842.]! This motivates considering tighter equilibrium concepts revolving around the notion of \(\Phi\)-regret.
Now, the techniques we have covered so far for minimizing external regret will not be enough to minimize swap regret. For example, multiplicative weights update (MWU) can have linear swap regret [CL06[CL06] N. Cesa-Bianchi and G. Lugosi. (2006). Prediction, learning, and games. Cambridge university press.]. As a result, we will need new algorithmic ideas to go beyond external regret and coarse correlated equilibria.
Before we proceed, we formalize the connection between minimizing \(\Phi\)-regret and correlated equilibrium concepts. We now extend our scope to general multilinear games. Here, each player \(i \in [n]\) selects a strategy \(\boldsymbol{x}_{i} \in \mathcal{X}_{i}\) from a convex and compact set \(\mathcal{X}_{i}\), so that for any joint strategy \((\boldsymbol{x}_{1}, \dots , \boldsymbol{x}_{n}) \in \mathcal{X}_{1} \times \dots \times \mathcal{X}_{n}\), the utility can be expressed as \(u_{i} (\boldsymbol{x}_{1}, \dots , \boldsymbol{x}_{n}) = \left\langle \boldsymbol{x}_{i},\boldsymbol{u}_{i} (\boldsymbol{x}_{-i}) \right\rangle\) for some utility vector \(\boldsymbol{u}_{i}\) that does not depend on \(\boldsymbol{x}_{i}\). This is a useful abstraction for encompassing both normal- and extensive-form games, the latter under the so-called sequence-form representation.
Definition 1.7 (\(\Phi\)-equilibrium) .
A correlated distribution \(\mu \in \Delta (\mathcal{X}_{1} \times \dots \times \mathcal{X}_{n})\) is an \(\epsilon\)-\(\Phi\)-equilibrium if for any player \(i \in [n]\) and deviation function \(\phi_{i}: \mathcal{X}_{i} \to \mathcal{X}_{i}\),
Theorem 1.8 .
Above, \(\boldsymbol{x}_{1}^{(t)} ⊗ \dots ⊗ \boldsymbol{x}_{n}^{(t)}\) is the product distribution induced by \((\boldsymbol{x}_{1}^{(t)}, \dots , \boldsymbol{x}_{n}^{(t)})\); \(⊗\) denotes the tensor product. The distribution \(\mu\) produced by Theorem 1.8 is a mixture of \(T\) product distributions. Correlation arises by playing multiple iterations of the game. As a special case, Theorem 1.8 implies that players minimizing swap regret converge—in terms of the average correlated distribution of play—to correlated equilibria, whereas external regret is associated with coarse correlated equilibria.
Proof.
For any player \(i \in [n]\), we have
by multilinearity. Let \(\mu = \frac{1}{T} \sum_{t=1}^{T} ⨂_{i=1}^{n} \boldsymbol{x}_{i}^{(t)}\). Continuing from (6),
1.3 A framework for minimizing \(\Phi\)-regret
Having connected \(\Phi\)-regret with \(\Phi\)-equilibria, we now introduce the elegant framework of [GGM08] to minimize \(\Phi\)-regret in the online learning setting; by virtue of Theorem 1.8, one can then compute a \(\Phi\)-equilibrium by having each player employ such algorithms.
Reducing \(\Phi\)-regret to external regret. The key idea in the construction of [GGM08] is that one can reduce minimizing \(\Phi\)-regret to minimizing external regret. The framework of [GGM08] provides a general template based on two basic subroutines.
- A fixed-point oracle: for any deviation \(\phi \in \Phi\), it outputs a fixed point \(\boldsymbol{x} = \phi (\boldsymbol{x})\).
- An online algorithm \(R_{\Phi}\) minimizing external regret with respect to the set \(\Phi\).
Regarding the fixed-point oracle, we will assume that \(\Phi\) consists of continuous functions mapping \(\mathcal{X}\) to \(\mathcal{X}\), so that the existence of a fixed point is guaranteed by Brouwer’s fixed-point theorem; whether such a fixed point can be computed efficiently is a different story. (As we shall see in the sequel, computing approximate fixed points of general functions is known to be equivalent to finding Nash equilibria [DGP08], which defeats our purpose.) For now, we can assume that \(\Phi\) is structured enough so that it admits an efficient fixed-point oracle; for example, this is the case when \(\Phi\) contains only linear deviations. A final point is that it will be enough if one has instead an approximate fixed-point oracle, in that \(\Vert \boldsymbol{x} - \phi (\boldsymbol{x})\Vert \le \epsilon\).
Assuming access to a fixed-point oracle, the reduction of [GGM08] reduces \(\Phi\)-regret to external regret, but with an important catch: the algorithm minimizing external regret needs to operate over the set of deviations \(\Phi\). This is a significantly more complex set than the one we started with, and will be the critical step in establishing efficient \(\Phi\)-regret minimizers.
Assuming access to these oracles, the algorithm of [GGM08] produces a \(\Phi\)-regret minimizer \(R\) as follows.
- In every time \(t \in [T]\), it obtains the next strategy \(\phi^{(t)}\) of \(R_{\Phi}\). \(R\) then produces as the next strategy \(\boldsymbol{x}^{(t)} \in \mathcal{X}\) any fixed point of \(\phi^{(t)}\) through the fixed-point oracle.
- Next, upon observing \(\boldsymbol{u}^{(t)}\), \(R\) feeds to \(R_{\Phi}\) the utility function \(u_{\Phi}^{(t)}: \phi \mapsto \left\langle \phi (\boldsymbol{x}^{(t)}),\boldsymbol{u}^{(t)} \right\rangle\).
NextStrategy():
NextStrategy\(()\)
ObserveUtility(\(\boldsymbol{u}^{(t)}\)):
ObserveUtility\((u_{\Phi}^{(t)})\)Theorem 1.9 ([GGM08]) .
Proof.
We have
since \(\boldsymbol{x}^{(t)} = \phi^{(t)} (\boldsymbol{x}^{(t)})\). Continuing from (8),
1.3.1 The algorithm of Blum and Mansour
Finally, we will see how to make use of the previous framework to minimize swap regret when the underlying strategy set is the probability simplex—corresponding to normal-form games. The resulting algorithm was first developed by [BM07[BM07] A. Blum and Y. Mansour. (2007). From External to Internal Regret. Journal of Machine Learning Research, 8, 1307–1324.] (we also refer to a closely related algorithm due to [SL05[SL05] G. Stoltz and G. Lugosi. (2005). Internal regret in on-line portfolio selection. Machine Learning, 59, 125–159.]). As alluded to above, the key to applying the framework of [GGM08] is to understand the structure of the set of deviations \(\Phi\). In this special case, it is enough to consider only linear functions mapping \(\Delta (\mathcal{A})\) to \(\Delta (\mathcal{A})\), for which there is a simple combinatorial characterization in terms of (column)-stochastic matrices.
Lemma 1.10 .
In proof, since \(\phi\) is linear it can be expressed as \(\boldsymbol{x} \mapsto \mathbf{M} \boldsymbol{x}\) for some matrix \(\mathbf{M}\). Now, every column of \(\mathbf{M}\) is equal to the output of \(\phi\) for the probability distribution that places all the probability in the corresponding action profile. Since \(\phi\) maps \(\Delta (\mathcal{A})\) to \(\Delta (\mathcal{A})\), it follows that every column of \(\mathbf{M}\) is a probability distribution.
Armed with the characterization of Lemma 1.10, we now see how to implement the two oracles required in the framework of [GGM08]. First, a stochastic matrix induces a Markov chain over \(\mathcal{A}\). Any stationary distribution of that Markov chain is a fixed point, and can be computed efficiently as it boils down to solving a linear system. (More broadly, when the underlying set \(\mathcal{X}\) is a polytope and \(\Phi\) comprises linear deviations, computing a fixed point amounts to solving a linear program, which can be done in polynomial time.)
The next step is to minimize regret with respect to the set of stochastic matrices. By definition, the set of stochastic matrices is a product of simplices—one probability distribution for each column:
where, for \(\boldsymbol{x}, \boldsymbol{x}' \in \mathbb{R}^{\mathcal{A}}\), \([(\boldsymbol{x}, \boldsymbol{x}')]\) denotes the matrix with columns \(x\) and \(x'\). Minimizing (external) regret over such a set can be accomplished by simply having an independent regret minimizer for each column.
Lemma 1.11 .
The overall construction is given below.
NextStrategy():
NextStrategy\(()\)
ObserveUtility(\(\boldsymbol{u}^{(t)} \in \mathbb{R}^{\mathcal{A}}\)):
ObserveUtility\((\boldsymbol{u}_{a}^{(t)})\)It consists of \(|\mathcal{A}|\) separate regret minimizers, \((R_{a})_{a \in \mathcal{A}}\), each of which operates over \(\Delta (\mathcal{A})\). To obtain the next strategy, we create the stochastic matrix \(\mathbf{M}^{(t)}\) in which each column is given by the strategy of the corresponding regret minimizer, and then output any fixed point of \(\mathbf{M}^{(t)}\). To explain the second part of the algorithm, let’s first note that the utility observed by \(R_{\Phi}\), per the construction in Theorem 1.9, can be cast as
where we used that \(\phi (\boldsymbol{x}^{(t)}) = \mathbf{M} \boldsymbol{x}^{(t)}\). In other words, \(R_{\Phi}\) observes the utility vector \(\boldsymbol{u}^{(t)} ⊗ \boldsymbol{x}^{(t)}\). Moreover, one should forward to each \(R_{a}\) its corresponding component, which is \(\boldsymbol{x}^{(t)}[a] \boldsymbol{u}^{(t)}\). If we instantiate each \(R_{a}\) with MWU and invoke Theorem 1.9, we arrive at the following result.
Theorem 1.12 ([BM07]) .
The naive argument here would only yield a swap regret bound of \(O(|\mathcal{A}| \sqrt{T \operatorname{log} |\mathcal{A}|})\) since each MWU algorithm incurs an external regret bounded by \(O(\sqrt{T \operatorname{log} |\mathcal{A}|})\). However, one can make use of the structure of the utilities to obtain the improved bound claimed in Theorem 1.12. In particular, we observe that for any \(t \in [T]\),
So, using the regret bound of MWU together with Theorem 1.9,
Optimizing the learning rate \(η\) gives the claim.
Bibliography for this chapter
| [GGM08] | G. J. Gordon, A. Greenwald and C. Marks. (2008). No-regret learning in convex games. International Conference on Machine Learning. |
| [Nas50] | J. Nash. (1950). Equilibrium points in N-person games. Proceedings of the National Academy of Sciences, 36, 48–49. |
| [DGP08] | C. Daskalakis, P. Goldberg and C. Papadimitriou. (2008). The Complexity of Computing a Nash Equilibrium. SIAM Journal on Computing. |
| [GZ89] | I. Gilboa and E. Zemel. (1989). Nash and Correlated Equilibria: Some Complexity Considerations. Games and Economic Behavior, 1, 80–93. |
| [MV78] | H. Moulin and J.-P. Vial. (1978). Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 7, 201–221. |
| [Aum74] | R. Aumann. (1974). Subjectivity and Correlation in Randomized Strategies. Journal of Mathematical Economics, 1, 67–96. |
| [VZ13] | Y. Viossat and A. Zapechelnyuk. (2013). No-regret dynamics and fictitious play. Journal of Economic Theory, 148, 825–842. |
| [CL06] | N. Cesa-Bianchi and G. Lugosi. (2006). Prediction, learning, and games. Cambridge university press. |
| [BM07] | A. Blum and Y. Mansour. (2007). From External to Internal Regret. Journal of Machine Learning Research, 8, 1307–1324. |
| [SL05] | G. Stoltz and G. Lugosi. (2005). Internal regret in on-line portfolio selection. Machine Learning, 59, 125–159. |