Conditional Punishment: Descriptive Social Norms Drive Negative Reciprocity

Peer punishment is widely considered a key mechanism supporting cooperation in human groups. Although much research shows that human behaviour is shaped by the prevailing social norms, little is known about how punishment decisions are impacted by the social context. Here we show that people’s willingness to punish free riders strongly depends on descriptive social norms of cooperation and punishment. Participants in a large-scale experiment (N=999) could punish their partner conditional on the level of cooperation or the level of punishment displayed by others who previously interacted in the same setting. We find that many people punish free riding more severely when cooperation is more common ('norm enforcement'), and when free riding is more severely punished by others ('conformist punishment'). With a dynamic model we demonstrate that these conditional punishment strategies can substantially promote cooperation. In particular, conformist punishment helps cooperation to gain a foothold in a population, and norm enforcement helps to maintain cooperation at high levels. Our results provide solid empirical evidence of conditional punishment strategies and illustrate their possible implications for the dynamics of human cooperation.


Introduction
For organizations, communities, and society as a whole to function, individuals often have to engage in activities that are costly for themselves, but beneficial for others. Peer punishment is considered to be one of the key mechanisms to explain why humans often cooperate in situations where private and collective incentives do not align: many people are willing to punish those who free ride on the cooperation of others, even if punishment is costly and cannot lead to future benefits [1][2][3][4][5][6][7][8] . The threat of punishment makes free riding less attractive and can thereby help maintain cooperation at high levels 4,5,[9][10][11][12][13][14][15][16][17][18] .
Given that peer punishment can play a pivotal role in sustaining cooperation, it is critical to understand what factors influence people's willingness to punish. When studying the drivers of peer punishment, laboratory studies typically focus on aspects specific to the interaction at hand, such as the cooperation decisions of the interaction partner, the cost and impact of punishment, or the potential for future interaction or retaliation 4,10,15,[19][20][21] . In doing so, these studies generally abstract away from the broader social context in which an interaction takes place. Cross-cultural experiments, however, show that social context matters for the effectiveness of punishment to support cooperation: people from different societies use peer punishment in systematically different ways 3,[22][23][24][25][26][27][28] . Because societies differ from each other in myriad ways, such cross-cultural comparisons have limited ability to identify exactly which aspects of the social context underlie any observed differences.
In this paper, we investigate an important way in which the social context may influence punishment of free riding: through indicating 'descriptive norms' specifying what behavior is typical in the current interaction setting [29][30][31] . Studies from across the social sciences have shown that people tend to conform to descriptive norms 29,32-37 . In social dilemmas, it has been established that many people are more willing to cooperate if they believe that others will do so as well 8,20,28,31,[38][39][40] . Whether, and if so how, descriptive norms influence peer punishment, however, remains unclear. Here, we first provide experimental evidence that many people condition their punishment of a free-riding partner on descriptive norms of cooperation and punishment. With a simple dynamic model, we then show that such conditional punishment strategies can have pronounced implications for the emergence and maintenance of cooperation in groups.
For the decision to punish a free riding peer, two descriptive norms may be important. First, punishment might be guided by the descriptive norm of cooperation: is free riding the typical action in the population? It has been shown that people often infer injunctive norms (what one ought to do) from descriptive norms (what most people actually do): people tend to judge behaviors that are less common in a population to be less socially appropriate (or 'moral') and Electronic copy available at: https://ssrn.com/abstract=3571220 consequently more deserving of punishment [41][42][43][44][45][46][47][48] . If people use descriptive norms of cooperation to form moral judgments in this manner, they will judge free riding more harshly when it is atypical, which will increase their willingness to punish. Second, punishment might be guided by a descriptive norm of punishment: is punishment a typical reaction to free riding?
Descriptive norms of punishment can signal a 'principle of social proof' 49 that free riding is disapproved of, and that punishment is an appropriate and legitimate reaction. Conformity to these norms would lead people to punish free riding if others do so as well. Examining the impact of these two descriptive norms on sanctioning behavior increases our understanding of how the social context can affect individuals' punishment of free riding and thereby influence the emergence and maintenance of cooperation.
To investigate whether descriptive norms of cooperation and punishment impact peer punishment, we conduct a large-scale decision-making experiment. Participants are randomly paired and play a prisoner's dilemma with punishment. Our implementation consists of two stages. In the first stage, participants decide to either 'cooperate' or 'defect'. In the second stage, they decide how severely they want to punish their partner if their partner chose to defect. We add minimal social context by allowing participants to condition their punishment decisions on the levels of cooperation and punishment displayed by participants who previously interacted in the same setting (hereafter, the 'reference group'). In two betweensubject treatments, participants can either condition their punishment decisions on (i) the level of cooperation, or (ii) the level of punishment in the reference group. Importantly, the decisions of members of the reference group do not affect payoffs of the focal participants.
Our setup enables us to classify individual participants according to how their punishment decisions respond to descriptive norms, thereby deepening empirical understanding of individual differences in (conditional) punishment. Individual differences in conditional cooperation have received considerable attention in prior research, indicating that the dynamics of cooperation in groups strongly depend on the interplay of individuals' conditional strategies and their beliefs about others' cooperativeness 40, 50,51 . In sharp contrast, little is known about individual differences in conditional punishment and the way in which these differences may affect the emergence of cooperation. Our experimental design allows us to isolate the possible effects of descriptive norms on punishment from related considerations such as a preference for coordinated punishment or positive reciprocity towards other punishers 7,[52][53][54][55] . Finally, by creating controlled conditions that systematically differ in terms of descriptive norms of cooperation and punishment, our setup complements cross-cultural experiments on punishment that rely on natural variation in social context 3,22,24-28, 56 .
Electronic copy available at: https://ssrn.com/abstract=3571220 Our results demonstrate that on aggregate, people's willingness to punish their free riding partner increases both with the level of cooperation and with the level of punishment in the reference group. Importantly, we observe substantial heterogeneity in how people react to the level of cooperation and the level of punishment. Among punishers, we find that three strategies predominate: 'independent punishment', applying the same punishment intensity irrespective of the descriptive norm, 'norm enforcement', increasing punishment intensity with the fraction of cooperators in the reference group, and 'conformist punishment', increasing punishment intensity with punishment levels in the reference group.
To examine the possible long-term implications of the experimentally observed conditional punishment strategies, we develop a simple dynamic model in which a population of agents recurrently interact in a social dilemma game with punishment similar to our experiment. We use analytical methods and agent-based simulations to evaluate how the experimentally observed punishment strategies can shape cooperation in a population.
The model captures key qualitative features of social norm dynamics, involving prolonged periods of stability and sudden shifts. Moreover, the model shows that, in conjunction with independent punishers, norm enforcement and conformist punishment can effectively support cooperation. Importantly, we find that norm enforcement and conformist punishment play markedly different roles in promoting cooperation: conformist punishment can effectively promote the establishment of cooperation in a population, whereas norm enforcement is particularly effective at maintaining cooperation at high levels. Overall, our model shows that the experimentally observed conditional punishment strategies can have a strong and positive impact on the dynamics of cooperation.

Experimental design
We randomly matched participants in pairs to play a two-stage game in which they could earn points (which were converted into dollars at the end of the game). In the first stage, the two players simultaneously choose to cooperate or defect. Joint payoffs are highest when both partners cooperate, with both earning 18 points. However, each individual can increase their personal payoffs in this stage by choosing to defect: unilateral defection leads to 25 points for self and 9 points for the other. Mutual defection leads to 16 points for each. In the second stage, participants have the opportunity to punish their interaction partner if their partner chose to defect (by design excluding 'antisocial punishment'; see 26), by assigning up to 10 deduction points to them. Each assigned deduction point reduces the participant's payoffs with Electronic copy available at: https://ssrn.com/abstract=3571220 to defect in the first stage, and to never assign any deduction points in the second stage.
We report on two separate treatments (total N=999), in which participants could condition their punishment on descriptive norms of cooperation (CC treatment; N=498) or descriptive norms of punishment (CP treatment; N=501). We operationalized these descriptive norms as behaviour in a reference group of individuals who previously interacted in the same setting, but who were irrelevant for the payoffs in the current interaction. Participants had to indicate how many deduction points they would assign to their partner (if the partner chose to defect) for a set of situations that vary with respect to the reference group's levels of cooperation or punishment. The actual behavior in the reference group determined which of the situations was implemented and used to calculate payoffs (see Methods for details; the SI shows the experimental materials in full). On aggregate, behaviour in the reference group impacted the participants' punishment decisions: both the fraction of cooperators and the average intensity of punishment had a significantly positive effect on the average number of deduction points that participants assigned to their free riding partners (ordinary least squares regression: P<0.01 for both treatments; Table S1; Fig. S1). We interpret this as evidence that the social context impacts peer punishment, with both descriptive norms of cooperation and descriptive norms of punishment modulating people's overall willingness to punish defectors.

Results
Participants substantially differed in their punishment behaviour (Fig. 1). Among participants who punished at least once (64% and 55% for CC and CP, respectively), three distinct punishment strategies predominate (Fig. 1a,b): (i) 'independent punishment', applying the same punishment intensity irrespective of the behaviour in the reference group (Fig. 1a,b; orange bars), (ii) 'norm enforcement', monotonically increasing punishment with the level of cooperation in the reference group ( Fig. 1a; green bar), and (iii) 'conformist punishment', monotonically increasing punishment with the level of punishment in the reference group ( Fig.   1b; green bar). In the CC treatment, a smaller portion of participants decreased their punishment of free riders as cooperation became more common in the reference group (Fig. Electronic copy available at: https://ssrn.com/abstract=3571220 1a; blue bar); in the CP treatment, such 'decreasing punishment' was virtually absent (Fig. 1b, blue bar). These results indicate that people substantially vary in how they condition punishment of free riders on the levels of cooperation and punishment in the social environment.  and participants who engaged in conformist punishment ( Fig. 1d; green) strongly reacted to the level of cooperation and punishment in the reference group. On average, 'norm enforcing' participants assigned 1.6 deduction points when the percentage of cooperators in the reference group was less than 5%. Their punishment increased to 6.3 deduction points when more than 95% of the participants in the (payoff-irrelevant) reference group cooperated ( Figure   1c; green line). Similarly, in the CP treatment, participants who punished conformistically assigned about 0.8 deduction points when participants in the reference group assigned 0 deduction points on average. Their punishment increased sharply to 6.5 deduction points when the average number of deduction points assigned by members of the reference group was 10. Taken together, these results show that the punishment behaviour of participants who use conditional strategies is strongly affected by the social environment.
For participants who punished independently and cooperated in stage 1, the modal behaviour in both treatments was to assign 8 deduction points (Fig. 2a,b). By contrast, assigning 8 deduction points is very rare among independent punishers who defected in stage 1 (see Fig.   S3-5 for a full breakdown of punishment decisions by cooperators and defectors in each treatment). This level of punishment equalizes the earnings between a cooperator and their free-riding partner, suggesting that some participants' do not punish to reciprocate the unkind action, but rather to eliminate disadvantageous inequality 21,57 . Electronic copy available at: https://ssrn.com/abstract=3571220 The dynamics are stochastic: with probability > 0, an agent makes a mistake and behaves randomly; with complementary probability 1 − the agent behaves according to its strategy 58,59 ; see Methods for more details.
Our goal is to assess how relative frequencies of independent punishment ( ) , norm enforcement ( ), and conformist punishment ( ) affect the dynamics of cooperation. First, we derive analytical results about the stationary distribution of the dynamic when the observation sample is large (m=n) and the mistake probability is vanishingly small ( → 0).
The stationary distribution reflects the relative frequencies of different population states in the long run ( → ∞). We show that if + Electronic copy available at: https://ssrn.com/abstract=3571220  Figure 3 shows the dynamics of cooperation in situations where independent punishment is not sufficiently frequent to sustain cooperation by itself. We first confirm that, if independent punishers alone are too rare to support cooperation on their own, and neither of the conditional punishment strategies is present in the population, cooperation never emerges in our simulations (Fig. 3a,b). Next, we consider cases where independent punishment is complemented with conditional punishment strategies, raising the overall frequency of punishers. The presence of norm enforcement has a strong stabilizing effect once high levels of cooperation have been achieved (Fig. 3c). However, it might take considerable time for cooperation to emerge (Fig. 3d). These dynamics are driven by a positive feedback loop between norm enforcement and cooperation, locking a population into a state of either high or low cooperation, making it hard to transition from one state to the other.
By contrast, in the presence of conformist punishers cooperation readily emerges, but is not stable (Fig. 3e,f). The population alternates between states with low and high levels of cooperation, with rapid shifts between these states. These dynamics are driven by another positive feedback loop: when levels of cooperation and punishment are low, some agents may punish their free riding partner due to mistakes or-in the case of conformist punishers-due to sampling bias. In turn, these stochastic events may prompt other conformist punishers to punish too in the next period, thereby increasing the levels of cooperation and punishment even more, and possibly tipping the population to high levels of cooperation and punishment.
However, similar stochastic processes may also cause cooperation to suddenly break down when conformist punishers stop punishing when they happen to underestimate the level of punishment in the population.
When both conformist punishment and norm enforcement are present in the population-but keeping the overall frequency of conditional punishment the same-cooperation rapidly emerges and remains stable at high levels ( Fig. 3g,h). Conformist punishers still amplify the impact of stochasticity when cooperation is low, facilitating the emergence of cooperation.
Subsequently, norm enforcement locks the population into a state of high cooperation. This result highlights that the concerted action of conformist punishment and norm enforcement can efficiently support cooperation.
Electronic copy available at: https://ssrn.com/abstract=3571220 These results indicate that different conditional punishment strategies can promote cooperation in different ways: conformist punishment facilitates the emergence of cooperation; norm enforcement helps to maintain it after its emergence. Figure 4 confirms these insights.
When a population starts from a state of low cooperation, the presence of conformist punishment, rather than norm enforcement, can strongly increase the rate at which it shifts to a state of high cooperation (Fig. 4a). Conversely, the presence of norm enforcement can substantially extend the time that a population remains in a state of high cooperation (Fig. 4b).
In the Supplementary Information we examine the generalizability and robustness of our model results. We confirm that our main model results hold across different ranges of relative frequencies of the various (conditional) punishment strategies and different initial beliefs about cooperation and punishment in the population (Fig. S9-10). Furthermore, we show that the presence of agents who decrease their punishment of free riding as cooperation becomes more common-as observed in the CC treatment ('decreasing punishment' in Fig. 1a)-Electronic copy available at: https://ssrn.com/abstract=3571220 destabilizes the non-cooperative equilibrium. By itself, decreasing punishment cannot support high levels of cooperation. However, in conjunction with other conditional punishment strategies, norm enforcement in particular, decreasing punishment can boost the likelihood that a population reaches high and stable levels of cooperation (Fig. S11).

Discussion
Our experiment provides large-scale behavioural evidence that punishment of free riding in social dilemmas is shaped both by descriptive norms of cooperation ("is free riding a typical Our finding that people punish free riding more when cooperation is more common provides novel behavioral evidence for the idea that people infer injunctive norms (what is 'moral') from descriptive norms (what is 'common') [41][42][43][44][45][46]48,63 . In doing so, we complement existing research that largely relied on (non-incentivized) moral judgments 43,45,46,48 . Our behavioural approach, however, does not allow us to pin down the psychological mechanisms underlying the different punishment strategies. Previous evidence suggests that norm enforcement may be driven by increased disapproval of free riding when cooperation is common 47  insights into the different roles that norm enforcement and conformist punishment play in this dynamic. Norm enforcers punish free riders when the cooperation rate in the population is relatively high, which makes them effective in maintaining cooperation. However, they do not punish when free riding predominates and are therefore of little help for cooperation to emerge from scratch. In contrast, conformist punishers sanction free riders as long as sufficiently many others do-irrespective of the cooperation rate-and can, therefore, play a valuable role in helping cooperation gain a foothold in a population.
Whereas our experiment shows that the behaviour of an individual can be influenced by what the collective is doing, our model illustrates how these individual strategies can subsequently impact collective dynamics. We deliberately employ a simple stylized model to illustrate the basic effects of conditional punishment strategies on the dynamics of cooperation. Despite its simplifying assumptions (e.g., mutually exclusive punishment strategies, binary punishment and cooperation choices, random re-matching after every interaction), our model produces intuitive and robust results. Moreover, the model is able to capture key qualitative features of the dynamics of social norms: prolonged periods of stability which are punctuated by tipping points, where one norm is rapidly replaced by another ( Fig. 3; 63). In line with the results of an existing project 47 , we find that especially the positive social feedback provided by norm enforcers is critical to capture these patterns in norm dynamics.
Our simulations illustrate how conformist punishment can amplify stochastic events, leading to both rapid alternation between the emergence and breakdown of cooperation in a Electronic copy available at: https://ssrn.com/abstract=3571220 population (Fig. 3, 4). In contrast, norm enforcement can engender a process of positive feedback with cooperation, locking a population into a state of either high or low levels of cooperation, making it hard to transition to the other state (Fig. 3). These results give pointers for efficiently promoting desirable behaviours, such as voting, tax compliance, or energy conservation. In particular, facilitating the observability of (or accessibility to) information about other people's behaviour may be effective when the majority of the population displays the desired behaviour: this information can boost norm enforcement, ensuring that adherence to the present norm remains high. Conversely, when a majority of the population shows the undesired behaviour, it may be more effective to provide people with information that informs them that many people disapprove of the undesirable behaviour. Such information may trigger conformist punishment and shift the system to the more desirable outcome. After reading the instructions and passing compulsory control questions, participants entered stage 1 and made their binary cooperation decisions. In stage 2, participants completed another set of compulsory control questions (see Figs. S6 and S7 for details), before we asked them to provide their punishment responses to descriptive norms of cooperation and punishment. We used the strategy method 70 to obtain a full punishment profile for each individual 55,[71][72][73] . In the CC treatment, we operationalized the descriptive norm of cooperation as the fraction of cooperative choices in a payoff-irrelevant reference group (sampled from a pre-recorded pool; details below). We presented participants with eleven situations regarding Electronic copy available at: https://ssrn.com/abstract=3571220 the proportion of cooperators in this reference group, spanning the full range of possible outcomes. For each of these situations, participants had to indicate how many deduction points they would assign to their current interaction partner. In the CP treatment, we operationalized the descriptive norm of punishment as the average intensity of punishment in the reference group, and participants indicated for each possible situation how many deduction points they would assign to their current interaction partner.

Methods
The pre-recorded pool consisted of a total of 273 MTurkers who played a prisoner's dilemma with punishment mirroring our experiment (cooperation rate: 69%; average punishment of free riding partners: 2.7 deduction points). For each dyad in the main experiment, we independently sampled 50 participants from the pre-recorded pool to form the reference group. The behaviour of the reference group defined the situation that was used to calculate participants' earnings. Since participants did not know which situation was the actual one beforehand, they were incentivized to consider each situation as if it was real.
Once participants had completed the two decision making stages of the experiment, they were placed in a lobby, in which they would be matched with another participant as soon as they completed their decisions as well. Excluding the time spent in the lobby, our experiment on average lasted 9.9 minutes. In our experiment, participants could earn points which were converted to US dollars at the end of the experiment (20 points were worth $1.00). Average earnings were $1.96 (range $0.41 -$2.51), which translates to an hourly wage of $12.00.
We define independent punishment as using the same (non-zero) level of punishment across all situations. We defined conditional punishment strategies of norm enforcement and conformist punishment as showing a weakly monotonic increase in punishment in responses to increasing levels of cooperation (CC treatment) and punishment (CP treatment) in the reference group. This approach based on monotonicity is a conservative way to identify conditional punishment strategies: an alternative classification method based on linear regression models would lead all individuals with non-monotonic response patterns (cf. Fig. 1) to be identified as using either independent, increasing, and decreasing punishment strategies.
Dynamic model. In the first period of the simulations, agents are endowed with initial beliefs about the norms of cooperation and punishment, and respond to the beliefs according to their specified strategies. For Starting High in Fig. 3 and Fig. 4b, agents initially believe that 75% of the agents in the population will cooperate ( = 0.75) and 75% of the agents would punish free riding ( = 0.75). For all agents, the payoff maximizing response to these beliefs is to cooperate; independent punishers, norm enforcers, and conformist punishers punish their defecting partner when holding these beliefs. For Starting Low in Fig. 3 and Fig. 4a, agents Electronic copy available at: https://ssrn.com/abstract=3571220 have initial beliefs = = 0.25. For all agents, the payoff maximizing response to these beliefs is to defect; only independent punishers punish defectors when holding these beliefs.
In each subsequent period, each agent updates their beliefs by sampling m agents from the population with probability . We set = 0.5 in our simulations; our analytical results apply to any with 0 < < 1. Assuming < 1 prevents that all agents simultaneously update in a period with probability one 58 Table S1. Electronic copy available at: https://ssrn.com/abstract=3571220  Fig. 2a,b), vertical axes show counts. Note that, in contrast to cooperators, defectors do not equalize payoffs between themselves and their partners by assigning 8 deduction points (potentially explaining why this response was much more frequent among cooperators than among defectors).

blue bar) on cooperation dynamics.
Agents with this strategy punish free riders if they believe cooperation rates are lower than 50% ( ' < 0.5). Triangles show outcomes of simulations that vary the relative proportion of norm enforcement, conformist punishment, and decreasing punishment, with independent punishment fixed at 30%. The top row shows the percentage of periods for which cooperation was higher than 75% for each combination of strategies, whereas the bottom row shows the frequency of cooperation over all periods.
4% is the frequency of decreasing punishment, "# is the frequency of norm enforcement, and $% is the frequency of conformist punishment. Results are the average outcome of simulations where cooperation either started high ( ' = ) = 0.75) or low ( ' = ) = 0.25). In particular, for each possible combination of "# , $% , and 4% averages are based on 10 simulations (5 with high and 5 with low initial beliefs). Each simulation runs for 10,000 (10 5 ) periods. Further simulation settings: n = 100, m = 10, = 0.5, ε=0.05. Here we use analytical methods to evaluate our model, addressing how the experimentally identified punishment strategies interact to shape the dynamics of cooperation in the long run.

Supplementary Tables
Section 1 describes and formalizes the interaction setting. Section 2 describes the strategies we consider. Sections 3 and 4 analyse the effects of conditional punishment strategies on cooperation in the short run and in the long run, respectively.

Setting
We consider the following decision setting, which is similar to the task used in the experiment.
Two agents, A and B, are randomly drawn from a large population to play a two-stage game.
In Stage 1, they can either cooperate or defect. Table S2 shows how the Stage 1 material payoffs for both agents depend on their choices.  We focus on binary punishment decisions for the sake of exposition and tractability. Compared with the task in our experiment, focusing on binary punishment decisions in our model is not without loss of generality. Binary punishment excludes the possibility that an individual's punishment is not weakly monotonic-i.e., that it is neither independent, nor weakly increasing or weakly decreasing-in response to increasing cooperation rate or punishment rate in the population. Our experimental results, however, suggest that non-monotonic punishment behaviour is much less common than independent punishment, norm enforcement, and conformist punishment (Fig. 1 in the main text). Furthermore, the group of participants who show non-monotonic punishment behaviour becomes very small (less than 10 percent) if we exclude participants who had difficulty answering the nine compulsory control questions (Fig.   S7), suggesting that such non-monotonic behaviour is likely to be the result of inattentive choice behaviour, rather than a real preference.

Strategies
Cooperation. We assume that an agent's choice to cooperate or defect depends on which choice generates the highest expected material payoffs. Let ' ∈ [0,1] denote an agent's belief about the cooperation rate in the population, and ) ∈ [0,1] the punishment rate. From Table   S2 we can see that the expected payoff from choosing cooperate is The expected payoff from choosing defect is An agent cooperates if and only if (1) ≥ (2) (assuming they cooperate if expected payoffs are the same). Rearranging the terms leads to the condition

Short-run (Nash) equilibrium
Proposition 1 shows the conditions under which cooperation can be sustained in the short run.
Agents do not know the type of agent with whom they are matched. As is standard in the economic literature, we assume that agents have a common prior on the population composition, which corresponds to ( 2% , "# , $% , M ) . In the next section 'Long-run equilibrium', we will address the problem of how agents form and update beliefs over time.
Exogenous payoff parameters of the game determine the equilibria through their effects on the thresholds $ , "# , and $% .  (2) We show the contrapositive: Suppose there is an equilibrium in which an agent defects.
Then for the agent, ) < $ , where ) is at least . Hence Q.E.D.

Long-run equilibrium
In this section we examine the long-run effects of conditional and unconditional punishment strategies on cooperation. We aim to delineate the conditions under which conditional punishment strategies (norm enforcement and conformist punishment) will, in the long run, cause the population to be in or around the cooperation equilibrium for most of the time. Our analysis builds on 1-3 .
We consider discrete time periods: = 0,1,2, … , . In each period, agents are randomly matched and interact in the two-stage game described in Section 1 above. An agent's punishment strategy and the population composition ( 2% , "# , $% , M ) are fixed over time, but agents may update their cooperation and punishment decisions as their beliefs ' and ) change. In each period agents react to their beliefs 'myopically' to maximise their expected payoffs in that period.
To be more precise, each period involves two subsequent classes of events: Updating beliefs. In each period ≥ 1, each agent updates their beliefs with probability u, with 0<u<1. Belief updating works as follows. The agent randomly samples m agents from the population, with 0<m ≤n. She counts how many agents in the sample cooperated and would punish according to their strategies in the previous period, and divide the counts by m. The results become their beliefs ' and ) in the current period.

II.
Responding myopically to beliefs. An agent cooperates in a period if and only if they have belief ) ≥ $ . Punishment decisions are determined according to the agents' types (as specified in Section 2 above).
With a high probability, an agent's decisions are implemented according to the rules stated above. With small probability ≥ 0, however, an agent makes a mistake ("tremble"). A mistake implies that the agent randomly selects a cooperative action or a punishment action. We assume that mistakes are independent across periods, agents, and across cooperation and punishment decisions. Following 1-3 , we refer to the dynamic with > 0 as the stochastic dynamic, and the dynamic with = 0 as the best-response dynamic.
We first analyse the stochastic dynamic in the case of = , → ∞ , and → 0 using analytical methods. As previous studies of the same class of stochastic dynamics show 1-3 , whether < or = does not affect stationary distributions of the dynamics. Later, we also conduct simulations to explore the cases of small sample size , finite , and nonnegligible .
Our analytical results aim to characterize the set of long-run equilibria. These are the equilibria that have a positive frequency in the stationary distribution of the stochastic dynamic when the probability of mistakes is vanishingly small. A long run equilibrium is formally defined as follows.
Let be a population state specifying the cooperation decision and punishment decision of each agent in the population. Let denote the set of all population states. Let j ∈ ( ) denote the stationary distribution of the stochastic dynamic under > 0 and = . The stochastic dynamic is an irreducible Markov chain on the finite state space . Hence j exists and is unique for each . We obtain j by taking → ∞. Let ≡ l→M j denote the limit distribution as approaches zero. A state is a long-run equilibrium if ( ) > 0 1-3 . If a state is a unique long-run equilibrium for sufficiently large , then it is a generically unique long-run equilibrium.
The proofs for Proposition 2 are provided at the end of this section. Fig. S12 illustrates the proposition, which says that, together with the independent punishment, conditional punishment can support cooperation as the generically unique long-run equilibrium. If the frequencies of independent punishment and conditional punishment are both low, then the cooperation equilibrium cannot be sustained in the long run. Nevertheless, the characterization by Proposition 2 is incomplete. It is silent about the case of < 2% + "# < o or < 2% + $% < o . Proposition 3 below provides precise cut-off conditions for the long-run equilibrium for the special case where $ = "# = $% = K n , which is also the set of parameters we use in our simulations presented in the main text. The assumption that all thresholds are equal to a half is somewhat arbitrary. As stated in Section 2 of this Supplement, the threshold $ is determined by exogenous payoff parameters. Hence, setting it equal to a half comes down to considering a subset of the potential payoff space. For "# and $% , however, a threshold of a half makes intuitive sense. For norm enforcement, it is in line with the idea that people will judge the more common behavior as the more moral one, and act to enforce it 4 . For conformist punishment, it states that these agents follow the behaviour of the majority. Furthermore, our focus here is not on the comparative statics with respect to these thresholds, but rather on how the population composition (with respect to punishment strategies) affects cooperation dynamics. In this regard, the proposition below is illuminating. That is, the average frequency of norm enforcement and conformist punishment is important to support cooperation in the long run; it is as important as the role played by independent punishment.

Remarks.
Economists have used myopic best-response stochastic dynamics to study bargaining norms 5 , customs in economic contracts 6 , evolution of altruism 7 , the selection of coordination actions in social networks 8,9 , diffusion of innovations 10,11 , and the evolution of cooperation strategies in repeated games 12 . In particular, studies [1][2][3] show that many details of these dynamics do not affect their stationary distributions when → 0. In particular, the stationary distribution is not affected by the value of the updating probability as long as 0 < < 1, or the sample size as long as does not become too small to affect the tipping thresholds, or the probability distribution used to pick actions when making mistakes.
Assuming < 1 means that it will not occur that all agents update simultaneously in a period. If = 1 and is small, then besides the cooperation equilibrium and the defection equilibrium, the population can also be trapped in a loop of jumping back and forth between two states: in one, all agents defect and all punish defectors except for the non-punishers; in the other, all agents cooperate but no one would punish defectors except for the independent punishers.
We exclude this possibility to focus on the transitions between the cooperation equilibrium and the defection equilibrium characterized by Proposition 1.

Proof of Proposition 2.
Preliminaries. First, we introduce necessary terminology for our proof (see, e.g., Young 3 for a more extensive discussion). An absorbing set (of the best-response dynamic) is a subset of states ⊂ such that (i) if the the best-response dynamic starts from a state in then it stays within with probability 1, and (ii) for any , ′ ∈ , there is a positive probability of transiting from to ′ within a finite number of periods. If an absorbing set contains only one state, then we call the state an absorbing state. A transition path from to ′ is a finite sequence of states, K , n , … , t ∈ , with K = , t = ′, and u ≠ u\K for each 1 ≤ < . The cost of a transition path, denoted by ( K , n , … , t ), is the number of mistakes (choices that are not best responses) that occur along the path.
We use stochastic trees to represent minimum transition costs between absorbing sets. A stochastic tree is a directed tree with each absorbing set as a vertex. The directed edges in a stochastic tree represent transitions among absorbing sets. Each edge is weighted by the minimum number of mistakes required to transit from one absorbing set to another. An absorbing set is said to be at the root of a stochastic tree if there is no edge (with positive weight) leading from it to other absorbing sets in the tree. The cost of a stochastic tree is the sum of the weights of all its edges. Our proof applies the following theorem: Young's Theorem 2 .

A state is a long-run equilibrium only if it is contained in an absorbing set.
2. If an absorbing state is at the root of the stochastic tree that strictly minimizes the cost among all stochastic trees, then the state is the unique long-run equilibrium.
The best-response dynamic in our model has only two absorbing sets: one consisting of defection equilibrium, and the other consisting of the cooperation equilibrium. With abuse of notation, we denote them by and , respectively. By Young's theorem, and are the only candidates for a long-run equilibrium.
We can construct two stochastic trees: → (a directed line with and as its two vertices connected by a unique edge leading from to ) and → . Let $→4 denote the minimum number of mistakes required to transit from to . More precisely, $→4 is the minimum value of ( K , n , … , t ) among the set of all paths K , n , … , t with K = and t = .
Likewise, $→4 is the minimum value of ( K , n , … , t ) among the set of all paths K , n , … , t with K = and t = . By Young's theorem, it suffices to compare $→4 with $→4 to determine the long-run equilibrium.
Transition paths. Now we examine transition paths with minimum costs between and .
Three paths are relevant to determine the minimum cost of transitions from to : • Path E1 ('E' for Emergence of cooperation): Starting from at time = 0, if $ − 2% < M , then let ⌈( $ − 2% ) ⌉ non-punishers punish defectors by mistake at = 1 (for any real number , ⌈ ⌉is the lowest integer equal to or greater than ). If $ − 2% ≥ M , then let all non-punishers and ⌈( $ − 2% − M ) ⌉ conformist punishers punish by mistake at = 1. At = 2, let all agents update their cooperation decision. Then they all cooperate (for brevity, if we do not explicitly mention that agents update their cooperation or punishment decision, then the agents do not update from the last period, and do not make any mistakes). At = 3, let all norm enforcement agents update their punishment decision.
Then by requiring all agents to update both cooperation and punishment decisions at = 6 , we reach . Counting the number of mistakes, we obtain the cost of path E1: Correspondingly, the following three paths are relevant to compute the minimum cost of transiting from to : Electronic copy available at: https://ssrn.com/abstract=3571220 Simplifying observations. We need to determine the path with minimum cost among the six paths above. This requires solving a set of linear inequalities. Two observations simplify our calculations. First, since we are only concerned with generically unique long-run equilibria, it is both sufficient and necessary for the minimum cost path to have strictly lower cost than all transition paths of the opposite direction for infinitely many . A sufficient and necessary condition for this is that there is a finite under which all relevant inequalities for pairwise cost comparisons hold strictly. This condition is equivalent to having all relevant inequalities holding strictly when we ignore all "⌈. ⌉" brackets, i.e., by ignoring the "least integer greater than" operator. To see the equivalence, first, suppose ⌈ ⌉ < ⌈ ⌉ for some positive . Then obviously < . Conversely, suppose < for some positive . Then < , and there is some large enough integer such that ( − ) > 1, implying + 1 < . Thus ⌈ ⌉ < ⌈ ⌉, and ⌈ ⌉ < ⌈ ⌉ for all > .
Second, after removing all " ⌈. ⌉ " brackets, the costs of the paths listed above are all multiplications of . Taking the two observations together, it suffices to consider their relative costs † (. ) ≡ (. )/ and ignore all "⌈. ⌉" operators. Henceforth we will focus on † (. ) and remove all "⌈. ⌉" operators. The six transition paths and their relative costs † are summarized in Table S3 below.  Taking together, we have the claimed properties.
Observe that # , ‚ ≤ K n , but † ( 3), † ( 3) ≥ K n . Hence, 3 and 3 are never the paths with strictly minimum costs. Therefore, by Young's theorem, 2% > M , implying # < ‚ , is both necessary and sufficient for to be the generically unique long-run equilibrium. And 2% < M , implying # > ‚ , is necessary and sufficient for to be the generically unique longrun equilibrium. Q.E.D.

Experimental Procedures and Materials
Participants were recruited from Amazon Mechanical Turk (MTurk), which has been shown to provide good quality data in various settings [13][14][15] , social dilemma games with punishment 16 .
After reading instructions, participants were placed in a 'lobby' until another participant arrived.
Once two participants were in the lobby, they were matched and directed to the first decision stage of the experiment. In case no match could be made within 5 minutes, participants could choose to leave and receive a fixed bonus payment of $1.00, or to wait for another 2 minutes for a possible matching partner (as in 16 ). Participants were informed that from the point of reaching the lobby onwards, they did not have to make any further decisions.
Below we show on-screen instructions as displayed to participants. We start with the CC treatment in which participants could condition punishment of their interaction partner on descriptive norms of cooperation. Then we show the CP treatment, in which participants could condition punishment of their interaction partner on descriptive norms of punishment. The experiment was programmed in LIONESS Lab 17 . Participants could not navigate the experimental pages at will. Each time they pressed a button, the browser history was automatically overwritten.