Haplotype Reconstruction for Diploid Populations

The inference of haplotype pairs directly from unphased genotype data is a key step in the analysis of genetic variation in relation to disease and pharmacogenetically relevant traits. Most popular methods such as Phase and PL do require either the coalescence assumption or the assumption of linkage between the single-nucleotide polymorphisms (SNPs). We have now developed novel approaches that are independent of these assumptions. First, we introduce a new optimization criterion in combination with a block-wise evolutionary Monte Carlo algorithm. Based on this criterion, the ‘haplotype likelihood’, we develop two kinds of estimators, the maximum haplotype-likelihood (MHL) estimator and its empirical Bayesian (EB) version. Using both real and simulated data sets, we demonstrate that our proposed estimators allow substantial improvements over both the expectation-maximization (EM) algorithm and Clark’s procedure in terms of capacity/scalability and error rate. Thus, hundreds and more ambiguous loci and potentially very large sample sizes can be processed. Moreover, applying our proposed EB estimator can result in significant reductions of error rate in the case of unlinked or only weakly linked SNPs.


Introduction
In all phases of disease gene discovery, it is of paramount importance to correctly determine the haplotypes, the specifi c combinations of given sequence variants for each of the two chromosomes of an individual [1][2][3] . These haplotypes may relate to ancestral chromosomal segments and/or defi ned candidate genes [1] . The currently used standard experimental approaches to the analysis of genetic variation, sequencing and genotyping, rely on the analysis of diploid (mixed) genomic DNA. Applying these technologies does not allow the direct determination of the underlying molecular haplotypes [4] . This means that we usually have to depend on unphased genotypes. In principle, phase information can be obtained by genotyping the family members of each individual; in many cases, however, their genotypes simply are not available or not suffi cient to resolve haplotype ambiguity completely [5] . Existing methods for molecular haplotyping such as allele-specifi c long-range PCR [6,7] , carbon nanotube probing [8] , or the construction of somatic cell hybrids [9] are to date too cost and labour intensive and not amenable to automation. Recently developed, novel and promising technologies such as polony haplotyping [10] , single-copy DNA genotyping in conjunction with the MassARRAY system [11] , or (fosmid/ cosmid) clone-based systematic haplotyping [12] have not yet become available for effi cient routine use.
Thus, in the past decade, several in silico methods have been developed that allow the prediction of haplo-types from unphased genotype data. Two of them have become particularly popular. One is Clark's approach [6,13] , the other the EM approach based on the genotype likelihood [14][15][16][17][18] . The fi rst approach is motivated by the fact that for any haplotype that is common enough such that homozygotes can be found in the sample, the sample is expected to have several heterozygotes bearing one copy of that haplotype [19] . This approach focuses on the prediction of haplotype pairs instead of estimating haplotype frequencies. However, it does not take full advantage of the count information (i.e. the information from the repeated observations) of some genotypes. This limitation can strongly affect the conclusions drawn by the subsequent haplotype-based statistical analyses (see Results). In contrast, the second approach has a clear statistical background and is not restricted to biallelic loci. EM is applied fi rst to estimate haplotype frequencies and then to assign to each individual genotype the haplotype pair with the highest frequency among all possible pairs consistent with the genotype under consideration. Unfortunately, application of EM is limited by the size of the problem it can tackle [20] .
In an effort to contend with these limitations, Stephens et al. [20] and Niu et al. [17] developed some new statistical methods, called Phase and PL, respectively. Both Phase and PL require the assumption that the SNPs are linked and consider the unknown haplotypes as unobserved random quantities, which avoids the requirement of storing estimated haplotype frequencies for every possible haplotype in the sample, a limitation of the EM approach. This allows both methods handling many more ambiguous loci than EM. Moreover, the error-rate improvement of Phase over Clark's and EM results from detailed model assumptions, specifi cally, the assumption of a coalescent model that has been made to infer the conditional distribution of the unknown haplotypes underlying the genotype data. In contrast, PL attempts to improve the error-rate performance of Clark's and EM by using partition ligation and prior annealing techniques in order to regularise the conditional distribution of the unknown haplotypes, as compared to imposing a specifi c model on this conditional distribution. This explains why the performance of Phase is better than PL under the coalescent assumption and worse than PL, when this assumption is invalid.
A major motivation of our work has been to develop an empirical Bayesian procedure, termed EB, in order to overcome the limitation of PL, i.e. deterioration of performance, when the SNPs under consideration become weakly linked or unlinked. The proposed EB procedure represents an extension of/improvement over PL in the sense that the same Bayes model is used, it is similarly based on the same haplotype likelihood framework, however, the requirement of tightly linked loci has been removed through introducing the empirical pseudo-counts in the prior. It has been demonstrated by Niu et al. [17] that PL may provide better results than Phase under the condition of an invalid coalescence assumption. Thus, this conclusion is supposed to hold for our EB. The above described improvement is achieved through the use of a new optimization criterion, an empirical posterior, as well as a block-wise evolutionary Markov chain Monte Carlo algorithm. The empirical posterior is based on the so-called profi le haplotype likelihood, which usually assigns the relatively higher likelihood to that specifi c haplotype that occurs in both homozygotes and heterozygotes. Relying on this profi le likelihood, we present a new haplotype estimation procedure as an alternative to EM, called the maximum haplotype likelihood (MHL). We show that both MHL and EB can be derived by optimization of the profi le haplotype likelihood or of a posterior of the haplotypes only. This property helps also to contend the limitation of the EM approach, that is, the need to store the estimated haplotype frequencies for every possible haplotype in the sample. In contrast to PL, we address the issue of how to specify the prior by using the Excoffi er-Slatkin assumption [14] . That is, we use those haplotype frequencies in the space of haplotypes that are compatible with the genotypes, in order to specify the pseudo-counts in the prior.
In order to evaluate the performance of our proposed solutions, we carried out at fi rst several simulation studies considering two scenarios, one characterized by tightly linked SNPs, and the other by weakly linked or unlinked SNPs, respectively. While the fi rst scenario has been well motivated in the literature [17,20] , the second one has attracted attention just recently [21] . The rationale of the second scenario may be outlined as follows: SNPs may, in practice, not reside on the same chromosome and therefore be physically unlinked. However, some of them may still interact with each other and in that, confer genetic risk to complex disease [21] . This implies that the two physically unlinked haplotypes may have the same parental identity if they are coupled with each other as the result of the potential interactions of disease genes in the gene network. It has been demonstrated by Zhang et al. [21] that phasing two unphased genotype blocks together allows prediction of these interactions. Moreover, considering the potentially interacting SNPs as one block, that means in practice, combining all SNPs and phase them as a total, was shown to improve the accuracy of haplotype prediction.
In order to extend assessment of the performance of our proposed procedures, we applied them consecutively to two real data sets, the human opioid receptor gene (OPMR1) and the angiotensin converting enzyme (ACE) data sets [2,22] . Referring to the fi rst one, our motivation was to examine whether the risk pattern extracted by Hoehe et al. [2] could be derived by our proposed procedures as compared to the other methods outlined, Clark, EM, Phase and PL, respectively. The ACE data set, on the other side, was used, because it appeared to allow a much more detailed investigation of the specifi c mechanisms involved in the performance of the different methods under comparison.
Upon application to both simulated and real data sets, we conclude that our EB procedure not only can tackle a large size problem, but also can perform signifi cantly better than the existing methods in terms of error rate in many instances. This improvement in performance might be due to using the effi cient evolutionary Monte Carlo algorithm and adopting the above described empirical Bayesian method to regularize the model. Note that haplotyping is an ill-posed problem, in which the dimension of parameter space is usually much higher than the sample size. Based on our simulations we conclude further that compared to all other existing approaches, under a coalescent model, our method underperforms (slightly) Phase and is very similar on average to PL in the case of tightly linked SNPs. However, in the case where the SNPs are unlinked or only weakly linked, our EB method clearly yields better results than PL.

Haplotype Likelihood
Consider a chromosomal region of u 0 loci specifi ed by an allelic vector ( r 1 , r 2 , ..., r u 0 ) T , in the reference genomic sequence. Let G = ( G 1 , ..., G n ) denote the observed genotypes for the n individuals, where G i = ( g i 1 , ..., g iu 0 ) T , () T is the transpose, and g ij is the genotype for individual i at locus j . Let g ij take 0, 1, or 2 according to whether its genetic haplotype at the locus j is homozygous and identical with the reference sequence, or homozygous but different from the reference sequence, or heterozygous. A genotype is called ambiguous if it has at least 2 heterozygous sites. Let H i = ( H i 1 , H i 2 ) denote the unobserved haplotype pair of G i , and ⌰ i the set of all possible haplotype pairs compatible with G i (called the candidate haplotype set for G i ). Set H = ( H 1 , ..., H n ), one possible way to decompose G into haplotypes. Let p = ( p 1 , ..., p m 0 ) denote population frequencies of all possible haplotypes compatible with G , where m 0 is the number of these candidate haplotypes. Then, given G , under the as-sumption of Hardy-Weinberg equilibrium, we derive the 'haplotype-likelihood' Note that the constant 2 c does not affect the estimation of ( p , H ). This point can be shown in the following example: Example 1. Suppose that G = {(0,0,0,1) T , (1,0,0,1) T , (2,2,0,1) T , (1,1,2,2) T } and that these genotypes have single count (i.e., multiplicity = 1). Then, the candidate haplotype sets of these genotypes Denote their unknown population frequencies by p 1, p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , respectively. Then, there are four ways (or socalled assignments) to decompose G into haplotypes, namely, their likelihoods can be expressed as follows: Here, we are mainly interested in the selection of the best among all possible ways H k , 1 ^ k ^ 4 to decompose G . The selection leads to an estimator of p = ( p 1 , ..., p 7 ). Note that the constant 4 does not affect the maximization of these haplotype likelihoods with respect to ( p , H ).

EM Approach
Note that the genotype likelihood in [14,23] can be viewed as the marginal likelihood of p in L ( G ͉ p , H ), namely Let log( L ( G ͉ p ))/(2 n ) denote the log-genotype-likelihood. Within the EM approach, we attempt to fi nd p that maximizes the above marginal likelihood. Haplotype pairs can be reconstructed by choosing the most probable ones, given the genotype data and the estimated population haplotype frequencies p.
Clark's Approach Clark [6] proposed an algorithm for haplotype assignment, which consists of two steps: First, the initial set of the haplotypes is formed from the 'self-resolved' genotypes (i.e., those genotypes with one heterozygous position at the most). Second, a known haplotype is chosen in order to see whether any of the unresolved genotypes is the composite of a known haplotype and a complementary haplotype, and, if this is the case, the known haplotype set is updated by adding the complementary haplotype. The second step is repeated until all unresolved genotypes are resolved or the remaining genotypes cannot be resolved any further. Obviously, the solution depends on the order in which the known haplotypes are chosen in the second step. The larger the number of resolved ambiguous genotypes, i.e. the resolution, is, the better the solution [13] .
Using Clark's approach, we obtain a unique optimal solution, in which h 2 has no heterozygous descendant. From there, we take fi rst ( h 1 , h 2 ) as the initially known haplotype set, which we choose h 1 from, in order to resolve (2,2,0,1) T . Then, the known haplotype set is updated by adding the complementary haplotype h 3 . At last, (1,1,2,2) T is resolved by h 3 . Note that the count information is completely ignored in this algorithm. This is in contradiction to the rationale of the approach as outlined in the 'Introduction'. For example, if we set n 2 = 10, the haplotype h 2 already has a much higher frequency than the other known haplotype in the initial haplotype set, h 1 . According to the coalescent theory in population genetics, the expected rank of a haplotype by age is the same as the rank by its frequency, and older haplotypes will tend to have more mutational connections than younger ones [24] . This implies that (2,2,0,1) T is more probably resolved by h 2 than by h 1 according to the rationale of Clark's method. However, choosing h 2 , we fail to resolve all genotypes according to Clark's concept. This motivated us to develop a new procedure in the next section, taking into account the frequency information.

Maximum Haplotype Likelihood Approach
In order to make these improvements, we fi rst introduce the maximum likelihood (ML) estimator of ( p , H ), denoted b y (p, Ĥ ) by maximizing the haplotype likelihood L ( G ͉ p , H ) in (1). On the other hand, for each particular assignment H , are the numbers of times that they appear in H , and p ( H k ), 1 ^ k ^ k 0 , are the unknown population frequencies of these haplotypes. It is obvious that ⌺ k s k = 2 n . Maximizing L ( G ͉ p , H ) with respect to these population frequencies under the constraints ⌺ k p k = 1, p k 6 0, 1 ^ k ^ m 0 , we have Then, it is directly shown that Ĥ attains the maximum of the above described profi le log-haplotype-likelihood. Note that -l ( G ͉ H ) is the entropy of p . Furthermore, for haplotype Z , the ML estimator of its population frequency is the frequency of Z in Ĥ . We use 'Example 2' to show the advantage of the MHL over Clark's method in that MHL chooses the haplotypes with higher frequencies in resolving the unresolved genotypes. 'Example 2' serves moreover to elaborate the mathematical details on derivation of the above described profi le log-likelihood in a reader-friendly manner. Any additional mathematical details will be made available on request to J.Zhang@kent.ac.uk. Example 2 (continued). We have To maximize the fi rst likelihood with respect to p ( h k ), 1 ^ k ^ 7, we consider the following Lagrange multiplier: where is the Lagrange coeffi cient. Setting all partial derivatives of l ( p , ) to be zero and solving the resultant simultaneous equations, we obtain the estimate p (H 1 ). S ubstituting this estimate into log(L( G ͉ p , H 1 )) and dropping out constants, we obtain the following profi le log-haplotype likelihood at H 1 :  6 )}, which also attains the maximum resolution. However, if n 2 = 10, then there are two MHL assignments, Moreover, the log-genotype-likelihood obtains the same value of -0.7083934 through these two solutions. However, both have not attained the maximum resolution. This is not surprising because the haplotype h 2 has a greater frequency than h 1 . Note that EM provides the same solution as MHL in this toy example. However, MHL does produce a better result than EM for the OPMR1 data as shown in the section 'Results'.

Empirical Bayesian Approach
As pointed out in the 'Introduction', the dimension of space of the candidate haplotypes can be much larger than the sample size. This can cause some problems such as the bias due to over-fi tting. Even if the space dimension is lower than the sample size, the MHL estimator may not be unique and then the surface of the haplotype likelihood could be fl at around the assignments of some genotypes (for instance, genotype (1,1,2,2) T in 'Example 2'). We called these unidentifi ed genotypes orphans. In this situation, we seek the following (empirical) Bayesian approach to the problem by adopting the Dirichlet distribution H B as the estimated haplotype assignment to G , which can be writ- . The details of deriving the above marginal posterior will be made available on personal request to J.Zhang@kent.ac.uk.
To specify the pseudo-counts ␣ k , k = 1, ..., m 0 , we fi rst follow Excoffi er and Slatkin [14] , assuming that all genotypes are equally important and that for each genotype all possible haplotype pairs are equally likely. This assumption is based on the fact that the individuals are identically and independently distributed. Furthermore, we assume that the two haplotypes for each haplotype pair are equally important. Then, we take the total weights for different haplotypes in the space of all haplotypes compatible to G as the pseudo-counts. We call this assumption the structural information in the genotypes, which leads to the following scheme: We fi rst assign the same weight 2 d to the observed different genotypes, where d is a constant with a default value of 1. For each of these genotypes, we distribute the weight 2 d equally to all its candidate haplotypes. Then, for k = 1, ..., m 0 , let ␣ k be the summation of all the weights we put on the k th haplotype, which appears in the n candidate haplotype sets, ⌰ i , i = 1, ..., n . The resulting Bayesian estimator is called EB.
We shall now use the toy 'example 2' again in order to line out the mechanism behind EB step by step.
Example 2 (continued). To apply EB, we set d = 1 in the above scheme and adopt the notations in example 1.

Evolutionary Tree
As pointed out before, the analysis of haplotypes can be useful in both population and disease studies. Here we focus on the second issue. In order to test for association with complex disease, we estimate the frequencies of given haplotypes in cases and controls and examine, whether signifi cant differences between cases and controls are given. In practice, the number of different haplotypes is often unfeasibly large, resulting in a sparse contingency table so that ordinary tests of association like the 2 test do not have suffi cient power to detect an association with any single haplotype. In order to cope with this problem, several methods have been suggested, which rely on the classifi cation of haplotypes into evolutionarily related ones [25] or functionally related (ideally functionally equivalent) ones, based on sequence-structure-function similarity [2] . In the present work, we build such a tree via a modifi ed UPGMA procedure, where two modifi cations have been made. First, we use an information distance instead of the traditional Hemming distance. For this purpose, we calculate the pairwise percentage identities for m 0 different haplotypes defi ned in (1), say q ij , 1 ^ i , j ^ m 0 . Then, the information distance between the i th and j th haplotypes is defi ned by the Kullback-Leiber distance between the probability vectors ( q i 1 , ..., q im ) and ( q i 1 , ..., q jm ) T namely \^ \1 , , , ,  . This implies that there is a one-to-one correspondence between z and the assignment H in (2). Moreover, for each z , we identify the corresponding haplotypes and calculate their counts. Then, the log-haplotype likelihood in (2) can be written in the form l ( G ͉ z ), as a function of z . Now, the optimal assignment can be obtained by maximizing l ( G ͉ z ). Although the number of the operational variables is, compared to the original optimization problem, significantly reduced, the new problem posed remains still a very hard optimization problem in a highly dimensional space. In particular, the new objective function l ( G ͉ z ) with many subtle local maxima is not a convex function. Here, we apply a recently developed MCMC algorithm, called the evolutionary Monte Carlo algorithm [26] , in order to solve the problem.
The evolutionary Monte Carlo algorithm works by simulating a population of Markov chains in parallel, where a different temperature is attached to each chain. The population is supplied with mutation, crossover and exchange operators, and the updates are accepted or rejected according to the Metropolis rule. More specifi cally, given the current population Z = z 1 , ..., z N } and a temperature ladder t = { t 1 , ..., t N }, we construct a Boltzmann distribution for the population Z by 1 exp . We sample the next population by taking the following two steps: (1) Apply the mutation or the crossover operator to Z with probability p m and 1 -p m , respectively; (2) exchange z i with z j for N pairs ( i , j ) with i being sampled uniformly on {1, ..., N } and j = i 8 1 with Note that in the mutation operator, a new vector y is generated by randomly selecting a member, say z k from the population Z and by randomly mutating some components of z k from 0 to 1 or from 1 to 0. Z is replaced by the proposed population Y = { z 1, ..., y , ..., z N } with probability min{1, r m }, where r m is the Metropolis-Hastings ratio, r m = f ( Y )/ f ( Z ). In the crossover operator, a new pair of vectors, say y i , y j , are two 'offspring' of a pair z i , z j ( i = j ) selected from Z according to a roulette wheel procedure. For details see [26] . Z is updated by the proposal population Y = { z 1 , ..., y i , ..., y j , ..., z N } with probability min{1, r m }, where is the Metropolis-Hastings ratio, In the exchange operator, we change the order of two randomly selected z i and z j (without changing the order of t i and t j ) with probability min{1, r e }, where In this paper, we set where t h and t l are the highest and lowest temperatures, respectively. As in [24] , we choose t h and t l such that Var ( l ( G ͉ z i ))( t h -t l ) 2 = O (1) or simply by checking whether the overall acceptance rates of mutation, crossover and exchange operations are around 0.50.
Our experiences show that the EMC algorithm is often effi cient, when the number of ambiguous loci in each genotype is less than 20. However, when a large number of ambiguous loci are involved, the idea of segmentation [17] is very useful in speeding up the convergence of our EMC algorithm. Like Niu et al. [17] , we fi rst divide each genotype into several blocks, say m p blocks. We run the EMC algorithm for each block to fi nd a list of partial haplotype pairs with the fi rst m 0 highest likelihood values. We slightly modify our EMC algorithm and apply it to the restricted haplotype space generated by all the possible combinations of these partial haplotypes. More specifi cally, in this space we need to consider the n ! m p table ( S ij ), where S ij (subsets of all integers) represents the set of different partial haplotypes for the j th segment in the i th genotype. The vectors m i = ( m i 1 , ..., m im p ) with m ij D S ij , 1 ^ i ^ n give an assignment of haplotype pairs to the genotypes. We view the segment with the size of S ij 6 2 as an ambiguous segment. Assume that there are i ambiguous segments in the i th genotype. Then it is similar to the case of the unrestricted haplotype space such that L ( G ͉ H ) can be rewritten as a function of z * = ( z k * ), where z * stands for the ⌺ i ambiguous segments and z k * belongs to some integer subset S k *. For the new objective function, we only need to change the mutation operator in the EMC algorithm by randomly mutating some components of z *, say z k *, from the current value to z k * + 1 or z k * -1. We replace z k * + 1 by the left boundary of S k *, when it exceeds the right bound-ary of S k *. Similarly, when z k * -1 exceeds the left boundary of S k *, we replace it by the right boundary. This strategy can be repeatedly used until the size of the restricted haplotype space becomes moderate. The resulting algorithm is called block-wise EMC algorithm.

Simulation Studies
Comparison of Procedures Operating under the Assumption of Linkage between SNPs. First, the performances of the proposed procedures were examined under the assumption of linkage between SNPs. Following Stephens et al. [20] and using a coalescent-based program kindly provided by R. Hudson, we simulated 100 independent data sets, each containing 100 haplotypes, for ( , R ) = (4,0), (4,4), and (4,40), respectively. Here = 4N e , R = 4 N e r , N e was the effective population size, the total per-generation mutation rate across the region sequenced and r the length (in Morgans) of the region sequenced. The lengths of these haplotypes varied from 10 to 40 and depended on ( , R ). For each data set, the haplotypes were randomly paired to form 50 genotypes. Then we applied the MHL, EB, and PL methods, respectively, to reconstruct these haplotypes from the resulting genotypes in which phase information was ignored. In our algorithm, we set the population size N = 20, the highest and lowest temperatures ( t h , t l ) = (0.02, 0.001) in MHL, and ( t h , t l ) = (0.0, 0.003) in EB, and the mutation and crossover parameters, p m = 0.2, p 0 = 0.004, p 1 = 0.008 and p 2 = 0.01. We set the round parameter equal to 20 in PL as suggested by Niu et al. [17] . The performance of each method on each data set was measured by the error rate, being the proportion of ambiguous genotypes, the haplotypes of which were incorrectly assigned. For ( , R ) = (4,0), (4,4), (4,40), we compared the Clark, EM, EB, MHL, Phase and PL methods in terms of estimated error rates in substantial numbers of simulated data sets (see table 1 ). Although the results of the EM, Clark, and Phase methods in table therefore the loci weakly linked. Table 1 also shows that MHL has almost the same error rate as EM, whereas, in contrast to EM, MHL can handle genotypes with a large number of ambiguous loci in these simulated data sets. Although the Clark method can cope with genotypes with large numbers of ambiguous loci, too, it has a considerably higher error rate than the MHL method developed by us.

Comparison of Procedures Operating under the Assumption of Weak Linkage or No Linkage between SNPs.
We partitioned each genotype in the previously simulated 100 datasets into two blocks with approximately equal lengths, , ..., ,  ,k,1 for 1 ^ i ^ 30, 1 ^ k ^ 50. Note that these genotypes consist of two independent parts, which, however, have the same population parameters ( , R ). We phased {G * i , i = 1,2,...,30} directly by using EB, MHL and PL . The differences between resulting error rates of the three methods are represented by the last three plots in fi gure 1 . This fi gure clearly demonstrates that EB performs better than PL in general. In particular, reductions of error rate up to 50% in the case of unlinked or only weakly linked SNPs can be obtained.

OPRM1 Data Set
The OPRM1 data set used included 172 African-American cases and controls, 25 of the identifi ed variants were non-unique [2] . We fi rst applied the MHL approach to this data set, making M = 10 6 iterations, a long run to obtain suffi cient samples for inferring the maximum value of the log-haplotype likelihood l ( G ͉ z ). 46 different haplotypes, numbered 1 to 46, were inferred. These haplotypes were either consistent with or slightly different from the 52 different haplotypes predicted by Hoehe et al. [2] by use of the EM method. In a second step, an evolutionary tree was constructed from these haplotypes by use of the modifi ed UPGMA described in 'Methods'.
Based on this tree, the inferred haplotypes could be classifi ed into two groups. Group one was almost the same as the group found associated with substance-dependence by Hoehe et al. [2] by means of similarity clustering: it included 13 haplotypes, 11 of which were common, signifi cantly more frequent in the substance dependent individuals and shared the same pattern of sequence variants characterized by a unique combination of 5 polymorphic sites [2] . These 11 haplotypes occurred in 22 genotypes. Notably, the above described risk pattern has been experimentally validated by Hoehe et al. [2] . Thus, the phases of this subset of the SNPs are known.
EB yielded altogether 41 different haplotypes; Phase with the settings (burn-in, iteration) = (10 4 ,10 4 ) and with (burn-in, iteration) = (10 4 ,10 5 ), yielded 35 and 36 haplotypes, respectively; PL with two settings, round = 20 and round = 1000, resulted in 39 and 37 haplotypes, respectively. Although these seven sets of haplotypes may be quite different with respect to size, they do contain, however, a similar subset, which features the above described risk pattern.
It is obvious that both our solutions and that of Hoehe et al. [2] attain maximum resolution. In fact, Clark's method identifi es a solution, which can be divided into two groups. But none of these groups carried the above described risk pattern. The relatively weaker performance Both ME and SE represent the sample mean and standard error of the error rate, which is the proportion of ambiguous genotypes, haplotypes of which have been incorrectly assigned. For each combination of parameters subjected to analyses, the values of ME (SE) for the EB, MHL and PL methods have been derived from the 100 simulated data sets. The values of ME (SE) for the EM, Clark, and Phase methods, kindly provided by Stephens et al. [20] have been derived from 90 to 100 simulated data sets using the same coalescent model described in the Results section. The EM method requires a pre-treatment of the data sets, that is, discarding those data sets for which the total number of possible haplotypes was 1 of Clark's method in this example is due to the fact that it has not taken into consideration the count information.
As a local optimisation algorithm, EM can lead to a local maximum of genotype likelihood. To demonstrate this for the OPRM1 data set, we report the log-haplotype likelihood and log-genotype likelihood values of the solutions derived from the fi ve different haplotyping methods in table 2 . These calculations show that among the fi ve procedures, the MHL provides the maximum value of the genotype likelihood. This implies that the EM algorithm fails to give a global maximum of its objective function. In contrast, the evolutionary Markov chain Monte Carlo used in MHL tends to provide a solution for the problem of maximizing the genotype likelihood, which is closer to unknown globally optimal solutions. Therefore the application of the EM algorithm by Hoehe et al. [2] seems less favourable than the use of MHL.

ACE Data Set
This data set is composed of 11 genotypes, each with 52 non-unique varying sites [22] . The 11 genotypes defi ned by these bi-allelic loci were resolved into combinations of 13 distinct haplotypes, which had been determined by application of molecular genetic techniques [22] : G 1 = ( H 1 , H 6 ), G 2 = ( H 1 , H 1 H 13 ), where G i , 1 ^ i ^ 11 are the genotypes and H j , 1 ^ j ^ 13 the haplotypes. The log-haplotype likelihood of this assignment is -2.323397. This assignment could only partially be recovered by the Clark method, because, under this assignment, there were three orphan genotypes, G 8 , G 9 , G 11 . Moreover, the haplotype likelihood reached the same value when we changed the true assignment by assigning any compatible haplotype pairs to these orphans. Where we randomly assigned compatible haplotype pairs to these orphans, the average number of erroneous phase calls was about 2.7, when running the Clark algorithm.
In order to apply MHL, we divided the genotype table into 5 blocks, the fi rst four with the same width 10 and the last one with the width 12. In our EMC algorithm, we empirically set the population size N = 20, the highest and lowest temperatures t h = 0.5 and t l = 0.01, and the mutation and crossover parameters p m = 0.2, p 0 = 0.004, p 1 = 0.008 and p 2 = 0.01. Then, we performed such a two-stage block-wise EMC algorithm 100 times. The average number of erroneous phase calls was 2.4, with a standard error of 0.19. Applying this procedure, G 1 , ..., G 7 , G 10 were always correctly resolved. That is, all the erroneous phase calls resulted from G 8 , G 9 , and G 11 .
Note that our empirical Bayesian method contains two components: haplotype likelihood and empirical prior. We have chosen to use an ad-hoc approach in order to demonstrate why our EB improved the accuracy of phase prediction. Naturally, we could also apply EB directly. Then, however, we would not be able to extract the specifi c mechanism underlying the described improvement. The ad-hoc approach involved two steps: First, dividing the genotype table into two blocks, one with width 30 and the other with width 22, and performing a two-stage MHL analysis, in which all the genotypes were resolved except for the orphan genotypes G 8 , G 9 and G 11 . In the second step, the haplotypes for these orphan genotypes were reconstructed using the empirical prior.
Note that the surface of the haplotype likelihood around the assignments for these orphan genotypes is fl at. Therefore, the haplotype likelihood failed to gather information for resolving these orphan genotypes unambiguously. In this case, the following empirical structural information on these genotypes could be useful: unlike G 9 the segments of G 8  2 , H C1 ) to G 11 . Note that our Bayesian approach does not help us to resolve G 8 , because the pseu- Phase is with burn-in = 10000, iteration = 100000; MHL has been combined with the block-wise EMC; EM is based on the implementation described by Hoehe et al. [2]; PL is with round = 1000. HL represents the log-haplotype-likelihood, while GL represents the log-genotype-likelihood. docounts of the two options mentioned in the last paragraph are the same. In summary, combination of the haplotype likelihood with the empirical Bayesian prior could resolve all genotypes in the ACE data set with a number of erroneous phase calls being less than or equal to 2. Note that the average numbers of erroneous phase calls by PL and Phase were shown by Niu et al. [17] to be 2.09 and 3.96 with standard errors of 0.033 and 0.044, respectively. EM is excluded from the comparison because it is limited with regard to the number of heterozygous loci allowable for each genotype [17] . Thus, the combination of our MHL and Bayesian approaches seems to perform slightly better than the other methods at least in this example. from ambiguous genotypes constituted by many heterozygous sites by application of the block-wise evolutionary Monte Carlo algorithm. On the other hand, for ambiguous genotypes comprised of a lower number of heterozygous sites, MHL has almost the same mean error rates as EM under a coalescent model. Furthermore, we have shown that among the existing methods, Phase is the best one when the population conforms to the coalescence assumption, whereas PL and EB perform very similarly under these conditions. EB is, in contrast, more accurate than PL when some SNPs are unlinked or weakly linked. The EMC algorithm can improve the EM algorithm and Gibbs sampling because it combines the features of three different algorithms: the simulated annealing, genetic algorithm, and Metropolis algorithms. These three algorithms are designed to address the problem of approximately global optimization. Compared with the objective function used in Phase, ours are simple but still powerful. Thus, our approach might also provide a novel basis for haplotype block decomposition using unphased genotype data, where the coalescent, model-based method is not easily extended.
At last, we would like to point out that our algorithms have also been generalized to allow handling missing genotype data in some individuals at some loci. Like Niu et al. [17] , we have considered three types of missing data: both alleles are missing; only the allele 0 is missing, and only the allele 1 is missing (details available on request). Although in this paper we have discussed the haplotypes consisting of bi-allelic loci only, both our methods and algorithms can easily be modifi ed in order to cope with other types of loci, such as for instance microsatellite loci. Like Rohde and Fuerst [18] , we are also able to include nuclear family information in our procedures.
To summarize, the main advantage of the proposed procedure (EB) is that it allows analysis of weakly linked or physically unlinked SNPs, respectively, and in this, improved analysis of certain candidate gene data sets or of SNP interactions that may confer genetic risk to complex disease. Extension and solidifi cation of performance results in this additional aspect would require extensive, physically validated data sets from the same sets of individuals for a number of loci resolved at high depth, however. Such data sets are hardly accessible at present. Thus, the ultimate standard that will allow informed, sensible statistical comparisons of different methods is not yet given. Because the underlying model can greatly affect the conclusions drawn from different in silico haplotyping methods in a simulation study [17] , it seems at this point preferable to apply all currently available methods.
This will help to avoid the possible bias introduced by any single method when we tackle real data, as has been illustrated at the examples of some concrete genotypic data sets.