{"title": "An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1433, "page_last": 1440, "abstract": "We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.", "full_text": "An Empirical Analysis of Domain Adaptation\n\nAlgorithms for Genomic Sequence Analysis\n\nGabriele Schweikert1\nMax Planck Institutes\n\nSpemannstr. 35-39, 72070 T\u00a8ubingen, Germany\nGabriele.Schweikert@tue.mpg.de\n\nChristian Widmer1\n\nFriedrich Miescher Laboratory\n\nSpemannstr. 39, 72070 T\u00a8ubingen, Germany\n\nZBIT, T\u00a8ubingen University\n\nSand 14, 72076 T\u00a8ubingen, Germany\n\nChristian.Widmer@tue.mpg.de\n\nBernhard Sch\u00a8olkopf\n\nMax Planck Institute for biol. Cybernetics\nSpemannstr. 38, 72070 T\u00a8ubingen, Germany\nBernhard.Schoelkopf@tue.mpg.de\n\nGunnar R\u00a8atsch\n\nFriedrich Miescher Laboratory\n\nSpemannstr. 39, 72070 T\u00a8ubingen, Germany\n\nGunnar.Raetsch@tue.mpg.de\n\nAbstract\n\nWe study the problem of domain transfer for a supervised classi\ufb01cation task in\nmRNA splicing. We consider a number of recent domain transfer methods from\nmachine learning, including some that are novel, and evaluate them on genomic\nsequence data from model organisms of varying evolutionary distance. We \ufb01nd\nthat in cases where the organisms are not closely related, the use of domain adap-\ntation methods can help improve classi\ufb01cation performance.\n\n1 Introduction\n\nTen years ago, an eight-year lasting collaborative effort resulted in the \ufb01rst completely sequenced\ngenome of a multi-cellular organism, the free-living nematode Caenorhabditis elegans. Today, a\ndecade after the accomplishment of this landmark, 23 eukaryotic genomes have been completed and\nmore than 400 are underway. The genomic sequence builds the basis for a large body of research on\nunderstanding the biochemical processes in these organisms. Typically, the more closely related the\norganisms are, the more similar the biochemical processes. It is the hope of biological research that\nby analyzing a wide spectrum of model organisms, one can approach an understanding of the full\nbiological complexity. For some organisms, certain biochemical experiments can be performed more\nreadily than for others, facilitating the analysis of particular processes. This understanding can then\nbe transferred to other organisms, for instance by verifying or re\ufb01ning models of the processes\u2014at\na fraction of the original cost. This is but one example of a situation where transfer of knowledge\nacross domains is fruitful.\n\nIn machine learning, the above information transfer is called domain adaptation, where one aims\nto use data or a model of a well-analyzed source domain to obtain or re\ufb01ne a model for a less\nanalyzed target domain. For supervised classi\ufb01cation, this corresponds to the case where there are\nample labeled examples (xi, yi), i = 1, . . . , m for the source domain, but only few such examples\n(xi, yi), i = m + 1, . . . , m + n for the target domain (n \u226a m). The examples are assumed to be\ndrawn independently from the joint probability distributions PS(X, Y ) and PT (X, Y ), respectively.\nThe distributions PS(X, Y ) = PS(Y |X) \u00b7 PS(X) and PT (X, Y ) = PT (Y |X) \u00b7 PT (X) can differ\nin several ways:\n\n(1)\nIn the classical covariate shift case, it is assumed that only the distributions of the input features\nP (X) varies between the two domains: PS(X) 6= PT (X). The conditional, however, remains\n\n1These authors contributed equally.\n\n\finvariant, PS(Y |X) = PT (Y |X). For a given feature vector x the label y is thus independent of\nthe domain from which the example stems. An example thereof would be if a function of some\nbiological material is conserved between two organisms, but its composition has changed (e.g. a\npart of a chromosome has been duplicated).\n\nIn a more dif\ufb01cult scenario the conditionals differ between domains, PS(Y |X) 6= PT (Y |X),\n(2)\nwhile P (X) may or may not vary. This is the more common case in biology. Here, two organisms\nmay have evolved from a common ancestor and a certain biological function may have changed\ndue to evolutionary pressures. The evolutionary distance may be a good indicator for how well\nthe function is conserved. If this distance is small, we have reason to believe that the conditionals\nmay not be completely different, and knowledge of one of them should then provide us with some\ninformation also about the other one.\n\nWhile such knowledge transfer is crucial for biology, and performed by biologists on a daily basis,\nsurprisingly little work has been done to exploit it using machine learning methods on biological\ndatabases. The present paper attempts to \ufb01ll this gap by studying a realistic biological domain\ntransfer problem, taking into account several of the relevant dimensions in a common experimental\nframework:\n\n\u2022 methods \u2014 over the last years, the \ufb01eld of machine learning has seen a strong increase in\ninterest in the domain adaptation problem, re\ufb02ected for instance by a recent NIPS workshop\n\u2022 domain distance \u2014 ranging from close organisms, where simply combining training sets\ndoes the job, to distant organisms where more sophisticated methods can potentially show\ntheir strengths\n\n\u2022 data set sizes \u2014 whether or not it is worth transferring knowledge from a distant organism\n\nis expected to depend on the amount of data available for the target system\n\nWith the above in mind, we selected the problem of mRNA splicing (see Figure A1 in the Appendix\nfor more details) to assay the above dimensions of domain adaptation on a task which is relevant to\nmodern biology. The paper is organized as follows: In Section 2, we will describe the experimen-\ntal design including the datasets, the underlying classi\ufb01cation model, and the model selection and\nevaluation procedure. In Section 3 we will brie\ufb02y review a number of known algorithms for domain\nadaptation, and propose certain variations. In Section 4 we show the results of our comparison with\na brief discussion.\n\n2 Experimental Design\n\n2.1 A Family of Classi\ufb01cation Problems\n\nWe consider the task of identifying so-called acceptor splice sites within a large set of potential\nsplice sites based on a sequence window around a site. The idea is to consider the recognition\nof splice sites in different organisms: In all cases, we used the very well studied model organism\nC. elegans as the source domain. As target organisms we chose two additional nematodes, namely,\nthe close relative C. remanei, which diverged from C. elegans 100 million years ago [10], and the\nmore distantly related P. paci\ufb01cus, a lineage which has diverged from C. elegans more than 200\nmillion years ago [7]. As a third target organism we used D. melanogaster, which is separated\nfrom C. elegans by 990 million years [11]. Finally, we consider the plant A. thaliana, which has\ndiverged from the other organisms more than 1,600 million years ago. It is assumed that a larger\nevolutionary distance will likely also have led to an accumulation of functional differences in the\nmolecular splicing machinery. We therefore expect that the differences of classi\ufb01cation functions\nfor recognizing splice sites in these organisms will increase with increasing evolutionary distance.\n\n2.2 The Classi\ufb01cation Model\n\nIt has been demonstrated that Support Vector Machines (SVMs) [1] are well suited for the task of\nsplice site predictions across a wide range of organisms [9]. In this work, the so-called Weighted\nDegree kernel has been used to measure the similarity between two example sequences x and x\u2032 of\n\n\f\ufb01xed length L by counting co-occurring substrings in both sequences at the same position:\n\nkwd\n\u2113 (x, x\u2032) =\n\n1\nL\n\nL\u2212l+1\n\n\u2113\n\nXl=1\n\nXd=1\n\n\u03b2dI(cid:16)x[l:l+d] = x\u2032\n\n[l:l+d](cid:17)\n\n(1)\n\nwhere x[l:l+d] is the substring of length d of x at position l and \u03b2d = 2 \u2113\u2212d+1\n\u21132+\u2113\nsubstring lengths.\n\nis the weighting of the\n\nIn our previous study we have used sequences of length L = 140 and substrings of length \u2113 = 22\nfor splice site detection [9]. With the four-letter DNA sequence alphabet {A, C, G, T} this leads to\na very high dimensional feature space (> 1013 dimensions). Moreover, to archive the best classi\ufb01-\ncation performance, a large number of training examples is very helpful ([9] used up to 10 million\nexamples).\n\nFor the designed experimental comparison we had to run all algorithms many times for different\ntraining set sizes, organisms and model parameters. We chose the source and target training set as\nlarge as possible\u2013in our case at most 100,000 examples per domain. Moreover, not for all algorithms\nwe had ef\ufb01cient implementations available that can make use of kernels. Hence, in order to perform\nthis study and to obtain comparable results, we had to restrict ourselves to a case were we can\nexplicitly work in the feature space, if necessary (i.e. \u2113 not much larger than two). We chose \u2113 =\n1. Note, that this choice does not limit the generality of this study, as there is no strong reason,\nwhy ef\ufb01cient implementations that employ kernels could not be developed for all methods. The\ndevelopment of large scale methods, however, was not the main focus of this study.\n\nNote that the above choices required an equivalent of about 1500 days of computing time on state-of-\nthe-art CPU cores. We therefore refrained from including more methods, examples or dimensions.\n\n2.3 Splits and Model Selection\n\nIn the \ufb01rst set of experiments we randomly selected a source dataset of 100,000 examples from\nC. elegans, while data sets of sizes 2,500, 6,500, 16,000, 40,000 and 100,000 were selected for\neach target organism. Subsequently we performed a second set of experiments where we combined\nseveral sources. For our comparison we used 25,000 labeled examples from each of four remaining\norganisms to predict on a target organism. We ensured that the positives to negatives ratio is at\n1/100 for all datasets. Two thirds of each target set were used for training, while one third was used\nfor evaluation in the course of hyper-parameter tuning.1 Additionally, test sets of 60,000 examples\nwere set aside for each target organism. All experiments were repeated three times with different\ntraining splits (source and target), except the last one which always used the full data set. Reported\nwill be the average area under the precision-recall-curve (auPRC) and its standard deviation, which\nis considered a sensible measure for imbalanced classi\ufb01cation problems. The data and additional\ninformation will be made available for download on a supplementary website.2\n\n3 Methods for Domain Adaptation\n\nRegarding the distributional view that was presented in Section 1, the problem of splice site pre-\ndiction can be affected by both evils simultaneously, namely PS(X) 6= PT (X) and PS(Y |X) 6=\nPT (Y |X), which is also the most realistic scenario in the case of modeling most biological pro-\ncesses. In this paper, we will therefore drop the classical covariate shift assumption, and allow for\ndifferent predictive functions PS(Y |X) 6= PT (Y |X).\n3.1 Baseline Methods (SVMS and SVMT )\n\nAs baseline methods for the comparison we consider two methods: (a) training on the source data\nonly (SVMS) and (b) training on the target data only (SVMT ). For SVMS we use the source data\nfor training however we tune the hyper-parameter on the available target data. For SVMT we use\nthe available target data for training (67%) and model selection (33%). The resulting functions are\n\nfS(x) = h\u03a6(x), wSi + bS\n\nfT (x) = h\u03a6(x), wTi + bT .\n1Details on the hyper-parameter settings and tuning are shown in Table A2 in the appendix.\n2http://www.fml.mpg.de/raetsch/projects/genomedomainadaptation\n\nand\n\n\f3.2 Convex Combination (SVMS +SVMT )\n\nThe most straightforward idea for domain adaptation is to reuse the two optimal functions fT and\nfS as generated by the base line methods SVMS and SVMT and combine them in a convex manner:\n\nHere, \u03b1 \u2208 [0, 1] is the convex combination parameter that is tuned on the evaluation set (33%) of\nthe target domain. A great bene\ufb01t of this approach is its ef\ufb01ciency.\n\nF (x) = \u03b1fT (x) + (1 \u2212 \u03b1)fS(x).\n\n3.3 Weighted Combination (SVMS+T )\n\nAnother simple idea is to train the method on the union of source and target data. The relative\nimportance of each domain is integrated into the loss term of the SVM and can be adjusted by\nsetting domain-dependent cost parameters CS and CT for the m and n training examples from the\nsource and target domain, respectively:\n\nmin\nw,\u03be\n\ns.t.\n\nm\n\nm+n\n\n\u03bei + CT\n\nXi=1\n\n1\n2kwk2 + CS\nyi(hw, \u03a6(xi)i + b) \u2265 1 \u2212 \u03bei \u2200i \u2208 [1, m + n]\n\u03bei \u2265 0 \u2200i \u2208 [1, m + n]\n\nXi=m+1\n\n\u03bei\n\n(2)\n\nThis method has two model parameters and requires training on the union of the training sets. Since\nthe computation time of most classi\ufb01cation methods increases super-linearly and full model se-\nlection may require to train many parameter combinations, this approach is computationally quite\ndemanding.\n\n3.4 Dual-task Learning (SVMS,T )\n\nOne way of extending the weighted combination approach is a variant of multi-task learning [2].\nThe idea is to solve the source and target classi\ufb01cation problems simultaneously and couple the two\nsolutions via a regularization term. This idea can be realized by the following optimization problem:\n\nmin\n\nwS ,wT ,\u03be\n\ns.t.\n\nm+n\n\n\u03bei\n\nXi=1\n\n1\n2kwS \u2212 wTk2 + C\nyi(hwS, \u03a6(xi)i + b) \u2265 1 \u2212 \u03bei\nyi(hwT , \u03a6(xi)i + b) \u2265 1 \u2212 \u03bei\n\u03bei \u2265 0\n\n(3)\n\n\u2200i \u2208 1, . . . , m\n\u2200i \u2208 m + 1, . . . , m + n\n\u2200i \u2208 1, . . . , m + n\n\nPlease note that now wS and wT are optimized. The above optimization problem can be solved us-\ning a standard QP-solver. In a preliminary experiment we used the optimization package CPLEX to\nsolve this problem, which took too long as the number of variables is relatively large. Hence, we de-\ncided to approximate the soft-margin loss using the logistic loss l(f (x), y) = log(1+exp(\u2212yf (x)))\nand to use a conjugate gradient method3 to minimize the resulting objective function in terms of wS\nand wT .\n\n3.5 Kernel Mean Matching (SVMS\u2192T )\n\nKernel methods map the data into a reproducing kernel Hilbert space (RKHS) by means of a map-\n\nping \u03a6 : X \u2192 H related to a positive de\ufb01nite kernel via k(x, x\u2032) = h\u03a6(x), \u03a6(x\u2032)i. Depending\non the choice of kernel, the space of H may be spanned by a large number of higher order features\nof the data. In such cases, higher order statistics for a set of input points can be computed in H by\nsimply taking the mean (i.e., the \ufb01rst order statistics). In fact, it turns out that for a certain class of\nkernels, the mapping\n\n\u00b5 : (x1, . . . , xn) 7\u2192\n\n1\nn\n\nn\n\nXi=1\n\n\u03a6(xi)\n\n3We used Carl Rasmussen\u2019s minimize function.\n\n\fis injective [5] \u2014 in other words, given knowledge of (only) the mean (the right hand side), we\ncan completely reconstruct the set of points. For a characterization of this class of kernels, see for\ninstance [4]. It is often not necessary to retain all information (indeed, it may be useful to specify\nwhich information we want to retain and which one we want to disregard, see [8]). Generally\nspeaking, the higher dimensional H, the more information is contained in the mean.\nIn [6] it was proposed that one could use this for covariate shift adaptation, moving the mean of a\nsource distribution (over the inputs only) towards the mean of a target distribution by re-weighting\nthe source training points. We have applied this to our problem, but found that a variant of this\napproach performed better. In this variant, we do not re-weight the source points, but rather we\ntranslate each point towards the mean of the target inputs:\nm+n\n\nm\n\n\u02c6\u03a6(xj) = \u03a6(xj) \u2212 \u03b1 1\n\nm\n\n\u03a6(xi) \u2212\n\nXi=1\n\n1\nn\n\nXi=m+1\n\n\u03a6(xi)!\n\n\u2200j = 1, . . . , m.\n\nThis also leads to a modi\ufb01ed source input distribution which is statistically more similar to the\ntarget distribution and which can thus be used to improve performance when training the target task.\nUnlike [6], we do have a certain amount of labels also for the target distribution. We make use of\nthem by performing the shift separately for each class y \u2208 {\u00b11}:\n\n\u02c6\u03a6(xj) = \u03a6(xj) \u2212 \u03b1 1\n\nmy\n\nm\n\nXi=1\n\n[[yi = y]]\u03a6(xi) \u2212\n\n1\nny\n\nm+n\n\nXi=m+1\n\n[[yi = y]]\u03a6(xi)!\n\nfor all j = m + 1, . . . , m + n with yj = y, where my and ny are the number of source and target\nexamples with label y, respectively. The shifted examples can now be used in different ways to\nobtain a \ufb01nal classi\ufb01er. We decided to use the weighted combination with CS = CT for comparison.\n\n3.6 Feature Augmentation (SVMS\u00d7T )\n\nIn [3] a method was proposed that augments the features of source and target examples in a domain-\nspeci\ufb01c way:\n\n\u02c6\u03a6(x) = (\u03a6(x), \u03a6(x), 0)\u22a4 for i = 1, . . . , m\n\u02c6\u03a6(x) = (\u03a6(x), 0, \u03a6(x))\u22a4 for i = m + 1, . . . , m + n.\n\nThe intuition behind this idea is that there exist one set of parameters that models the properties\ncommon to both sets and two additional sets of parameters that model the speci\ufb01cs of the two\ndomains. It can easily be seen that the kernel for the augmented feature space can be computed as:\n\nkAU G(xi, xi) =(cid:26)2h\u03a6(xi), \u03a6(xj)i\nh\u03a6(xi), \u03a6(xj)i\n\nif [[i \u2264 m]] = [[j \u2264 m]]\n\notherwise\n\nThis means that the \u201csimilarity\u201d between two examples is two times as high, if the examples were\ndrawn from the same domain, as if they were drawn from different domains. Instead of the factor 2,\nwe used a hyper-parameter B in the following.\n\n3.7 Combination of Several Sources\n\nMost of the above algorithms can be extended in one way or another to integrate several source do-\nmains. In this work we consider only three possible algorithms: (a) convex combinations of several\ndomains, (b) KMM on several domains and (c) an extension of the dual-task learning approach to\nmulti-task learning. We brie\ufb02y describe these methods below:\n\nMultiple Convex Combinations (M-SVMS +SVMT ) The most general version would be to op-\ntimize all convex combination coef\ufb01cients independently. If done in a grid-search-like manner, it\nbecomes prohibitive for more than say three source domains. In principle, one can optimize these\ncoef\ufb01cients also by solving a linear program. In preliminary experiments we tried both approaches\nand they typically did not lead to better results than the following combination:\n\nF (x) = \u03b1fT (x) + (1 \u2212 \u03b1)\n\n1\n\n|S| XS\u2208S\n\nfS(x),\n\nwhere S is the set of all considered source domains. We therefore only considered this way of\ncombining the predictions.\n\n\fMultiple KMM (M-SVMS\u2192T ) Here, we shift the source examples of each domain independently\ntowards the target examples, but by the same relative distance (\u03b1). Then we train one classi\ufb01er on\nthe shifted source examples as well as the target examples.\n\nMulti-task Learning (M-SVMS,T ) We consider the following version of multi-task learning:\n\nmin\n\n{wD}D\u2208D ,\u03be\n\ns.t.\n\n1\n\n\u03b3D1,D2kwD1 \u2212 wD2k2 +Xi\n\n2 XD1\u2208D XD2\u2208D\nyi(hwDj , \u03a6(xi)i + b) \u2265 1 \u2212 \u03bei\n\u03bei \u2265 0\n\n\u03bei\n\n(4)\n\n(5)\n\nfor all examples (xi, yi) in domain Dj \u2208 D, where D is the set of all considered domains. \u03b3 is a set\nof regularization parameters, which we parametrized by two parameters CS and CT in the following\nway: \u03b3D1,D2 = CS if D1 and D2 are source domains and CT otherwise.\n\n4 Experimental Results\n\nWe considered two different settings for the comparison. For the \ufb01rst experiment we assume that\nthere is one source domain with enough data that should be used to improve the performance in\nthe target domain. In the second setting we analyze whether one can bene\ufb01t from several source\ndomains.\n\n4.1 Single Source Domain\n\nDue to space constraints, we restrict ourselves to presenting a summary of our results with a fo-\ncus on best and worst performing methods. The detailed results are given in Figure A2 in the\nappendix, where we show the median auPRC of the methods SVMT , SVMS, SVMS\u2192T , SVMS+T ,\nSVMS+SVMT , SVMS\u00d7T and SVMS,T for the considered tasks. The summary is given in Fig-\nure 1, where we illustrate which method performed best (green), similarly well (within a con\ufb01dence\ninterval of \u03c3/\u221an) as the best (light green), considerably worse than the best (yellow), not signi\ufb01-\ncantly better than the worst (light red) or worst (red). From these results we can make the following\nobservations:\n\n1. Independent of the task, if there is very little target data available, the training on source\ndata performs much better than training on the target data. Conversely, if there is much\ntarget data available then training on it easily outperforms training the source data.\n\n2. For a larger evolutionary distance of the target organisms to source organism C. elegans, a\nrelatively small number of target training examples for the SVMT approach is suf\ufb01cient to\nachieve similar performance to the SVMS approach, which is always trained on 100,000\nexamples. We call the number of target examples with equal source and target performance\nthe break-even point. For instance, for the closely related organism C. remanei one needs\nnearly as many target data as source data to achieve the same performance. For the most\ndistantly related organism A. thaliana, less than 10% target data is suf\ufb01cient to outperform\nthe source model.\n\n3. In almost all cases, the performance of domain adaption algorithms is considerably higher\nthan source (SVMS) and target only (SVMT ). This is most pronounced near the break-even\npoint, e.g. 3% improvement for C. remanei and 14% for D. melanogaster.\n\n4. Among the domain adaptation algorithms, the dual-task learning approach (SVMS,T ) per-\nformed most often best (12/20 cases). Second most often best (5/20) performed the convex\ncombination approach (SVMS+SVMT ).\n\nFrom our observations we can conclude that the simple convex combination approach works surpris-\ningly well. It is only outperformed by the dual-task learning algorithm which performs consistently\nwell for all organisms and target training set sizes.\n\n\f\f\f", "award": [], "sourceid": 510, "authors": [{"given_name": "Gabriele", "family_name": "Schweikert", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Christian", "family_name": "Widmer", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}