将多类问题分成N个二类分类问题,训练N个二类分类器,对第i个类来说,所有属于第i个类的样本为正(positive)样本,其他样本为负(negative)样本,每个二类分类器将属于i类的样本从其他类中分离出来。
问题转换方法的核心是“改造样本数据使其适应现有学习算法”。该类方法的思路是通过处理多标记训练样本,使其适应现有的学习算法,也就是将多标记学习问题转换为现有的学习问题进行求解。
代表性学习算法有一阶方法Binary Relevance,该方法将多标记学习问题转化为“二类分类( binary classification )”问题求解;二阶方法Calibrated Label Ranking,该方法将多标记学习问题转化为“标记排序( labelranking )问题求解;高阶方法Random k-labelset,该方法将多标记学习问题转化为“多类分类(Multiclass classification)”问题求解。
代表性学习算法有一阶方法ML-kNN},该方法将“惰性学习(lazy learning )”算法k近邻进行改造以适应多标记数据;二阶方法Rank-SVM,该方法将“核学习(kernel learning )”算法SVM进行改造以适应多标记数据;高阶方法LEAD,该方法将“贝叶斯学习(Bayes learning)算法”Bayes网络进行改造以适应多标记数据。
在此类学习中,训练集由若干个具有概念标记的包(bag)组成,每个包包含若干没有概念标记的示例。若一个包中至少有一个正例,则该包被标记为正(positive),若一个包中所有示例都是反例,则该包被标记为反(negative)。通过对训练包的学习,希望学习系统尽可能正确地对训练集之外的包的概念标记进行预测。
多任务学习(Multi-task learning)是和单任务学习(single-task learning)相对的一种机器学习方法。在机器学习领域,标准的算法理论是一次学习一个任务,也就是系统的输出为实数的情况。复杂的学习问题先被分解成理论上独立的子问题,然后分别对每个子问题进行学习,最后通过对子问题学习结果的组合建立复杂问题的数学模型。多任务学习是一种联合学习,多个任务并行学习,结果相互影响。
拿大家经常使用的school data做个简单的对比,school data是用来预测学生成绩的回归问题的数据集,总共有139个中学的15362个学生,其中每一个中学都可以看作是一个预测任务。单任务学习就是忽略任务之间可能存在的关系分别学习139个回归函数进行分数的预测,或者直接将139个学校的所有数据放到一起学习一个回归函数进行预测。而多任务学习则看重 任务之间的联系,通过联合学习,同时对139个任务学习不同的回归函数,既考虑到了任务之间的差别,又考虑到任务之间的联系,这也是多任务学习最重要的思想之一。
多任务学习早期的研究工作源于对机器学习中的一个重要问题,即“归纳偏置(inductive bias)”问题的研究。机器学习的过程可以看作是对与问题相关的经验数据进行分析,从中归纳出反映问题本质的模型的过程。归纳偏置的作用就是用于指导学习算法如何在模型空间中进行搜索,搜索所得模型的性能优劣将直接受到归纳偏置的影响,而任何一个缺乏归纳偏置的学习系统都不可能进行有效的学习。不同的学习算法(如决策树,神经网络,支持向量机等)具有不同的归纳偏置,人们在解决实际问题时需要人工地确定采用何种学习算法,实际上也就是主观地选择了不同的归纳偏置策略。一个很直观的想法就是,是否可以将归纳偏置的确定过程也通过学习过程来自动地完成,也就是采用“学习如何去学(learning to learn)”的思想。多任务学习恰恰为上述思想的实现提供了一条可行途径,即利用相关任务中所包含的有用信息,为所关注任务的学习提供更强的归纳偏置。
目前多任务学习方法大致可以总结为两类,一是不同任务之间共享相同的参数(common parameter),二是挖掘不同任务之间隐藏的共有数据特征(latent feature)。
=================================================================================================
----------------------------------------------------------------我是个人理解分割线---------------------------------------------------------
个人认为的多任务学习是这样的:比如说目前的事件抽取, 子任务1是触发词(trigger)的识别, 子任务2是关系的抽取,对这个事件抽取而言,我们可以使用串行的方法首先识别触发词其次判断触发词和论元之间的关系。也可以采用联合的方法:触发词的识别和关系的判断是同时进行的,两个子任务是相互影响的。
Multi-tasklearning (多任务学习)是和single-task learning (单任务学习)相对的一种机器学习方法。拿大家经常使用的school data做个简单的对比,school data是用来预测学生成绩的回归问题的数据集,总共有139个中学的15362个学生,其中每一个中学都可以看作是一个预测任务。单任务学习就是忽略任务之间可能存在的关系分别学习139个回归函数进行分数的预测,或者直接将139个学校的所有数据放到一起学习一个回归函数进行预测。而多任务学习则看重任务之间的联系,通过联合学习,同时对139个任务学习不同的回归函数,既考虑到了任务之间的差别,又考虑到任务之间的联系,这也是多任务学习最重要的思想之一。
Multi-tasklearning的研究也有大概20年之久,虽然没有单任务学习那样受关注,但是也不时有不错的研究成果出炉,与单任务学习相比,多任务学习的优势在哪呢?上节我们已经提到了单任务学习的过程中忽略了任务之间的联系,而现实生活中的学习任务往往是有千丝万缕的联系的,比如多标签图像的分类,人脸的识别等等,这些任务都可以分为多个子任务去学习,多任务学习的优势就在于能发掘这些子任务之间的关系,同时又能区分这些任务之间的差别。
目前多任务学习方法大致可以总结为两类,一是不同任务之间共享相同的参数(common parameter),二是挖掘不同任务之间隐藏的共有数据特征(latent feature)。下面将简单介绍几篇比较经典的多任务学习的论文及算法思想。
这篇文章很有必要一看,文中提出了基于最小正则化方程的多任务学习方法,并以SVM为例给出多任务学习支持向量机,将多任务学习与经典的单任务学习SVM联系在一起,并给出了详细的求解过程和他们之间的联系,当然实验结果也证明了多任务支持向量机的优势。文中最重要的假设就是所有任务的分界面共享一个中心分界面,然后再次基础上平移,偏移量和中心分界面最终决定了当前任务的分界面。
这篇文章可以看作是比较全面的总结性文章,文中总共讨论了四种情况,feature selection, kernel selection,adaptive pooling and graphical model structure。并详细介绍了四种多任务学习方法,很具有参考价值。
[2] T. Jebara. MultitaskSparsity via Maximum Entropy Discrimination. In Journal of Machine LearningResearch, (12):75-110, 2011.
[3] A. Argyriou,T. Evgeniou and M. Pontil. Convex multitask feature learning. In MachineLearning, 73(3):243-272, 2008.
=====================================================================================
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.[3]
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.[4] Further examples of settings for MTL include multiclass classification and multi-label classification.[5]
Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.[4] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.[6]
Contents
[hide] 1Methods1.1Task grouping and overlap1.2Exploiting unrelated tasks1.3Transfer of knowledge2Mathematics2.1Reproducing Hilbert space of vector valued functions (RKHSvv)2.1.1RKHSvv concepts2.1.2Separable kernels2.1.3Known task structure2.1.3.1Task structure representations2.1.3.2Task structure examples2.1.4Learning tasks together with their structure2.1.4.1Optimization of Q2.1.4.2Special cases2.1.4.3Generalizations3applications3.1Spam filtering3.2Web search3.3RoboEarth4Software package5See also6References7External links7.1SoftwareMethods[edit]
Task grouping and overlap[edit]
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.[7] Task relatedness can be imposed a priori or learned from the data.[5][8]
Exploiting unrelated tasks[edit]
One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.[6]
Transfer of knowledge[edit]
Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,[9] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.[10]
Mathematics[edit]
Reproducing Hilbert space of vector valued functions (RKHSvv)[edit]
The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al, 2015.[5]
RKHSvv concepts[edit]
Suppose the training data set is {/displaystyle {/mathcal {S}}_{t}=/{(x_{i}^{t},y_{i}^{t})/}_{i=1}^{n_{t}}}
,%20with {/displaystyle%20x_{i}^{t}/in%20{/mathcal%20{X}}}
, {/displaystyle%20y_{i}^{t}/in%20{/mathcal%20{Y}}}
,%20where {/displaystyle%20t}
indexes%20task,%20and {/displaystyle%20t/in%201,...,T}
.%20Let {/displaystyle%20n=/sum%20_{t=1}^{T}n_{t}}
.%20In%20this%20setting%20there%20is%20a%20consistent%20input%20and%20output%20space%20and%20the%20same loss%20function {/displaystyle%20{/mathcal%20{L}}:/mathbb%20{R}%20/times%20/mathbb%20{R}%20/rightarrow%20/mathbb%20{R}%20_{+}}
for%20each%20task:%20.%20This%20results%20in%20the%20regularized%20machine%20learning%20problem:
{/displaystyle%20/min%20_{f/in%20{/mathcal%20{H}}}/sum%20_{t=1}^{T}{/frac%20{1}{n_{t}}}/sum%20_{i=1}^{n_{t}}{/mathcal%20{L}}(y_{i}^{t},f_{t}(x_{i}^{t}))+/lambda%20||f||_{/mathcal%20{H}}^{2}}![]() | | (1) |
where {/displaystyle%20{/mathcal%20{H}}}
is%20a%20vector%20valued%20reproducing%20kernel%20Hilbert%20space%20with%20functions {/displaystyle%20f:{/mathcal%20{X}}/rightarrow%20{/mathcal%20{Y}}^{T}}
having%20components {/displaystyle%20f_{t}:{/mathcal%20{X}}/rightarrow%20{/mathcal%20{Y}}}
.
The%20reproducing%20kernel%20for%20the%20space {/displaystyle%20{/mathcal%20{H}}}
of%20functions {/displaystyle%20f:{/mathcal%20{X}}/rightarrow%20/mathbb%20{R}%20^{T}}
is%20a%20symmetric%20matrix-valued%20function {/displaystyle%20/Gamma%20:{/mathcal%20{X}}/times%20{/mathcal%20{X}}/rightarrow%20/mathbb%20{R}%20^{T/times%20T}}
,%20such%20that {/displaystyle%20/Gamma%20(/cdot%20,x)c/in%20{/mathcal%20{H}}}
and%20the%20following%20reproducing%20property%20holds:
{/displaystyle%20/langle%20f(x),c/rangle%20_{/mathbb%20{R}%20^{T}}=/langle%20f,/Gamma%20(x,/cdot%20)c/rangle%20_{/mathcal%20{H}}}![]() | | (2) |
The%20reproducing%20kernel%20gives%20rise%20to%20a%20representer%20theorem%20showing%20that%20any%20solution%20to%20equation 1 has%20the%20form:
{/displaystyle%20f(x)=/sum%20_{t=1}^{T}/sum%20_{i=1}^{n_{t}}/Gamma%20(x,x_{i}^{t})c_{i}^{t}}![]() | | (3) |
Separable%20kernels[edit]
The%20form%20of%20the%20kernel {/displaystyle%20/Gamma%20}
induces%20both%20the%20representation%20of%20the feature%20space and%20structures%20the%20output%20across%20tasks.%20A%20natural%20simplification%20is%20to%20choose%20a separable%20kernel, which%20factors%20into%20separate%20kernels%20on%20the%20input%20space {/displaystyle%20{/mathcal%20{X}}}
and%20on%20the%20tasks {/displaystyle%20/{1,...,T/}}
.%20In%20this%20case%20the%20kernel%20relating%20scalar%20components {/displaystyle%20f_{t}}
and {/displaystyle%20f_{s}}
is%20given%20by {/textstyle%20/gamma%20((x_{i},t),(x_{j},s))=k(x_{i},x_{j})k_{T}(s,t)=k(x_{i},x_{j})A_{s,t}}
.%20For%20vector%20valued%20functions {/displaystyle%20f/in%20{/mathcal%20{H}}}
we%20can%20write {/displaystyle%20/Gamma%20(x_{i},x_{j})=k(x_{i},x_{j})A}
,%20where {/displaystyle%20k}
is%20a%20scalar%20reproducing%20kernel,%20and {/displaystyle%20A}
is%20a%20symmetric%20positive%20semi-definite {/displaystyle%20T/times%20T}
matrix.%20Henceforth%20denote {/displaystyle%20S_{+}^{T}=/{{/text{PSD%20matrices}}/}/subset%20/mathbb%20{R}%20^{T/times%20T}}
.
This%20factorization%20property,%20separability,%20implies%20the%20input%20feature%20space%20representation%20does%20not%20vary%20by%20task.%20That%20is,%20there%20is%20no%20interaction%20between%20the%20input%20kernel%20and%20the%20task%20kernel.%20The%20structure%20on%20tasks%20is%20represented%20solely%20by {/displaystyle%20A}
.%20Methods%20for%20non-separable%20kernels {/displaystyle%20/Gamma%20}
is%20an%20current%20field%20of%20research.
For%20the%20separable%20case,%20the%20representation%20theorem%20is%20reduced%20to {/textstyle%20f(x)=/sum%20_{i=1}^{N}k(x,x_{i})Ac_{i}}
.%20The%20model%20output%20on%20the%20training%20data%20is%20then {/displaystyle%20KCA}
,%20where {/displaystyle%20K}
is%20the {/displaystyle%20n/times%20n}
empirical%20kernel%20matrix%20with%20entries {/textstyle%20K_{i,j}=k(x_{i},x_{j})}
,%20and {/displaystyle%20C}
is%20the {/displaystyle%20n/times%20T}
matrix%20of%20rows {/displaystyle%20c_{i}}
.
With%20the%20separable%20kernel,%20equation 1 can%20be%20rewritten%20as
{/displaystyle%20/min%20_{C/in%20/mathbb%20{R}%20^{n/times%20T}}V(Y,KCA)+/lambda%20tr(KCAC^{/top%20})}![]() | | (P) |
where {/displaystyle%20V}
is%20a%20(weighted)%20average%20of {/displaystyle%20{/mathcal%20{L}}}
applied%20entry-wise%20to%20Y%20and%20KCA.%20(The%20weight%20is%20zero%20if {/displaystyle%20Y_{i}^{t}}
is%20a%20missing%20observation).
Note%20the%20second%20term%20in P can%20be%20derived%20as%20follows:
{/displaystyle%20||f||_{/mathcal%20{H}}^{2}=/langle%20/sum%20_{i=1}^{n}k(/cdot%20,x_{i})Ac_{i},/sum%20_{j=1}^{n}k(/cdot%20,x_{j})Ac_{j}/rangle%20_{/mathcal%20{H}}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}/langle%20k(/cdot%20,x_{i})Ac_{i},k(/cdot%20,x_{j})Ac_{j}/rangle%20_{/mathcal%20{H}}}
(bilinearity)
{/displaystyle%20=/sum%20_{i,j=1}^{n}/langle%20k(x_{i},x_{j})Ac_{i},c_{j}/rangle%20_{/mathbb%20{R}%20^{T}}}
(reproducing%20property)
{/displaystyle%20=/sum%20_{i,j=1}^{n}k(x_{i},x_{j})c_{i}^{/top%20}Ac_{j}=tr(KCAC^{/top%20})}![]()
Known%20task%20structure[edit]
Task%20structure%20representations[edit]
There%20are%20three%20largely%20equivalent%20ways%20to%20represent%20task%20structure:%20through%20a%20regularizer;%20through%20an%20output%20metric,%20and%20through%20an%20output%20mapping.
Regularizer -%20With%20the%20separable%20kernel,%20it%20can%20be%20shown%20(below)%20that {/textstyle%20||f||_{/mathcal%20{H}}^{2}=/sum%20_{s,t=1}^{T}A_{t,s}^{/dagger%20}/langle%20f_{s},f_{t}/rangle%20_{{/mathcal%20{H}}_{k}}}
,%20where {/displaystyle%20A_{t,s}^{/dagger%20}}
is%20the {/displaystyle%20t,s}
element%20of%20the%20pseudoinverse%20of {/displaystyle%20A}
,%20and {/displaystyle%20{/mathcal%20{H}}_{k}}
is%20the%20RKHS%20based%20on%20the%20scalar%20kernel {/displaystyle%20k}
,%20and {/textstyle%20f_{t}(x)=/sum%20_{i=1}^{n}k(x,x_{i})A_{t}^{/top%20}c_{i}}
.%20This%20formulation%20shows%20that {/displaystyle%20A_{t,s}^{/dagger%20}}
controls%20the%20weight%20of%20the%20penalty%20associated%20with {/textstyle%20/langle%20f_{s},f_{t}/rangle%20_{{/mathcal%20{H}}_{k}}}
.%20(Note%20that {/textstyle%20/langle%20f_{s},f_{t}/rangle%20_{{/mathcal%20{H}}_{k}}}
arises%20from {/textstyle%20||f_{t}||_{{/mathcal%20{H}}_{k}}=/langle%20f_{t},f_{t}/rangle%20_{{/mathcal%20{H}}_{k}}}
.)
Proof:
{/displaystyle%20||f||_{/mathcal%20{H}}^{2}=/langle%20/sum%20_{i=1}^{n}/gamma%20((x_{i},t_{i}),/cdot%20)c_{i}^{t_{i}},/sum%20_{j=1}^{n}/gamma%20((x_{j},t_{j}),/cdot%20)c_{j}^{t_{j}}/rangle%20_{/mathcal%20{H}}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}c_{i}^{t_{i}}c_{j}^{t_{j}}/gamma%20((x_{i},t_{i}),(x_{j},t_{j}))}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}/sum%20_{s,t=1}^{T}c_{i}^{t}c_{j}^{s}k(x_{i},x_{j})A_{s,t}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}k(x_{i},x_{j})/langle%20c_{i},Ac_{j}/rangle%20_{/mathbb%20{R}%20^{T}}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}k(x_{i},x_{j})/langle%20c_{i},AA^{/dagger%20}Ac_{j}/rangle%20_{/mathbb%20{R}%20^{T}}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}k(x_{i},x_{j})/langle%20Ac_{i},A^{/dagger%20}Ac_{j}/rangle%20_{/mathbb%20{R}%20^{T}}}![]()
{/displaystyle%20=/sum%20_{i,j=1}^{n}/sum%20_{s,t=1}^{T}(Ac_{i})^{t}(Ac_{j})^{s}k(x_{i},x_{j})A_{s,t}^{/dagger%20}}![]()
{/displaystyle%20=/sum%20_{s,t=1}^{T}A_{s,t}^{/dagger%20}/langle%20/sum%20_{i=1}^{n}k(x_{i},/cdot%20)(Ac_{i})^{t},/sum%20_{j=1}^{n}k(x_{j},/cdot%20)(Ac_{j})^{s}/rangle%20_{{/mathcal%20{H}}_{k}}}![]()
{/displaystyle%20=/sum%20_{s,t=1}^{T}A_{s,t}^{/dagger%20}/langle%20f_{t},f_{s}/rangle%20_{{/mathcal%20{H}}_{k}}}![]()
Output%20metric -%20an%20alternative%20output%20metric%20on {/displaystyle%20{/mathcal%20{Y}}^{T}}
can%20be%20induced%20by%20the%20inner%20product {/displaystyle%20/langle%20y_{1},y_{2}/rangle%20_{/Theta%20}=/langle%20y_{1},/Theta%20y_{2}/rangle%20_{/mathbb%20{R}%20^{T}}}
.%20With%20the%20squared%20loss%20there%20is%20an%20equivalence%20between%20the%20separable%20kernels {/displaystyle%20k(/cdot%20,/cdot%20)I_{T}}
under%20the%20alternative%20metric,%20and {/displaystyle%20k(/cdot%20,/cdot%20)/Theta%20}
,%20under%20the%20canonical%20metric.
Output%20mapping -%20Outputs%20can%20be%20mapped%20as {/displaystyle%20L:{/mathcal%20{Y}}^{T}/rightarrow%20{/mathcal%20{/tilde%20{Y}}}}
to%20a%20higher%20dimensional%20space%20to%20encode%20complex%20structures%20such%20as%20trees,%20graphs%20and%20strings.%20For%20linear%20maps {/displaystyle%20L}
,%20with%20appropriate%20choice%20of%20separable%20kernel,%20it%20can%20be%20shown%20that{/displaystyle%20A=L^{/top%20}L}
.
Task%20structure%20examples[edit]
Via%20the%20regularizer%20formulation,%20one%20can%20represent%20a%20variety%20of%20task%20structures%20easily.
Letting {/textstyle%20A^{/dagger%20}=/gamma%20I_{T}+(/gamma%20-/lambda%20){/frac%20{1}{T}}{/mathbf%20{1}}{/mathbf%20{1}}^{/top%20}}
(where {/displaystyle%20I_{T}}
is%20the TxT identity%20matrix,%20and {/textstyle%20{/mathbf%20{1}}{/mathbf%20{1}}^{/top%20}}
is%20the TxT matrix%20of%20ones)%20is%20equivalent%20to%20letting {/displaystyle%20/gamma%20}
control%20the%20variance {/textstyle%20/sum%20_{t}||f_{t}-{/bar%20{f}}||_{{/mathcal%20{H}}_{k}}}
of%20tasks%20from%20their%20mean {/textstyle%20{/frac%20{1}{T}}/sum%20_{t}f_{t}}
.%20For%20example,%20blood%20levels%20of%20some%20biomarker%20may%20be%20taken%20on {/displaystyle%20T}
patients%20at {/displaystyle%20n_{t}}
time%20points%20during%20the%20course%20of%20a%20day%20and%20interest%20may%20lie%20in%20regularizing%20the%20variance%20of%20the%20predictions%20across%20patients.Letting {/displaystyle%20A^{/dagger%20}=/alpha%20I_{T}+(/alpha%20-/lambda%20)M}
,%20where {/displaystyle%20M_{t,s}={/frac%20{1}{|G_{r}|}}/mathbb%20{I}%20(t,s/in%20G_{r})}
is%20equivalent%20to%20letting {/displaystyle%20/alpha%20}
control%20the%20variance%20measured%20with%20respect%20to%20a%20group%20mean: {/displaystyle%20/sum%20_{r}/sum%20_{t/in%20G_{r}}||f_{t}-{/frac%20{1}{|G_{r}|}}/sum%20_{s/in%20G_{r})}f_{s}||}
.%20(Here {/displaystyle%20|G_{r}|}
the%20cardinality%20of%20group%20r,%20and {/displaystyle%20/mathbb%20{I}%20}
is%20the%20indicator%20function).%20For%20example%20people%20in%20different%20political%20parties%20(groups)%20might%20be%20regularized%20together%20with%20respect%20to%20predicting%20the%20favorability%20rating%20of%20a%20politician.%20Note%20that%20this%20penalty%20reduces%20to%20the%20first%20when%20all%20tasks%20are%20in%20the%20same%20group.Letting {/displaystyle%20A^{/dagger%20}=/delta%20I_{T}+(/delta%20-/lambda%20)L}
,%20where {/displaystyle%20L=D-M}
is%20the%20Laplacian for%20the%20graph%20with%20adjacency%20matrix M giving%20pairwise%20similarities%20of%20tasks.%20This%20is%20equivalent%20to%20giving%20a%20larger%20penalty%20to%20the%20distance%20separating%20tasks t and s when%20they%20are%20more%20similar%20(according%20to%20the%20weight {/displaystyle%20M_{t,s}}
,)%20i.e. {/displaystyle%20/delta%20}
regularizes {/displaystyle%20/sum%20_{t,s}||f_{t}-f_{s}||_{{/mathcal%20{H}}_{k}}^{2}M_{t,s}}
.All%20of%20the%20above%20choices%20of%20A%20also%20induce%20the%20additional%20regularization%20term {/textstyle%20/lambda%20/sum%20_{t}||f||_{{/mathcal%20{H}}_{k}}^{2}}
which%20penalizes%20complexity%20in%20f%20more%20broadly.Learning%20tasks%20together%20with%20their%20structure[edit]
Learning%20problem P can%20be%20generalized%20to%20admit%20learning%20task%20matrix%20A%20as%20follows:
{/displaystyle%20/min%20_{C/in%20/mathbb%20{R}%20^{n/times%20T},A/in%20S_{+}^{T}}V(Y,KCA)+/lambda%20tr(KCAC^{/top%20})+F(A)}![]() | | (Q) |
Choice%20of {/displaystyle%20F:S_{+}^{T}/rightarrow%20/mathbb%20{R}%20_{+}}
must%20be%20designed%20to%20learn%20matrices A of%20a%20given%20type.%20See%20"Special%20cases"%20below.
Optimization%20of Q[edit]
Restricting%20to%20the%20case%20of convex losses%20and coercive penalties%20Ciliberto et%20al have%20shown%20that%20although Q is%20not%20convex%20jointly%20in C and A, a%20related%20problem%20is%20jointly%20convex.
Specifically%20on%20the%20convex%20set {/displaystyle%20{/mathcal%20{C}}=/{(C,A)/in%20/mathbb%20{R}%20^{n/times%20T}/times%20S_{+}^{T}|Range(C^{/top%20}KC)/subseteq%20Range(A)/}}
,%20the%20equivalent%20problem
{/displaystyle%20/min%20_{C,A/in%20{/mathcal%20{C}}}V(Y,KC)+/lambda%20tr(A^{/dagger%20}C^{/top%20}KC)+F(A)}![]() | | (R) |
is%20convex%20with%20the%20same%20minimum%20value.%20And%20if {/displaystyle%20(C_{R},A_{R})}
is%20a%20minimizer%20for R then {/displaystyle%20(C_{R}A_{R}^{/dagger%20},A_{R})}
is%20a%20minimizer%20for Q.
R may%20be%20solved%20by%20a%20barrier%20method%20on%20a%20closed%20set%20by%20introducing%20the%20following%20perturbation:
{/displaystyle%20/min%20_{C/in%20/mathbb%20{R}%20^{n/times%20T},A/in%20S_{+}^{T}}V(Y,KC)+/lambda%20tr(A^{/dagger%20}(C^{/top%20}KC+/delta%20^{2}I_{T}))+F(A)}![]() | | (S) |
The%20perturbation%20via%20the%20barrier {/displaystyle%20/delta%20^{2}tr(A^{/dagger%20})}
forces%20the%20objective%20functions%20to%20be%20equal%20to {/displaystyle%20+/infty%20}
on%20the%20boundary%20of {/displaystyle%20R^{n/times%20T}/times%20S_{+}^{T}}
.
S can%20be%20solved%20with%20a%20block%20coordinate%20descent%20method,%20alternating%20in C and A. This%20results%20in%20a%20sequence%20of%20minimizers {/displaystyle%20(C_{m},A_{m})}
in S that%20converges%20to%20the%20solution%20in R as {/displaystyle%20/delta%20_{m}/rightarrow%200}
,%20and%20hence%20gives%20the%20solution%20to Q.
Special%20cases[edit]
Spectral%20penalties -%20Dinnuzo et%20al[11] suggested%20setting F as%20the%20Frobenius%20norm {/displaystyle%20{/sqrt%20{tr(A^{/top%20}A)}}}
.%20They%20optimized Q directly%20using%20block%20coordinate%20descent,%20not%20accounting%20for%20difficulties%20at%20the%20boundary%20of {/displaystyle%20/mathbb%20{R}%20^{n/times%20T}/times%20S_{+}^{T}}
.
Clustered%20tasks%20learning -%20Jacob et%20al[12] suggested%20to%20learn A in%20the%20setting%20where T tasks%20are%20organized%20in R disjoint%20clusters.%20In%20this%20case%20let {/displaystyle%20E/in%20/{0,1/}^{T/times%20R}}
be%20the%20matrix%20with {/displaystyle%20E_{t,r}=/mathbb%20{I}%20({/text{task%20}}t/in%20{/text{group%20}}r)}
.%20Setting {/displaystyle%20M=I-E^{/dagger%20}E^{T}}
,%20and{/displaystyle%20U={/frac%20{1}{T}}{/mathbf%20{11}}^{/top%20}}
,%20the%20task%20matrix {/displaystyle%20A^{/dagger%20}}
can%20be%20parameterized%20as%20a%20function%20of {/displaystyle%20M}
: {/displaystyle%20A^{/dagger%20}(M)=/epsilon%20_{M}U+/epsilon%20_{B}(M-U)+/epsilon%20(I-M)}
,%20with%20terms%20that%20penalize%20the%20average,%20between%20clusters%20variance%20and%20within%20clusters%20variance%20respectively%20of%20the%20task%20predictions.%20M%20is%20not%20convex,%20but%20there%20is%20a%20convex%20relaxation {/displaystyle%20{/mathcal%20{S}}_{c}=/{M/in%20S_{+}^{T}:I-M/in%20S_{+}^{T}/land%20tr(M)=r/}}
.%20In%20this%20formulation, {/displaystyle%20F(A)=/mathbb%20{I}%20(A(M)/in%20/{A:M/in%20{/mathcal%20{S}}_{C}/})}
.
Generalizations[edit]
Non-convex%20penalties -%20Penalties%20can%20be%20constructed%20such%20that%20A%20is%20constrained%20to%20be%20a%20graph%20Laplacian,%20or%20that%20A%20has%20low%20rank%20factorization.%20However%20these%20penalties%20are%20not%20convex,%20and%20the%20analysis%20of%20the%20barrier%20method%20proposed%20by%20Ciliberto%20et%20al%20does%20not%20go%20through%20in%20these%20cases.
Non-separable%20kernels -%20Separable%20kernels%20are%20limited,%20in%20particular%20they%20do%20not%20account%20for%20structures%20in%20the%20interaction%20space%20between%20the%20input%20and%20output%20domains%20jointly.%20Future%20work%20is%20needed%20to%20develop%20models%20for%20these%20kernels.
Applications[edit]
Spam%20filtering[edit]
Using%20the%20principles%20of%20MTL,%20techniques%20for%20collaborative spam%20filtering that%20facilitates%20personalization%20have%20been%20proposed.%20In%20large%20scale%20open%20membership%20email%20systems,%20most%20users%20do%20not%20label%20enough%20messages%20for%20an%20individual%20local classifier to%20be%20effective,%20while%20the%20data%20is%20too%20noisy%20to%20be%20used%20for%20a%20global%20filter%20across%20all%20users.%20A%20hybrid%20global/individual%20classifier%20can%20be%20effective%20at%20absorbing%20the%20influence%20of%20users%20who%20label%20emails%20very%20diligently%20from%20the%20general%20public.%20This%20can%20be%20accomplished%20while%20still%20providing%20sufficient%20quality%20to%20users%20with%20few%20labeled%20instances.[13]
Web%20search[edit]
Using%20boosted decision%20trees,%20one%20can%20enable%20implicit%20data%20sharing%20and%20regularization.%20This%20learning%20method%20can%20be%20used%20on%20web-search%20ranking%20data%20sets.%20One%20example%20is%20to%20use%20ranking%20data%20sets%20from%20several%20countries.%20Here,%20multitask%20learning%20is%20particularly%20helpful%20as%20data%20sets%20from%20different%20countries%20vary%20largely%20in%20size%20because%20of%20the%20cost%20of%20editorial%20judgments.%20It%20has%20been%20demonstrated%20that%20learning%20various%20tasks%20jointly%20can%20lead%20to%20significant%20improvements%20in%20performance%20with%20surprising%20reliability.[14]
RoboEarth[edit]
In%20order%20to%20facilitate%20transfer%20of%20knowledge,%20IT%20infrastructure%20is%20being%20developed.%20One%20such%20project,%20RoboEarth,%20aims%20to%20set%20up%20an%20open%20source%20internet%20database%20that%20can%20be%20accessed%20and%20continually%20updated%20from%20around%20the%20world.%20The%20goal%20is%20to%20facilitate%20a%20cloud-based%20interactive%20knowledge%20base,%20accessible%20to%20technology%20companies%20and%20academic%20institutions,%20which%20can%20enhance%20the%20sensing,%20acting%20and%20learning%20capabilities%20of%20robots%20and%20other%20artificial%20intelligence%20agents.[15]
Software%20package[edit]
The%20Multi-Task%20Learning%20via%20StructurAl%20Regularization%20(MALSAR)%20Matlab%20package [16] implements%20the%20following%20multi-task%20learning%20algorithms:
Mean-Regularized%20Multi-Task%20Learning[17][18]Multi-Task%20Learning%20with%20Joint%20Feature%20Selection[19]Robust%20Multi-Task%20Feature%20Learning[20]Trace-Norm%20Regularized%20Multi-Task%20Learning[21]Alternating%20Structural%20Optimization[22][23]Incoherent%20Low-Rank%20and%20Sparse%20Learning[24]Robust%20Low-Rank%20Multi-Task%20LearningClustered%20Multi-Task%20Learning[25][26]Multi-Task%20Learning%20with%20Graph%20StructuresSee%20also[edit]
Artificial%20IntelligenceArtificial%20neural%20networkEvolutionary%20computationHuman-based%20genetic%20algorithmKernel%20methods%20for%20vector%20outputMachine%20LearningRoboticsReferences[edit]
Jump%20up^ Baxter,%20J.%20(2000).%20A%20model%20of%20inductive%20bias%20learning.%20Journal%20of%20Artificial%20Intelligence%20Research,%2012:149--198, On-line%20paperJump%20up^ Thrun,%20S. (1996).%20Is%20learning%20the%20n-th%20thing%20any%20easier%20than%20learning%20the%20first?.%20In%20Advances%20in%20Neural%20Information%20Processing%20Systems%208,%20pp.%20640--646.%20MIT%20Press. Paper%20at%20Citeseer^ Jump%20up%20to:a b Caruana,%20R.%20(1997). "Multi-task%20learning" (PDF). Machine%20Learning. doi:10.1023/A:1007379606734.^ Jump%20up%20to:a b Weinberger,%20Kilian. "Multi-task%20Learning".^ Jump%20up%20to:a b c Ciliberto,%20C.%20(2015).%20"Convex%20Learning%20of%20Multiple%20Tasks%20and%20their%20Structure". arXiv:1504.03101
.^ Jump up to:a b Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdfJump up^ Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdfJump up^ Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdfJump up^ Szegedy, C. (2014). "Going Deeper with Convolutions". Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. arXiv:1409.4842
. doi:10.1109/CVPR.2015.7298594.Jump up^ Roig, Gemma. "Deep Learning Overview" (PDF).Jump up^ Dinuzzo, Francesco (2011). "Learning output kernels with block coordinate descent.". Proceedings of the 28th International Conference on Machine Learning (ICML-11).Jump up^ Jacob, Laurent (2009). "Clustered multi-task learning: A convex formulation". Advances in neural information processing systems.Jump up^ Attenberg, J., Weinberger, K., & Dasgupta, A. Collaborative Email-Spam Filtering with the Hashing-Trick. http://www.cse.wustl.edu/~kilian/papers/ceas2009-paper-11.pdfJump up^ Chappelle, O., Shivaswamy, P., & Vadrevu, S. Multi-Task Learning for Boosting with Application to Web Search Ranking. http://www.cse.wustl.edu/~kilian/papers/multiboost2010.pdfJump up^ Description of RoboEarth ProjectJump up^ Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. On-line manualJump up^ Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).Jump up^ Evgeniou, T., Micchelli, C., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615.Jump up^ Argyriou, A., Evgeniou, T., & Pontil, M. (2008a). Convex multi-task feature learning. Machine Learning, 73, 243–272.Jump up^ Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.Jump up^ Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).Jump up^ Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6, 1817–1853.Jump up^ Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).Jump up^ Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).Jump up^ Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems, 2008Jump up^ Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. Advances in Neural Information Processing Systems.