1 Introduction
In many applications, as for example image or text classification, gathering unlabeled data is easier than gathering labeled data. Semisupervised methods try to extract information from the unlabeled data to get improved classification results over purely supervised methods. A well known technique to incorporate unlabeled data into a learning process is manifold regularization (MR) [6, 16]
. This procedure adds a datadependent penalty term to the loss function and that penalizes classification rules that behave nonsmooth with respect to the data distribution. This paper presents a sample complexity and a Rademacher complexity analysis for this procedure and illustrates how our Rademacher complexity bounds can be used for choosing a suitable manifold regularization parameter. We organize our paper as follows. In Sections
2 and 3 we discuss related work and introduce the semisupervised setting.In Section 4 we fomalize the idea of adding a distributiondependent penalty term to a loss function. Algorithms such as manifold, entropy or coregularization [6, 13, 18] follow this idea. Our formalization of this idea is inspired by Balcan and Blum [3] and allows for a similar sample complexity analysis.
Section 5 reviews the work from Balcan and Blum [3] and generalizes a sample complexity bound from their paper. We then show how this bound can be used to derive sample complexity bounds for the proposed framework, and thus in particular for MR. In the same section we discuss what our findings mean in terms of the performance gap between supervised and semisupervised methods. We conclude that if our hypothesis set has finite pseudodimension, any semisupervised learner (SSL) that falls in our framework can have at most a constant improvement in terms of sample complexity when compared to all supervised learners. This relates to work of Darnstädt et al. [11] and BenDavid et al. [7] and we argue that we generalize their results.
While our results from the previous sections are uniformly over all distributions and thus distribution independent, we show in Section 6 how one can obtain distribution dependent complexity bounds for MR. We review a kernel formulation of MR [19]
and show how this can be used to estimate Rademacher complexities for
specific datasets.In Section 7 we illustrative on an artificial dataset how the distribution dependent bounds can be used for choosing the regularization parameter of MR. This is in particular useful as this analysis does not need an additional labeled validation set, and in semisupervised settings labeled samples are often assumed to be sparse.
In Section 8 we discuss our results.
We summarize the three core contributions of this paper.

We formalize the concept of distributiondependent regularization for semisupervised learning and derive distribution independent sample complexity bounds for it. This includes in particular sample complexity bounds for MR.

We compare the derived upper bounds with the lower bounds of purely supervised learners. We show that for all SSL that fall in our framework, the performance gap, when compared to their supervised counterpart, can be at most a constant. The constant depends on the hypothesis class used.

We derive a computational feasible method to compute distribution dependent Rademacher complexity bounds for MR. We illustrate how these complexity bounds can be used to find a suitable regularization parameter for MR and removes the need for a labeled validation set.
2 Related Work
There are currently two related analyses of MR that show, to some extent, that a SSL can learn efficiently if it knows the true underlying manifold, while a fully supervised learner may not. Globerson et al. [12] investigate a setting where they essentially restrict themselves to distributions over the input space that correspond to unions of irreducible algebraic sets of a fixed size , and each algebraic set is either labeled or . A SSL that knows the true distribution on can identify the algebraic sets and reduce the hypothesis space to all possible label combinations on those sets. As we are left with finitely many hypotheses we can learn them efficiently, while they show that every supervised learner is left with a hypothesis space of infinite VC dimension.
A similar type of analysis was done by Niyogi [16]. Niyogi considers manifolds that arise as embeddings from a circle, where the labeling over the circle is (up to the decision boundary) smooth. Niyogi then shows that a learner that has knowledge of the manifold can learn efficiently while for every fully supervised learner one can find an embedding and a distribution for which this is not possible.
The relation to our paper is as follows. They provide examples where sample complexity between a semisupervised and a supervised learner be arbitrarily large, while we explore the general difference in sample complexity between a supervised method and MR.
3 The Semi Supervised Setting
We work in the statistical learning framework. That means that we assume we are given a feature domain and a label space
together with an unknown probability distribution
over . Additionally we are given a loss function , which is convex in the first argument and in practice often a surrogate loss function for the loss, as for example the hinge loss. A hypothesis is a function . We setto be a random variable distributed according to
, while small and are elements of and respectively. Our goal is to find a hypothesis , within a restricted class , such that the expected loss is small. Throughout this paper we call models that minimize discriminative models as the loss only depends on the output of the model and not on the input . In the standard supervised setting we choose a hypothesis based on a i.i.d. sample drawn from . With that we define the empirical risk of a model with respect to and measured on the sample as For ease of notation we sometimes omit and just write . Given a learning problem defined by and a labeled sample , one way to choose a hypothesis is by the empirical risk minimization principle(1) 
We refer to as the supervised solution as it does not depend on the unlabeled sample. In SSL we additionally have samples with unknown labels. So we assume to have samples independently drawn according to , where has not been observed for the last samples. We call the first samples the labeled samples. The last samples are referred to as unlabeled samples. We furthermore set , so is the set that contains all our available information about the feature distribution.
4 Proposed Framework for SemiSupervised Learning
We propose to include the unlabeled sample in the following manner. We first introduce a second convex loss function that only depends on the input feature and a hypothesis . We refer to as the unsupervised loss as it does not depend on any labels. We propose to add the unlabeled data through the loss function and add it as a penalty term to the supervised loss to obtain the semisupervised solution
(2) 
where controls the tradeoff between the supervised and the unsupervised loss. For ease of notation we set and We don’t claim any novelty for the idea of adding a unsupervised loss for regularization, a different framework can be found in [10, Chapter 10]. We are, however, not aware of this particular formulation of the framework and a sample complexity analysis as presented here. As we are in particular interested in the class of MR schemes we first remark that this method actually fits in our framework.
Example: Manifold Regularization
Overloading the notation we write now for the distribution restricted to . In MR one assumes that the input distribution has support on a compact manifold . Our assumption is that the predictor varies smoothly in the geometry of [6]. There are several regularization terms that can enforce this smoothness, one of which is . Belkin and Niyogi [5] show that may be approximated with a finite sample of drawn from . Given such a sample one defines first a weight matrix , where . We set then as the Laplacian matrix , where is a diagonal matrix with . Let furthermore
be the evaluation vector of
on . The expression converges to [5]. Thus we can set the unsupervised loss as , and this is indeed a convex function in .5 Analysis of the Framework
In this section we analyze the properties of the semisupervised solution found in Equation (2). We give sample complexity bounds for this procedure and compare them to sample complexities for the supervised case. We derive our results from Balcan and Blum [3], who introduced the concept of unsupervised loss functions. The use of the unsupervised loss function in their work differs however from our work. While they use it to restrict the hypothesis space directly, we use it as a regularization tool in the empirical risk minimization as usually done in practice. To switch between the views of a constrained optimization formulation and our formulation (2) we use the following classical result from convex optimization [14, Theorem 1].
Lemma 1.
Let and be functions convex in for all . Then the following two optimization problems are equivalent:
(3) 
(4) 
Where equivalence means that for each we can find a such that both problems have the same solution and vice versa.
The next subsection introduces the sample complexity bound and shows how it can be used to give theoretical guarantees for the presented framework.
5.1 Sample Complexity Bounds
Sample complexity bounds for supervised learning use typically a notion of complexity of the hypothesis space to bound the worst case difference between the estimated and the true risk. One wellknown capacity notion is the VCdimension [20], which allows for a distribution free, worstcase analysis on the complexity of the hypothesis space. Balcan and Blum [3] use this notion to derive sample complexity bounds for SSL. While this bound was given for classification error and boolean unsupervised loss functions, we adapt it to be valid for all bounded supervised loss functions and all unsupervised loss functions bounded by . To do that we use the notion of pseudodimension, an extension of the VCdimension to real valued loss functions and hypotheses [20, 15]. We briefly review the concept of pseudodimension.
Let and for and let . Ignoring the label we can similarly define for the unsupervised loss . The function
is the 01 error we observe if we try to classify the point
with the hypothesis and by thresholding the loss function at . For a given sample the following set captures then all possible 01 errors (and thus all possible labellings), when using the functions for classification:(5) 
For brevity we substitute in the following and . The pseudodimension of a hypothesis class and a loss function is then defined as
Those definitions let us state our first main Theorem, which is a generalization of Balcan and Blum [3, Theorem 10] to bounded loss functions.
Theorem 1.
Let . Assume is a measurable loss function such that there exists a with for all and and let be a distribution. Furthermore let . Then an unlabeled sample of size
and a labeled sample of size
is sufficient to ensure that with probability at least the classifier that minimizes subject to satisfies
(6) 
Proof.
The proof can be found in the supplementary material.
∎
In the next subsection we show how to use this theorem to derive sample complexity bounds for MR. But before that we want to make one remark about the assumption that the loss function is globally bounded. If we assume that is a reproducing kernel Hilbert space there exists an such that for all and it holds that . If we restrict the norm of by introducing a regularization term with respect to the norm and assume that is continuous we can conclude that is globally bounded in . In a way this can also be seen as a justification to also use an intrinsic regularization for the norm of in addition to the regularization by the unsupervised loss, as only then the guarantees of Theorem 1 apply. Using this bound together with Lemma 1 we can state the following corollary to give a PACstyle guarantee for our proposed framework.
Corollary 1.
As our paper focuses mostly on MR there are two notes to make. First, recall that in this setting , so we gather unlabeled samples from instead of and thus we need instead of unlabeled samples for the same bound. Second, the unlabeled loss does not necessarily map to . But it is clear that we can adapt the bounds for bounded unlabeled loss functions the same way we did for bounded supervised loss functions.
5.2 Comparison to the Supervised Solution
In the SSL community it is well known that using SSL does not come without a risk [10, Chapter 4]. Thus it is of particular interest how they compare to purely supervised schemes. There are, however, many potential supervised methods we can think of. In the work of BenDavid et al. [7], Darnstädt et al. [11] and Globerson et al. [12] this problem is avoided by comparing to all possible supervised schemes. The framework introduced in this paper allows for a more finegrained analysis as the semisupervision happens on top of an already existing supervised methods. Thus for our framework it is natural to compare the sample complexities of with the sample complexity of . To compare the supervised and semisupervised solution we use again the language of sample complexities and state a simple lower bound on the sample complexity for the semisupervised solution.
Theorem 2 ([17, p. 73]).
Let . Assume is a measurable loss function such that there exists a with for all and . Furthermore let
Then there exists a constant such that a labeled sample of size
is necessary to ensure that for all distributions with probability at least the classifier that minimizes subject to satisfies
We want to point out two things on this lower bound. First note that in Theorem 1 we provided an upper bound on the sample complexity, so a sample size which is sufficient for the guarantees. The lower bound for sample complexity from Theorem 2 is the necessary amount of labeled samples for the given guarantees. Second we note that we do not have any requirement on the unlabeled sample size. This is because we stated Theorem 2 for the assumption that we have infinitely many unlabeled samples. As the theorem holds for that case, it for sure holds when we have only a finite unlabeled sample size. The following corollary captures the comparison to the supervised solution.
Corollary 2.
Assume the setting of Theorem 2 and assume that and are convex. There exists a universal constant such that the semisupervised solution found in (2) needs at least a labeled sample size of
to guarantee the same performance as the supervised solution found in (1) which has access to a labeled sample of size
Proof.
We can interpret the corollary also as follows. The worst case difference of labeled sample consumption between the supervised and semisupervised solution only depends on the two universal constants and and on the pseudodimensions and . In particular the sample complexity of the supervised method is at most as much as the semisupervised method. This is finite if is finite.
5.3 The Limits of Manifold Regularization
We now relate our result to the conjectures published by ShalevShwartz and BenDavid [17]: A SSL cannot learn faster by more than a constant (which my depend on the hypothesis class and the loss ) than the supervised learner. Darnstädt et al. [11, Theorem 2 and 3] showed that this conjecture is essentially true for classes with finite VCdimension, and SSL that do not make any distributional assumptions. Corollary 2 and the text thereafter shows that this statement also holds for every SSL that falls in our proposed framework and thus extents this result. Our result holds explicitly for SSLs that do make assumptions about the distribution: MR makes the assumption that the labeling function behaves smoothly with respect to the underlying manifold.
6 Rademacher Complexity for Manifold Regularization
The analysis of the previous sections was distribution independent. We saw that the sample complexity difference between manifold regularized methods and purely supervised methods was at most a constant, which depends on the hypothesis class and the regularization parameter. We believe that in order to find out in which scenarios semisupervised learning can help it is useful to also look at distribution dependent complexity measures. For this we derive in this section computational feasible upper and lower bounds on the Rademacher complexity of MR. For this we first review the work of Sindhwani et al. [19]; they create a kernel which corresponds to MR. Having this kernel we can use standard upper and lower bounds of the Rademacher complexity for RKHS, as found for example in [9]. The analysis is thus similar to [18], they consider a coregularization setting. In particular [19, p1] show the following, here informally stated, theorem.
Theorem 3 ([19, Propositions 2.1, 2.2]).
Let be a RKHS with inner product . As before let , and . Furthermore be any inner product in . Let be the same space of functions as , but with a newly defined inner product by Then is a RKHS.
Assume now that is a positive definite dimensional matrix and we set the inner product By setting as the Laplacian matrix as explained in Section 4 we note that the norm of now automatically regularizes with respect to the data manifold given by . We furthermore know the exact form of the kernel of .
Theorem 4 ([19, Proposition 2.2]).
Let be the kernel of , be the gram matrix given by and . Finally let be the
dimensional identity matrix. The kernel of
is then given byThis interpretation of MR is useful to derive computational feasible upper and lower bounds of the empirical Rademacher complexity, so distribution dependent complexity bounds. First, recall that the empirical Rademacher complexity of the hypothesis class and measured on the sample labeled input features is defined as
where are i.i.d Rademacher random variables, i.e. .
Theorem 5 ([9, p. 333]).
Let be a RKHS with kernel and . Given an sample we can bound the empirical Rademacher complexity of by
(7) 
With the previous two theorems we can give upper bounds on the Rademacher complexity of MR, in particular we can also derive a bound of the maximal complexity reduction over supervised learning.
Corollary 3.
Let be a RKHS and for define the inner product , where is a positive definite matrix and is a regularization parameter. Let be defined as before, then
(8) 
Similarly we can also obtain a lower bound according to Inequality (7).
The corollary allows us to compute upper bounds of the Rademacher complexity for MR and shows in particular that the difference of the Rademacher complexity of the supervised and the semisupervised method is given by the term . This can be used for example to compute generalization bounds [15, Chapter 3]. We can also use the kernel to compute local Rademacher complexities which may yield tighter generalization bounds [4]. We, however, will use the Rademacher complexity bounds instead for choosing the regularization parameter without the need for an additional labeled validation set.
7 Experiment: Concentric circles
We now illustrate how one can use the upper bound of the Rademacher complexity (8) for model selection, in particular we can get an initial idea of how to choose the regularization parameter . The idea is to plot the Rademacher complexity versus the parameter as in Figure 1 . For increasing we will see that the Rademacher complexity reduces, so how to choose the optimal
? We propose to use an heuristic which is often used in clusterings, the so called ’elbow criteria’
[8]. We essentially want to find a such that increasing the will not result in much reduction of the complexity anymore. We test this idea on a dataset which consists out of two concentric circles with 500 datapoints in , 250 per circle, see also Figure 2. We use a gaussian base kernel with bandwith set to and for the MR matrix we use the Laplacian matrix as before, where weights are computed also with a gaussian kernel with bandwith set to . Note that those parameters have to be carefully set in order to capture the structure of the dataset, but this is not the concern of this paper, we already assume that we found a reasonable choice for those parameters. We furthermore add a small L2regularization that ensures that the radius from Inequality (8) is finite. The precise value of plays here a secondary role as the behavior of the curve from Figure 1 remains the same. Details on the data generation process can be found in our code. We are now interested in how to choose the regularization parameter . Looking at Figure 1 we observe that for values smaller than the curve still drops steeply, while after it starts to flatten out. We thus plot the resulting kernels for and in Figure 2. We plot the isolines of the kernel around the point of class one, the red dot in the figure. We indeed observe that for we don’t capture that much structure yet, while for the two conectric circles are almost completely separated by the kernel.8 Discussion
We presented a distribution dependent and distribution independent complexity analysis for manifold regularization. We note that the improvements in terms of sample or Rademacher complexity do not immediately imply that we will also observe an improvement in terms of performance. The reason for that is that the difference in performance between supervised and semisupervised learning depends depends on two things. First, on how the approximation error of the class compares the approximation error of is. Second, on how much we reduce the complexity by switching from the hypothesis class to the hypothesis class . In our analysis we discussed the second part and we argue that our proposed framework is suited to address this problem, as the resulting SSL schemes have a natural supervised counterpart. The first part depends on a notion the literature often refers to as a semisupervised assumption. This assumption basically states that we can learn with as good as with . Regarding our example of the two concentric circles, this would mean that each circle actually corresponds to a class. The problem is that it is typically not known whether one can test efficiently if the assumption is true or not, without prior knowledge. The issue becomes clearer when one realizes that the assumption is always a property on the distribution , so one would need labeled samples to verify it. We speculate that this verification process would negate the sample savings one can achieve with SSL. We are not aware of much research in this direction, besides the work of Balcan et al. [2] which discusses the sample consumption to test the so called cluster assumption and the work of Azizyan et al. [1] which analyze the overhead of crossvalidating the extra hyper parameter from their proposed semisupervised learner.
Furthermore we note that one condition for the main finding in Corollary 1 is that the unsupervised loss function is convex and this is not the case for example for entropy regularization, which in fact as a concave unsupervised loss function. It is an open question whether we can give similar bounds for those functions. We are not aware of any simple way to adapt the presented theory for nonconvex functions.
References
 Azizyan et al. [2012] M. Azizyan, A. Singh, and L. A. Wasserman. Densitysensitive semisupervised inference. CoRR, abs/1204.1685, 2012.
 Balcan et al. [2011] M. Balcan, E. Blais, A. Blum, and L. Yang. Active testing. CoRR, abs/1111.0897, 2011.
 Balcan and Blum [2010] M.F. Balcan and A. Blum. A discriminative model for semisupervised learning. J. ACM, 57(3):19:1–19:46, Mar. 2010. ISSN 00045411. doi: 10.1145/1706591.1706599.
 Bartlett et al. [2005] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Ann. Statist., 33(4):1497–1537, 08 2005. doi: 10.1214/009053605000000282. URL https://doi.org/10.1214/009053605000000282.
 Belkin and Niyogi [2008] M. Belkin and P. Niyogi. Towards a theoretical foundation for laplacianbased manifold methods. Journal of Computer and System Sciences, 74(8):1289 – 1308, 2008. ISSN 00220000. doi: https://doi.org/10.1016/j.jcss.2007.08.006. Learning Theory 2005.
 Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, Dec. 2006. ISSN 15324435.
 BenDavid et al. [2008] S. BenDavid, T. Lu, and D. Pál. Does unlabeled data provably help? worstcase analysis of the sample complexity of semisupervised learning. In Proceedings of the The 21st Annual Conference on Learning Theory (COLT08), Helsinki, Finland, 2008.

Bholowalia and Kumar [2014]
P. Bholowalia and A. Kumar.
Article: Ebkmeans: A clustering technique based on elbow method and kmeans in wsn.
International Journal of Computer Applications, 105(9):17–24, November 2014. Full text available.  Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances, 2005.
 Chapelle et al. [2010] O. Chapelle, B. Schölkopf, and A. Zien. SemiSupervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125, 9780262514125.
 Darnstädt et al. [2013] M. Darnstädt, H. U. Simon, and B. Szörényi. Unlabeled data does provably help. In STACS, pages 185–196, 2013.
 Globerson et al. [2017] A. Globerson, R. Livni, and S. ShalevShwartz. Effective semisupervised learning on manifolds. In S. Kale and O. Shamir, editors, COLT, volume 65 of JMLR Workshop and Conference Proceedings, pages 978–1003. JMLR.org, 2017.
 Grandvalet and Bengio [2005] Y. Grandvalet and Y. Bengio. Semisupervised learning by entropy minimization. In F. Denis, editor, CAP, pages 281–296. PUG, 2005.
 Kloft et al. [2009] M. Kloft, U. Brefeld, P. Laskov, K.R. Müller, A. Zien, and S. Sonnenburg. Efficient and accurate lpnorm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 997–1005. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3675efficientandaccuratelpnormmultiplekernellearning.pdf.

Mohri et al. [2012]
M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning
. The MIT Press, 2012. ISBN 026201825X, 9780262018258.  Niyogi [2013] P. Niyogi. Manifold regularization and semisupervised learning: Some theoretical analyses. J. Mach. Learn. Res., 14(1):1229–1250, May 2013. ISSN 15324435.
 ShalevShwartz and BenDavid [2014] S. ShalevShwartz and S. BenDavid. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. ISBN 1107057132, 9781107057135.
 Sindhwani and Rosenberg [2008] V. Sindhwani and D. S. Rosenberg. An rkhs for multiview learning and manifold coregularization. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 976–983, New York, NY, USA, 2008. ACM. ISBN 9781605582054. doi: 10.1145/1390156.1390279. URL http://doi.acm.org/10.1145/1390156.1390279.
 Sindhwani et al. [2005] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semisupervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 824–831, New York, NY, USA, 2005. ACM. ISBN 1595931805. doi: 10.1145/1102351.1102455. URL http://doi.acm.org/10.1145/1102351.1102455.
 Vapnik [1998] V. N. Vapnik. Statistical Learning Theory. WileyInterscience, 1998.
Comments
There are no comments yet.