Algorithm for Matrix Completion
matlab | 代做report | arm | Algorithm | angular代写 | scheme | Machine learning | Objective | AI代写 | 代写lab – 该题目是一个常规的arm的练习题目代写, 涉及了matlab/report/arm/Algorithm/angular/scheme/Machine learning/Objective/AI等代写方面, 这个项目是lab代写的代写题目
A Singular Value Thresholding Algorithm for Matrix Completion
Jian-Feng C AI Emmanuel J. Cand`es Zuowei Shen
Temasek Laboratories, National University of Singapore, Singapore 117543
Applied and Computational Mathematics, Caltech, Pasadena, CA 91125
Department of Mathematics, National University of Singapore, Singapore 117543
September 2008
Abstract
This paper introduces a novel algorithm to approximate the matrix with minimum nuclear
norm among all matrices obeying a set of convex constraints.This problem may be understood as
the convex relaxation of a rank minimization problem, and arises in many important applications
as in the task of recovering a large matrix from a small subsetof its entries (the famous Netflix
problem). Off-the-shelf algorithms such as interior point methods are not directly amenable to
large problems of this kind with over a million unknown entries.
This paper develops a simple first-order and easy-to-implement algorithm that is extremely
efficient at addressing problems in which the optimal solution has low rank. The algorithm is
iterative and produces a sequence of matrices{Xk,Yk}and at each step, mainly performs a
soft-thresholding operation on the singular values of the matrixYk. There are two remarkable
features making this attractive for low-rank matrix completion problems. The first is that
the soft-thresholding operation is applied to a sparse matrix; the second is that the rank of
the iterates{Xk}is empirically nondecreasing. Both these facts allow the algorithm to make
use of very minimal storage space and keep the computationalcost of each iteration low. On
the theoretical side, we provide a convergence analysis showing that the sequence of iterates
converges. On the practical side, we provide numerical examples in which 1, 000 1 ,000 matrices
are recovered in less than a minute on a modest desktop computer. We also demonstrate that
our approach is amenable to very large scale problems by recovering matrices of rank about 10
with nearly a billion unknowns from just about 0.4% of their sampled entries. Our methods are
connected with the recent literature on linearized Bregmaniterations for 1 minimization, and
we develop a framework in which one can understand these algorithms in terms of well-known
Lagrange multiplier algorithms.
Keywords. Nuclear norm minimization, matrix completion, singular value thresholding, La-
grange dual function, Uzawas algorithm.
1 Introduction
1.1 Motivation
There is a rapidly growing interest in the recovery of an unknown low-rank or approximately low-
rank matrix from very limited information. This problem occurs in many areas of engineering and
applied science such as Machine learning [1, 3, 4], control [42] and computer vision, see [48]. As a motivating example, consider the problem of recovering a data matrix from a sampling of its entries. This routinely comes up whenever one collects partially filled out surveys, and one would like to infer the many missing entries. In the area of recommender systems, users submit ratings on a subset of entries in a database, and the vendor provides recommendations based on the users preferences. Because users only rate a few items, one would like to infer their preference for unrated items; this is the famous Netflix problem [2]. Recovering a rect angular matrix from a sampling of its entries is known as thematrix completionproblem. The issue is of course that this problem is extraordinarily ill posed since with fewer samples than entries, we have infinitely many completions. Therefore, it is apparently impossible to identify which ofthese candidate solutions is indeed the correct one without some additional information. In many instances, however, the matrix we wish to recover haslow rank or approximately low rank. For instance, the Netflix data matrix of all user-ratings may be approximately low-rank because it is commonly believed that only a few factors contribute to anyones taste or preference. In computer vision, inferring scene geometry and camera motion from a sequence of images is a well- studied problem known as the structure-from-motion problem. This is an ill-conditioned problem for objects may be distant with respect to their size, or especially for missing data which occur because of occlusion or tracking failures. However, when properly stacked and indexed, these images form a matrix which has very low rank (e.g. rank 3 under orthography) [21, 48]. Other examples of low-rank matrix fitting abound; e.g. in control (system identification), machine learning (multi- class learning) and so on. Having said this, the premise thatthe unknown has (approximately) low rank radically changes the problem, making the search for solutions feasible since the lowest-rank solution now tends to be the right one. In a recent paper [13], Cand`es and Recht showed that matrix completion is not as ill-posed as people thought. Indeed, they proved that most low-rank matrices can be recoveredexactlyfrom most sets of sampled entries even though these sets have surprisingly small cardinality, and more importantly, they proved that this can be done by solving a simpleconvexoptimization problem. To state their results, suppose to simplify that the unknownmatrixMRnnis square, and that one has availablemsampled entries{Mij: (i, j)}where is a random subset of cardinality m. Then [13] proves that most matricesM of rankrcan be perfectly recovered by solving the optimization problem minimize X subject to Xij=Mij, (i, j),
(1.1)
provided that the number of samples obeys
mCn^6 /^5 rlogn (1.2)
for some positive numerical constantC.^1 In (1.1), the functionalXis the nuclear norm of the matrixM, which is the sum of its singular values. The optimization problem (1.1) is convex and can be recast as a semidefinite program [29,30]. In some sense, this is the tightest convex relaxation of the NP-hard rank minimization problem
minimize rank(X)
subject to Xij=Mij, (i, j), (1.3)
(^1) Note that annnmatrix of rankrdepends uponr(2nr) degrees of freedom.
since the nuclear ball{X :X 1 }is the convex hull of the set of rank-one matrices with spectral norm bounded by one. Another interpretation of Cand`es and Rechts result is that under suitable conditions, the rank minimization program (1.3) and the convex program (1.1) areformally equivalentin the sense that they have exactly the same unique solution.
1.2 Algorithm outline
Because minimizing the nuclear norm both provably recoversthe lowest-rank matrix subject to constraints (see [45] for related results) and gives generally good empirical results in a variety of situations, it is understandably of great interest to develop numerical methods for solving (1.1). In [13], this optimization problem was solved using one of the most advanced semidefinite programming solvers, namely, SDPT3 [47]. This solver and others like SeDuMi are based on interior-point methods, and are problematic when the size of the matrix is large because they need to solve huge systems of linear equations to compute the Newton direction. In fact, SDPT3 can only handle nnmatrices withn100. Presumably, one could resort to iterative solvers suchas the method of conjugate gradients to solve for the Newton step but this is problematic as well since it is well known that the condition number of the Newton system increases rapidly as one gets closer to the solution. In addition, none of these general purpose solvers use the fact that the solution may have low rank. We refer the reader to [40] for some recent progresson interior-point methods concerning some special nuclear norm-minimization problems. This paper develops thesingular value thresholdingalgorithm for approximately solving the nuclear norm minimization problem (1.1) and by extension, problems of the form
minimize X
subject to A(X) =b,
(1.4)
whereAis a linear operator acting on the space ofn 1 n 2 matrices andbRm. This algorithm is a simple first-order method, and is especially well suited for problems of very large sizes in which the solution has low rank. We sketch this algorithm in the special matrix completion setting and letPbe the orthogonal projector onto the span of matrices vanishing outside of so that the (i, j)th component ofP(X) is equal toXijif (i, j) and zero otherwise. Our problem may be expressed as minimize X subject to P(X) =P(M), (1.5)
with optimization variableXRn^1 n^2. Fix >0 and a sequence{k}k 1 of scalar step sizes. Then starting withY^0 = 0Rn^1 n^2 , the algorithm inductively defines { Xk= shrink(Yk^1 , ), Yk=Yk^1 +kP(MXk)
(1.6)
until a stopping criterion is reached. In (1.6), shrink(Y, ) is a nonlinear function which applies a soft-thresholding rule at levelto the singular values of the input matrix, see Section 2 for details. The key property here is that for large values of, the sequence{Xk}converges to a solution which very nearly minimizes (1.5). Hence, at each step, one only needs to compute at most one singular value decomposition and perform a few elementary matrix additions. Two important remarks are in order:
- Sparsity.For eachk0,Ykvanishes outside of and is, therefore, sparse, a fact whichcan be used to evaluate the shrink function rapidly.
- Low-rank property.The matricesXkturn out to have low rank, and hence the algorithm has minimum storage requirement since we only need to keep principal factors in memory.
Our numerical experiments demonstrate that the proposed algorithm can solve problems, in Matlab, involving matrices of size 30, 000 30 ,000 having close to a billion unknowns in 17 minutes on a standard desktop computer with a 1.86 GHz CPU (dual core with Matlabs multithreading option enabled) and 3 GB of memory. As a consequence, the singular value thresholding algorithm may become a rather powerful computational tool for large scale matrix completion.
1.3 General formulation
The singular value thresholding algorithm can be adapted to deal with other types of convex constraints. For instance, it may address problems of the form
minimize X
subject to fi(X) 0 , i= 1,... , m,
(1.7)
where eachfiis a Lipschitz convex function (note that one can handle linear equality constraints by considering pairs of affine functionals). In the simpler casewhere thefis are affine functionals, the general algorithm goes through a sequence of iterations which greatly resemble (1.6). This is useful because this enables the development of numerical algorithms which are effective for recovering matrices from a small subset of sampled entries possibly contaminated with noise.
1.4 Contents and notations
The rest of the paper is organized as follows. In Section 2, wederive the singular value threshold- ing (SVT) algorithm for the matrix completion problem, and recasts it in terms of a well-known Lagrange multiplier algorithm. In Section 3, we extend the SVT algorithm and formulate a gen- eral iteration which is applicable to general convex constraints. In Section 4, we establish the convergence results for the iterations given in Sections 2 and 3. We demonstrate the performance and effectiveness of the algorithm through numerical examples in Section 5, and review additional implementation details. Finally, we conclude the paper with a short discussion in Section 6. Before continuing, we provide here a brief summary of the notations used throughout the paper. Matrices are bold capital, vectors are bold lowercase and scalars or entries are not bold. For instance, X is a matrix andXij its (i, j)th entry. Likewise, xis a vector andxiitsith component. The nuclear norm of a matrix is denoted byX, the Frobenius norm byXF and the spectral norm byX 2 ; note that these are respectively the 1-norm, the 2-norm andthe sup-norm of the vector of singular values. The adjoint of a matrixX is X and similarly for vectors. The notation diag(x), wherexis a vector, stands for the diagonal matrix with{xi}as diagonal elements. We denote byX,Y= trace(XY) the standard inner product between two matrices (X^2 F=X,X). The Cauchy-Schwarz inequality givesX,YXFYFand it is well known that we also haveX,YXY 2 (the spectral and nuclear norms are dual from one another), see e.g. [13,45].
2 The Singular Value Thresholding Algorithm
This section introduces the singular value thresholding algorithm and discusses some of its ba- sic properties. We begin with the definition of a key buildingblock, namely, the singular value thresholding operator.
2.1 The singular value shrinkage operator
Consider the singular value decomposition (SVD) of a matrixXRn^1 n^2 of rankr
X=UV, = diag({i} 1 ir), (2.1)
whereUandV are respectivelyn 1 randn 2 rmatrices with orthonormal columns, and the singular valuesiare positive (unless specified otherwise, we will always assume that the SVD of a matrix is given in the reduced form above). For each0, we introduce the soft-thresholding operatorDdefined as follows:
D(X) :=UD()V, D() = diag({i)+}), (2.2)
wheret+is the positive part oft, namely,t+= max(0, t). In words, this operator simply applies a soft-thresholding rule to the singular values ofX, effectively shrinking these towards zero. This is the reason why we will also refer to this transformation as thesingular value shrinkageoperator. Even though the SVD may not be unique, it is easy to see that thesingular value shrinkage operator is well defined and we do not elaborate further on this issue. In some sense, this shrinkage operator is a straightforward extension of the soft-thresholding rule for scalars and vectors. In particular, note that if many of the singular values ofXare below the threshold, the rank ofD(X) may be considerably lower than that ofX, just like the soft-thresholding rule applied to vectors leads to sparser outputs whenever some entries of the input are below threshold. The singular value thresholding operator is the proximity operator associated with the nuclear norm. Details about the proximity operator can be found in e.g. [35].
Theorem 2.1 For each 0 andY Rn^1 n^2 , the singular value shrinkage operator(2.2)obeys
D(Y) = arg min
X
{
1
2
XY^2 F+X
}
. (2.3)
Proof.Since the functionh 0 (X) :=X+^12 XY^2 F is strictly convex, it is easy to see that there exists a unique minimizer, and we thus need to prove that it is equal toD(Y). To do this, recall the definition of a subgradient of a convex functionf:Rn^1 n^2 R. We say thatZ is a subgradient offatX 0 , denotedZf(X 0 ), if
f(X)f(X 0 ) +Z,XX 0 (2.4)
for allX. NowX minimizesh 0 if and only if 0 is a subgradient of the functionalh 0 at the point X, i.e. 0 XY + X, (2.5)
whereXis the set of subgradients of the nuclear norm. LetX Rn^1 n^2 be an arbitrary matrix andUVbe its SVD. It is known [13,37,49] that
X=
{
UV+W: WRn^1 n^2 , UW= 0, W V = 0, W 2 1
}
. (2.6)
SetX:=D(Y) for short. In order to show thatX obeys (2.5), decompose the SVD ofY as
Y =U 0 0 V 0 +U 1 1 V 1 ,
whereU 0 ,V 0 (resp.U 1 ,V 1 ) are the singular vectors associated with singular values greater than (resp. smaller than or equal to). With these notations, we have
X=U 0 ( 0 I)V 0
and, therefore, Y X=(U 0 V 0 +W), W=^1 U 1 1 V 1 .
By definition,U 0 W = 0, W V 0 = 0 and since the diagonal elements of 1 have magnitudes bounded by, we also haveW 2 1. HenceY X X, which concludes the proof.
2.2 Shrinkage iterations
We are now in the position to introduce the singular value thresholding algorithm. Fix >0 and a sequence{k}of positive step sizes. Starting withY 0 , inductively define fork= 1, 2 ,.. ., { Xk=D(Yk^1 ), Yk=Yk^1 +kP(MXk)
(2.7)
until a stopping criterion is reached (we postpone the discussion this stopping criterion and of the choice of step sizes). This shrinkage iteration is very simple to implement. At each step, we only need to compute an SVD and perform elementary matrix operations. With the help of a standard numerical linear algebra package, the whole algorithm can be coded in just a few lines. Before addressing further computational issues, we would like to make explicit the relationship between this iteration and the original problem (1.1). In Section 4, we will show that the sequence {Xk}converges to the unique solution of an optimization problemclosely related to (1.1), namely,
minimize X+^12 X^2 F
subject to P(X) =P(M). (2.8)
Furthermore, it is intuitive that the solution to this modified problem converges to that of (1.5) as as shown in Section 3. Thus by selecting a large value of the parameter, the sequence of iterates converges to a matrix which nearly minimizes (1.1). As mentioned earlier, there are two crucial properties which make this algorithm ideally suited for matrix completion.
- Low-rank property. A remarkable empirical fact is that the matrices in the sequence{Xk} have low rank (provided, of course, that the solution to (2.8) has low rank). We use the word empirical because all of our numerical experiments have produced low-rank sequences but we cannot rigorously prove that this is true in general. The reason for this phenomenon is, however, simple: because we are interested in large values of(as to better approximate the solution to (1.1)), the thresholding step happens to killmost of the small singular values and produces a low-rank output. In fact, our numerical results show that the rank ofXkis nondecreasing withk, and the maximum rank is reached in the last steps of the algorithm, see Section 5.
Thus, when the rank of the solution is substantially smallerthan either dimension of the
matrix, the storage requirement is low since we could store eachXkin its SVD form (note
that we only need to keep the current iterate and may discard earlier values).
- Sparsity. Another important property of the SVT algorithm is that the iteration matrixYk is sparse. SinceY^0 = 0 , we have by induction thatYkvanishes outside of . The fewer entries available, the sparserYk. Because the sparsity pattern is fixed throughout, one can then apply sparse matrix techniques to save storage. Also, if||=m, the computational cost of updatingYk is of orderm. Moreover, we can call subroutines supporting sparse matrix computations, which can further reduce computational costs. One such subroutine is the SVD. However, note that we do not need to compute the entire SVD ofYkto apply the singular value thresholding operator. Only thepart corresponding to singular values greater thanis needed. Hence, a good strategy is to apply the iterative Lanczos algorithm to compute the first few singular values and singular vectors. Because Yk is sparse,Yk can be applied to arbitrary vectors rapidly, and this procedure offers a considerable speedup over naive methods.
2.3 Relation with other works
Our algorithm is inspired by recent work in the area of 1 minimization, and especially by the work on linearized Bregman iterations for compressed sensing, see [911,23,44,51] for linearized Bregman iterations and [1417,26] for some information about the field of compressed sensing. In this line of work, linearized Bregman iterations are used to find the solution to an underdetermined system of linear equations with minimum 1 norm. In fact, Theorem 2.1 asserts that the singular value thresholding algorithm can be formulated as a linearized Bregman iteration. Bregman iterations were first introduced in [43] as a convenient tool for solvingcomputational problems in the imaging sciences, and a later paper [51] showed that they were usefulfor solving 1 -norm minimization problems in the area of compressed sensing. Linearized Bregman iterations were proposed in [23] to improve performance of plain Bregman iterations, see also [51]. Additional details together with a technique for improving the speed of convergence calledkickingare described in [44]. On the practical side, the paper [11] applied Bregman iterations to solve a deblurring problem while on the theoretical side, the references [9, 10] gave a rigorous analysis of the convergence of such iterations. New developments keep on coming out at a rapid pace and recently, [32] introduced a new iteration, thesplit Bregman iteration, to extend Bregman-type iterations (such as linearized Bregman iterations) to problems involving the minimization of 1 -like functionals such as total- variation norms, Besov norms, and so forth. When applied to 1 -minimization problems, linearized Bregman iterations are sequences of soft-thresholding rules operating on vectors. Iterative soft-thresholding algorithms in connection with 1 or total-variation minimization have quite a bit of historyin signal and image processing and we would like to mention the works [12, 39] for total-variation minimization, [24, 25, 31] for 1 minimization, and [5, 7, 8, 19, 20, 27, 28, 46] for some recentapplications in the area of image inpainting and image restoration. Just as iterative soft-thresholding methods are designed to find sparse solutions, our iterative singular value thresholding scheme is designed to find a sparse vector of singular values. In classical problems arising in the areas of compressed sensing, and signal or image processing, the sparsity is expressed in a known transformed domain and soft-thresholding is applied to transformed coefficients. In contrast, the shrinkage operatorDis adaptive. The SVT not
only discovers a sparse singular vector but also the bases inwhich we have a sparse representation. In this sense, the SVT algorithm is an extension of earlier iterative soft-thresholding schemes. Finally, we would like to contrast the SVT iteration (2.7) with the popular iterative soft- thresholding algorithm used in many papers in imaging processing and perhaps best known under the name of Proximal Forward-Backward Splitting method (PFBS), see [8,22,24,31,33] for example. The constrained minimization problem (1.5) may be relaxed into
minimize X+
1
2
P(X)P(M)^2 F (2.9)
for some >0. Theorem 2.1 asserts thatDis the proximity operator ofXand Proposition 3.1(iii) in [22] gives that the solution to this unconstrained problem is characterized by the fixed point equationX =D(X+P(MX)) for each >0. One can then apply a simplified version of the PFBS method (see (3.6) in [22]) to obtain iterations of the form
Xk=Dk 1 (Xk^1 +k 1 P(MXk^1 )).
Introducing an intermediate matrixYk, this algorithm may be expressed as { Xk=Dk 1 (Yk^1 ), Yk=Xk+kP(MXk).
(2.10)
The difference with (2.7) may seem subtle at firstreplacingXkin (2.10) withYk^1 and setting k =gives (2.7) with=but has enormous consequences as this gives entirely different algorithms. First, they have different limits: while (2.7) converges to the solution of the constrained minimization (2.8), (2.10) converges to the solution of (2.9) provided that the sequence of step sizes is appropriately selected. Second, selecting a large(or a large value of=) in (2.10) gives a low-rank sequence of iterates and a limit with small nuclear norm. The limit, however, does not fit the data and this is why one has to choose a small or moderate value of(or of=). However, whenis not sufficiently large, theXkmay not have low rank even though the solution has low rank (and one may need to compute many singular vectors), andYk is not sufficiently sparse to make the algorithm computationally attractive. Moreover, the limit does not necessary have a small nuclear norm. These are reasons why (2.10) is notsuitable for matrix completion.
2.4 Interpretation as a Lagrange multiplier method
In this section, we recast the SVT algorithm as a type of Lagrange multiplier algorithm known as Uzawas algorithm. An important consequence is that this will allow us to extend the SVT algorithm to other problems involving the minimization of the nuclearnorm under convex constraints, see Section 3. Further, another contribution of this paper is that this framework actually recasts linear Bregman iterations as a very special form of Uzawas algorithm, hence providing fresh and clear insights about these iterations. In what follows, we setf(X) =X+^12 X^2 Ffor some fixed >0 and recall that we wish to solve (2.8) minimize f(X) subject to P(X) =P(M).
The Lagrangian for this problem is given by
L(X,Y) =f(X) +Y,P(MX),
whereY Rn^1 n^2. Strong duality holds andXandYare primal-dual optimal if (X,Y) is a saddlepoint of the LagrangianL(X,Y), i.e. a pair obeying
sup
Y
inf
X
L(X,Y) =L(X,Y) = inf
X
sup
Y
L(X,Y). (2.11)
(The functiong 0 (Y) = infXL(X,Y) is called the dual function.) Uzawas algorithm approaches the problem of finding a saddlepoint with an iterative procedure. FromY 0 = 0 , say, inductively define { L(Xk,Yk^1 ) = minXL(X,Yk^1 ) Yk=Yk^1 +kP(MXk),
(2.12)
where{k}k 1 is a sequence of positive step sizes. Uzawas algorithm is, in fact, a subgradient method applied to the dual problem, where each step moves thecurrent iterate in the direction of the gradient or of a subgradient. Indeed, observe that
Yg 0 (Y) =YL(X ,Y) =P(MX ), (2.13)
whereX is the minimizer of the Lagrangian for that value ofY so that a gradient descent update forY is of the form
Yk=Yk^1 +kYg 0 (Yk^1 ) =Yk^1 +kP(MXk).
It remains to compute the minimizer of the Lagrangian (2.12), and note that
arg minf(X) +Y,P(MX)= arg minX+
1
2
XPY^2 F. (2.14)
However, we know that the minimizer is given byD(P(Y)) and sinceYk=P(Yk) for allk0, Uzawas algorithm takes the form { Xk=D(Yk^1 ) Yk=Yk^1 +kP(MXk),
which is exactly the update (2.7). This point of view brings to bear many different mathemat- ical tools for proving the convergence of the singular valuethresholding iterations. For an early use of Uzawas algorithm minimizing an 1 -like functional, the total-variation norm, under linear inequality constraints, see [12].
3 General Formulation
This section presents a general formulation of the SVT algorithm for approximately minimizing the nuclear norm of a matrix under convex constraints.
3.1 Linear equality constraints
Set the Objective functionalf(X) =X+^12 X^2 F for some fixed > 0, and consider the following optimization problem:
minimize f(X)
subject to A(X) =b,
(3.1)
whereAis a linear transformation mappingn 1 n 2 matrices intoRm(Ais the adjoint ofA). This more general formulation is considered in [13] and [45] as anextension of the matrix completion problem. Then the Lagrangian for this problem is of the form
L(X,y) =f(X) +y,bA(X), (3.2)
whereXRn^1 n^2 andyRm, and starting withy^0 = 0 , Uzawas iteration is given by { Xk=D(A(yk^1 )), yk=yk^1 +k(bA(Xk)).
(3.3)
The iteration (3.3) is of course the same as (2.7) in the case whereAis a sampling operator extractingmentries with indices in out of ann 1 n 2 matrix. To verify this claim, observe that in this situation,AA=P, and letM be any matrix obeyingA(M) =b. Then defining Yk=A(yk) and substituting this expression in (3.3) gives (2.7).
3.2 General convex constraints
One can also adapt the algorithm to handle general convex constraints. Suppose we wish to minimizef(X) defined as before over a convex setX C. To simplify, we will assume that this convex set is given by C={X:fi(X) 0 ,i= 1,… , m},
where the fis are convex functionals (note that one can handle linear equality constraints by considering pairs of affine functionals). The problem of interest is then of the form
minimize f(X)
subject to fi(X) 0 , i= 1,... , m.
(3.4)
Just as before, it is intuitive that as, the solution to this problem converges to a minimizer of the nuclear norm under the same constraints (1.7) as shownin Theorem 3.1 at the end of this section. PutF(X) := (f 1 (X),… , fm(X)) for short. Then the Lagrangian for (3.4) is equal to
L(X,y) =f(X) +y,F(X),
whereXRn^1 n^2 andyRmis now a vector with nonnegative components denoted, as usual, byy 0. One can apply Uzawas method just as before with the only modification that we will use a subgradient method with projection to maximize the dual function since we need to make sure that the successive updatesykbelong to the nonnegative orthant. This gives { Xk= arg min{f(X) +yk^1 ,F(X)}, yk= [yk^1 +kF(Xk)]+.
(3.5)
Above,x+is of course the vector with entries equal to max(xi,0). WhenFis an affine mapping of the formbA(X) so that one solves
minimize f(X)
subject to A(X)b,
this simplifies to { Xk=D(A(yk^1 )), yk= [yk^1 +k(bA(Xk))]+,
(3.6)
and thus the extension to linear inequality constraints is straightforward.
3.3 Example
An interesting example concerns the extension of the Dantzig selector [18] to matrix problems. Suppose we have available linear measurements about a matrixMof interest
b=A(M) +z, (3.7)
wherez Rm is a noise vector. Then under these circumstances, one mightwant to find the matrix which minimizes the nuclear norm among all matrices which are consistent with the datab. Inspired by the work on the Dantzig selector which was originally developed for estimating sparse parameter vectors from noisy data, one could approach this problem by solving
minimize X
subject to |vec(A(r))|vec(E), r:=bA(X),
(3.8)
whereEis an array of tolerances, which is adjusted to fit the noise statistics [18]. Above,vec(A) vec(B), for any two matricesAandB, means componentwise inequalities; that is,AijBijfor all indicesi, j. We use this notation as not to confuse the reader with the positive semidefinite ordering. In the case of the matrix completion problem whereAextracts sampled entries indexed by , one can always see the data vector as the sampled entries of some matrixBobeyingP(B) =A(b); the constraint is then natural for it may be expressed as
|BijXij|Eij, (i, j),
Ifzis white noise with standard deviation, one may want to use a multiple offorEij. In words, we are looking for a matrix with minimum nuclear norm under the constraint that all of its sampled entries do not deviate too much from what has been observed. LetY+Rn^1 n^2 (resp.YRn^1 n^2 ) be the Lagrange multiplier associated with the compo- nentwise linear inequality constraintsvec(A(r))vec(E) (resp.vec(A(r))vec(E)). Then starting withY^0 = 0 , the SVT iteration for this problem is of the form { Xk=D(AA(Y+k^1 Yk^1 )), Yk= [Yk^1 +k(A(rk)E)]+, rk=bkA(Xk),
(3.9)
where again []+is applied componentwise. We conclude by noting that in the matrix completion problem whereAA=P and one observesP(B), one can check that this iteration simplifies to { Xk=D(Y+k^1 Yk^1 ), Yk= [Yk^1 +kP((BXk)E)]+.
(3.10)
Again, this is easy to implement and whenever the solution has low rank, the iteratesXkhave low rank as well.
3.4 When the proximal problem gets close
We now show that minimizing the proximal objectivef(X) =X+^12 X^2 F is the same as minimizing the nuclear norm in the limit of larges. The theorem below is general and covers the special case of linear equality constraints as in (2.8).
Theorem 3.1 LetXbe the solution to(3.4)andXbe the minimum Frobenius-norm solution to(1.7)defined as X:= arg min X
{X^2 F : Xis a solution of (1.7)}. (3.11)
Assume that thefi(X)s, 1 im, are convex and lower semi-continuous. Then
limXXF= 0. (3.12)
Proof.It follows from the definition ofXandXthat
X+
1
2
X^2 F X+
1
2
X^2 F, and XX. (3.13)
Summing these two inequalities gives
X^2 F X^2 F, (3.14)
which implies thatX^2 F is bounded uniformly in. Thus, we would prove the theorem if we could establish that any convergent subsequence{Xk}k 1 must converge toX. Consider an arbitrary converging subsequence{Xk}and setXc:= limkXk. Since for each 1im,fi(Xk)0 andfiis lower semi-continuous,Xcobeys
fi(Xc) 0 , i= 1,... , m. (3.15)
Furthermore, sinceX^2 F is bounded, (3.13) yields
lim sup
XX, Xlim inf
X.
An immediate consequence is limX =X and, therefore,Xc=X. This shows thatXcis a solution to (1.1). Now it follows from the definition of XthatXcF XF, while we also haveXcF XF because of (3.14). We conclude thatXcF = XF and thusXc=XsinceXis unique.
4 Convergence Analysis
This section establishes the convergence of the SVT iterations. We begin with the simpler proof of the convergence of (2.7) in the special case of the matrix completion problem, and then present the argument for the more general constraints (3.5). We hopethat this progression will make the second and more general proof more transparent.
4.1 Convergence for matrix completion
We begin by recording a lemma which establishes the strong convexity of the objectivef.
Lemma 4.1 LetZf(X)andZf(X). Then
ZZ,XXXX^2 F. (4.1)
Proof. An elementZ off(X) is of the formZ=Z 0 +X, whereZ 0 X, and similarly forZ. This gives
ZZ,XX=Z 0 Z 0 ,XX+XX^2 F
and it thus suffices to show that the first term of the right-handside is nonnegative. From (2.6), we have that any subgradient of the nuclear norm atXobeysZ 0 2 1 andZ 0 ,X=X. In particular, this gives
|Z 0 ,X|Z 0 2 XX, |Z 0 ,X|Z 0 2 XX.
Whence,
Z 0 Z 0 ,XX=Z 0 ,X+Z 0 ,XZ 0 ,XZ 0 ,X
=X+XZ 0 ,XZ 0 ,X 0 ,
which proves the lemma.
This lemma is key in showing that the SVT algorithm (2.7) converges.
Theorem 4.2 Suppose that the sequence of step sizes obeys 0 <infk supk< 2. Then the sequence{Xk}obtained via(2.7)converges to the unique solution of (2.8).
Proof.Let (X,Y) be primal-dual optimal for the problem (2.8). The optimality conditions give
0 =ZkP(Yk^1 )
0 =ZP(Y),
for someZkf(Xk) and someZf(X). We then deduce that
(ZkZ)P(Yk^1 Y) = 0
and, therefore, it follows from Lemma 4.1 that
XkX,P(Yk^1 Y)=ZkZ,XkXXkX^2 F. (4.2)
We continue and observe that becausePX=PM,
P(YkY)F=P(Yk^1 Y) +kP(XXk)F.
Therefore, settingrk=P(YkY)F,
r^2 k=r^2 k 1 2 kP(Yk^1 Y),XkX+k^2 P(XXk)^2 F
r^2 k 1 2 kXkX^2 F+^2 kXkX^2 F (4.3)
since for any matrixX,P(X)F XF. Under our assumptions about the size ofk, we have 2 k^2 kfor allk1 and some >0 and thus
r^2 kr^2 k 1 XkX^2 F. (4.4)
Two properties follow from this:
- The sequence{P(YkY)F}is nonincreasing and, therefore, converges to a limit.
- As a consequence,XkX^2 F0 ask.
The theorem is established.
4.2 General convergence theorem
Our second result is more general and establishes the convergence of the SVT iterations to the solution of (3.4) under general convex constraints. From now now, we will only assume that the functionF(X) is Lipschitz in the sense that
F(X)F(YL(F)XYF, (4.5)
for some nonnegative constant L(F). Note that ifF is affine, F(X) = b A(X), we have L(F) =A 2 whereA 2 is the spectrum norm of the linear transformationAdefined asA 2 := sup{A(X) 2 :XF = 1}. We also recall thatF(X) = (f 1 (X),… , fm(X)) where eachfiis convex, and that the Lagrangian for the problem (3.4) is given by
L(X,y) =f(X) +y,F(X), y 0.
We will assume to simplify that strong duality holds which isautomatically true if the constraints obey constraint qualifications such as Slaters condition [6]. We first establish the following preparatory lemma.
Lemma 4.3 Let(X,y)be a primal-dual optimal pair for(3.4). Then for each > 0 ,yobeys
y= [y+F(X)]+. (4.6)
Proof.Recall that the projectionx 0 of a pointxonto a convex setCis characterized by { x 0 C, yx 0 ,xx 0 0 , yC.
In the case whereC=Rm+={xRm:x 0 }, this condition becomesx 0 0 and
yx 0 ,xx 0 0 , y 0.
Now becauseyis dual optimal we have
L(X,y)L(X,y), y 0.
Substituting the expression for the Lagrangian, this is equivalent to
yy,F(X) 0 , y 0 ,
which is the same as
yy,y+F(X)y 0 , y 0 , 0.
Hence it follows thatymust be the projection ofy+F(X) onto the nonnegative orthantRm+. Since the projection of an arbitrary vectorxontoRm+is given byx+, our claim follows.
We are now in the position to state our general convergence result.
Theorem 4.4 Suppose that the sequence of step sizes obeys 0 <infk supk < 2 /L(F)^2 , whereL(F)is the Lipschitz constant in(4.5). Then assuming strong duality, the sequence{Xk} obtained via(3.5)converges to the unique solution of (3.4).
Proof. Let (X,y) be primal-dual optimal for the problem (3.4). We claim thatthe optimality conditions give that for allX
Zk,XXk+yk^1 ,F(X)F(Xk) 0 ,
Z,XX+y,F(X)F(X) 0 , (4.7)
for someZkf(Xk) and someZf(X). We justify this assertion by proving one of the two inequalities since the other is exactly similar. For thefirst,XkminimizesL(X,yk^1 ) over all Xand, therefore, there existZkf(Xk) andZikfi(Xk), 1im, such that
Zk+
m
i=
yik^1 Zik= 0.
Now because eachfiis convex,
fi(X)fi(Xk)Zik,XXk
and, therefore,
Zk,XXk+
m
i=
yki^1 (fi(X)fi(Xk))Zk+
m
i=
yik^1 Zik,XXk= 0.
This is (4.7). Now write the first inequality in (4.7) forX, the second forXkand sum the two inequalities. This gives ZkZ,XkX+yk^1 y,F(Xk)F(X) 0.
The rest of the proof is essentially the same as that of Theorem 4.5. It follows from Lemma 4. that yk^1 y,F(Xk)F(X)ZkZ,XkXXkX^2 F. (4.8)
We continue and observe that becausey= [y+kF(X)]+by Lemma 4.3, we have
yky=[yk^1 +kF(Xk)]+[y+kF(X)]+
yk^1 y+k(F(Xk)F(X))
since the projection onto the convex setRm+is a contraction. Therefore,
yky^2 =yk^1 y^2 + 2kyk^1 y,F(Xk)F(X)+^2 kF(Xk)F(X)^2
yk^1 y^2 2 kXkX^2 F+^2 kL^2 XkX^2 F,
where we have putLinstead ofL(F) for short. Under our assumptions about the size ofk, we have 2k^2 kL^2 for allk1 and some >0. Then
yky^2 yk^1 y^2 XkX^2 F, (4.9)
and the conclusion is as before.
The problem (3.1) with linear constraints can be reduced to (3.4) by choosing
F(X) =
[
b
b
]
[
A
A
]
X,
and we have the following corollary:
Corollary 4.5Suppose that the sequence of step sizes obeys 0 <infksupk< 2 /A^22. Then the sequence{Xk}obtained via(3.3)converges to the unique solution of (3.1).
LetA 2 := sup{A(X)F :XF= 1}. WithF(X) given as above, we have|L(F)|^2 = 2A^22 and thus, Theorem 4.4 guarantees convergence as long as 0<infksupk< 1 /A^22. However, an argument identical to the proof of Theorem 4.2 would remove the extra factor of two. We omit the details.
5 Implementation and Numerical Results
This section provides implementation details of the SVT algorithmas to make it practically effective for matrix completionsuch as the numerical evaluation of the singular value thresholding operator, the selection of the step sizek, the selection of a stopping criterion, and so on. This section also introduces several numerical simulation results which demonstrate the performance and effectiveness of the SVT algorithm. We show that 30, 000 30 ,000 matrices of rank 10 are recovered from just about 0.4% of their sampled entries in a matter of a few minutes on a modest desktop computer with a 1.86 GHz CPU (dual core with Matlabsmultithreading option enabled) and 3 GB of memory.
5.1 Implementation details
5.1.1 Evaluation of the singular value thresholding operator
To apply the singular value tresholding operator at levelto an input matrix, it suffices to know those singular values and corresponding singular vectors above the threshold. In the matrix completion problem, the singular value thresholding operator is applied to sparse matrices{Yk} since the number of sampled entries is typically much lower than the number of entries in the unknown matrixM, and we are hence interested in numerical methods for computing the dominant singular values and singular vectors of large sparse matrices. The development of such methods is a relatively mature area in scientific computing and numerical linear algebra in particular. In fact, many high-quality packages are readily available. Our implementation uses PROPACK, see [36] for documentation and availability. One reason for this choice is convenience: PROPACK comes in a mat lab and a Fortran version, and we find it convenient to use the well-documented Matlab version. More importantly, PROPACK uses the iterative Lanczos algorithm to compute the singular values and singular vectors directly, by using the Lanczos bidiagonalization algorithm with partial reorthogonalization. In particular, PROPACK does not compute the eigenvalues and eigenvectors of (Yk)YkandYk(Yk), or of an augmented matrix as in the Matlab built-in functionsvds for example. Consequently, PROPACK is an efficientboth in termsof number of flops and storage requirementand stable package for computing the dominantsingular values and singular vectors
of a large sparse matrix. For information, the available documentation [36] reports a speedup factor of about ten over Matlabs svds. Furthermore, the Fortran version of PROPACK is about 34 times faster than the Matlab version. Despite this significant speedup, we have only used the Matlab version but since the singular value shrinkage operator is by-and-large the dominant cost in the SVT algorithm, we expect that a Fortran implementation would run about 3 to 4 times faster. As for most SVD packages, though one can specify the number ofsingular values to compute, PROPACK can not automatically compute only those singular values exceeding the threshold. One must instead specify the numbersof singular values ahead of time, and the software will compute theslargest singular values and corresponding singular vectors. To use this package, we must then determine the numberskof singular values ofYk^1 to be computed at thekth iteration. We use the following simple method. Letrk 1 = rank(Xk^1 ) be the number of nonzero singular values ofXk^1 at the previous iteration. Setsk=rk 1 +1 and compute the firstsksingular values ofYk^1. If some of the computed singular values are already smallerthan, thensk is a right choice. Otherwise, incrementsk by a predefined integerrepeatedly until some of the singular values fall below. In the experiments, we choose= 5. Another rule might be to repeatedly multiplysk by a positive numbere.g. 2until our criterion is met. Incrementingskby a fixed integer works very well in practice; in our experiments, we very rarely need more than one update. We note that it is not necessary to rerun the Lanczos iterations for the firstskvectors since they have been already computed; only a few new singular values (of them) need to be numerically evaluated. This can be done by modifying the PROPACK routines. We have not yet modified PROPACK, however. Had we done so, our run times would be decreased.
5.1.2 Step sizes
There is a large literature on ways of selecting a step size but for simplicity, we shall use step sizes that are independent of the iteration count; that isk=fork= 1, 2 ,.. .. From Theorem 4.2, convergence for the completion problem is guaranteed (2.7)provided that 0< <2. This choice is, however, too conservative and the convergence is typically slow. In our experiments, we use instead = 1. 2 n 1 n 2 m
, (5.1)
i.e. 1.2 times the undersampling ratio. We give a heuristic justification below. Consider a fixed matrixARn^1 n^2. Under the assumption that the column and row spaces of Aare not well aligned with the vectors taken from the canonical basis ofRn^1 andRn^2 respectively theincoherence assumptionin [13]then with very large probability over the choices of, we have
(1)pA^2 FP(A)^2 F(1 +)pA^2 F, p:=m/(n 1 n 2 ), (5.2)
provided that the rank ofAis not too large. The probability model is that is a set of sampled entries of cardinalitymsampled uniformly at random so that all the choices are equally likely. In (5.2), we want to think ofas a small constant, e.g. smaller than 1/2. In other words, the energy ofAon (the set of sampled entries) is just about proportional to the size of . The near isometry (5.2) is a consequence of Theorem 4.1 in [13], and we omit the details. Now returning to the proof of Theorem 4.2, we see that a sufficient condition for the convergence of (2.7) is
> 0 , 2 XXk^2 F+^2 P(XXk)^2 FXXk^2 F,
compare (4.4), which is equivalent to
0 < < 2
XXk^2 F
P(XXk)^2 F
.
SinceP(X)FXF for any matrixXRn^1 n^2 , it is safe to select <2. But suppose that we could apply (5.2) to the matrixA=XXk. Then we could takeinversely proportional top; e.g. with= 1/4, we could take 1. 6 p^1. Below, we shall use the value= 1. 2 p^1 which allows us to take large steps and still provides convergence, at least empirically. The reason why this is not a rigorous argument is that (5.2) cannot be applied toA=XXk even though this matrix difference may obey the incoherence assumption. The issue here is that XXk is not a fixed matrix, but rather depends on since the iterates{Xk}are computed with the knowledge of the sampled set.
5.1.3 Initial steps
The SVT algorithm starts withY^0 = 0 , and we want to choose a largeto make sure that the solution of (2.8) is close enough to a solution of (1.1). Definek 0 as that integer obeying
P(M) 2
(k 0 1 , k 0 ]. (5.3)
SinceY^0 = 0 , it is not difficult to see that
Xk= 0 , Yk=kP(M), k= 1,... , k 0.
To save work, we may simply skip the computations ofX^1 ,… ,Xk^0 , and start the iteration by computingXk^0 +1fromYk^0. This strategy is a special case of akicking deviceintroduced in [44]; the main idea of such a kicking scheme is that one can jump over a few steps whenever possible. Just like in the aforementioned reference, we can develop similar kicking strategies here as well. Because in our numerical experiments the kicking is rarely triggered, we forgo the description of such strategies.
5.1.4 Stopping criteria
Here, we discuss stopping criteria for the sequence of SVT iterations (2.7), and present two possi- bilities. The first is motivated by the first-order optimality conditions or KKT conditions tailored to the minimization problem (2.8). By (2.14) and lettingYg 0 (Y) = 0 in (2.13), we see that the solution Xto (2.8) must also verify { X=D(Y), P(XM) = 0 ,
(5.4)
whereY is a matrix vanishing outside of c. Therefore, to make sure thatXkis close toX, it is sufficient to check how close (Xk,Yk^1 ) is to obeying (5.4). By definition, the first equation in (5.4) is always true. Therefore, it is natural to stop (2.7) when the error in the second equation is below a specified tolerance. We suggest stopping the algorithm when
P(XkM)F
P(M)F
, (5.5)
whereis a fixed tolerance, e.g. 10^4. We provide a short heuristic argument justifying this choice below. In the matrix completion problem, we know that under suitable assumptions
P(M)^2 FpM^2 F,
which is just (5.2) applied to the fixed matrixM(the symbolhere means that there is a constant as in (5.2)). Suppose we could also apply (5.2) to the matrixXkM(which we rigorously cannot sinceXkdepends on ), then we would have
P(XkM)^2 FpXkM^2 F, (5.6)
and thus P(XkM)F P(M)F
XkMF
MF
.
In words, one would control the relative reconstruction error by controlling the relative error on the set of sampled locations. A second stopping criterion comes from duality theory. Firstly, the iteratesXk are generally not feasible for (2.8) although they become asymptoticallyfeasible. One can construct a feasible point fromXkby projecting it onto the affine space{X:P(X) =P(M)}as follows:
X k=Xk+P(MXk).
As usual letf(X) =X+^12 X^2 F and denote bypthe optimal value of (2.8). SinceX kis feasible, we have pf(X k) :=bk.
Secondly, using the notations of Section 2.4, duality theory gives that
ak:=g 0 (Yk^1 ) =L(Xk,Yk^1 )p.
Therefore,bkakis an upper bound on the duality gap and one can stop the algorithm when this quantity falls below a given tolerance. For very large problems in which one holdsXk in reduced SVD form, one may not want to compute the projectionX k since this matrix would not have low rank and would require signifi- cant storage space (presumably, one would not want to spend much time computing this projection either). Hence, the second method only makes practical sense when the dimensions are not pro- hibitively large, or when the iterates do not have low rank.
5.1.5 Algorithm
We conclude this section by summarizing the implementationdetails and give the SVT algorithm for matrix completion below (Algorithm 1). Of course, one would obtain a very similar structure for the more general problems of the form (3.1) and (3.4) withlinear inequality constraints. For convenience, define for each nonnegative integersmin{n 1 , n 2 },
[Uk,k,Vk]s, k= 1, 2 ,... ,
whereUk= [uk 1 ,… ,uks] andVk= [v 1 k,… ,vsk] are the firstssingular vectors of the matrixYk, andkis a diagonal matrix with the firstssingular values 1 k,… , kson the diagonal.
Algorithm 1:Singular Value Thresholding (SVT) Algorithm
Input: sampled set and sampled entriesP(M), step size, tolerance, parameter
, increment, and maximum iteration countkmax
Output:Xopt
Description: Recover a low-rank matrixM from a subset of sampled entries
1 SetY^0 =k 0 P(M) (k 0 is defined in (5.3))
2 Setr 0 = 0
3 fork= 1tokmax
4 Setsk=rk 1 + 1
5 repeat
6 Compute [Uk^1 ,k^1 ,Vk^1 ]sk
7 Setsk=sk+
8 untilksk^1
9 Setrk= max{j:kj^1 > }
10 SetXk=
rk
j=1(
k 1
j )u
k 1
j v
k 1
j
11
if P(XkM)F/PMFthen break
12
SetYijk=
{
0 if (i, j)6,
Yijk^1 +(MijXijk) if (i, j)
13 endfork
14 SetXopt=Xk
5.2 Numerical results
5.2.1 Linear equality constraints
Our implementation is in Matlab and all the computational results we are about to report were obtained on a desktop computer with a 1.86 GHz CPU (dual core with Matlabs multithreading option enabled) and 3 GB of memory. In our simulations, we generatennmatrices of rankr by sampling twonrfactorsMLandMRindependently, each having i.i.d. Gaussian entries, and settingM=MLMRas it is suggested in [13]. The set of observed entries is sampled uniformly at random among all sets of cardinalitym. The recovery is performed via the SVT algorithm (Algorithm 1), and we use
P(XkM)F/PMF< 10 ^4 (5.7)
as a stopping criterion. As discussed earlier, the step sizes are constant and we set= 1. 2 p^1. Throughout this section, we denote the output of the SVT algorithm byXopt. The parameter is chosen empirically and set to= 5n. A heuristic argument is as follows. Clearly, we would like the termMto dominate the other, namely,^12 M^2 F. For products of Gaussian matrices as above, standard random matrix theory asserts that the Frobenius norm ofMconcentrates around n
r, and that the nuclear norm concentrates around aboutnr(this should be clear in the simple case wherer= 1 and is generally valid). The value= 5nmakes sure that on the average, the
UnknownM Computational results
size (nn) rank (r) m/dr m/n^2 time(s) # iters relative error
10 6 0.12 23 117 1. 64 10 ^4
1 , 000 1 , 000 50 4 0.39 196 114 1. 59 10 ^4
100 3 0.57 501 129 1. 68 10 ^4
10 6 0.024 147 123 1. 73 10 ^4
5 , 000 5 , 000 50 5 0.10 950 108 1. 61 10 ^4
100 4 0.158 3,339 123 1. 72 10 ^4
10 6 0.012 281 123 1. 73 10 ^4
10 , 000 10 , 000 50 5 0.050 2,096 110 1. 65 10 ^4
100 4 0.080 7,059 127 1. 79 10 ^4
20 , 000 20 , 000 10 6 0.006^5881241.^73 ^10 ^4
50 5 0.025 4,581 111 1. 66 10 ^4
30 , 000 30 , 000 10 6 0.004 1,030 125 1. 73 10 ^4
Table 1:Experimental results for matrix completion. The rankris the rank of the unknown
matrixM, m/dris the ratio between the number of sampled entries and the number of
degrees of freedom in annnmatrix of rankr(oversampling ratio), andm/n^2 is the fraction
of observed entries. All the computational results on the right are averaged over five runs.
value ofMis about 10 times that of^12 M^2 F as long as the rank is bounded away from the dimensionn. Our computational results are displayed in Table 1. There, we report the run time in seconds, the number of iterations it takes to reach convergence (5.7), and the relative error of the reconstruction
relative error =XoptMF/MF, (5.8)
whereMis the real unknown matrix. All of these quantities are averaged over five runs. The table also gives the percentage of entries that are observed, namely,m/n^2 together with a quantity that we may want to think as the information oversampling ratio. Recall that annnmatrix of rank rdepends upondr:=r(2nr) degrees of freedom. Thenm/dris the ratio between the number of sampled entries and the true dimensionality of annnmatrix of rankr. The first observation is that the SVT algorithm performs extremely well in these experiments. In all of our experiments, it takes fewer than 200 SVT iterations to reach convergence. As a consequence, the run times are short. As indicated in the table, we note that one recovers a 1 , 000 1 ,000 matrix of rank 10 in less than a minute. The algorithm alsorecovers 30, 000 30 , 000 matrices of rank 10 from about 0.4% of their sampled entries in just about 17 minutes. In addition, higher-rank matrices are also efficiently completed: for example, it takes between one and two hours to recover 10, 000 10 ,000 matrices of rank 100 and 20, 000 20 ,000 matrices of rank 50. We would like to stress that these numbers were obtained on a modest CPU (1.86GHz). Furthermore, a Fortran implementation is likely to cut down on these numbers by a multiplicative factor typically between three and four. We also check the validity of the stopping criterion (5.7) byinspecting the relative error defined in (5.8). The table shows that the heuristic and nonrigorousanalysis of Section 5.1 holds in practice since the relative reconstruction error is of the same orderasP(XoptM)F/PMF 10 ^4. Indeed, the overall relative errors reported in Table 1 are all less than 2 10 ^4.
We emphasized all along an important feature of the SVT algorithm, which is that the matrices Xk have low rank. We demonstrate this fact empirically in Figure 1, which plots the rank of Xk versus the iteration countk, and does this for unknown matrices of size 5, 000 5 ,000 with different ranks. The plots reveal an interesting phenomenon: in our experiments, the rank ofXk is nondecreasing so that the maximum rank is reached in the final steps of the algorithm. In fact, the rank of the iterates quickly reaches the valuerof the true rank. After these few initial steps, the SVT iterations search for that matrix with rankrminimizing the objective functional. As mentioned earlier, the low-rank property is crucial for making the algorithm run fast.
(^0050100150) 1 2 3 4 5 6 7 8 9 10 11 Itertion Step k Rank of X k (^0020406080100120140) 10 20 30 40 50 60 Iteration step k Rank of X k (^0050100150) 10 20 30 40 50 60 70 80 90 100 110 Iteration step k Rank of X k r= 10 r= 50 r= 100 Figure 1:Rank ofXkas a functionkwhen the unknown matrixMis of size 5, 000 5 , 000 and of rankr. Finally, we demonstrate the results of the SVT algorithm formatrix completion from noisy sampled entries. Suppose we observe data from the model Bij=Mij+Zij, (i, j), (5.9) whereZis a zero-mean Gaussian white noise with standard deviation. We run the SVT algorithm but stop early, as soon asXkis consistent with the data and obeys P(XkB)^2 F(1 +)m^2 , (5.10) whereis a small parameter. Our reconstructionM is the firstXk obeying (5.10). The results are shown in Table 2 (the quantities are averages of 5 runs). Define the noise ratio as P(Z)F/P(M)F, and the relative error by (5.8). From Table 2, we see that the SVT algorithm works well as the relative error between the recovered and the true data matrix is just about equal to the noise ratio. The theory of low-rank matrix recovery from noisy data is nonexistent at the moment, and is obviously beyond the scope of this paper. Having said this, we would like to conclude this section with an intuitive and nonrigorous discussion, which may explain why the observed recovery error is within the noise level. Suppose again thatM obeys (5.6), namely, P(MM)^2 FpM M^2 F. (5.11)
noise ratio Unknown matrixM Computational results
size (nn) rank (r) m/dr m/n^2 time(s) # iters relative error
10 6 0.12 10.8 51 0. 78 10 ^2
10 ^21 , 000 1 , 000 50 4 0.39 87.7 48 0. 95 10 ^2
100 3 0.57 216 50 1. 13 10 ^2
10 6 0.12 4.0 19 0. 72 10 ^1
10 ^11 , 000 1 , 000 50 4 0.39 33.2 17 0. 89 10 ^1
100 3 0.57 85.2 17 1. 01 10 ^1
10 6 0.12 0.9 3 0. 52
1 1 , 000 1 , 000 50 4 0.39 7.8 3 0. 63
100 3 0.57 34.8 3 0. 69
Table 2:Simulation results for noisy data. The computational results are averaged over five
runs.
As mentioned earlier, one condition for this to happen is thatM andM have low rank. This is the reason why it is important to stop the algorithm early as we hope to obtain a solution which is both consistent with the data and has low rank (the limit ofthe SVT iterations, limkXk, will not generally have low rank since there may be no low-rank matrix matching the noisy data). From P(M M)F P(M B)F+P(BM)F,
and the fact that both terms on the right-hand side are on the order of
m^2 , we would have pM M^2 F =O(m^2 ) by (5.11). In particular, this would give that the relativereconstruction error is on the order of the noise ratio sinceP(M)^2 FpM^2 Fas observed experimentally.
5.2.2 Linear inequality constraints
We now examine the speed at which one can solve similar problems with linear inequality constraints instead of linear equality constraints. We assume the model(5.9), where the matrixM of rank ris sampled as before, and solve the problem (3.8) by using (3.10). We formulate the inequality constraints in (3.8) withEij =so that one searches for a solutionM with minimum nuclear norm among all those matrices whose sampled entries deviatefrom the observed ones by at most the noise level.^2 In this experiment, we adjustto be one tenth of a typical absolute entry of M, i.e.= 0. 1
ij|Mij|/m, and the noise ratio as defined earlier is 0.780. We setn= 1,000, r= 10, and the numbermof sampled entries is five times the number of degrees of freedom, i.e.m= 5dr. Just as before, we set= 5n, and choose a constant step size= 1. 2 p^1. The results, reported in Figure 2, show that the algorithm behaves just as well with linear inequality constraints. To make this point, we compare our results with those obtained from noiseless data (same unknown matrix and sampled locations). In the noiseless case, it takes about 150 iterations to reach the tolerance= 10^4 whereas in the noisy case, convergence occurs in about 200 iterations (Figure 2(a)). In addition, just as in the noiseless problem, the rank of the iterates is nondecreasing and quickly reaches the true valuerof the rank of the unknown matrix
(^2) This may not be conservative enough from a statistical viewpoint but this works well in this case, and our emphasis here is on computational rather than statistical issues.
10 5 0 20 40 60 80 100 120 140 160 180 200
10 4
10 3
10 2
10 1
100 Relative errors
Error from noisy dataResidual error from noisy data
Error from noiseless dataResidual error from noiseless data
(^0020406080100120140160180200) 5 10 15 Rank Noisy data^ Noiseless data (a) (b) (^0020406080100120140160180200) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 svd time Noisy data^ Noiseless data (c) Figure 2: Computational results of the algorithm applied to noisy (linear inequality con- straints as in (3.8)) and noiseless data (equality constraints). The blue (resp. red) color is used for the noisy (resp. noiseless) experiment. (a) Plot ofthe reconstruction errors from noisy and noiseless data as a function of the iteration count. The thin line is the residual relative errorP(XkM)F/P(M)F and the thick line is the overall relative error XkMF/MF. (b) Rank of the iterates as a function of the iteration count. (c) Time it takes to compute the singular value thresholding operationas a function of the iteration count. The computer here is a single-core 3.00GHz Pentium 4 runningMatlab 7.2.0. M we wish to recover (Figure 2(b)). As a consequence the SVT iterations take about the same amount of time as in the noiseless case (Figure 2(c)) so that the total running time of the algorithm does not appear to be substantially different from that in thenoiseless case. We close by pointing out that from a statistical point of view, the recovery of the matrixM from undersampled and noisy entries by the matrix equivalent of the Dantzig selector appears to be accurate since the relative error obeysM MF/MF= 0.0769 (recall that the noise ratio is about 0.08).
6 Discussion
This paper introduced a novel algorithm, namely, the singular value thresholding algorithm for matrix completion and related nuclear norm minimization problems. This algorithm is easy to implement and surprisingly effective both in terms of computational cost and storage requirement
when the minimum nuclear-norm solution is also the lowest-rank solution. We would like to close this paper by discussing a few open problems and research directions related to this work. Our algorithm exploits the fact that the sequence of iterates {Xk}have low rank when the minimum nuclear solution has low rank. An interesting question is whether one can prove (or disprove) that in a majority of the cases, this is indeed the case. It would be interesting to explore other ways of computingD(Y)in words, the action of the singular value shrinkage operator. Our approach uses the Lanczos bidiagonalization algorithm with partial reorthogonalization which takes advantages of sparse inputs but other approaches are possible. We mention two of them.
- A series of papers have proposed the use of randomized procedures for the approximation of a matrixY with a matrixZof rankr[38, 41]. When this approximation consists of the truncated SVD retaining the part of the expansion corresponding to singular values greater than, this can be used to evaluateD(Y). Some of these algorithms are efficient when the inputY is sparse [41], and it would be interesting to know whether these methods are fast and accurate enough to be used in the SVT iteration (2.7).
- A wide range of iterative methods for computing matrix functions of the general formf(Y) are available today, see [34] for a survey. A valuable research direction is to investigate whether some of these iterative methods, or other to be developed, would provide powerful ways for computingD(Y).
In practice, one would like to solve (2.8) for large values of. However, a larger value of generally means a slower rate of convergence. A good strategy might be to start with a value of , which is large enough so that (2.8) admits a low-rank solution, and at the same time for which the algorithm converges rapidly. One could then use a continuation method as in [50] to increase the value ofsequentially according to a schedule 0 , 1 ,.. ., and use the solution to the previous problem with=i 1 as an initial guess for the solution to the current problem with=i(w arm starting). We hope to report on this in a separate paper.
Acknowledgments
J-F. C. is supported by the Wavelets and Information Processing Programme under a grant from DSTA, Singapore. E. C. is partially supported by the Waterman Award from the National Science Foundation and by an ONR grant N00014-08-1-0749. Z. S. is supported in part by Grant R-146-000-113-112 from the National University of Singapore. E. C. would like to thank Benjamin Recht and Joel Tropp for fruitful conversations related to this project, and Stephen Becker for his help in preparing the computational results of Section 5.2.2.
References
[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.
[2] ACM SIGKDD and Netflix.Proceedings of KDD Cup and Workshop, 2007. Proceedings available online athttp://www.cs.uic.edu/liub/KDD-cup-2007/proceedings.html.
[3] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. InProceedings of the Twenty-fourth International Conference on Machine Learning, 2007.
[4] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. InNeural Information Processing Systems, 2007.
[5] J. Bect, L. Blanc-F eraud, G. Aubert, and A. Chambolle, A 1 unified variational framework for image restoration, inProc. Eighth Europ. Conf. Comput. Vision, 2004.
[6] S. Boyd, and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004.
[7] J.-F. Cai, R. Chan, L. Shen, and Z. Shen. Restoration of chopped and nodded images by framelets. SIAM J. Sci. Comput., 30(3):12051227, 2008.
[8] J.-F. Cai, R. H. Chan, and Z. Shen. A framelet-based imageinpainting algorithm. Appl. Comput. Harmon. Anal., 24(2):131149, 2008.
[9] J.-F. Cai, S. Osher, and Z. Shen.Convergence of the Linearized Bregman Iteration for 1 -norm Mini- mization, 2008. UCLA CAM Report (08-52).
[10] J.-F. Cai, S. Osher, and Z. Shen.Linearized Bregman Iterations for Compressed Sensing, 2008. Math. Comp., to appear; see also UCLA CAM Report (08-06).
[11] J.-F. Cai, S. Osher, and Z. Shen. Linearized Bregman Iterations for Frame-Based Image Deblurring,
- preprint.
[12] E. J. Cand`es, and F. Guo. New multiscale transforms, minimum total variation synthesis: Applications to edge-preserving image reconstruction.Signal Processing, 82:15191543, 2002.
[13] E. J. Cand`es and B. Recht.Exact Matrix Completion via Convex Optimization, 2008.
[14] E. J. Cand`es and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969985, 2007.
[15] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information.IEEE Trans. Inform. Theory, 52(2):489509, 2006.
[16] E. J. Cand`es and T. Tao. Decoding by linear programming.IEEE Trans. Inform. Theory, 51(12):4203 4215, 2005.
[17] E. J. Cand`es and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies?IEEE Trans. Inform. Theory, 52(12):54065425, 2006.
[18] E. J. Cand`es and T. Tao. The Dantzig selector: statistical estimation whenpis much larger thann. Annals of Statistics35:23132351, 2007.
[19] A. Chai and Z. Shen. Deconvolution: A wavelet frame approach.Numer. Math., 106(4):529587, 2007.
[20] R. H. Chan, T. F. Chan, L. Shen, and Z. Shen. Wavelet algorithms for high-resolution image recon- struction.SIAM J. Sci. Comput., 24(4):14081432 (electronic), 2003.
[21] P. Chen, and D. Suter. Recovering the missing components in a large noisy low-rank matrix: application to SFM source.IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1051-1063, 2004.
[22] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting.Multiscale Model. Simul., 4(4):11681200 (electronic), 2005.
[23] J. Darbon and S. Osher.Fast discrete optimization for sparse approximations and deconvolutions, 2007. preprint.
[24] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint.Comm. Pure Appl. Math., 57(11):14131457, 2004.
[25] I. Daubechies, G. Teschke, and L. Vese. Iteratively solving linear inverse problems under general convex constraints.Inverse Probl. Imaging, 1(1):2946, 2007.
[26] D. L. Donoho. Compressed sensing.IEEE Trans. Inform. Theory, 52(4):12891306, 2006.
[27] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho. Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA).Appl. Comput. Harmon. Anal., 19(3):340358, 2005.
[28] M. J. Fadili, J.-L. Starck, and F. Murtagh. Inpainting and zooming using sparse representations.The Computer Journal, to appear.
[29] M. Fazel.Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.
[30] M. Fazel, H. Hindi, and S. Boyd, Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. inProc. Am. Control Conf., June 2003.
[31] M. Figueiredo, and R. Nowak, An EM algorithm for wavelet-based image restoration.IEEE Transactions on Image Processing, 12(8):906916, 2003.
[32] T. Goldstein and S. Osher.The Split Bregman Algorithm for L1 Regularized Problems, 2008. UCLA CAM Reprots (08-29).
[33] E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: methodology and convergence. 2008. preprint.
[34] N. J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008.
[35] J.-B. Hiriart-Urruty and C. Lemar echal.Convex analysis and minimization algorithms. I, volume 305 ofGrundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1993. Fundamentals.
[36] R. M. Larsen, PROPACK Software for large and sparse SVD calculations, Available fromhttp: //sun.stanford.edu/rmunk/PROPACK/.
[37] A. S. Lewis. The mathematics of eigenvalue optimization. Math. Program., 97(1-2, Ser. B):155176,
- ISMP, 2003 (Copenhagen).
[38] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices.Proc. Natl. Acad. Sci. USA, 104(51): 2016720172, 2007.
[39] S. Lintner, and F. Malgouyres. Solving a variational image restoration model which involvescon- straints.Inverse Problems, 20:815831, 2004.
[40] Z. Liu, and L. Vandenberghe. Interior-point method fornuclear norm approximation with application to system identification. submitted toMathematical Programming, 2008.
[41] P.-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the approximation of matrices Department of Computer Science, Yale University,New Haven, CT, Technical Report 1361,
[42] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimization problem over a positive semidefinite linear matrix inequality.IEEE Transactions on Automatic Control, 42(2):239243, 1997.
[43] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variation-based image restoration.Multiscale Model. Simul., 4(2):460489 (electronic), 2005.
[44] S. Osher, Y. Mao, B. Dong, and W. Yin. Fast Linearized Bregman Iteration for Compressed Sensing and Sparse Denoising, 2008. UCLA CAM Reprots (08-37).
[45] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. 2007. Submitted toSIAM Review.
[46] J.-L. Starck, D. L. Donoho, and E. J. Cand`es, Astronomical image representation by the curvelet transform.Astronom. and Astrophys., 398:785800, 2003.
[47] K. C. Toh, M. J. Todd, and R. H. T ut unc u.SDPT3 a MATLAB software package for semidefinite- quadratic-linear programming, Available fromhttp://www.math.nus.edu.sg/mattohkc/sdpt3.html.
[48] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method.International Journal of Computer Vision, 9(2):137154, 1992.
[49] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra Appl., 170:3345, 1992.
[50] S. J. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. Sub- mitted for publication, 2007.
[51] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for 1 -minimization with applications to compressed sensing.SIAM J. Imaging Sci., 1(1):143168, 2008.