C++|Linux代做|代写Algorithm|作业Assignment – 这是利用C++进行的group的算法作业

A Group-Specific Recommender System

Xuan Bi, Annie Qu, Junhui Wang and Xiaotong Shen

Abstract
In recent years, there has been a growing demand to develop efficient recommender
systems which track users preferences and recommend potential items of interest to
users. In this paper, we propose a group-specific method to utilize dependency informa-
tion from users and items which share similar characteristics under the singular value
decomposition framework. The new approach is effective for the cold-start problem,
where, in the testing set, majority responses are obtained from new users or for new
items, and their preference information is not available from the training set. One ad-
vantage of the proposed model is that we are able to incorporate information from the
missing mechanism and group-specific features through clustering based on the num-
bers of ratings from each user and other variables associated with missing patterns.
In addition, since this type of data involves large-scale customer records, traditional
algorithms are not computationally scalable. To implement the proposed method, we
propose a new  Algorithm that embeds a back-fitting algorithm into alternating least
squares, which avoids large matrices operation and big memory storage, and therefore
makes it feasible to achieve scalable computing. Our simulation studies and MovieLens
data analysis both indicate that the proposed group-specific method improves pre-
diction accuracy significantly compared to existing competitive recommender system
approaches.
Key words: Cold-start problem, group-specific latent factors, non-random missing ob-
servations, personalized prediction.

Xuan Bi is Ph.D. student, Department of Statistics, University of Illinois at Urbana-Champaign, Cham- paign, IL 61820 (E-mail: xuanbi2@illinois.edu). Annie Qu is Professor, Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL 61820 (E-mail: anniequ@illinois.edu). Junhui Wang is Associate Professor, Department of Mathematics, City University of Hong Kong, Hong Kong, China (E- mail: junhwang@cityu.edu.hk). Xiaotong Shen is Professor, School of Statistics, University of Minnesota, Minneapolis, MN 55455 (E-mail: xshen@umn.edu). Research is supported in part by National Science Foundation Grants DMS-1207771, DMS-1415500, DMS-1415308, DMS-1308227, DMS-1415482, and HK GRF-11302615. The authors thank Yunzhang Zhu for providing the program code for Zhu et al. (2015)s method.

1 Introduction

Recommender systems have drawn great attention since they can be applied to many areas, such as movies reviews, restaurant and hotel selection, financial services, and even identifying gene therapies. Therefore there is a great demand to develop efficient recommender systems which track users preferences and recommend potential items of interest to users. However, developing competitive recommender systems brings new challenges, as infor- mation from both users and items could grow exponentially, and the corresponding utility matrix representing users preferences over items are sparse and high-dimensional. The standard methods and algorithms which are not scalable in practice may suffer from rapid deterioration on recommendation accuracy as the volume of data increases. In addition, it is important to incorporate dynamic features of data instead of one-time usage only, as data could stream in over time and grow exponentially. For example, in the MovieLens 10M data,96%of the most recent ratings are either from new users or on new items which did not exist before. This implies that the information collected at an early time may not be representative for future users and items. This phenomenon is also called the cold-start problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. Another important feature of this type of data is that the missing mechanism is likely nonignorable missing, where the missing mechanism is associated with unobserved responses. For instance, items with fewer and lower rating scores are less likely to attract other users. Existing recommender systems typically assume missing completely at random, which may lead to estimation bias. Content-based filtering and collaborative filtering are two of the most prevalent ap- proaches for recommender systems. Content-based filtering methods (e.g., Lang, 1995; Mooney and Roy, 2000; Blanco-Fernandez et al., 2008) recommend items by comparing the content of the items with a users profile, which has the advantage that new items can be recommended upon release. However, domain knowledge is often required to establish

a transparent profile for each user (Lops et al., 2011), which entails pre-processing tasks to formulate information vectors for items (Pazzani and Billsus, 2007). In addition, content- based filtering suffers from the cold-start problem as well when a new user is recruited (Adomavicius and Tuzhilin, 2005). For collaborative filtering, the key idea is to borrow information from similar users to predict their future actions. One significant advantage is that the domain knowledge for items is not required. Popular collaborative filtering approaches include, but are not limited to, singular value decomposition (SVD; Funk, 2006; Mazumder et al., 2010), restricted Boltzman machines (RBM; Salakhutdinov et al., 2007), and the nearest neighbor methods (kNN; Bell and Koren, 2007). It is well-known that an ensemble of these methods could further enhance prediction accuracy. (See Cacheda et al. (2011) and Feuerverger et al. (2012) for extensive reviews.) However, most existing collaborative filtering approaches do not effectively solve the cold-start problem, although various attempts have been made. For example, Park et al. (2006) suggest adding artificial users or items with pre-defined characteristics, while Goldberg et al. (2001), Melville et al. (2002), and Nguyen et al. (2007) consider imputing pseudo ratings. Most recently, a hybrid system incorporating content-based auxiliary information has been proposed (e.g., Agarwal and Chen, 2009; Nguyen and Zhu, 2013; Zhu et al., 2015). Nevertheless, the cold-start problem imposes great challenges, and has not been effectively solved. In this paper, we propose a group-specific singular value decomposition method that gen- eralizes the SVD model by incorporating between-subject dependency and utilizes informa- tion of missingness. Specifically, we cluster users or items based on their missingness-related characteristics. We assume that individuals within the same cluster are correlated, while individuals from different clusters are independent. The cluster correlation is incorporated through mixed-effects modeling assuming that users or items from the same cluster share the same group effects, along with latent factors modeling using singular value decomposition. The proposed method has two significant contributions. First, it solves the cold-start

problem effectively through incorporating group effects. Most collaborative filtering methods rely on subject-specific parameters to predict users and items future ratings. However, for a new user or item, the training samples provide no information to estimate such parameters. In contrast, we are able to incorporate additional group information for new users and items to achieve higher prediction accuracy. Second, our clustering strategy takes nonignorable missingness into consideration. In the MovieLens data, we notice that individuals rating behaviors are highly associated with their missing patterns: movies with higher average rating scores attract more viewers, while frequent viewers tend to be more critical and give low ratings. We cluster individuals into groups based on their non-random missingness, and this allows us to capture individuals latent characteristics which are not utilized in other approaches. To implement the proposed method, we propose a new algorithm that embeds a back- fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and makes it feasible to achieve scalable computing in practice. Our numerical studies indicate that the proposed method is effective in terms of prediction ac- curacy. For example, for the MovieLens 1M and 10M data, the proposed method improves prediction accuracy significantly compared to existing competitive recommender system ap- proaches (e.g., Agarwal and Chen, 2009; Koren et al., 2009; Mazumder et al., 2010; Zhu et al., 2015). This article is organized as follows. Section 2 provides the background of the singular value decomposition model and introduces the proposed method. Section 3 presents the proposed method, a new algorithm and its implementation. Section 4 establishes the theo- retical foundation of the proposed method. In Section 5 we illustrate the performance and robustness of the proposed method through simulation studies. MovieLens 1M and 10M data are analyzed in Section 6. Section 7 provides concluding remarks and discussion.

2 Background and Model Framework

2.1 Background

We provide the background of the singular value decomposition method (Funk, 2006) as follows. LetR= (rui)nm be the utility matrix, wherenis the number of users,mis the number of items, and eachruiis an explicit rating from userufor itemi(u= 1,…,n, i= 1,…,m). The SVD method decomposes the utility matrixRas:

R=PQ,

whereRis assumed to be low-rank,P= (p 1 ,…,pn)is annKuser preference matrix, Q= (q 1 ,…,qm) is anmKitem preference matrix, and Kis the pre-specified upper bound of the number of latent factors, which corresponds to the rank ofR. Hereqiand puareK-dimensional latent factors associated with itemiand useru, respectively, which explain variability inR. The predicted value ofruigiven by the SVD method is:rui=puqi,whereqiandpuare estimated iteratively by:

qi= argminqi 
uUi

(ruipuqi)^2 +qi^22 ,

and: pu= argminpu

iIu

(ruipuqi)^2 +pu^22.

HereUidenotes the set of all users who rate itemi, andIuis the set of all items rated by user u. Different penalty functions can be applied. For example, Zhu et al. (2015) suggestL 0 and L 1 penalties to achieve sparsity ofPandQ. In addition, some SVD methods (e.g., Koren, 2010; Mazumder et al., 2010; Nguyen and Zhu, 2013) are implemented on residuals after a baseline fit, such as linear regression or ANOVA, rather than the raw ratingsruidirectly.

The SVD method can be carried out through several algorithms, for example, the alter- nating least square (ALS; Carroll and Chang, 1970; Harshman, 1970; Koren et al., 2009), gradient descent approaches (Wu, 2007), and one-feature-at-a-time ALS (Funk, 2006).

2.2 Model Framework

The general framework of the proposed method is constructed as follows. Supposexuiis a covariate vector corresponding to the useruand itemi. In the rest of this article, we considerruixuias the new response, whereis the linear regression coefficient ofxuito fitrui. To simplify our notation, we still useruito denote the residual here. In case covariate information is not available, we apply the ANOVA-type model where the grand mean, the user main effects and the item main effects are subtracted and replaceruiby its residual. Letui=E(rui).We generalize the SVD model and formulate eachuias

ui= (pu+svu)(qi+tji), (1)

wheresvu andtjiareK-dimensional group effects that are identical across members from the same cluster. We denote users from thev-th cluster asVv={u:vu=v}(v= 1,…,N), and items from thej-th cluster asJj ={i:ji=j}(j= 1,…,M), whereNv=1|Vv|=n andMj=1|Jj|=m,||is the cardinality of a set, andN andM are the total number of clusters for users and items, respectively. Details about selectingN andMare provided in Section 3.3. In matrix form, we useS= (s 1 ,…,sN)andT= (t 1 ,…,tM)to denote the user and item group-effect matrices, respectively. However, the dimensions of matrixSandTareNK andMK, which are not compatible with the dimensions ofPandQ, respectively. There- fore, alternatively we defineSc = (s 11 |V 1 |,…,sN 1 |VN|) and Tc = (t 11 |J 1 |,…,tM 1 |JM|),

corresponding to group-effects from users and items, where (^1) k is ak-dimensional vector of 1s, and the subscript c inScandTcdenotes the complete forms of matrices. Let

= (ui)nm, then we have

= (P+Sc)(Q+Tc),

and if there are no group effects,degenerates to=PQ, which is the same as the SVD model. Here the users or items can be formed as clusters based on their similar characteristics. For example, we can use missingness-related information such as the number of ratings from each user and each item. Users or items within the same cluster are correlated with each other through the group effectssvu or tji, while observations from different clusters are assumed to be independent. In Section 3, Section 4 and Section 5.1, we assumeN andM are known, and that members in each cluster are correctly labeled.

Remark 1. For easy operation, one could use users and items covariate information for clustering. In fact, (1) is still a generalization of the SVD method even ifN=M= 1, because svutji,putji,svuqicorrespond to the grand mean, the user main effects and the item main effects, analogous to the ANOVA-type of SVD model. Note that covariate information might not be collected from new users and new items. However, missingness-related information is typically available for clustering, and thereforesvuandtjican be utilized for new users and new items. This is crucial to solve the cold-start problem.

3 The General Method

3.1 Parameter Estimation

In this subsection, we illustrate how to obtain estimations of model parameters through training data. In addition, we develop a new algorithm that embeds back-fitting into alter- nating least squares. This enables us to circumvent large-scale matrix operations through a two-step iteration, and hence significantly improve computational speed and scalability.

Let be a vectorization of(P,Q,S,T),be a set of user-item pairs associated with observed ratings, andRo={rui: (u,i)}be a set of observed ratings. We define the loss function as

L(|Ro) = 
(u,i)

(ruiui)^2 +(
n
u=

pu^22 +

N

v=

sv^22 +
m
i=

qi^22 +

M

j=

tj^22 ), (2)

whereuiis given by (1) andis a tuning parameter. We can estimatevia

= argmin L(|Ro).

Then the predicted value ofuican be obtained byui= (pu+svu)(qi+tji). The estimation procedure consists of updating(pu+svu)and(qi+tji)iteratively. Fol- lowing the strategy of the alternating least squares, the latent factors and the group effects associated with item clusterjare estimated by:

({qi}iJj,tj) = arg min{qi}iJ
j,tj

iJj

uUi

(ruiui)^2 +(
iJj

qi^22 +tj^22 ). (3)

Similarly, we estimate latent factors and group effects associated with user clusterv:

({pu}uVv,sv) = arg{pu}minuVv,sv

uVv

iIu

(ruiui)^2 +(

uVv

pu^22 +sv^22 ). (4)

However, directly solving (3) and (4) by the alternating least square encounters large matrices. In the MovieLens 10M data, it could involve matrices with more than 100,000 rows. We develop a new algorithm which embeds back-fitting into alternating least squares, and minimize each of (3) and (4) iteratively. Specifically, for each item clusterJj(j= 1,…,M), we fixPandS, and minimize (3) through estimatingqiandtjiteratively:

qi= argminqi 
uUi

(ruiui)^2 +qi^22 ,iJj, (5)

tj= argmintj 
iJj

uUi

(ruiui)^2 +tj^22. (6)

For each user clusterVv(v= 1,…,N), we fixQandTand minimize (4) through estimating puandsviteratively:

pu= argminpu 
iIu

(ruiui)^2 +pu^22 ,uVv, (7)

sv= argminsv 
uVv

iIu

(ruiui)^2 +sv^22. (8)

3.2 Algorithm

In this section, we provide the detailed algorithm as follows. Algorithm 1: Parallel Computing for the Proposed Method

(Initialization)Setl = 1. Set initial values for(P(0),Q(0),S(0),T(0))and the tuning parameter.
(Item Effects)EstimateQ(l)andT(l)iteratively. (i) SetQ(l)Q(l1), and setT(l)T(l1). (ii) For each itemi= 1,…,m, calculateq(il)new using (5). (iii) For each item clusterJj,j= 1,…,M, calculatet(jl)new based on (6). (iv) Stop iteration ifmK^1 Q(l)newQ(l)^2 F+MK^1 T(l)newT(l)^2 F < 10 ^5 , otherwise assignQ(l)Q(l)new andT(l)T(l)new, and go to step 2(ii).
(User Effects)EstimateP(l)andS(l)iteratively. (i) SetP(l)P(l1), and setS(l)S(l1). (ii) For each useru= 1,…,n, calculatep(ul)new using (7). (iii) For each user clusterVv,v= 1,…,N, calculates(vl)newbased on (8). (iv) Stop iteration ifnK^1 P(l)newP(l)^2 F+NK^1 S(l)newS(l)^2 F< 10 ^5 , otherwise assign P(l)P(l)newandS(l)S(l)new, and go to step 3(ii).
(Stopping criterion)Stop ifnK^1 P(l)+S(cl)P(l1)S(cl1)^2 F+mK^1 Q(l)+T(cl)Q(l1) T(cl1)^2 F< 10 ^3 , otherwise setll+ 1and go to step 2.

Note that the alternating least square is performed by conducting steps 2 and 3 itera-

tively, while the back-fitting algorithm is carried out within step 2 and step 3. The parallel computing can be implemented in steps 2(ii), (iii) and 3(ii), (iii). Algorithm 1 does not require large computational and storage cost. We denoteIB 1 , IB 2 andIALS as the numbers of iterations for back-fitting in steps 2 and 3, and the ALS, respectively, andCRidgeas the computational complexity of solving the ridge regression with K variables andmax{|V 1 |,…,|VN|,|J 1 |,…,|JM|}observations. Then the computational complexity of Algorithm 1 is no greater than{(m+M)IB 1 + (n+N)IB 2 }CRidgeIALS. Since both ridge regression and Lasso have the same computational complexity as ordinary least squares (Efron et al., 2004), the computational cost of the proposed method is indeed no greater than that of Zhu et al. (2015). For the storage cost, Algorithm 1 requires storages of only item-specific or user-specific information to solve (5) or (7), and the sizes of items and users information not exceedingmax{|J 1 |,…,|JM|}andmax{|V 1 |,…,|VN|}to solve (6) or (8), respectively. We also establish the convergence property of Algorithm 1 as follows. Let=vec(P,Q, S,T)be a stationary point ofL(|Ro)corresponding to two blocks. That is,

vec(P,S) = argminP,SL(vec(P,Q,S,T)|Ro),

and vec(Q,T) = argminQ,T L(vec(P,Q,S,T)|Ro).

The following lemma shows the convergence of Algorithm 1 to a stationary point.

Lemma 1.The estimate=vec(P,Q,S,T)from Algorithm 1 is a stationary point of the loss functionL(|Ro)in (2).

3.3 Implementation

In this subsection we address some implementation issues for the proposed method. Our al- gorithm is implemented in the R environment, which requires packages foreach and doPar-

allel for parallel computing and bigmemory and bigalgebra for big matrix storage and operation. All the reported numerical studies are implemented using the linux system on cluster computers. We can further enhance computation speed through c++ programming with OpenMP. To select tuning parameter, we search from grid points which minimizes the root mean square error (RMSE) on the validation set. In selection of the number of latent factorsK, we chooseKsuch that it is sufficiently large and leads to stable estimations. In general,K needs to be larger than the rank of the utility matrixR, but not so large as to intensify the computation. Regarding the selection of the number of clustersN andM, Corollary 2 of Section 4 provides the lower bound in the order ofO(N)andO(M). Note that too smallN andM may not have the power to distinguish between the proposed method and the SVD method. In practice, if clustering is based on categorical variables, then we can apply the existing categories, andNandMare known. However, if clustering is based on a continuous variable, we can apply the quantiles of the continuous variable to determineN andM and categorize users and items evenly. We then select the number of clusters through a grid search, similar to the selection ofandK. See Wang (2010) for a consistent selection of the number of clusters in more general settings. In particular, for our numerical studies, we split our dataset into60% training, 15% validation and25%testing sets based on the time of ratings (timestamps; Zhu et al., 2015). That is, we use historical data to predict future data. If time information is not available, we use a random split to determine training, validation and testing sets instead.

4 Theory

In this section, we provide the theoretical foundation of the proposed method in a general setting. That is, we allowruito follow a general class of distributions. In particular, we derive an upper bound for the prediction error in probability, and show that existing approaches without utilizing group effects lead to a larger value of the loss function, and therefore are

less efficient compared to the proposed method. Furthermore, we establish a lower bound of the number of clusters which guarantees that the group effects can be detected effectively. Suppose the expected value of each rating is formulated via a known mean function. That is, E(rui) =(ui),

anduiis defined as in (1). For example, ifruiis a continuous variable, then(ui) =ui; and ifruis are binary, then(ui) =1+exp(exp(uiui)). We letfui = f(rui|ui) be the probability density function ofrui. Since each rui is associated withonly throughui, we denotefui(r,) =f(rui|ui). We define the likelihood- based loss function as:

L(|Ro) = 
(u,i)

logfui+||D(),

where|| is the penalization coefficient,||is the total number of observed ratings, and D()is a non-negative penalty function of. For example, we haveD() =^22 for the L 2 -penalty. Since, in practice, the ratings are typically non-negative finite values, it is sensible to assumeL, whereLis a positive constant. We define the parameter vector space as

S(k) ={:L,D()k^2 }.

Notice that the dimension of is dim() = (n+m+N+M)K which goes to infinity as eithernormincreases. Therefore, we assumekO((n+m+N+M)K). Similarly, we define the parameter space for eachui: S(k) ={:L,D()k^2 }.

Assumption 1.For some constantG 0 , andui, uiS(k),


f^1 /^2 (rui|ui)f^1 /^2 (rui| ui)

G(rui)ui ui 2 ,

whereEG^2 (rui)G ^2 foru= 1,…,n,i= 1,…,m.

The Hellinger metrich(,)onS(k)is defined as:

h(ui, ui) =

[

{f^1 /^2 (rui|ui)f^1 /^2 (rui| ui)}^2 d(rui)

] 1 / 2

,

where()is a probability measure. Based on Assumption 1, h(ui, ui)is bounded by ui ui 2. We now define the Hellinger metrichS(,)onS(k). For, S(k), let

hS(, ) =

{

1

nm

m
i=

n
u=

h^2 (ui, ui)

} 1 / 2

.

It is straightforward to show thathSis still a metric. In the rest of this article, we suppress the subscript and useh(,)to denote the Hellinger metric onS(k). In the following, we show thath(, )can be bounded by 2.

Lemma 2.Under Assumption 1, there exists a constantd 0 0 , such that for, S(k),

h(, )d 0

n+m nm ^2. Suppose= arg minS(k)L(|Ro)is a penalized maximum likelihood estimator of. Theo- rem 1 indicates thatconverges toexponentially in probability, with a convergence rate of||.

Theorem 1.Under Assumption 1 and suppose||< 21 k^2 ||, the best possible convergence rate of is ||

(n+m)K
||^1 /^2

{

log

( ||

nmK

)} 1 / 2

,

and there exists a constantc > 0 , such that

P(h(,)||)7 exp(c||^2 ||).

Remark 2.Theorem 1 is quite general in terms of the rates of nandm. If we assume O(n) =O(m) =O(n+m)such as in the MovieLens data, then||converges faster thanSAJ|| , whereSAJ||

(n+m)K
||^1 /^2

{

log

( ||

(n+m)k

)} 1 / (^2) { log(mk)}^1 /^2 is the convergence rate provided by the collaborative prediction method with binary ratings (Srebro et al., 2005). The exact rate comparison is not available here. Remark 3.If we impose 2 dn,mwith radiusdn,m= 2 nm d^20 (n+m)||, then the entropy ofS(k)under Assumption 1 also satisfies the condition of local entropy (Wong and Shen, 1995). That is, S(k) =S(k)

{

1

nm

m
i=

n
u=

f^1 /^2 (rui,)f^1 /^2 (rui,)^22  2 s^2

}

, for alls||.

Consequently, the convergence rate of||islog(||)times faster than the convergence rate calculated by using global entropy.

We now assume that the density functionfuiis a member of the exponential family in its canonical form. That is,

f(rui|ui) =H(rui) exp{uiT(rui)A(ui)}.

In fact, the following results still hold iffis in the over-dispersed exponential family. Suppose S(k)andui S(k)are the true parameters. Then Theorem 2 indicates that if misspecified uis are not close touis, then the loss function of the corresponding cannot be closer to the loss function ofthan a given threshold in probability.

Theorem 2.Under Assumption 1 and||< 21 k^2 ||, there existci> 0 ,i= 1, 2 , such that for||> 0 , there exists||> 0 , and 1 uminn, 1 im| uiui|> || implies that

( 1

||{L(|R

o)L( |Ro)}c 1 (^2) ||

)

7 exp(c 2 ||^2 ||),

wherePdenotes the outer measure (Pollard, 2012).

Remark 4.Theorem 1 and Theorem 2 still hold if the loss functionL(|)is not likelihood- based, but is a general criterion function. For suchL(|), we can replaceh(,)by(,) = K^1 /^2 (,)as the new measure of convergence, whereK(, ) =E{L(|R)L( |R)}. Note thatK(,)is the Kullback-Leiber information ifL(|)is a log-likelihood, which dominates the Hellinger distanceh(,), and hence the convergence is stronger underK(,). See Shen (1998) for more details about regularity conditions for a more general criterion function.

Suppose 0 S(k)is a vectorization of(P,Q, 0 , 0 ), which corresponds to models with no group effects. The following Corollary 1 shows that if the true group effects are not close to 0, then existing methods ignoring group effects such as the SVD model (ui^0 =puqi) lead to a larger loss in probability than the proposed method.

Corollary 1. Under Assumption 1 and|| < 21 k^2 ||, there existsci > 0 ,i= 1, 2 , and a constant (0,1], such that for ^1 || > 0 , there exists|| > 0. Assume that at least (nm)pairs of(u,i)satisfy|ui^0 ui|> ||. Then

( 1

||{L(|R

o)L( |Ro)}c 1 (^2) ||)7 exp(c 2 || (^2) ||). The following corollary provides the minimal rate ofNandM, in terms ofn,m,Kand ||. This implies that the number of clusters should be sufficiently large so that the group effects can be detected. Corollary 2.Under assumptions in Theorem 1, the rate ofN andM satisfies O(N+M)nm||log

( ||

nmK

)

.

If we further assume that the number of ratings is proportional to the size of the utility matrix, that is, O(||) = O(nm), then O(N+M) log(|K| 11 // 22 ). The lower bound of O(N+M)is useful in determining the minimal number of clusters. For example, for the MovieLens 10M data where||= 10, 000 , 000 , we have the lower boundlog(|K| 11 // 22 ) 7 if K 10.

5 Simulation Studies

In this section, we provide simulation studies to investigate the numerical performance of the proposed method in finite samples. Specifically, we compare the proposed method with four matrix factorization methods in Section 5.1 under a dynamic setting where new users and new items appear at later times. In Section 5.2, we test the robustness of the proposed model under various degrees of cluster misspecification.

5.1 Comparison under the cold-start problem

In this simulation studies, we simulate the cold-start problem where new users and new items information is not available in the training set. In addition, we simulate that users behavior is affected by other users behavior, and therefore the missingness is not missing completely at random. Here users and items from the same group are generated to be dependent from each other. We setn= 650 andm= 660 and generatepu,qi iid N(0,IK)for u= 1,…,n,i = 1 ,…,m, whereIK is aK-dimensional identity matrix withK= 3or 6. To simulate group

effects, we letsv= ( 3 .5 + 0. 5 v) (^1) K,v= 1,…,N, andtj= ( 3 .6 + 0. 6 j) (^1) K,j= 1,…,M, whereN= 13,M= 11. We set cluster size|V 1 |==|VN|= 50, and|J 1 |==|JM|=

Without loss of generality, we assume that covariate information is not available for this simulation. In contrast to other simulation studies, we do not generate the entire utility matrixR. Instead, we mimic the real data case where only a small percentage of ratings is collected. We choose the total number of ratings to be||= (1 )nm, where = 0. 7 , 0. 8 , 0. 9 or 0. is the missing rate. The following procedure is used to generate these ratings. We first select thel-th user-item pair(ul,il), wherel= 1,…,||indicates the sequence of ratings from the earliest to the latest. If itemils current average rating is greater than 0.5, then for userul, we assign a ratingrulilwith probability 0.85; otherwise we assignrulilwith

probability 0.2. The ratingrulilis generated by(pul+svul)(qil+tjil)/3+,whereiidN(0,1). That is, we simulate a setting where users tend to rate highly-rated items. Hereulandil are sampled from 1 ,…,n and 1 ,…,m independently, but with weights proportional to the density of normal distributions N(nl/||,(0. 2 n)^2 )and N(ml/||,(0. 2 m)^2 ), respectively. That is, ratings appearing at a later time are more likely corresponding to newer users or to newer items. If we fail to assignrulil a value, we re-draw(ul,il)and restart this procedure. The selection ofrulilis based on observed information, so the missing mechanism is missing at random (Rubin, 1976). We compare the performance of the proposed method with four competitive matrix fac- torization models, namely, the regularized singular value decomposition method solved by the alternating least square algorithm (RSVD; Funk, 2006; Koren et al., 2009), a regression- based latent factor model (Agarwal and Chen, 2009), a nuclear-norm matrix completion method (Soft-Impute; Mazumder et al., 2010), and a latent factor model with sparsity pursuit (Zhu et al., 2015). For the last three methods, we apply the codes in https: //github.com/beechung/latent-factor-models, the R package softImpute, and that of Zhu et al. (2015), respectively. For the proposed method, we apply the loss function (2). The tuning parameterfor the proposed method and the RSVD is selected from grid points ranging from 1 to 29 to minimize the RMSEs on the validation set. For Agarwal and Chen (2009), we use the default of 10 iterations, while for Mazumder et al. (2010), the default= 0is chosen to achieve convergence for the local minimum; and for Zhu et al. (2015), the tuning parameter selection is integrated in their programming coding. We generate simulation settings when the number of latent factorsK= 3and 6, and the missing rate = 0. 7 , 0. 8 , 0. 9 , 0. 95. The means and standard errors of RMSEs on the testing set are reported in Table 1. The simulation results are based on 500 replications. Table 1 indicates that the proposed method performs the best across all settings. Overall, the proposed method is relatively robust against different missing rates or different numbers of latent factors, and has the smallest standard error in most settings. In the most extreme

case withK= 6and = 0. 95 , the proposed method is still more than100%better than the best of the four existing methods in terms of the RMSEs. The RSVD method performs well when both andKare small, but performs poorly when either orKincreases. By contrast, Agarwal and Chen (2009), Mazumder et al. (2010) and Zhu et al. (2015) are able to provide small standard errors whenK= 6and = 0. 95 , but have large RMSEs across all settings.

5.2 Robustness against cluster misspecification

In this simulation study, we test the robustness of the proposed method when the clusters are misspecified. We follow the same data-generating process as in the previous study, but allow the cluster assignment to be misspecified. Specifically, we misassign users and items to adjacent clusters with10%,30%and50%chance. Here adjacent clusters are defined as the clusters with the closest group effects. This definition of adjacent clusters reflects the real data situation. For example, a horror movie might be misclassified as a thriller movie, but less likely a romantic movie. The simulation results based on 500 replications are summarized in Table 2. In general, the proposed method is robust against the misspecification of clusters. In comparison with the previous results from Table 1, the proposed method performs better than the other four methods in all settings even when50% of the cluster members are misclassified. On the other hand, the misspecification rate affects the performance of the proposed method to different degrees for various settings of andK. For example, the proposed method below the50% misspecification rate is 2 .7%worse than the proposed method when there is no misspecification, in terms of the RMSE underK= 3and = 0. 7 ; and becomes 18 .8%worse than the one with no misspecification underK= 6and = 0. 95.

6 MovieLens Data

We apply the proposed method to MovieLens 1M and 10M data. The two datasets are collected by GroupLens Research and are available athttp://grouplens.org/datasets/ movielens. The MovieLens 1M data contains 1,000,209 ratings of 3,883 movies by 6, users, and rating scores range from 1 to 5. In addition, the 1M dataset provides demographic information for the users (age, gender, occupation, zipcode), and genres and release dates of the movies. In the MovieLens 10M data, we have 10,000,054 ratings collected from 71, users over 10,681 items, and99%of the movie ratings are actually missing. Rating scores range from 0. 5 , 1 ,…, 5 , but no user information is available. Figure 1 illustrates the missing pattern of MovieLens 1M data. Both graphs indicate that the missing mechanism is possibly missing not at random. In the left figure, the right- skewed distribution from users indicates that only a few users rated a large number of movies. While the median number of ratings is 96, the maximum can reach up to 2,314. The right figure shows that popular movies attract more viewers. That is, the number of ratings for each movie is positively associated with its average rating score, indicating nonignorable missingness. For the proposed method, we take advantage of missingness information from each user and item for clustering. We observe that users who give a large number of ratings tend to assign low rating scores; therefore we classify users based on the quantiles of the number of their ratings. For items, we notice that old movies being rated are usually classical and have high average rating scores. Therefore, the items are clustered based on their release dates. We useN = 12andM = 10as the number of clusters for users and items in both data sets. The means of ratings from different clusters are significantly different based on their pairwise two-samplet-tests. In addition, we also try a large range ofNs andMs, but they do not affect the results very much. The proposed method is compared with the four matrix factorization methods described in Section 5.1. Tuning parameters for each method are selected from grid points to minimize

the RMSEs on the validation set. For the proposed method, we apply the loss function (2) and selectK = 2and= 12for the 1M data, andK = 6and= 16for the 10M data. For Agarwal and Chen (2009), we selectK= 1for both the 1M and 10M data, which requires 25 and 10 iterations of the EM algorithm to guarantee convergence, respectively. For Mazumder et al. (2010),K= 4andK= 9are selected for the 1M and 10M data, and while using differents does not influence the RMSE very much, we apply= 0 to estimate the theoretical local minimum. For Zhu et al. (2015), the tuning and the selection ofK are provided in their coding automatically, and theL 0 -penalty function is applied. For the RSVD,K= 4and= 7. 5 are selected for the 1M data, andK= 4and= 6are selected for the 10M data. In addition, we also compare the proposed method with the grand mean imputation approach, which predicts each rating by the mean of the training set and the validation set, and the linear regression" approach using ratings from the training and the validation sets against all available covariates from users and items. Table 3 provides the prediction results on the testing set, which indicates that the pro- posed method outperforms the other methods quite significantly. For example, for the 1M data, the RMSE of the proposed method is 8 .6%less than the RSVD, 19 .5%less than Agar- wal and Chen (2009), 10 .2%less than Mazumder et al. (2010), 9 .3%less than Zhu et al. (2015), and 13.2% and 11.6% less than grand mean imputation and linear regression, respec- tively. For the 10M data, the proposed method improves on grand mean imputation, linear regression, the RSVD, Agarwal and Chen (2009), Mazumder et al. (2010) and Zhu et al. (2015) by 8 .7%, 7 .1%, 6 .7%, 4 .5%, 8 .7%and 9 .6%in terms of the RMSE, respectively. In addition, while some of the matrix factorization methods are worse than the linear regression method, the proposed method always beats the linear regression method. We also investigate the cold-start problem in the Movielens 10M data, where96%of the ratings in the testing set are either from new users or on new items which are not available in the training set. We name these ratings new ratings, in contrast to the old ratings given by existing users to existing items. In Table 4, we compare the proposed method with the four competitive methods on the old ratings, the new ratings, and the entire testing set.

On the one hand, the RSVD, Mazumder et al. (2010), Zhu et al. (2015) and the proposed method have similar RMSE for the old ratings set, indicating similar performances on prediction accuracy for existing users and items. On the other hand, the proposed method has the smallest RMSE compared to the other methods for the new ratings and the entire testing sets, indicating the superior performance of the proposed method for the cold-start problem.

7 Discussion

We propose a new recommender system which improves prediction accuracy through incor- porating dependency among users and items, in addition to utilizing information from the non-random missingness. In most collaborative filtering methods, training data may not have sufficient information to estimate subject-specific parameters for new users and items. Therefore, only baseline models such as ANOVA or linear regression are applied. In contrast, the proposed method provides prediction for a new useruthroughui=xui+svu(qi+tji). The interaction term svuqiprovides the average rating of thevu-th cluster on thei-th item, and guarantees that uiis item-specific. The same property also holds for new items. The group effectssvuand tjiallow us to borrow information from existing users and items, and provide more accurate recommendations to new subjects. The proposed model also takes advantage of missingness information as users or items may have missing patterns associated with their rating behaviors. Therefore, we propose clustering users and items based on the numbers of their ratings or other variables associated with the missingness. Thus the group effects(svu,tji)could provide unique latent information which are not available inxui,puorqi. Note that if the group effects(svu,tji)are the only factors that are associated with the missing process, then the proposed method captures the entire missing-not-at-random mechanism. In other words, correctly estimating(svu,tji) enables us to achieve consistent and efficient estimation ofui, regardless of the missing

mechanism.

Appendix

Proof of Lemma 1

By Theorem 2.1 of Ansley and Kohn (1994), each of (3) and (4) has a unique solution, and the back-fitting algorithms for (5) and (6) can be applied in (3), and (7) and (8) can be applied in (4). This guarantees convergence to the unique solution given any initial value. Therefore, Algorithm 1 is equivalent to minimizing (3) and (4) iteratively. Note that minimizing (3) and (4) iteratively is a special case of Algorithm MBI (Chen et al., 2012) with two blocks. Therefore, following Theorem 3.1 of Chen et al. (2012), Algorithm 1 converges to a stationary point. This completes the proof.

Proof of Lemma 2

Sinceui= (pu+svu)(qi+tji)is a quadratic function of(pu,qi,svu,tji), and given that (pu,qi,svu,tji)and(p u, qi, svu, tji)are bounded byL, there exists a constantC 1 0 , such that ui ui 2 C 1 (pu,qi,svu,tji)(p u,q i, svu, tji) 2.

Recall thatfui(r,) =f(rui|ui), then based on Assumption 1:

h^2 (, ) = nm^1
n
u=1

m
i=1

|fui^1 /^2 (r,)fui^1 /^2 (r, )|^2 d(r)

 nm^1
n
u=1

m
i=1

G ^2 C 12 (pu,qi,svu,tji)(p u,q i, svu, tji)^22
 nm^1 G ^2 C 12 (PP ^2 F+QQ ^2 F+Sc Sc^2 F+TcT c^2 F)
 nm^1 G ^2 C 12 {PP ^2 F+QQ ^2 F+ (n+m)(SS ^2 F+TT ^2 F)}
 nnm+mG ^2 C 12  ^22.

The second-to-last inequality results from the fact that

ScS c^2 F =

N

v=1

|Vv|sv sv^22
 v=1max,...,N{|Vv|}SS ^2 F
 (n+m)SS ^2 F,

and similarlyTcT c^2 F (n+m)TT ^2 F. The last inequality results from the fact that

 ^22 =PP ^2 F+QQ ^2 F+SS ^2 F+TT ^2 F.

Defined 0 =GC  1 , and the result then follows. This completes the proof.

Proof of Theorem 1

We first verify the condition of Lemma 2.1 of Ossiander (1987). Based on Lemma 2,

{
1
nm

n
u=1

m
i=1

E(supBd()|fui^1 /^2 (r,)fui^1 /^2 (r,)|^2 )

} 1 / 2

=

{

1

nm

n
u=1

m
i=1

supBd()|fui^1 /^2 (r,)fui^1 /^2 (r,)|^2 d(r)

} 1 / 2

{n+m
nm G

(^2) C 12 supBd() (^22)

} 1 / 2

n+m
nm d^0 d
:= g(d)

Hence foru > 0 ,
HB(u,S(k),)H(g^1 (u/2),S(k),),

whereHB is the metric entropy ofS(k)with bracketing off^1 /^2 ,H is the ordinary metric entropy ofS(k), andis theL 2 -norm.

Next we provide an upper bound forH(g^1 (u/2),S(k),). Sinceg^1 (u/2) = nm 2 d 0 n+mu, Nn,Mm, andL, we have

0  HB(u,S(k),)
 H(g^1 (u/2),S(k),)
 log

max

(

L(n+m+N+M)K
2 d 0 nmn+mu

)(n+m+N+M)K
, 1

max

{

(n+m+N+M)Klog

(

2  2 Kd 0 L(n+m)
nmu

)

, 0

}

= max

{

(n+m+N+M)Klog

(

KC(n+m)
nmu

)

, 0

}

foru^2 ||andC= 2 2 d 0 L. We now find the convergence rate||, which is the smallestthat satisfies the conditions of Theorem 1 of Shen (1998). That is,

supkk 0  1 (,k)c 2 ||^1 /^2

for a constantk 0 , where 1 (,k) =xx^1 /^2 {HB(u,F(k))}^1 /^2 du/xwithx= (c 1 ^2 +||(kk 0 )), andF(k) ={f^1 /^2 (r,) :S(k)}. Note that 1 0 c 2 ||^1 /^2 whenx 1 , so we only consider the case when 0 < x <

Notice thatK 1 andn+m nm. Then with a sufficiently largeL, we have max

{

log

(KC(n+m)
nmu

)

, 0

}

= log

(KC(n+m)
nmu

)

foru[x,x^1 /^2 ]. Then:

 1 (,k) =

x 1 / 2
x

{HB(u,F(k))} 1 / (^2) du/x ((n+m+N+M)K)^1 /^2 x 1 / 2 x

{

log

(

KC(n+m)
nm

)

logu

} 1 / 2

du/x

 ((n+m+N+M)K)^1 /^2 (x^1 /^2 1)

{

log

(

KC(n+m)
nm

)

+ log(x^1 )

} 1 / 2

.

Since||< 21 k^2 ||andkO((n+m+N+M)K), we have||=o(^2 ||). Therefore, we solve

supkk 0  1 (,k) =  1 (,k 0 )
 (n+m+N+M)K^1 ||

{

log

(

K(n+m)
^2 ||nm

)} 1 / 2

= c 2 ||^1 /^2.

Then the smallest rate||is determined by

1
||

{

log

(

K(n+m)
^2 ||nm

)} 1 / 2

||

1 / 2
(n+m+N+M)K.

Note thatNnandMm, then we have

||

(n+m)K
||^1 /^2

{

log

( ||

nmK

)} 1 / 2

.

For||and||, the conditions of Corollary 1 of Shen (1998) are now satisfied. The result then follows. This completes the proof.

Proof of Theorem 2

Based on Theorem 1 and Theorem 1 of Shen (1998), there existsci> 0 ,i= 1, 2 , such that:

(

{ S(ksup),h(, )}{L(|Ro)L( |Ro)}c^1 ||^2

)

7 exp(c 2 ||^2 ).

Therefore, if there exists S(k)such thath(, ), then

P({L(|Ro)L( |Ro)}c 1 ||^2 )7 exp(c 2 ||^2 ).

We suppress the subscript, writeh(ui, ui)ash(, ), and writef(rui|ui)asf(r|). We now lower-boundh(, )by a function of|ui ui|:

h^2 (, ) = E

{

f^1 /^2 (r|)f^1 /^2 (r| )

} 2

=

(

{f(r|)>f(r| )}+

{f(r| )f(r|)}

){

f^1 /^2 (r|)f^1 /^2 (r| )

} 2

d(r)
:= I 1 +I 2 ,

where I 1 = {f(r|)>f(r| )}f(r|)

(

1 exp

[ 1

{

( )T(r)(A( )A())

}]) 2

d(r), and I 2 ={f(r| )f(r|)}f(r| )

(

1 exp

[ 1

{

( )T(r)(A()A( ))

}]) 2

d(r). ForI 1 , sincef(r|)> f(r| ), we haveZ := ( )T(r)(A( )A()) 0. Since L, we havebounded in a closed set, and henceA() =E[T(r)]is bounded. Let LA= sup|ET(r)|, then |A( )A()|LA| |. ThenZ=|Z|(|T(r)|LA)| |. That is,

1 exp{^12 |Z|}max

{

1 exp

[ 1

2 (LA|T(r)|)| |

]

, 0

}

,

and

I 1 =

{f(r|)>f(r| )}f(r|)

(

1 exp{^12 |Z|}

) 2

d(r)

{f(r|)>f(r| )}f(r|) max

{

1 exp

[ 1

2 (LA|T(r)|)| |

]

, 0

} 2

d(r).

In a similar way,

I 2

{f(r| )f(r|)}f(r|
 ) max

{

1 exp

[ 1

2 (LA|T(r)|)|
 |

]

, 0

} 2

d(r)

{f(r| )f(r|)}f(r|) max

{

1 exp

[ 1

2 (LA|T(r)|)|
 |

]

, 0

} 2

d(r).

Notice that 1 exp

{ 1

2 (LA|T(r)|)| |

}

 0 if and only if|T(r)|LA. Hence,

h^2 (, ) = I 1 +I 2

{|T(r)|LA}f(r|)

[

1 exp

{ 1

2 (LA|T(r)|)| |

}] 2

d(r),

which is a non-decreasing function of| |. Therefore, for eachui, and given the||in Theorem 1, there exists a||(ui), such that | uiui|> ||(ui)impliesh(ui, ui)||. Let||= 1 umaxn, 1 imsupui ||(ui), then| uiui|> ||for each(u,i)impliesh(, ) ||, and the result follows. This completes the proof.

Proof of Corollary 1

Define ={(u,i) :|^0 uiui|> ||}. Then the cardinality ofsatisfies||nm. From Theorem 2, for(u,i), we haveh(^0 ui,ui)^1 ||. Hence by the definition of 0 , we have:

h(, 0 ) =

{

1

nm

m
i=1

n
u=1

h^2 (^0 ui,ui)

} 1 / 2

{ 1

nm(nm)

1

(^2) ||

} 1 / 2

=||.

This completes the proof.

Proof of Corollary 2

In the proof of Corollary 1, we verify thath(, 0 )||. Then by Lemma 2, we have:

^2 ||h^2 (, 0 )d

(^20) (n+m) nm ^0 (^22). ThenS^2 F+T^2 F= 0 ^2 d (^20) (nmn+m)^2 ||. Meanwhile, we have each entry ofSandTbounded byL, and henceS^2 F+T^2 F (N+M)KL^2.

Therefore,N+Md (^20) L (^2) Knm(n+m)^2 ||. By the rate of||provided in Theorem 1, we have: O(N+M)nm||log

( ||

nmK

)

.

This completes the proof.

References

Adomavicius, G. and Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734749.

Agarwal, D. and Chen, B.-C. (2009). Regression-based latent factor models. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1928. ACM.

Ansley, C. F. and Kohn, R. (1994). Convergence of the backfitting algorithm for additive models.Journal of the Australian Mathematical Society (Series A), 57(03):316329.

Bell, R. M. and Koren, Y. (2007). Scalable collaborative filtering with jointly derived neigh- borhood interpolation weights. InProceedings of the 2007 7th IEEE International Con- ference on Data Mining, 4352. IEEE.

Blanco-Fernandez, Y., Pazos-Arias, J. J., Gil-Solla, A., Ramos-Cabrer, M., and Lopez-Nores, M. (2008). Providing entertainment by content-based filtering and semantic reasoning in intelligent recommender systems.IEEE Transactions on Consumer Electronics, 54(2):727

Cacheda, F., Carneiro, V., Fernndez, D., and Formoso, V. (2011). Comparison of collab- orative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems.ACM Transactions on the web (TWEB), 5(1):2.

Carroll, J. D. and Chang, J.-J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283319.

Chen, B., He, S., Li, Z., and Zhang, S. (2012). Maximum block improvement and polynomial optimization.SIAM Journal on Optimization, 22(1):87107.

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression.The Annals of Statistics, 32(2):407499.

Feuerverger, A., He, Y., and Khatri, S. (2012). Statistical significance of the Netflix challenge. Statistical Science, 27(2):202231.

Funk, S. (2006). Netflix update: Try this at home. URLhttp: // sifter. org/ ~simon/ journal/ 20061211. html.

Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. (2001). Eigentaste: A constant time collaborative filtering algorithm.Information Retrieval, 4(2):133151.

Harshman, R. A. (1970). Foundations of the parafac procedure: Models and conditions for an "explanatory" multi-modal factor analysis. UCLA Working Papers in Phonetics, 16:184.

Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1):1.

Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems.Computer, 42(8):3037.

Lang, K. (1995). Newsweeder: Learning to filter netnews. InProceedings of the 12th Inter- national Conference on Machine Learning, 331339.

Lops, P., De Gemmis, M., and Semeraro, G. (2011). Content-based recommender systems: State of the art and trends. InRecommender Systems Handbook, 73105. Springer.

Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices.The Journal of Machine learning Research, 11:2287

Melville, P., Mooney, R. J., and Nagarajan, R. (2002). Content-boosted collaborative fil- tering for improved recommendations. InProceedings of the 18th National Conference on Artificial Intelligence, 187192.

Mooney, R. J. and Roy, L. (2000). Content-based book recommending using learning for text categorization. InProceedings of the 5th ACM Conference on Digital Libraries, 195204. ACM.

Nguyen, A.-T., Denos, N., and Berrut, C. (2007). Improving new user recommendations with rule-based induction on cold user data. InProceedings of the 2007 ACM Conference on Recommender Systems, 121128. ACM.

Nguyen, J. and Zhu, M. (2013). Content-boosted matrix factorization techniques for recom- mender systems. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(4):286301.

Ossiander, M. (1987). A central limit theorem under metric entropy with L2 bracketing. The Annals of Probability, 15(3):897919.

Park, S.-T., Pennock, D., Madani, O., Good, N., and DeCoste, D. (2006). Nave filter- bots for robust cold-start recommendations. InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 699705. ACM.

Pazzani, M. J. and Billsus, D. (2007). Content-based recommendation systems. InThe Adaptive Web, 325341. Springer.

Pollard, D. (2012).Convergence of Stochastic Processes. Springer Science & Business Media.

Rubin, D. B. (1976). Inference and missing data.Biometrika, 63(3):581592.

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. InProceedings of the 24th International Conference on Machine Learning, 791798. ACM.

Shen, X. (1998). On the method of penalization.Statistica Sinica, 8(2):337357.

Srebro, N., Alon, N., and Jaakkola, T. S. (2005). Generalization error bounds for collabora- tive prediction with low-rank matrices. InIn Advances In Neural Information Processing Systems 17, 527.

Wang, J. (2010). Consistent selection of the number of clusters via crossvalidation. Biometrika, 97(4):893904.

Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood rat ios and conver- gence rates of sieve MLEs.The Annals of Statistics, 23(2):339362.

Wu, M. (2007). Collaborative filtering via ensembles of matrix factorizations. InProceedings of KDD Cup and Workshop.

Zhu, Y., Shen, X., and Ye, C. (2015). Personalized prediction and sparsity pursuit in latent factor models.Journal of the American Statistical Association, (to appear).

Table 1: RMSE (standard error) of the proposed method compared with four existing meth- ods, with the missing rate = 70%,80%,90%and95%, and the number of latent factors K= 3or 6, where RSVD, AC, MHT and ZSY stand for regularized singular value decom- position, the regression-based latent factor model (Agarwal and Chen, 2009), Soft-Impute (Mazumder et al., 2010), and the latent factor model with sparsity pursuit (Zhu et al., 2015), respectively.

No. of latent factors Missing Rate The Proposed MethodK= 3 70% 1.232 (0.029) 1.823 (0.324) 4.218 (0.089) 3.591 (0.178) 2.384 (0.077)RSVD AC MHT ZSY
80%90% 1.329 (0.042)1.521 (0.070) 2.574 (0.506) 4.190 (0.091) 4.064 (0.140) 2.574 (0.085)4.002 (0.689) 4.109 (0.095) 4.581 (0.116) 2.982 (0.095)
K= 6 95%70% 1.800 (0.103)1.461 (0.035) 4.526 (0.172) 4.087 (0.096) 4.774 (0.123) 3.288 (0.100)3.728 (0.188) 7.164 (0.132) 7.126 (0.294) 5.844 (0.656)
80%90% 1.634 (0.058)2.032 (0.136) 4.926 (0.274) 6.962 (0.134) 8.038 (0.267) 5.885 (0.145)7.048 (0.270) 6.805 (0.136) 8.931 (0.172) 6.019 (0.420)
95% 2.839 (0.388) 8.316 (0.270) 6.846 (0.149) 9.142 (0.176) 6.207 (0.151)

Table 2: RMSE (standard error) of the proposed method when the missing rate is70%,80%, 90%or95%, and the number of latent factorsK= 3or 6, under0%,10%,30%and50% cluster misspecification rate.

No. of latent factors Missing Rate 0% Misspecification Rate10% 30% 50%
K= 3 70%80% 1.232 (0.029)1.329 (0.042) 1.237 (0.029)1.340 (0.052) 1.250 (0.032) 1.265 (0.038)1.359 (0.051) 1.380 (0.049)
90%95% 1.521 (0.070)1.800 (0.103) 1.544 (0.180)1.810 (0.116) 1.591 (0.162) 1.626 (0.255)1.869 (0.102) 1.920 (0.093)
K= 6 70%80% 1.461 (0.035)1.634 (0.058) 1.502 (0.049)1.698 (0.070) 1.560 (0.048) 1.623 (0.059)1.815 (0.074) 1.911 (0.092)
90%95% 2.032 (0.136)2.839 (0.388) 2.229 (0.198)3.041 (0.302) 2.428 (0.146) 2.648 (0.150)3.245 (0.238) 3.373 (0.178)

Table 3: RMSE of the proposed method compared with six existing methods for MovieLens 1M and 10M data, where RSVD, AC, MHT and ZSY stand for regularized singular value decomposition, the regression-based latent factor model (Agarwal and Chen, 2009), Soft- Impute (Mazumder et al., 2010), and the latent factor model with sparsity pursuit (Zhu et al., 2015), respectively.

Grand Mean Imputation MovieLens 1M MovieLens 10M1.1112 1.0185
The Proposed MethodLinear Regression 1.09050.9644 1.00070.9295
RSVDAC 1.05521.1974 0.99660.9737
MHTZSY 1.07371.0635 1.01771.0287

Table 4: RMSE of the proposed method compared with four existing methods on the MovieLens 10M data to study the cold-start problem: old ratings and new ratings stand for ratings in the testing sets given by existing users to existing items, and by new users or to new items. Here RSVD, AC, MHT and ZSY stand for regularized singular value decomposition, the regression-based latent factor model (Agarwal and Chen, 2009), Soft- Impute (Mazumder et al., 2010), and the latent factor model with sparsity pursuit (Zhu et al., 2015), respectively.

old ratings The proposed method RSVD0.7971 0.8062 1.3324 0.8160 0.7959AC MHT ZSY
the entire testing setnew ratings 0.93480.9295 1.0039 0.9553 1.0252 1.03750.9966 0.9737 1.0177 1.0287

Figure 1: Missing pattern analysis for the MovieLens 1M data. Left: Most users rated a small number of movies, while few users rated a large number of movies. Right: Movies with a high average rating attract more users.