Differences

This shows you the differences between two versions of the page.

--- data_analysis:recommendation_bpr [2020/04/27 05:56] – prgram
+++ data_analysis:recommendation_bpr [2025/07/07 14:12] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
-====== recommendation bpr ======
-<!--
+<wrap hide>
-	$theme: default
+<style>
-	$size: 4:3
+  .slide {
-	page_number: true
+  	font-size: 25px;
-	footer: this is footer
+  }
--->
+</style>
-# Slide 1
+<!-- $theme: default -->
+</wrap>
-# ![](images/marp.png)
+====== Review on BPR: Bayesian Personalized Ranking from Implicit Feedback, Rendle et al. ======
-------
-# Slide 2
+---
+====# Recommendation of item====
+### Explicit Feedback : user's preference of item itself
+> eg. '1-5 score' of movies, 'like/dislike' in video streaming service
+### Implicit Feedback : user behavior which reflects their interest 'indirectly'
+> eg. purchase history, click, view, etc.
+- only positive feedback observations are available.
+- non-observed user-item pairs = real negative feedback + missing value
+---
+====# BPR : Bayesian Personalized Ranking from Implicit Feedback====
+ Algorithm = model + **objective function + optimization**
+###
+> ### objective function = BPR-OPT
+> ### optimization = LEARN_BPR
+> #
+> ### model = ?
+---
+====# Formalization====
+* $U$ : set of all users, $I$ : set of all items
+* $S \subseteq U \times I$ : implicit feedback
+* $>_u \subset I^2$ : personalied total ranking of all items
+>  properties of total order:  $\forall i,j,k \in I$
+> (totality) : $\text{if } i \neq j \text{, then } i >_u j \text{ or } j >_u i$
+> (antisymmetry) : $\text{if } i>_uj \text{ and } j>_ui \text{, then } i = j$
+> (transitivity) : $\text{if } i>_uj \text{ and } j>_uk \text{, then } i>_u k$
-------
+* $I_u^+:=\{i \in I : (u,i) \in S\}$, $U_i^+:=\{u \in U : (u,i) \in S\}$
-# A Review on
+---
+====# Approach====
-> # BRP: Bayesian Personalized Ranking from Implicit Feedback, Rendle et al.
+* use item pairs as training data
+* optimize for correctly ranking item pairs instead of scroing single items
-------
+### Training data
+> $D_s := \{ (u,i,j) | i \in I_u^+ \text{ and } j \in I\setminus I_u^+ \}$  : training set
+ $\Rightarrow$ $(u,i,j)\in D_s$ means user u is assumed to prefer i over j : $i>_uj$
-### Algorithm = model + _loss function_ + _optimization_
+### Advantages
+. can treat both positive and negative pairs and missing values
+. training data is created for the actual objective of ranking
-- Loss function = BPR-OPT ( → approximates AUC)
+---
-- Optimization = stochastic gradient descent
+====# BPR optimization criterion (1/3)====
+Bayesian formulation : maximize posterior probability
+> $$ P(\Theta|>_u) \propto p(>_u|\Theta)p(\Theta) $$
+> $\Theta$ : parameter vector of an arbitrary model class
-------
+---
+====# BPR optimization criterion (2/3)====
+set $X_{uij} = I(i>_uj)$,
+then $X_{uij} \sim Bernoulli(p_{uij})$ where $p_{uij} = p(i>_uj)$
-. choose a rank metric
+for all user analysis, assume that each $(u,i,j) \in D_s$ are independent.
-. identify parts that are not smooth
+> $$\prod_{u\in U}p(>_u|\Theta) = \prod_{(u,i,j)\in D_s}p_{uij}^{x_{uij}}(1-p_{uij})^{1-x_{uij}} =\prod_{(u,i,j)\in D_s}p_{uij}$$
-. replace with "close enough" approximations
-   - often logistic function is used here
-. learn with gradient descent
-------
+---
+====# BPR optimization criterion (3/3)====
+with setting $p(i>_uj|\Theta) := \sigma(\hat{x}_{uij}(\Theta))$, where $\sigma (x) := {1 \over 1+ \exp(-x)}$
+and prior $p(\Theta)$ as normal distribution $N(0,\lambda_{\Theta} ^{-1}I)$
-- Explicit :
+>
-- Implicit : only positive observations are available
+> $$
-  $$ \text{non-observed data} = \text{real negative feedback} + \text{missing value} $$
+\begin{matrix}
+\text{BPR-OPT}
+ &:=& \ln \prod_{u\in U}p(\Theta|>_u) \\
+ &=& \ln \prod_{u\in U}(p(>_u|\Theta)p(\Theta) \\
+ &=& \sum_{(u,i,j)\in D_s} \ln\sigma(\hat{x}_{uij}(\Theta)) + \ln p(\Theta) \\
+ &=& \sum_{(u,i,j)\in D_s} \ln\sigma(\hat{x}_{uij}(\Theta)) - \lambda_{\Theta} \lVert\Theta\rVert^2
+ \end{matrix} $$
+> where $\lambda_{\Theta}$ are model specific regularization parameters.
-------
+â€» form of MLE with $l_2$-regularization
-### properties of a total order
+---
+====# Relation to AUC====
+> $$ \begin{matrix}
+AUC &=& {1 \over|U|}\sum_{u\in U} AUC(u) \\
+&=& {1 \over|U|}\sum_{u\in U} {1 \over |I_u^+||I\setminus I_u^+|} \sum_{i\in I_u^+} \sum_{j \in I\setminus I_u^+} \delta(\hat{x}_{uij}>0) \\
+&=& \sum_{(u,i,j)\in D_s}z_u \delta(\hat{x}_{uij}>0)
+\end{matrix} $$
+> where $z_u$ is normalizing constant.
-- totality : $\forall{i,j}\in{I},i\neq j \Rightarrow j >_{u}  \text{ or }  j >_{u} i$
+### compare to the BPR-OPT : they only differ in the loss function
-- antisymmetry :
+* non-diferentiable $\delta(x>0)$ is approximated by $\ln \sigma(x)$
-- transitivity :
-------
+---
+====# Optimization====
+### gradient descent
+> $$ {\partial \text{BPR-OPT} \over \partial \Theta} = \sum_{(u,i,j)\in D_s} {-\exp(-\hat{x}_{uij}) \over 1+\exp(-\hat{x}_{uij})} \cdot {\partial \over \partial\Theta}\hat{x}_{uij} - \lambda_\Theta \Theta$$
-BRP
+. Full GD
+	* too large training data : $O(|S||I|)$ triples
+	* the skewness in the training pairs : the dominating class affect largely
+	* regularization is difficult as the gradients differ much
+. Stochastic GD
+	* good approach for skew problem
+	* but, order of training data traversed is crucial
+		* in typical method (item-wise or user-wise), there are so many consecutive updates on the same user-item pair
+		$\Rightarrow$ bootstrap sampling with replacement
+---
+====# Model====
+Many model predict a real number $\hat{x}_{ul}$ per user-item pair $(u,l)$
-bayesian formulation
+To learn such model with BPR,
-$$\text{maximize} \ \ \  p(\Theta|>_u) \propto p(>_u|\Theta)p(\Theta) $$
+Decompose the estimator $\hat{x}_{uij}$
+> $$ \hat{x}_{uij} := \hat{x}_{ui}-\hat{x}_{uj} $$
-$$ X_{uij} := I(i >_u j) \sim Bern(p_{uij}) s.t. p_{uij} = P(i>_u j) $$
+â€» use same model, but optimize it against another criterion.
+(difference of two prediction instead of single number)
-with the assumption that each $(u,i,j) \in D_S$ is independent,
+---
+====# Model : Matrix Factorization====
+$X = [x_{ui}]$ is approximated by two low-rank matrices $W:|U| \times k$ and $H:|I|\times k$
+> $$ \hat{X} := WH^t $$
+> $$ \hat{x_{ui}} = \langle w_u, h_i \rangle= \sum_{f=1}^k w_{uf} \cdot h_{if} $$
+> $$ w_u : \text {row in }W \text{, describing user } u $$
+> $$ h_i : \text {row in }H \text{, describing item } i $$
+>
-$$ \prod_{u \in U} p( >_u | \Theta) = \prod_{(u,i,j) \in D_s} p (i>_u j|\Theta)$$
+---
-$$p(i>_uj|\Theta) := \sigma( \hat{x}_{uij}(\Theta) )$$
+$\Rightarrow \Theta = (W,H)$
-## $$\sigma(x) := {1\over 1+ \exp(-1) } $$
+> $${\partial \over \partial\theta}\hat{x}_{uij} =
+\begin{cases}
+(h_{if}-h_{jf}) & \text{if } \theta = w_{uf}, \\
+w_{uf} & \text{if } \theta = h_{if}, \\
+-w_{uf} & \text{if } \theta = h_{jf}, \\
+& \text{else}
+\end{cases}$$
-$$ \begin{matrix}
+$\lambda_W$ $\lambda_{H^+}$ $\lambda_{H^-}$ : regularization for each $\Theta$
-	BPR-OPT &:=& \ln p(\Theta|>_u)  \\
-    		&=& \ln p(>_u|\Theta)p(\Theta) \\
-            &=& \ln \prod_{(u,i,j) \in D_s} \sigma (\hat{x}_{uij})p(\Theta) \\
-            &=& \sum_{(u,i,j) \in D_s} \ln\sigma (\hat{x}_{uij}) + \ln p(\Theta) \\
-            &=& \sum_{(u,i,j) \in D_s} \ln\sigma (\hat{x}_{uij}) + \lambda_{\Theta} \lVert\Theta\rVert^2
-   \end{matrix}
-## $$
+---
+====# Model : Adaptive k-Nearest-Neighbor====
+> $$ \hat{x}_{ui} = \sum_{l \in I_u^+\text{ and }l \neq i} c_{il} $$
+> where $C:I \times I$ is the symmetric item-correlation/similarity matrix.
+>
-$$ \begin{matrix}
+$\Rightarrow \Theta = C$
-{\partial BPR-OPT \over \partial \Theta}
+> $${\partial \over \partial\theta}\hat{x}_{uij} =
-&=&
+\begin{cases}
-\sum_{(u,i,j) \in D_s} {\partial \over \partial\Theta }\ln\sigma (\hat{x}_{uij}) + \lambda_{\Theta} {\partial \over \partial\Theta }\lVert\Theta\rVert^2 \\
++1 & \text{if } \theta \in \{c_{il}, c_{li} \} \text{ and } l \in I_u^+\text{ and }l \neq i, \\
+-1 & \text{if } \theta \in \{c_{jl}, c_{lj} \} \text{ and } l \in I_u^+\text{ and }l \neq j, \\
+& \text{else}
+\end{cases}$$
-&\propto& \sum_{(u,i,j) \in D_s} {-\exp(-\hat{x}_{uij})\over{1+\exp(-\hat{x}_{uij})}} \centerdot {\partial \over \partial\Theta }\hat{x}_{uij} - \lambda_{\Theta} \Theta
+$\lambda_{+}$  $\lambda_{-}$ : regularization for each $\Theta$
-\end{matrix} $$
-------
-## full gradient descent
+---
+====# Relation to other methods ====
+### Weighted Regularized MF (WR-MF)
+implicit
+> $$ \sum_{u\in U} \sum_{i\in I} c_{ui}(\langle w_u, h_i \rangle - 1)^2 + \lambda\lVert W \rVert _f^2 + \lambda\lVert H \rVert _f^2  $$
+. this optimization is on instance level (one item) instead of pair level (two items) as BPR.
+. this optimization is a least square - correspond to the MLE for normally distributed random variables.
+	- But item prediction is actually not a regression(quantitative),
+but a classification (qualitative) one
+	- so the logistic optimization is more appropriate.
-- $\downarrow$ $O(\left|S\right|\left|I\right|)$ training triples
+---
-- $\downarrow$ skewness in the training pair lead to poor convergence
-- $\downarrow$ regularization is difficult
-## stochastic gradient descent
+### Maximum Margin MF (MMMF)
+explicit
+> $$ \sum_{(u,i,j) \in D_s} max(1- \langle w_u,h_i-h_j \rangle ) + + \lambda_w \lVert W \rVert _f^2 + \lambda_h \lVert H \rVert _f^2  $$
+. the error functions differ - loss in BPR is smooth and motivated by the MLE.
+. BPR-Opt criterion is generic and can be applied to several models, whereas their method is specific for MF.
+. MMMF is designed to work with sparse explicit data
+i.e. they assume that there are many missing values and thus they assume to have much less pairs than in an implicit setting.
+---
+====# Evaluation====
-- $\uparrow$ good approach for 'our' skew problem
+leave one out (remove one entry from $I_u^+$ per user $u$)
-- $\downarrow$ typical item-wise or user-wise order in training pairs have poor convergence because there are so many consecutive updates on the same user-item pair
-## $\therefore$ choose triples randomly, bootstrap sampling with replacement
+>$$ AUC = {1 \over |U|} \sum_u {1 \over |E(u)|} \sum_{(i,j) \in E(u) } \delta( \hat{x}_{ui} > \hat{x}_{uj})$$
+where the evaluation pairs per user $u$
+>$$ E(u) := \{(i,j) | (u,i) \in S_{test} \text{ and } (u,j) \notin (S_{test} \cup S_{train}) \} $$
-## $$ \Theta \leftarrow \Theta + \alpha\left( {\exp(-\hat{x}_{uij})\over{1+\exp(-\hat{x}_{uij})}} \centerdot {\partial \over \partial\Theta }\hat{x}_{uij} + \lambda_{\Theta} \Theta \right) $$
+higher value of AUC indicates a better quality.
+---
+##### \<Data\> MovieLens100K ( user 943 $\times$ item 1,682, 100K ratings )
+![](img3.png)![](img_netflix3.png)
+{{:data_analysis:pasted:20200429-203305.png}}
+{{:data_analysis:pasted:20200429-203230.png}}
+###### remove rating scores, repeated 5 times, $\lambda = 0.0025$, $\alpha = 0.05$
+---
+# END
+==== Code ====
 <code>
-code
+#install.packages("rrecsys", force=TRUE, dependencies=TRUE)
-</code>
-{{tag>data_analysis tag1 tag2}}
-~~DISCUSSION~~
+test_index<- function(data, na=TRUE) {
+  index_test<-c()
+  for (i in 1:dim(data)[1]) {
+    if(na) indeX_i <- which(!is.na(data[i,]))
+    else index_i<-which(data[i,]==0)
+    index_test[i]<-index_i[sample(1:length(index_i),1)]
+  }
+  test_index <- index_test
+}
+train <- function (data, index_test, na=TRUE) {
+  train_data<-data
+  for (i in 1:dim(data)[1]) {
+    train_data[i,index_test[i]]<-NA
+  }
+  train_data[is.na(train_data)]<-0
+  train<-train_data
+}
+calc_AUC <- function(data, pred, index_test, na=TRUE) {
+  AUC_u <- 0
+  for (i in 1:dim(data)[1]) {
+    i=1
+    if(na) index_j<-which(is.na(data[i,]))
+    else  index_j<-which(data[i,]==0)
+    AUC_u <- AUC_u + 1/length(index_j) * sum(pred[i,index_test[i]] > pred[i,index_j])
+  }
+  calc_AUC <- AUC_u / dim(data)[1]
+}
+############################################
+############################################
+############################################
+############################################
+############################################
+# check data
+which(apply(is.na(data2),1,sum)>1680)
+which(apply(is.na(data2),2,sum)<10)
+sum(!is.na(data2))
+############################################
+############################################
+ml100k_ori<-ml100k
+ml100k[ml100k>0]<-1
+index_test<- test_index(ml100k, na=FALSE)
+data2_train<-train(ml100k,index_test, na=FALSE)
+############################################
+############################################
+############################################
+############################################
+####BPR
+library("rrecsys")
+setwd("C:/Users/heeseoklee/Downloads/rrecsys-master/rrecsys-master/R")
+source("ALG_BPR.R")  #BPR2
+showDeltaError()
+#rrecsys(ratings, alg="BPR")
+ratings<-defineData(data2_train)
+AUC_total<-c()
+for (i in 1:10) {
+  AUC<-0
+  for( j in 1:5){
+    model_bpr<-BPR2(ratings,k=10*i)
+    pred_bpr<-predict(model_bpr)
+    AUC<-AUC+calc_AUC(ml100k,pred_bpr,index_test,na=FALSE)
+  }
+  AUC_total[i]<-AUC/5
+}
+AUC_total_bpr<-AUC_total
+setStoppingCriteria(autoConverge = FALSE,
+                    deltaErrorThreshold = 1e-05, nrLoops = NULL, minNrLoops = 10)
+showStoppingCriteria()
+############################################
+############################################
+####WRMF
+#devtools::install_github("dselivanov/reco")
+#library("reco")
+install.packages("devtools")
+####R CMD INSTALL rsparse-master/
+devtools::install_github("dselivanov/rsparse", force=TRUE)
+library(rsparse)
+data("movielens100k")
+library(Matrix)
+Matrix(movielens100k)
+ml100k<-as(movielens100k, "matrix")#[1:10,1:10]
+AUC_total<-c()
+for (i in 1:10) {
+  AUC<-0
+  for( j in 1:5){
+    fb<-"implicit"
+    #fb<-"explicit"
+    model = WRMF$new(rank = i*10, lambda = 0.0025,
+                     feedback = fb,#c("implicit", "explicit"),
+                     init_stdv = 0.01,
+                     n_threads = parallel::detectCores(),
+                     non_negative = TRUE,
+                     solver = "conjugate_gradient",#c("conjugate_gradient", "cholesky"),
+                     cg_steps = 5L,
+                     components = NULL)
+    x<-as(data2_train, "CsparseMatrix")
+    #x<-movielens100k
+    n_users_rank<-model$fit_transform(x, n_iter = 10L)
+    rank_n_items<-model$components
+    pred_mmmf<-n_users_rank %*% rank_n_items
+    AUC<-AUC+calc_AUC(ml100k,pred_mmmf,index_test,na=FALSE)
+  }
+  AUC_total[i]<-AUC/5
+}
+AUC_total_wrmf<-AUC_total
+############################################
+############################################
+####SVD-MF
+#install.packages("recommenderlab")
+library("recommenderlab")
+ml <- as(data2_train,"realRatingMatrix")
+r_svd <- Recommender(ml, method = "SVD",param=list(k=10))
+p.svd <- predict(r_svd, ml, type = "ratings")
+pred_svd<-getModel(r_svd)$svd$u %&% diag((getModel(r_svd)$svd)$d) %*% t(getModel(r_svd)$svd$v)
+AUC<-calc_AUC(ml100k,pred_svd,index_test,na=FALSE)
+r <- Recommender(data = ml, method = "ALS",
+                 parameter = list(normalize=NULL, lambda=0.0025, n_factors=10,
+                                  n_iterations=10, seed = NULL, verbose = FALSE))
+r
+recom_ratings <- predict(r, newdata = ml, type = "ratings")
+as(recom_ratings, "matrix")
+getModel(r)
+AUC_total<-c()
+for (i in 1:10) {
+  AUC<-0
+  for( j in 1:2){
+    fsvd<-funkSVD(ml, k = 10, gamma = 0.0025, lambda = 0.01,
+               min_improvement = 1e-06, min_epochs = 30, max_epochs = 50,
+               verbose = TRUE)
+    pred_svdf <- tcrossprod(fsvd$U, fsvd$V)
+    AUC<-AUC+calc_AUC(ml100k,pred_svdf,index_test,na=FALSE)
+  }
+  AUC_total[i]<-AUC/2
+}
+AUC_total_svdf<-AUC_total
+</code>