Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
data_analysis:recommendation_bpr [2020/04/27 05:56] – prgram | data_analysis:recommendation_bpr [2025/07/07 14:12] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== recommendation bpr ====== | ||
- | <!-- | + | <wrap hide> |
- | $theme: default | + | < |
- | $size: 4:3 | + | |
- | page_number: | + | font-size: 25px; |
- | footer: this is footer | + | } |
- | --> | + | </style> |
- | # Slide 1 | + | <!-- $theme: default --> |
+ | </ | ||
- | #  : $\text{if } i \neq j \text{, then } i >_u j \text{ or } j >_u i$ | ||
+ | > (antisymmetry) : $\text{if } i>_uj \text{ and } j>_ui \text{, then } i = j$ | ||
+ | > (transitivity) : $\text{if } i>_uj \text{ and } j>_uk \text{, then } i>_u k$ | ||
- | ------ | + | * $I_u^+:=\{i \in I : (u,i) \in S\}$, $U_i^+:=\{u \in U : (u,i) \in S\}$ |
- | # A Review on | + | --- |
+ | ====# Approach==== | ||
- | > # BRP: Bayesian Personalized Ranking from Implicit Feedback, Rendle et al. | + | * use item pairs as training data |
+ | * optimize for correctly ranking item pairs instead of scroing single items | ||
- | ------ | + | ### Training data |
+ | > $D_s := \{ (u,i,j) | i \in I_u^+ \text{ and } j \in I\setminus I_u^+ \}$ : training set | ||
+ | | ||
- | ### Algorithm = model + _loss function_ + _optimization_ | + | ### Advantages |
+ | 1. can treat both positive and negative pairs and missing values | ||
+ | 2. training data is created for the actual objective of ranking | ||
- | - Loss function | + | --- |
- | - Optimization | + | ====# BPR optimization criterion |
+ | Bayesian formulation : maximize posterior probability | ||
+ | > $$ P(\Theta|> | ||
+ | > $\Theta$ : parameter vector of an arbitrary model class | ||
- | ------ | + | --- |
+ | ====# BPR optimization criterion (2/3)==== | ||
+ | set $X_{uij} = I(i> | ||
+ | then $X_{uij} \sim Bernoulli(p_{uij})$ where $p_{uij} = p(i> | ||
- | 1. choose a rank metric | + | for all user analysis, assume |
- | 2. identify parts that are not smooth | + | > $$\prod_{u\in U}p(> |
- | 3. replace with "close enough" | + | |
- | | + | |
- | 4. learn with gradient descent | + | |
- | ------ | + | --- |
+ | ====# BPR optimization criterion (3/3)==== | ||
+ | with setting $p(i> | ||
+ | and prior $p(\Theta)$ as normal distribution $N(0, | ||
- | - Explicit : | + | > |
- | - Implicit : only positive observations are available | + | > $$ |
- | | + | \begin{matrix} |
+ | \text{BPR-OPT} | ||
+ | &: | ||
+ | & | ||
+ | & | ||
+ | & | ||
+ | | ||
+ | > where $\lambda_{\Theta}$ are model specific regularization parameters. | ||
- | ------ | + | ※ form of MLE with $l_2$-regularization |
- | ### properties of a total order | + | --- |
+ | ====# Relation to AUC==== | ||
+ | > $$ \begin{matrix} | ||
+ | AUC &=& {1 \over|U|}\sum_{u\in U} AUC(u) \\ | ||
+ | &=& {1 \over|U|}\sum_{u\in U} {1 \over |I_u^+||I\setminus I_u^+|} \sum_{i\in I_u^+} \sum_{j \in I\setminus I_u^+} \delta(\hat{x}_{uij}> | ||
+ | &=& \sum_{(u, | ||
+ | \end{matrix} $$ | ||
+ | > where $z_u$ is normalizing constant. | ||
- | - totality | + | ### compare to the BPR-OPT : they only differ in the loss function |
- | - antisymmetry : | + | * non-diferentiable |
- | - transitivity : | + | |
- | ------ | + | --- |
+ | ====# Optimization==== | ||
+ | ### gradient descent | ||
+ | > $$ {\partial \text{BPR-OPT} \over \partial \Theta} = \sum_{(u, | ||
- | BRP | + | 1. Full GD |
+ | * too large training data : $O(|S||I|)$ triples | ||
+ | * the skewness in the training pairs : the dominating class affect largely | ||
+ | * regularization is difficult as the gradients differ much | ||
+ | 2. Stochastic GD | ||
+ | * good approach for skew problem | ||
+ | * but, order of training data traversed is crucial | ||
+ | * in typical method (item-wise or user-wise), there are so many consecutive updates on the same user-item pair | ||
+ | $\Rightarrow$ bootstrap sampling with replacement | ||
+ | |||
+ | --- | ||
+ | ====# Model==== | ||
+ | Many model predict a real number $\hat{x}_{ul}$ per user-item pair $(u,l)$ | ||
- | bayesian formulation | + | To learn such model with BPR, |
- | $$\text{maximize} \ \ \ p(\Theta|>_u) \propto p(>_u|\Theta)p(\Theta) | + | Decompose the estimator |
+ | > $$ \hat{x}_{uij} := \hat{x}_{ui}-\hat{x}_{uj} | ||
- | $$ X_{uij} := I(i >_u j) \sim Bern(p_{uij}) s.t. p_{uij} = P(i>_u j) $$ | + | ※ use same model, but optimize it against another criterion. |
+ | (difference of two prediction instead of single number) | ||
- | with the assumption that each $(u,i,j) \in D_S$ is independent, | + | --- |
+ | ====# Model : Matrix Factorization==== | ||
+ | $X = [x_{ui}]$ is approximated by two low-rank matrices $W:|U| \times k$ and $H: | ||
+ | > $$ \hat{X} := WH^t $$ | ||
+ | > $$ \hat{x_{ui}} = \langle w_u, h_i \rangle= \sum_{f=1}^k w_{uf} \cdot h_{if} $$ | ||
+ | > $$ w_u : \text {row in }W \text{, describing user } u $$ | ||
+ | > $$ h_i : \text {row in }H \text{, describing item } i $$ | ||
+ | > | ||
- | $$ \prod_{u \in U} p( >_u | \Theta) = \prod_{(u, | + | --- |
- | $$p(i> | + | $\Rightarrow |
- | ## $$\sigma(x) := {1\over 1+ \exp(-1) } $$ | + | > $${\partial \over \partial\theta}\hat{x}_{uij} = |
+ | \begin{cases} | ||
+ | (h_{if}-h_{jf}) & \text{if } \theta | ||
+ | w_{uf} & \text{if } \theta = h_{if}, \\ | ||
+ | -w_{uf} & \text{if } \theta = h_{jf}, \\ | ||
+ | 0 & \text{else} | ||
+ | \end{cases}$$ | ||
- | $$ \begin{matrix} | + | $\lambda_W$ $\lambda_{H^+}$ $\lambda_{H^-}$ : regularization for each $\Theta$ |
- | BPR-OPT &: | + | |
- | & | + | |
- | &=& \ln \prod_{(u, | + | |
- | &=& \sum_{(u, | + | |
- | &=& \sum_{(u, | + | |
- | | + | |
- | ## $$ | + | --- |
+ | ====# Model : Adaptive k-Nearest-Neighbor==== | ||
+ | > $$ \hat{x}_{ui} = \sum_{l \in I_u^+\text{ and }l \neq i} c_{il} $$ | ||
+ | > where $C:I \times I$ is the symmetric item-correlation/ | ||
+ | > | ||
- | $$ \begin{matrix} | + | $\Rightarrow \Theta = C$ |
- | {\partial | + | > $${\partial \over \partial\theta}\hat{x}_{uij} = |
- | &=& | + | \begin{cases} |
- | \sum_{(u, | + | +1 & \text{if } \theta |
+ | -1 & \text{if } \theta \in \{c_{jl}, c_{lj} \} \text{ and } l \in I_u^+\text{ and }l \neq j, \\ | ||
+ | 0 & \text{else} | ||
+ | \end{cases}$$ | ||
- | &\propto& \sum_{(u, | + | $\lambda_{+}$ $\lambda_{-}$ : regularization for each $\Theta$ |
- | \end{matrix} $$ | + | |
- | ------ | ||
- | ## full gradient descent | + | --- |
+ | ====# Relation to other methods ==== | ||
+ | ### Weighted Regularized MF (WR-MF) | ||
+ | implicit | ||
+ | > $$ \sum_{u\in U} \sum_{i\in I} c_{ui}(\langle w_u, h_i \rangle - 1)^2 + \lambda\lVert W \rVert _f^2 + \lambda\lVert H \rVert _f^2 $$ | ||
+ | 1. this optimization is on instance level (one item) instead of pair level (two items) as BPR. | ||
+ | 2. this optimization is a least square - correspond to the MLE for normally distributed random variables. | ||
+ | - But item prediction is actually not a regression(quantitative), | ||
+ | but a classification (qualitative) one | ||
+ | - so the logistic optimization is more appropriate. | ||
- | - $\downarrow$ $O(\left|S\right|\left|I\right|)$ training triples | + | --- |
- | - $\downarrow$ skewness in the training pair lead to poor convergence | + | |
- | - $\downarrow$ regularization is difficult | + | |
- | ## stochastic gradient descent | + | ### Maximum Margin MF (MMMF) |
+ | explicit | ||
+ | > $$ \sum_{(u, | ||
+ | 1. the error functions differ - loss in BPR is smooth and motivated by the MLE. | ||
+ | 2. BPR-Opt criterion is generic and can be applied to several models, whereas their method is specific for MF. | ||
+ | 3. MMMF is designed to work with sparse explicit data | ||
+ | i.e. they assume that there are many missing values and thus they assume to have much less pairs than in an implicit setting. | ||
+ | --- | ||
+ | ====# Evaluation==== | ||
- | - $\uparrow$ good approach for ' | + | leave one out (remove one entry from $I_u^+$ per user $u$) |
- | - $\downarrow$ typical item-wise or user-wise order in training pairs have poor convergence because there are so many consecutive updates on the same user-item pair | + | |
- | ## $\therefore$ choose triples randomly, bootstrap sampling with replacement | + | >$$ AUC = {1 \over |U|} \sum_u {1 \over |E(u)|} \sum_{(i,j) \in E(u) } \delta( \hat{x}_{ui} > \hat{x}_{uj})$$ |
+ | where the evaluation pairs per user $u$ | ||
+ | >$$ E(u) := \{(i,j) | (u,i) \in S_{test} \text{ and } (u,j) \notin (S_{test} \cup S_{train}) \} $$ | ||
- | ## $$ \Theta \leftarrow \Theta + \alpha\left( {\exp(-\hat{x}_{uij})\over{1+\exp(-\hat{x}_{uij})}} \centerdot {\partial \over \partial\Theta }\hat{x}_{uij} + \lambda_{\Theta} \Theta \right) $$ | + | higher value of AUC indicates a better quality. |
+ | --- | ||
+ | ##### \< | ||
+ | |||
+ |  | ||
+ | {{: | ||
+ | {{: | ||
+ | ###### remove rating scores, repeated 5 times, $\lambda = 0.0025$, $\alpha = 0.05$ | ||
+ | --- | ||
+ | |||
+ | # END | ||
+ | |||
+ | |||
+ | |||
+ | ==== Code ==== | ||
< | < | ||
- | code | + | # |
- | </ | + | |
- | {{tag> | ||
- | ~~DISCUSSION~~ | ||
+ | test_index< | ||
+ | index_test< | ||
+ | for (i in 1: | ||
+ | if(na) indeX_i <- which(!is.na(data[i, | ||
+ | else index_i< | ||
+ | | ||
+ | index_test[i]< | ||
+ | } | ||
+ | test_index <- index_test | ||
+ | } | ||
+ | |||
+ | |||
+ | train <- function (data, index_test, na=TRUE) { | ||
+ | |||
+ | train_data< | ||
+ | for (i in 1: | ||
+ | train_data[i, | ||
+ | } | ||
+ | train_data[is.na(train_data)]< | ||
+ | train< | ||
+ | } | ||
+ | |||
+ | |||
+ | |||
+ | calc_AUC <- function(data, | ||
+ | AUC_u <- 0 | ||
+ | for (i in 1: | ||
+ | i=1 | ||
+ | if(na) index_j< | ||
+ | else index_j< | ||
+ | | ||
+ | AUC_u <- AUC_u + 1/ | ||
+ | } | ||
+ | | ||
+ | calc_AUC <- AUC_u / dim(data)[1] | ||
+ | } | ||
+ | |||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ############################################ | ||
+ | |||
+ | |||
+ | # check data | ||
+ | which(apply(is.na(data2), | ||
+ | |||
+ | which(apply(is.na(data2), | ||
+ | sum(!is.na(data2)) | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ############################################ | ||
+ | ############################################ | ||
+ | |||
+ | ml100k_ori< | ||
+ | ml100k[ml100k> | ||
+ | |||
+ | index_test< | ||
+ | data2_train< | ||
+ | ############################################ | ||
+ | ############################################ | ||
+ | |||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ####BPR | ||
+ | library(" | ||
+ | setwd(" | ||
+ | source(" | ||
+ | |||
+ | |||
+ | showDeltaError() | ||
+ | # | ||
+ | |||
+ | ratings< | ||
+ | |||
+ | |||
+ | |||
+ | AUC_total< | ||
+ | for (i in 1:10) { | ||
+ | AUC<-0 | ||
+ | for( j in 1:5){ | ||
+ | model_bpr< | ||
+ | pred_bpr< | ||
+ | | ||
+ | AUC< | ||
+ | } | ||
+ | AUC_total[i]< | ||
+ | } | ||
+ | AUC_total_bpr< | ||
+ | |||
+ | |||
+ | setStoppingCriteria(autoConverge = FALSE, | ||
+ | deltaErrorThreshold = 1e-05, nrLoops = NULL, minNrLoops = 10) | ||
+ | showStoppingCriteria() | ||
+ | |||
+ | |||
+ | |||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ####WRMF | ||
+ | |||
+ | # | ||
+ | # | ||
+ | |||
+ | install.packages(" | ||
+ | ####R CMD INSTALL rsparse-master/ | ||
+ | devtools:: | ||
+ | library(rsparse) | ||
+ | |||
+ | |||
+ | data(" | ||
+ | |||
+ | library(Matrix) | ||
+ | Matrix(movielens100k) | ||
+ | |||
+ | ml100k< | ||
+ | |||
+ | AUC_total< | ||
+ | for (i in 1:10) { | ||
+ | AUC<-0 | ||
+ | for( j in 1:5){ | ||
+ | fb< | ||
+ | # | ||
+ | model = WRMF$new(rank = i*10, lambda = 0.0025, | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | x< | ||
+ | # | ||
+ | n_users_rank< | ||
+ | rank_n_items< | ||
+ | | ||
+ | pred_mmmf< | ||
+ | | ||
+ | AUC< | ||
+ | } | ||
+ | AUC_total[i]< | ||
+ | } | ||
+ | |||
+ | AUC_total_wrmf< | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ############################################ | ||
+ | ############################################ | ||
+ | ####SVD-MF | ||
+ | # | ||
+ | library(" | ||
+ | |||
+ | |||
+ | ml <- as(data2_train," | ||
+ | r_svd <- Recommender(ml, | ||
+ | p.svd <- predict(r_svd, | ||
+ | |||
+ | |||
+ | pred_svd< | ||
+ | |||
+ | AUC< | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | r <- Recommender(data = ml, method = " | ||
+ | | ||
+ | n_iterations=10, | ||
+ | r | ||
+ | recom_ratings <- predict(r, newdata = ml, type = " | ||
+ | as(recom_ratings, | ||
+ | |||
+ | getModel(r) | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | AUC_total< | ||
+ | for (i in 1:10) { | ||
+ | AUC<-0 | ||
+ | for( j in 1:2){ | ||
+ | fsvd< | ||
+ | | ||
+ | | ||
+ | | ||
+ | pred_svdf <- tcrossprod(fsvd$U, | ||
+ | |||
+ | AUC< | ||
+ | | ||
+ | } | ||
+ | AUC_total[i]< | ||
+ | } | ||
+ | |||
+ | AUC_total_svdf< | ||
+ | </ | ||