This is an old revision of the document!
recommendation bpr
<!–
$theme: default
$size: 4:3 page_number: true footer: this is footer
–>
# Slide 1
# 
# Slide 2
# A Review on
# BRP: Bayesian Personalized Ranking from Implicit Feedback, Rendle et al.
### Hee-seok Lee
### Algorithm = model + _loss function_ + _optimization_
- Loss function = BPR-OPT ( → approximates AUC)
- Optimization = stochastic gradient descent
1. choose a rank metric
2. identify parts that are not smooth
3. replace with “close enough” approximations
- often logistic function is used here
4. learn with gradient descent
- Explicit :
- Implicit : only positive observations are available
$$ \text{non-observed data} = \text{real negative feedback} + \text{missing value} $$
### properties of a total order
- totality : $\forall{i,j}\in{I},i\neq j \Rightarrow j >_{u} \text{ or } j >_{u} i$
- antisymmetry :
- transitivity :
BRP
bayesian formulation
$$\text{maximize} \ \ \ p(\Theta|>_u) \propto p(>_u|\Theta)p(\Theta) $$
$$ X_{uij} := I(i >_u j) \sim Bern(p_{uij}) s.t. p_{uij} = P(i>_u j) $$
with the assumption that each $(u,i,j) \in D_S$ is independent,
$$ \prod_{u \in U} p( >_u | \Theta) = \prod_{(u,i,j) \in D_s} p (i>_u j|\Theta)$$
$$p(i>_uj|\Theta) := \sigma( \hat{x}_{uij}(\Theta) )$$
## $$\sigma(x) := {1\over 1+ \exp(-1) } $$
$$ \begin{matrix} BPR-OPT &:=& \ln p(\Theta|>_u) \\ &=& \ln p(>_u|\Theta)p(\Theta) \\ &=& \ln \prod_{(u,i,j) \in D_s} \sigma (\hat{x}_{uij})p(\Theta) \\ &=& \sum_{(u,i,j) \in D_s} \ln\sigma (\hat{x}_{uij}) + \ln p(\Theta) \\ &=& \sum_{(u,i,j) \in D_s} \ln\sigma (\hat{x}_{uij}) + \lambda_{\Theta} \lVert\Theta\rVert^2 \end{matrix} ## $$
$$ \begin{matrix} {\partial BPR-OPT \over \partial \Theta} &=& \sum_{(u,i,j) \in D_s} {\partial \over \partial\Theta }\ln\sigma (\hat{x}_{uij}) + \lambda_{\Theta} {\partial \over \partial\Theta }\lVert\Theta\rVert^2 \\ &\propto& \sum_{(u,i,j) \in D_s} {-\exp(-\hat{x}_{uij})\over{1+\exp(-\hat{x}_{uij})}} \centerdot {\partial \over \partial\Theta }\hat{x}_{uij} - \lambda_{\Theta} \Theta \end{matrix} $$
## full gradient descent
- $\downarrow$ $O(\left|S\right|\left|I\right|)$ training triples
- $\downarrow$ skewness in the training pair lead to poor convergence
- $\downarrow$ regularization is difficult
## stochastic gradient descent
- $\uparrow$ good approach for 'our' skew problem
- $\downarrow$ typical item-wise or user-wise order in training pairs have poor convergence because there are so many consecutive updates on the same user-item pair
## $\therefore$ choose triples randomly, bootstrap sampling with replacement
## $$ \Theta \leftarrow \Theta + \alpha\left( {\exp(-\hat{x}_{uij})\over{1+\exp(-\hat{x}_{uij})}} \centerdot {\partial \over \partial\Theta }\hat{x}_{uij} + \lambda_{\Theta} \Theta \right) $$
code