# The aim of this class

- Basic concept of recommender systems
- Simple formula and theory
- Underlying intuition of each recommendation model
- Pros and cons

# What is recommender system?

Basic assumption of recommender systems

- Information overload
- Users are unsure about what they are looking for(different from information retrieval)

Target of (traditional) recommender systems

- Develop a
**mathematical model**of an**objective function**($\mathcal{F}$) to predict how much a**user**will like an**item**in a given**context**.

# Basic Concept

Ratings

- Explicit ratings: 5 star, like/dislike
*(additional effort from users)* - Implicit ratings: page views, click, effective playback, purchase records, whether or not listen to a music track
*(easier to collect, less precise)*

- Explicit ratings: 5 star, like/dislike
Interaction matrix

- Matrix modelling user’s rating on item
- User $u$’s rating towards item $i$ as $r_{u,i}$

# Collaborative Filtering “物以类聚，人以群分”

**Context-free**recommender systemIntuition:

**the users who have agreed in the past tend to also agree in the future**F: aggregate from nearest neighbor

User-based CF: user neighbors (green block)

Item-based CF: item neighbors (blue block)

Hybrid: use both

## Similarity Calculation

- Cosine similarity
- Pearson correlation
- SimRank

## Improvement of Similarity Calculation

- Not all neighbors are equally “valuable” i.e. agreement on commonly liked items not so informative as agreement on controversial items
- Give more weight to items that have a higher variance

- Low number of co-rated items may introduce bias
- Reduce “confidence” when the number of co-rated items is low

- All neighbors are not very “similar”
- Set similarity threshold
- Set maximum number of neighbors

## Pros and Cons

Pros

- Easy to implement
- Good explanation

Cons

- Large Memory needed
- Sparsity issue

# Matrix Factorization

**Context-free**recommender system- Intuition: express
**higher-level attributes** Generate user and item

**embeddings**- Latent vectors
(A group of) similar users/items have similar embeddings

$\mathcal{F}$ : dot product of embeddings $r_{u,i}=p_u^T q_i$

## Improvement of Matrix Factorization

$\mathcal{F}$ could be more complicated

Global bias

- Average ratings in the whole catalog

Item bias

- Users rate diﬀerent categories in diﬀerent ways

User bias

- Some users tend to give high ratings while others are not

## Pros and Cons

- Pros:
- Better generalization capability: even though two users haven’t rated any same movies, it’s still possible to ﬁnd the similarity between them through embeddings
- Low memory: only need to store low-dimensional latent vectors, no need to store large user behaviors

- Cons
- Hard to explain
- Sparsity

# Logistic Regression

Intuition: add

**context**information to our modelRecommendation as

**classiﬁcation**:$\mathcal{F}$

User, item, context =>

**categorical**feature vector- Example: gender, time diﬀerence within the day, show genre as feature
- $[0, 1, 0, 0, 0, 1, 0]$ => female, view on morning, target show is
*Horror Movie*

Diﬀerent weight of features,

**learned by gradient descent**Sigmoid projection

## Logistic Regression —— why?

Why categorical feature vector?

- Some non-numerical features, e.g. device: “roku”, “web”, “android”…
- Absolute value of feature does not make sense
- Feature crossing
- Fast computation of sparse vector

Why sigmoid projection?

- Output between 0 and 1
- Good mathematical form
- Maximum entropy model

## Pros and Cons

- Pros
- Good explanation (feature importance)
- Good parallelism, fast model training
- Low training cost
- Online training

- Cons
- Manually craft features
- Limited model expressiveness (linear model)

## Problems of Logistic Regression

Simpson’s Paradox

# Factorization Machine (FM)

Intuition:

**feature crossing**$\mathcal{F}$

Let $w_{ij}=

$ Deal with parameter explosion ($n^2 \geq nk$)

- Equivalent to low rank parameter matrix factorization $W=VV^T$

## Pros and Cons

- Pros
- Good expressiveness
- Good generalization
- Relatively low training cost

- Cons
- Hard to make higher level feature crossing

# Gradient Boosting Decision Tree (GBDT)

$\mathcal{F}$

Training samples $(x_1, y_1), (x_2, y_2), … , (x_n,y_n)$

Boosting:

$f$ : Regression tre -> Boosting Regression(Decision) Tree

Loss function -> Gradient Boosting Decision Tree, GBDT

## XGBoost & LightGBM

$\mathcal{F}$

Change optimization goal

Loss function: Taylor expansion, keep second order terms

- Regularization
- Prevent too complicated trees
- Prevent extreme parameters

- LightGBM (engineering optimization)

## Pros and Cons

- Pros
- Good expressiveness
- Low eﬀort of feature engineering

- Cons
- Hard parallelism
- Unable to do incremental training

# GBDT + LR

- Intuition:
**model-based feature engineering**