Skip to main content

The One Epoch Phenomenon in Recommendation Algorithms

· 3 min read

In the field of Click-Through Rate (CTR) prediction and recommendation systems, the "ONE EPOCH phenomenon" refers to the overfitting phenomenon where deep models achieve optimal generalization performance after only one epoch of training, and continued training leads to a sharp decline in performance on the test set.

Phenomenon Characteristics

  • Performance During Training‌

The model reaches a peak AUC on the test set at the end of the first epoch, and performance rapidly declines after the start of the second epoch. Overfitting manifests as a continuous decrease in training loss, but validation set metrics (such as AUC) significantly deteriorate after one epoch.

  • Relevance to Industrial Practice‌

This phenomenon explains why many industrial-grade recommendation systems stop training after only one epoch of data, rather than following the multiple iterations typical of traditional deep learning.

Key Causes

  • Feature Sparsity‌

The high number of IDs in feature domains (such as user IDs, product IDs) results in extremely low frequencies of occurrence for each ID, making the model prone to memorizing sparse samples rather than generalizing. The Embedding layer corresponding to sparse features overfits low-frequency IDs during multiple training epochs.

  • Model Optimization Characteristics‌

Strong optimizers (such as Adam) and larger learning rates accelerate convergence, enabling the model to complete effective learning within one epoch, with subsequent training falling into local overfitting. Complex model structures (such as deep networks) with excessive capacity exacerbate the memorization of noise/sparse features.

Mitigation Strategies

  • Feature Engineering Optimization‌

Reduce Sparsity‌: Merge low-frequency IDs, use default values to replace sparse IDs, reduce hash space. Dynamic Feature Filtering‌: Dynamically adjust the participation intensity of sparse features based on the training stage.

  • Training Parameter Adjustment‌

Limit Training Epochs‌: Force training to stop after only one epoch. Learning Rate Control‌: Use a smaller initial learning rate or a learning rate decay strategy.

  • Model Structure Improvement‌

Simplify the Embedding layer or introduce regularization (such as Dropout) to suppress overfitting.

Insights from Industrial Practice

  • Strong Correlation with Data Distribution‌

This phenomenon is particularly significant in high-sparsity business scenarios (such as advertising recommendations, e-commerce CTR prediction), but rarely occurs in dense data tasks.

  • Priority of Optimization Directions‌

The industrial community tends to directly avoid the problem through feature engineering and early stopping strategies, rather than modifying the model structure.

Theoretical Hypotheses

Some studies suggest that the ONE EPOCH phenomenon may stem from the dynamic balance between sample memorization and generalization: in the first epoch, the model completes coarse-grained learning of key features, and subsequent training disrupts generalization ability due to over-refinement of local features.

Further Reading

Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Prediction Models

Ali's OneEpoch VS KuaiShou's MultiEpoch

RecGo 如何实现内存管理优化:化零为整的工程实践

· 6 min read

概述

在RecGo推荐系统的核心架构中,物料特征的内存管理一直是性能优化的重点(详见《RecGo特征系统设计原理》)。我们采用内存驻留方案实现了亚毫秒级特征访问,但在实际落地过程中发现了新的挑战:动态加载海量物料特征时,频繁创建的内存碎片对象导致Go GC出现严重性能瓶颈。本文将深入解析我们如何通过创新内存分配方案彻底解决该问题。

问题分析:小对象引发的GC风暴

数据规模与内存特征

在千万级物料场景下,每个物料包含50-100个异构特征,具体构成如下:

特征类型内存占比平均内存占用
int6425%8 B
float3220%4 B
string30%32 B
[]int6415%128 B
[]float325%64 B
[]string5%256 B

每次全量更新物料池会生成约1.2亿个内存对象,其中90%的对象大小小于512字节。

GC性能瓶颈实测

在Go 1.21环境下进行压力测试,得到以下关键指标:

指标常规分配Arena方案
GC暂停时间(P99)420ms8ms
内存分配次数/s5.2M0.8M
内存碎片率38%4%

测试结果显示,高频小对象分配导致GC工作负载激增,在业务高峰期频繁触发GC STW(Stop-The-World),严重影响服务稳定性。

解决方案选型与技术论证

CGO方案可行性分析

初步设想采用C++实现特征管理模块,通过CGO桥接。性能测试可以参考网页 CGO Performance In Go 1.21

Conclusions
So at this point we’ve measured performance overhead of Cgo, at least in terms of wall clock time (note that we haven’t looked at memory/thread count/battery usage/etc). We know that the overhead is on the order of 2 mutex operations and that it does scale with number of cores up to around 16. We’ve also seen that with 16 cores we can do around 4ns/op or close to 250 million Cgo ops/s. So if I was looking at using Cgo in 2023 I’d definitely use it outside of very hot loops. There’s many reasons I wouldn’t use Cgo in 2023 (see disclaimer), but performance is unlikely to be one of them.
I’ll end with this little Cgo version of “latency numbers every programmer should know” table:

Go/Cgo latency
Benchmark Name1 Core16 Cores
Inlined Empty func0.271 ns0.02489 ns
Empty func1.5 ns0.135 ns
cgo40 ns4.281 ns
encoding/json int parse52.89 ns5.518 ns

基于上面的测试结果,我们认为特征存储使用cgo是不太合适的,因为特征的读取非常频繁。即在量级非常大的时候,频繁进行栈切换会导致耗时增加,影响整体性能。因此,我们最终没有选择这个方案。

Arena内存池化方案

基于物料池只读特性,我们创新性地提出以下设计原则:

  1. 连续内存分配‌:预分配大块内存(4KB pages)
  2. 对象生命周期绑定‌:特征对象与所属物料池同生命周期
  3. 批量释放机制‌:物料池更新时整块释放内存

架构对比示意图:

传统方案:
[obj1][padding][obj2][obj3][padding]... -> 内存碎片

Arena方案:
[page1: obj1|obj2|obj3...][page2: obj4|obj5...] -> 连续内存

实现细节与核心算法

内存分配器设计

const (
pageSize = 4096 // 4KB对齐现代CPU缓存行
maxSmallAlloc = 512 // 小对象阈值
)

type Arena struct {
mu sync.RWMutex
pages [][]byte // 内存页池
curPtr uintptr // 当前页偏移量
curCap int // 当前页剩余容量
}

// 分配算法伪代码
func (a *Arena) Alloc(size, align uintptr) []byte {
a.mu.Lock()
defer a.mu.Unlock()

// 计算对齐偏移
offset := (a.curPtr + align - 1) &^ (align - 1)

if size > maxSmallAlloc {
return a.allocLarge(size)
}

if offset+size > a.curCap {
a.newPage()
offset = a.curPtr
}

ptr := a.pages[len(a.pages)-1][offset : offset+size]
a.curPtr += size
return ptr
}

复杂类型序列化方案

以[]string类型为例,内存布局实现:

+-----------------------+
| sliceHeader (24B) |
| Data *stringHeader |
| Len int |
| Cap int |
+-----------------------+
| stringHeader (16B) |
| Data *byte |
| Len int |
+-----------------------+
| stringHeader |
| ... |
+-----------------------+
| 实际字符串数据区 |
| "value1"|"value2"|... |
+-----------------------+

序列化步骤:

  1. 预计算总内存需求(包含所有对齐填充)
  2. 分配连续内存块
  3. 按序写入实际字符串数据
  4. 构建stringHeader数组
  5. 初始化sliceHeader
// 伪代码
func MarshalStrings(arena *Arena, values []string) []string {
// 1. 计算总需求
total := len(values)*16 + 24 // 头部空间
for _, s := range values {
total += len(s)
}

// 2. 分配内存
buf := arena.Alloc(total, 8)

// 3. 写入字符串内容
// ...(细节省略)

// 4. 构建string headers
// ...(细节省略)

return *(*[]string)(unsafe.Pointer(&sliceHeader))
}

倒排召回系统设计

· 7 min read

倒排索引

基本定义

倒排索引(Inverted Index)是一种将内容映射到文档位置的索引结构,本质是"词项→文档"的逆向映射关系。与正排索引(文档→内容)形成互补。

核心组成

  • 词典(Term Dictionary):所有唯一词项的排序集合
  • 倒排列表(Posting List):每个词项对应的文档ID序列
  • 位置信息(Position):可选存储词项在文档中的具体位置

索引结构类型

1. 基本倒排索引

[词项A] → [文档1, 文档3, 文档5]
[词项B] → [文档2, 文档4]

带权倒排索引

{
"机器学习": [
{"doc_id": 101, "weight": 0.95},
{"doc_id": 205, "weight": 0.87}
]
}

多维倒排索引

支持复合键的索引形式:

"价格区间_品类" → [商品ID列表]

召回机制概述

常见召回类型及触发逻辑

I2I(Item-To-Item, e.g: ICF)召回

{
"key": "item_sim_123", // 商品123的相似列表
"value": [{"456":0.85}, {"789":0.76},...] // 按相似度降序存储
}

用户画像召回

  • 核心逻辑

用户画像的核心在于标签化,即通过标签体系对用户特征进行结构化表征,从而更精确地刻画用户属性。兴趣标签仅是其中一种维度,完整的用户画像通常涵盖自然属性(如年龄、性别)、社会属性(如职业、地域)、业务属性(如消费频次、客单价)、设备属性(如终端型号、操作系统)等多个方面。在构建用户画像标签体系后,用户画像标签作为查询键检索倒排索引,实现从特征到候选集的精准召回。

总结

上述召回过程的核心机制是通过用户特征生成查询键,进而检索倒排索引。基于这一原理,我们设计了以下方案。

系统架构设计

核心处理流水线

关键组件说明

查询键生成工具:Minia

Minia 是一个高效的 C++ 特征处理库,专注于通过表达式驱动的方式实现一致且灵活的特征转换。用户只需在配置文件中定义表达式规则,即可自动应用于召回、模型训练和推理等,确保数据处理逻辑严格一致,避免特征偏差。

  • 核心优势:表达式处理 自定义表达式语法:基于 ANTLR 实现,支持复杂数学、统计、字符串及日期运算(如 sqrt(x), bucket(age, [0,18,60]))。

  • 智能优化:自动进行常量折叠、公共子表达式消除等优化,减少冗余计算,提升执行效率。

  • 动态类型转换:支持 cast() 等操作,灵活处理多类型数据(数值、向量、字符串等)。

  • 简洁配置,统一流程

TOML 配置:通过声明式规则定义特征变换,如:

[transform]
expressions = ['user_gender_cat1 = concat(gender, cat1_prefer)']
features = ["user_gender_cat1"]

轮转式去重合并(Round Robin Merge)

依次从各召回列表中选取元素,确保结果兼具多样性和有序性。

  • 去重优先:自动跳过已存在元素,避免冗余。

  • 轮转均衡:按轮次平等分配各召回源的曝光机会,兼顾多样性。

  • 顺序保留:在全局去重基础上,尽量维持各列表内部排序。

假设有3个召回通道:

召回1: [A, B, C]
召回2: [B, D]
召回3: [A, E]


合并过程:
1. 第一轮:
- 召回1取A
- 召回2取B
- 召回3取A(跳过,已存在)
→ 结果:[A, B]

2. 第二轮:
- 召回1取B(跳过)
- 召回2取D
- 召回3取E
→ 结果:[A, B, D, E]

3. 第三轮:
- 召回1取C
- 召回2已耗尽
- 召回3已耗尽
→ 最终结果:[A, B, D, E, C]

Application of DeepFM Model in AppS

· 5 min read

In the field of recommendation systems, efficiently combining low-order and high-order feature interactions to improve prediction accuracy has always been a key challenge. The DeepFM model offers a solution that combines memory capacity and generalization ability by integrating Factorization Machines (FM) with Deep Neural Networks (DNN). This article will introduce the application and effectiveness of DeepFM in the AppS business.

Introduction

DeepFM (Deep Factorization Machine) is a recommendation system model that combines factorization machines (FM) with deep learning. It aims to capture both low-order and high-order feature interactions simultaneously. The architecture of DeepFM consists of two components: the FM component and the Deep component. The FM component is used to capture low-order feature interactions, while the Deep component learns high-order feature interactions through a multi-layer perceptron (MLP).

FM Module

  • Function: The FM module focuses on capturing second-order interactions between features. It leverages feature embeddings to compute interaction terms and efficiently represent relationships between sparse features.
  • Advantage: By modeling low-order interactions, the FM module effectively handles sparse data, making it particularly suitable for scenarios with a large number of sparse features.

DNN Module

  • Function: The DNN module is used to learn high-order feature combinations. Through a multi-layer neural network, DNN can capture complex nonlinear feature interactions.
  • Customization Capability: Users can design the network structure of the DNN according to specific needs, including the number of layers, the number of neurons in each layer, activation functions, and regularization strategies.
  • Advantage: With a flexible structure design, the DNN module can generalize to new feature combinations and improve the model's adaptability to different data distributions.

Benefits of DeepFM over FM

  1. Comprehensive Feature Interaction Capability: Traditional FM models mainly focus on second-order interactions between features, whereas DeepFM can effectively capture high-order feature interactions by introducing a deep learning component, thus improving recommendation accuracy.

  2. No Need for Manual Feature Engineering: DeepFM can automatically learn feature interactions, reducing the reliance on manual feature engineering, which is particularly useful for handling complex, large-scale datasets.

  3. Shared Feature Embeddings: The feature embedding layer in DeepFM is shared between the FM and Deep components, making the model more efficient in capturing feature interactions while reducing the number of model parameters.

Advantages of DeepFM

  • Comprehensive Capability: DeepFM combines the strengths of FM and DNN, allowing it to learn both low-order and high-order feature interactions without the need for feature engineering.
  • Model Simplicity: Compared to training FM and DNN separately and then combining them, DeepFM maintains model compactness and efficiency by sharing the feature embedding layer.
  • Wide Applicability: Due to its flexibility and strong expressive power, DeepFM is widely used in fields such as ad click-through rate prediction and recommendation systems.

Example Code for Developing DeepFM with PyTorch

In the following example, we will develop the FM and DNN modules separately and then combine them into a complete DeepFM model.

import torch
import torch.nn as nn
import torch.nn.functional as F

class DeepFM(nn.Module):
def __init__(self, field_dims, embed_dim, mlp_dims):
super(DeepFM, self).__init__()
self.embeddings = nn.ModuleList([
nn.Embedding(dim, embed_dim) for dim in field_dims
])
self.linear = nn.Linear(sum(field_dims), 1)
self.fm = FM(embed_dim)
self.dnn = DNN(sum(field_dims) * embed_dim, mlp_dims)

def forward(self, x):
x_emb = [emb(x[:, i]) for i, emb in enumerate(self.embeddings)]
x_emb = torch.cat(x_emb, dim=1)
x_linear = self.linear(x)
x_fm = self.fm(x_emb)
x_dnn = self.dnn(x_emb.view(x_emb.size(0), -1))
return x_linear + x_fm + x_dnn

class FM(nn.Module):
def __init__(self, embed_dim):
super(FM, self).__init__()
self.embed_dim = embed_dim

def forward(self, x):
square_of_sum = torch.sum(x, dim=1) ** 2
sum_of_square = torch.sum(x ** 2, dim=1)
return 0.5 * torch.sum(square_of_sum - sum_of_square, dim=1, keepdim=True)

class DNN(nn.Module):
def __init__(self, input_dim, dims):
super(DNN, self).__init__()
layers = []
for dim in dims:
layers.append(nn.Linear(input_dim, dim))
layers.append(nn.ReLU())
input_dim = dim
self.layers = nn.Sequential(*layers)

def forward(self, x):
return self.layers(x)

# Example usage:
field_dims = [10, 10, 10] # Example field dimensions
embed_dim = 10
mlp_dims = [64, 32]
model = DeepFM(field_dims, embed_dim, mlp_dims)

# Dummy input
x = torch.randint(0, 10, (4, len(field_dims))) # Batch size 4
output = model(x)
print(output)

Application

1. Feature Embedding Configuration

In our DeepFM model, the embedding dimension for each feature is set to 10. This configuration effectively captures low-order feature interactions and provides a solid foundation for subsequent high-order feature learning through the deep neural network.

2. Model Training and Optimization

Building on our experience with FM model training, the DeepFM model excels in combining memory and generalization. The FM component captures low-order feature interactions, while the DNN component learns high-order feature combinations. This combination achieves excellent results in the current business scenario.

  • Memory Capability: DeepFM uses the FM component's low-order interactions to capture known, stable feature combinations.

  • Generalization Capability: Through the DNN component, DeepFM can discover new, potential high-order feature combinations, enhancing the prediction of user behavior.

3. AB Testing Results

In the "Guess You Like" module, deploying the DeepFM model led to a 4.66% increase in average distribution per user. This result indicates that DeepFM significantly enhances the quality of personalized recommendations for users.

Further Reading

A Factorization-Machine based Neural Network for CTR Prediction - arXiv

Deep Factorization Machines — Dive into Deep Learning

DeepFM for recommendation systems explained with codes

Application of the ESMM Model in AppS

· 4 min read

In modern recommendation systems, particularly within the AppS business environment, predicting user behaviors such as Click-Through Rate (CTR) and Conversion Rate (CVR) is crucial for enhancing user satisfaction and driving business growth. The ESMM model, with its unique architecture and efficient multi-task learning capability, offers an outstanding solution for the AppS business.

Introduction

ESMM, short for Entire Space Multi-task Model, is a multi-task learning model specifically designed to tackle problems related to ad recommendations and user behavior prediction. The core idea of ESMM is to enhance the overall performance of the model by simultaneously learning multiple related tasks. This approach not only shares potential information between different tasks but also effectively alleviates the issue of data sparsity.

The ESMM model is typically applied to predict CTR and CVR. Traditional methods often train two separate models to predict CTR and CVR, whereas ESMM simultaneously performs these two prediction tasks within a unified framework, thereby capturing the correlation between them more effectively.

To learn more about the foundational concepts of ESMM, you can read this academic paper on ESMM.

Major Advantages of ESMM

  • Data Efficiency: By sharing the feature space, ESMM can better utilize data, especially in sparse data scenarios.
  • Performance Enhancement: By jointly learning multiple tasks, ESMM can better capture the mutual influences between related tasks, improving the accuracy of predictions.
  • Simplified Architecture: Compared to training multiple models independently, ESMM provides a more streamlined and efficient solution.

Differences Between ESMM and MMOE

In multi-task learning, besides ESMM, there is another popular model known as MMOE (Multi-gate Mixture-of-Experts). Both MMOE and ESMM aim to enhance the performance of multiple tasks by sharing information, but they exhibit significant differences in architecture and application scenarios:

Architectural Differences

  • ESMM: ESMM conducts multi-task learning by sharing the entire feature space. It primarily uses a unified network structure to simultaneously predict multiple tasks (such as CTR and CVR) and enhances overall performance by sharing underlying features.

  • MMOE: MMOE employs a more complex structure by introducing multiple expert networks and gating mechanisms to dynamically select suitable features and model paths for each task. Each task has its own gating network to select the most relevant information from multiple experts.

Application Scenarios

  • ESMM: Suitable for scenarios where tasks are highly related and require extensive information sharing, particularly when data is sparse and efficient information utilization is needed.

  • MMOE: More flexible and applicable to scenarios where task correlations are weaker or personalized feature selection is required. Due to its complex selection mechanism, MMOE performs better in situations with conflicting task requirements.

Performance Aspects

  • ESMM: Provides stable performance improvements between related tasks through its simplified network architecture and efficient feature sharing.

  • MMOE: Capable of offering higher prediction accuracy in complex task environments through flexible expert selection mechanisms, especially when task requirements are diverse.

Application

Similarity in Basic Structure Between ESMM and MMOE

The ESMM model shares many structural similarities with the traditional MMOE model. Both employ a multi-task learning framework to enhance the performance of different tasks by sharing information. However, the ESMM model adopts a different approach in the final conversion rate prediction: it calculates the predicted Conversion Rate (pCVR) as the product of two towers, a design aimed at fully capturing the interaction between CTR and CVR.

Key Components of the ESMM Model

Two Expert Networks

When applied to the AppS business, ESMM uses two expert networks. These expert networks are responsible for handling features related to CTR and CVR tasks, respectively. Through specialized network structures, ESMM can better extract and utilize task-specific information, thereby enhancing prediction accuracy.

Two Gating Mechanisms

In addition to expert networks, ESMM also employs two gating mechanisms to control the CTR and CVR tasks separately. These gating mechanisms dynamically adjust the selection and utilization of features for each task, ensuring that each task receives the most suitable information flow. Through optimization of gating mechanisms, ESMM provides more precise results in complex user behavior predictions.

Experimental Results and Effects

In practical applications within the AppS business, the ESMM model has demonstrated significant results through A/B testing. In the "Guess You Like" module, the ESMM model successfully achieved a 6.45% increase in average distribution per user.

Further Reading

Entire Space Multi-Task Model: An Effective Approach for Estimating ... - arXiv

ESMM — easy_rec 0.8.5 documentation

GitHub - dai08srhg/ESMM: PyTorch implementation of Entire Space Multitask Model (ESMM)

Application of FM Model in AppS

· 5 min read

Introduction

Factorization Machines (FM) are powerful machine learning models, especially widely used in recommendation systems and advertising click-through rate prediction. FM models can effectively capture the cross information between features and are highly efficient and easy to implement in engineering.

1. Advantages of FM over Linear Regression (LR)

Linear Regression (LR) is a simple and intuitive model, but it cannot capture the cross and nonlinear relationships between features. The FM model introduces latent vectors to factorize features, effectively capturing second-order interactions between features. Compared to LR, FM has the following advantages:

  • Feature Interaction: FM can automatically learn interactions between features without manually constructing cross-features, leading to better performance on complex datasets.
  • Model Flexibility: FM performs well in highly sparse datasets and is suitable for scenarios with a large number of categorical features.
  • Strong Generalization: By factorizing features, FM can avoid overfitting, especially when the data is high-dimensional but with a small sample size.

2. Time Complexity Analysis of FM

A significant advantage of the FM model is its efficient computational capability. Although FM considers interactions between all pairs of features, its time complexity remains ($O(N)$) rather than ($O(N​^2​​)$). This is because FM simplifies the calculation of feature interactions through factorization as follows:

$\hat{y}(x) = w_0 + \sum_{i=1}^{N}w_ix_i + \sum_{i=1}^{N}\sum_{j=i+1}^{N} \langle v_i, v_j \rangle x_ix_j$

where (⟨v​i​​,v​j​​⟩) denotes the inner product of two feature latent vectors, significantly reducing computational cost. The steps are as follows:

  • Linear Term: ($\sum_{i=1}^{N}w_ix_i$), with a time complexity of ($O(N)$).
  • Interaction Term: Through factorization, ($\sum_{i=1}^{N}\sum_{j=i+1}^{N} \langle v_i, v_j \rangle x_ix_j$) can be computed in ($O(N)$) time.

This efficiency allows FM to maintain fast computation speeds even when handling large-scale datasets.

3. Ease of Engineering Implementation

The FM model not only has theoretical advantages but is also relatively simple to implement in practice, especially when developed and deployed using deep learning frameworks like PyTorch. Below is a brief explanation and example code of implementing FM using PyTorch:

Developing and Training FM Model with PyTorch

Developing an FM model with PyTorch is very straightforward, as PyTorch's flexibility and ease of use make customizing model structures and training processes simple. Here is a simplified implementation example of an FM model:

import torch
import torch.nn as nn
import torch.optim as optim

class FactorizationMachine(nn.Module):
def __init__(self, n_features, k):
super(FactorizationMachine, self).__init__()
self.linear = nn.Linear(n_features, 1)
self.v = nn.Parameter(torch.randn(n_features, k))

def forward(self, x):
linear_part = self.linear(x)
interaction_part = 0.5 * torch.sum(
torch.pow(torch.mm(x, self.v), 2) - torch.mm(torch.pow(x, 2), torch.pow(self.v, 2)), dim=1, keepdim=True
)
return linear_part + interaction_part

# Example usage
n_features = 10 # Assuming 10 features
k = 5 # Dimension of latent vectors

model = FactorizationMachine(n_features, k)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Assuming we have some training data
X_train = torch.randn(100, n_features) # 100 samples
y_train = torch.randn(100, 1)

# Training process
model.train()
for epoch in range(100): # Train for 100 epochs
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train)
loss.backward()
optimizer.step()

if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

Through the above code example, it is evident that implementing the FM model with PyTorch is intuitive and flexible. Users can easily adjust model structures, optimizers, and loss functions to quickly adapt to different business needs.

Application

1. Feature Embedding Configuration

In our FM model, the embedding dimension for each feature is set to 10. This configuration allows us to fully capture the interactions between features without adding excessive computational overhead. Choosing the right embedding dimension is a critical step in balancing model complexity and computational efficiency.

2. Addressing Sample Imbalance Issues

During the model training process, we found a significant imbalance in the ratio of positive to negative samples, primarily due to the substantial differences in homepage exposure click data. To address this issue, we conducted two experiments:

  • Random Negative Sample Dropping: This straightforward method aims to balance the sample ratio by reducing the number of negative samples. However, our experiments showed that while this approach can somewhat alleviate the imbalance issue, it does not significantly improve model performance.

  • Positive Sample Weighting: In contrast, we applied weighting to positive samples, giving them higher learning importance. This method better emphasizes genuine user interest behaviors. By adjusting sample weights, we effectively increased the model's focus on positive samples, leading to a significant improvement in overall model performance.

The experiments demonstrated that positive sample weighting outperforms random negative sample dropping. This is mainly because weighting allows us to more accurately capture user interest preferences, avoiding model bias caused by an excess of negative samples.

3. Data Issues and Model Optimization

During the initial model training, we encountered a critical issue: training with the full dataset did not achieve the expected results. Upon thorough investigation, we discovered that many homepage exposures used frontend caching due to network issues. These cached data were not user-initiated actions, thus interfering with the model's learning process.

To resolve this issue, we differentiated the reported data on the first screen, removing the cached data. This approach ensured that the model was trained using genuine user behavior data, ultimately ensuring the effectiveness of the model once deployed.

4. AB Testing Results

The effect of online weighting was verified through AB testing. Specific AB test screenshots will be presented here, further proving the effectiveness of our optimization strategies in practical applications. In the Guess You Like section on the homepage, the average distribution per person increased by 14.8%.

Further Reading

Factorization Machine models in PyTorch - GitHub

Factorization Machines

Application of the MMOE Model in AppS

· 5 min read

In the AppS business, recommendation systems need not only to improve user Click-Through Rate (CTR) but also to enhance Conversion Rate (CVR) to achieve comprehensive user engagement and business growth. The Multi-gate Mixture-of-Experts (MMOE) model offers an efficient solution by simultaneously optimizing multiple objectives to meet these business needs.

## Introduction

In the field of recommendation systems and advertising, models often need to optimize multiple objectives simultaneously, such as Click-Through Rate (CTR) and Conversion Rate (CVR). The Multi-gate Mixture-of-Experts (MMOE) model provides an effective solution by achieving better goal synergy optimization within a multi-task learning framework.

1. What is MMOE?

MMOE (Multi-gate Mixture-of-Experts) is a deep learning architecture for multi-task learning, designed to simultaneously optimize multiple related but distinct objectives. It introduces multiple "expert" networks and "gating" mechanisms to dynamically select and combine different expert outputs to meet the needs of different tasks.

Core Components

  • Expert Networks: Multiple sub-networks, each responsible for learning different representations of input features to meet the needs of different tasks.
  • Gating Networks: For each task, MMOE introduces an independent gating network responsible for selecting the appropriate combination of experts for the input sample. The gating network dynamically allocates weights to each expert based on input features.

Working Principle

MMOE combines expert networks through the gating mechanism, enabling the model to flexibly select the most suitable feature combinations for each task while sharing basic features. This mechanism allows for the sharing of information between tasks while mitigating negative transfer effects.

2. Why Use MMOE?

In business scenarios, especially in advertising and recommendation systems, it is often necessary to optimize multiple key metrics simultaneously. For example, improving Click-Through Rate (CTR) and Conversion Rate (CVR) are common business objectives. Traditional single-task models often struggle to balance these goals, while MMOE offers an ideal solution for multi-objective optimization.

Business Goals: Considering CTR and CVR

  • Improving CTR (Click-Through Rate): CTR is a metric that measures the ability of advertisements or recommendations to attract user clicks. Increasing CTR can directly enhance user interaction and engagement.

  • Improving CVR (Conversion Rate): CVR is a metric that measures the ability of users to complete target behaviors (such as purchases, registrations, etc.). Enhancing CVR can directly impact the business's final revenue.

Why Choose MMOE?

  • Task Synergy Optimization: MMOE allows for the simultaneous optimization of CTR and CVR objectives by sharing feature representations of expert networks and using independent gating mechanisms.
  • Reducing Negative Transfer: Through the combination of experts and gating, MMOE effectively reduces negative transfer effects between tasks, ensuring that optimizing one objective does not significantly harm the other.
  • Dynamic Adaptability: The dynamic gating mechanism of MMOE enables the model to adjust expert combinations in real-time based on input features, adapting to different user behavior patterns and preferences.

MMOE demonstrates its strong adaptability and optimization performance in multi-task learning. For scenarios that require simultaneous consideration of multiple business goals, such as optimizing ad CTR and CVR, MMOE provides an efficient and flexible solution. By designing expert and gating structures wisely, MMOE can achieve more refined recommendation and advertising strategies in complex business environments.

Application

Model Architecture and Strategy

1. Retaining FM Cross Strategy

In the MMOE architecture, we continue to retain the original Factorization Machine (FM) cross strategy. This strategy excels in modeling low-order feature interactions, effectively capturing the basic relationship between users and content. We integrate FM's output into the final output of MMOE to ensure full utilization of basic feature interaction information.

2. Expert Network Design

The MMOE model employs two expert networks, each focusing on learning different feature combinations. Through diversified expert network design, the model can capture user behavior patterns from different perspectives and enhance adaptability to complex data.

3. Gating Mechanism

We designed two independent gating networks responsible for optimizing CTR and CVR, respectively:

  • CTR Gating: Controls the expert combination related to click-through rate features, ensuring that recommended content can attract user clicks.
  • CVR Gating: Optimizes the feature combination related to download conversion, increasing the probability of users downloading applications.

4. Training Strategy

During model training, we use click and download behaviors as task labels and design a fixed-weight loss function:

  • Loss Weight Allocation:
    • The loss weight for PCTR is set to 0.95, emphasizing the optimization of click behavior.
    • The loss weight for PCTCVR is set to 0.05, ensuring that download behavior receives appropriate attention.

This weight allocation ensures that CTR is the primary optimization direction while also considering the CVR objective.

5. Online Inference and Ranking

During online inference, we apply the same weights to PCTR and PCVR and rank them based on the weighted scores. Through this strategy, we can balance the priorities of clicks and downloads in recommendation ranking, optimizing user experience and business metrics.

AB Testing Results

By applying the MMOE model in the "Guess You Like" module, our AB testing results showed a 13.1% increase in average distribution per user. This significant improvement validates the effectiveness of the MMOE model in simultaneously optimizing CTR and CVR, bringing higher user engagement and conversion rates to the AppS business.

Conclusion

The MMOE model achieves comprehensive optimization of CTR and CVR in the AppS business through its flexible expert and gating mechanisms. Combined with the FM cross strategy, MMOE not only enhances the predictive ability of recommendation systems but also improves multi-objective synergy optimization of user behavior, providing strong support for business development.

Further Reading

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

The Annotated Multi-Task Ranker: An MMoE Code Example

Applicaton Store(AppS) Introduction

· 2 min read

Apps functions as a digital distribution platform similar to Apple's AppStore or Google Play, focusing on providing users with a variety of applications and games. Our primary key performance indicator is the number of user downloads and installations, which directly reflects the platform's usage and user satisfaction.

Current Status

  1. Network Adaptability
    In situations of no network or weak network, this platform utilizes cached information to ensure that the user interface remains fully populated, avoiding blank screens. This feature enhances user experience under various network conditions.

  2. App and Game Distribution
    The platform primarily distributes apps and games. With the expertise of our operational staff, these are configured for placement on different leaderboards. However, as these leaderboards are manually operated, the configuration tasks are heavy, with the homepage alone featuring more than 20 leaderboards. This results in longer onboarding cycles, typically requiring two weeks for review. This manual configuration approach leads to slow leaderboard updates and poor performance of new apps during the cold start phase, reducing overall distribution efficiency.

  3. Performance of "Guess You Like" Section
    The "You May Like" section of the platform is a relatively high-traffic area, accounting for 45% of the platform's daily active users, yet it only contributes 5% of the total downloads. Despite the high traffic, the limited exposure slots result in download volumes not matching the expected contribution levels.

Future Goals

To enhance the number of downloads and installations, the platform plans to integrate an advanced recommendation system. Through intelligent recommendations, we aim to effectively expedite leaderboard updates and improve the exposure of new applications, thereby enhancing overall distribution efficiency and optimizing user experience.