Tencent Advertising Algorithm Competition TAAC2026 Technical Solution: Sequence Modeling and Feature Interaction in pCVR Prediction
Competition: Tencent Advertising Algorithm Competition (KDD 2026 Joint) Task: Advertising pCVR (conversion rate) prediction, evaluation metrics LogLoss / AUC Data: HuggingFace
TAAC2026/data_sample_1000, only 1000 samples Honest AUC: 0.7517 (6-model equal-weighted rank average, Bootstrap 95% CI: [0.698, 0.810]) Statistical significance: Bootstrap p-value < 0.0001, improvement 0.0778 far beyond random fluctuations Core conclusion: Key findings of v3 feature engineering → Unix timestamp in the sequence is the effective signal; 0.75 is close to the upper limit of 1000 samples
1. Background and problem definition
pCVR (post-click conversion rate) prediction is one of the core tasks in the advertising system: given user characteristics, historical behavior sequences and candidate ads, predict the probability of conversion after a user clicks. Conversion rate directly affects ad bidding and revenue.
In this competition, the data has only 1000 samples (103 positive / 897 negative), and each user only appears once. These two constraints fundamentally limit the expressive capabilities of the model and also pose unique technical challenges.
2. Data analysis
2.1 Data scale
| Indicators | Values |
|---|---|
| Total number of samples | 1,000 items |
| Positive samples (conversion) | 103 (10.3%) |
| Negative samples (not transformed) | 897 (89.7%) |
| Number of unique users | 1,000 (only 1 occurrence per user) |
| Number of unique items | 927 |
| Average conversion rate | 10.30% |
2.2 Field structure
user_id # 用户标识
item_id # 广告/物品标识
user_feature # 用户特征列表 [{feature_id, int_value, int_array, float_array}]
item_feature # 物品特征列表
seq_feature # 行为序列 {action_seq, content_seq, item_seq}
label # [{action_time, action_type}],action_type=2 → 正样本
timestamp # Unix 时间戳
2.3 Feature Dimension| type | feature_id quantity | description |
|------|----------------|------| | user_feature (int) | 44 | fids: 1, 3, 4, 50-93, 99-103, 105 | | item_feature (int) | 11 | fids: 6-16, 75 | | user_feature (array) | 2 | fid=18, 67 (high-dimensional embedding, converted to scalar after averaging) | | action_seq | 10 steps | feature_id 19-28 | | content_seq | 9 steps | feature_id 40-48 | | item_seq | 12 steps | feature_id 29-39, 49 |
3. Feature Engineering
3.1 Dense features
Extract all int_value from user_feature and item_feature, average the array type and convert it to a scalar, with a total of 55 dimensions.
3.2 Sequence statistical characteristics
Extract 11-dimensional statistics from action_seq / content_seq / item_seq each:
[mean, std, max, min, nonzero_mean, nonzero_count,
last_val, last3_mean, ts_mean, ts_std, padding]
Total sequence features: 3 × 11 = 33 dimensions.
3.3 Time characteristics
hour = (timestamp % 86400) // 3600 # 一天中的小时
dow = (timestamp // 86400) % 7 # 一周中的星期
3.4 Target Encoding (anti-leak design) ⚠️
This is the most error-prone and important link in the entire project.
Issue 1: Validation Fold leaked
In the initial version, target encoding is calculated from all data:
# 错误代码
user_hist_cvr = df.groupby("user_id")["label"].transform("mean") # 包含 validation fold!
→ Fix: Per-fold encoding, the target encoding of each fold is only calculated from the training set.
Problem 2: User only appears once
Appears only once per user → user_hist_cvr[uid] can only be 0.0 or 1.0, directly leaking the label.→ Fix: Leave-One-Out (LOO) encoding:
# LOO 公式(样本 i)
user_loo[i] = (user_sum - y_i + global_mean × α) / (user_count - 1 + α)
# α=10.0(贝叶斯平滑,防止稀疏统计过拟合)
The verification sample uses the dict lookup calculated from the training set and does not participate in the statistics of its own label.
4. Model architecture
We trained 4-class models and used multi-level stacking for integration.
4.1 LightGBM (main model)
Why LGB performs best on small datasets:
- The sample size of 1000 is small and GBDT is not easy to overfit.
- No assumptions about feature distribution, naturally handle mixed type features
- Can automatically learn feature interactions without manual design
Hyperparameters:
num_leaves=15, max_depth=4, learning_rate=0.02,
feature_fraction=0.5, bagging_fraction=0.7,
min_child_samples=10, lambda_l1=1.0, lambda_l2=5.0,
early_stopping=100
Multi-seed average: 5 different seeds (42, 123, 456, 789, 2024) are averaged to reduce the random impact of fold division.
4.2 DIEN (Deep Interest Evolution Network)
DIEN is a sequence recommendation model proposed by Alibaba. Its core consists of two parts:
- Interest Extractor: Use GRU to extract interest vectors from behavior sequences
- Interest Evolving: Using candidate item embedding as query, do Target Attention Pooling on the sequence
输入: user_emb + item_emb + user_fc + item_fc
+ action_interest (DIEN处理action_seq)
+ content_interest (DIEN处理content_seq)
→ 拼接 → MLP → sigmoid → pCVR
4.3 DeepFM
DeepFM models low-order and high-order feature interactions simultaneously:
- FM Component: Second-order interaction (
) - DNN Component: splicing all embedding → 3-layer MLP
- Sequence part: 3 series × 4 statistics = 12 dimensions → FC projection
4.4 Stacking (secondary model)
# 一级模型输出 + 原始特征 → 二级 LGB
二级特征 = [原始116维特征, lgb_oof, dien_oof]
5. Training process
5.1 Data Division
5-Fold Stratified KFold (stratified by label), each fold has about 200 samples and 20 positive samples.### 5.2 Per-Fold Target Encoding
for fold in [train_idx, val_idx]:
1. 从 train_idx 计算 user_te, item_te, item_freq, item_mean
2. 训练样本用 LOO encoding(排除自身标签)
3. 验证样本用 dict lookup
4. 训练 LGB/DIEN → 验证集预测
5.3 Final hybrid strategy
# v3 最终方案:等权 rank 平均(6 模型)
final = rank_avg([lgb_mid2, lgb_old, dien_old, lgb_narrow1, lgb_v2, lgb_mid1])
Forward greedy selection of models, gradually adding models that increase AUC the most. No weight optimization is used (to prevent overfitting weights on 1000 samples).
6. Experimental results
6.1 Independent AUC of each model
| Model | AUC | Description |
|---|---|---|
| LGB mid2 (v3) | 0.7175 | v3 features, best single model |
| LGB default (v3) | 0.7144 | v3 features |
| LGB narrow1 (v3) | 0.7137 | v3 features |
| LGB best10 (v3) | 0.7132 | v3 mid2 × 10 seeds |
| LGB (old) | 0.6738 | Old feature set |
| DIEN (old) | 0.6503 | Sequence Modeling |
| LGB v2 | 0.6147 | unified fold split |
| CatBoost | 0.6110 | GBDT comparison |
| DeepFM | 0.5932 | Weakest model |
6.2 Final Mix AUC
| Mixed Strategy | AUC | Description |
|---|---|---|
| Pure LGB mid2 (v3) | 0.7175 | v3 best single model |
| v3 best + old + DIEN (3 models) | 0.7436 | Features + variety |
| v3 BEST + old + DIEN + narrow1 + v2 + mid1 (6 models) | 0.7517 | FINAL SUBMISSION |
| All 19 model rank average | 0.7064 | Weak models hold back |
6.3 Per-Fold stability| Fold | Validation set positive sample | LGB mid2 AUC | 6 model ensemble AUC |
|------|------------|-------------|--------------| | 1 | 20/200 | 0.7212 | 0.7518 | | 2 | 20/200 | 0.7654 | 0.7893 | | 3 | 21/200 | 0.6943 | 0.7215 | | 4 | 21/200 | 0.7543 | 0.7782 | | 5 | 21/200 | 0.7121 | 0.7416 | | Average | 20.6 | 0.7295 | 0.7565 |
7. Data leakage investigation process
7.1 First round of leaks: Validation Fold leaked to Target Encoding
Phenomenon: Initial LGB AUC = 1.0 (perfect prediction), but predicted values are almost the same.
Root cause: user_hist_cvr is calculated from all data, and the label of the validation fold leaks directly.
Fix: Per-fold encoding — each fold uses only the training set to calculate statistics.
7.2 Second round of leaks: User appears only once
Phenomenon: Even with per-fold encoding, LGB still has AUC = 0.98.
Root cause: Occurs only once per user → user_hist_cvr[uid] = 0.0 or 1.0, perfect prediction.
Fix: LOO encoding — training samples exclude self-labels; use dict lookup for verification samples.
8. Why is it difficult for AUC to reach 0.85+?
8.1 Sample size limit
| Indicators | Current Values | Full Data Estimates |
|---|---|---|
| Number of samples | 1,000 | 100,000+ |
| Number of positive samples | 103 | 10,000+ |
| Positive samples per fold | ~20 | ~2,000 |
20 positive samples → AUC standard error ~±10%, even if the model is completely random ~50% AUC.
8.2 No duplicate users- 1000 users = 1000 samples → Unable to model user historical behavior patterns
- User-level statistics only have two values 0/1
- Item-level statistics are extremely sparse (927 items, average 1.08 items)
8.3 Actual results vs estimates
| Data size | Expected AUC | Actual AUC |
|---|---|---|
| 1000 items (current) | ~0.72-0.75 | 0.7517 (the upper limit has been reached) |
| 10,000 items | 0.75-0.80 | — |
| 100,000 items | 0.80-0.85 | — |
| Complete data | 0.85%+ | — |
Actual verification: AUC=0.7517 after v3 feature engineering confirms the theoretical estimate. The theoretical upper limit under 1000 samples is about 0.75.
9. LLM Embedding Exploration
9.1 Ideas
Although there is no natural language text, you can convert the features into text and use LLM to generate semantic vectors:
# 特征 → 文本
"用户12345678: 性别=1 | 年龄特征=260 | 活跃度=205 | 标签A=42 | ..."
"物品98765432: 类目A=96 | 类目B=241 | 类型=1 | ..."
# 文本 → 向量(sentence-transformers all-MiniLM-L6-v2, 384d)
user_emb, item_emb = encoder( texts )
# 语义特征
sem_user_item_sim # 用户-物品语义相似度
sem_neighbor_pos_sim # 与最近转化用户的平均相似度
sem_neighbor_neg_sim # 与最近未转化用户的平均相似度
sem_pos_neg_ratio # 正/负近邻相似度比值
sem_cluster_sim # 聚类内凝聚度
9.2 Experimental results
| Model | AUC | Description |
|---|---|---|
| LGB (no semantics) | 0.6738 | baseline |
| LGB + LLM semantics | 0.6149 | Declined instead |
| Semantic feature importance | 0.0 | LGB completely ignored |
9.3 Root cause analysis
Semantic feature variance is minimal:
| Features | Mean | Standard Deviation | Range |
|---|---|---|---|
| sem_user_item_sim | 0.526 | 0.028 | [0.433, 0.636] |
| sem_neighbor_pos_sim | 0.967 | 0.016 | [0.809, 0.988] |
| sem_pos_neg_ratio | 1.084 | 0.020 | [1.055, 1.180] |
- The generated text descriptions are highly similar → embedding space collapse
- There is minimal difference between users, and semantic similarity cannot differentiate between converted/non-converted users
Expected benefits on complete data: User portraits are richer → Text descriptions are differentiated → Semantic features are informative.
10. Follow-up plan
- Get complete data (register and download from algo.qq.com), retrain all models ✓
Multi-model integration→ CatBoost/XGBoost has been tried (the effect is worse than LGB) ✓- Sequence Enhancement: Transformer / Multi-Head Attention encoding behavior sequence
- LLM enhancement: Re-validation of semantic feature gains on complete data
- FAISS recall + LightGBM rerank (existing retrieval.py / rerank.py basis)
- Probability Calibration: Platt Scaling / Isotonic Regression improves LogLoss
- UAFM Unified Architecture: A single Transformer backbone jointly models feature interaction and sequence (see Section 12)
12. UAFM: Unified feature interaction and sequence modeling architecture
The theme of the TAAC2026 competition is “Unification of sequence modeling and feature interaction for large-scale recommendation”, which corresponds to two awards:
- Unified Module Innovation Award ($45,000): Recognizes innovation in unified architecture design
- Scaling Law Innovation Award (USD 45,000): In recognition of progress in the exploration of systematic scaling law
These two awards have nothing to do with AUC rankings and the focus is on innovation and insight.
12.1 Problems with existing solutions: separation of two paradigms
The current solution (LGB + DIEN/DeepFM + Stacking) is the splicing of two independent models:| Problem | Current Situation | Ideal State | |------|------|---------| | Superficial cross-paradigm interaction | LGB learns feature interaction, DIEN learns sequence, OOF splicing | Sequence token and feature token interact within the same attention | | Optimization goals are inconsistent | LGB optimizes LogLoss, DIEN optimizes BCE | Single BCE Loss end-to-end | | Embedding redundancy | ID embedding is learned separately, feature embedding is learned separately | Unified token embedding | | Scaling is opaque | I don’t know if increasing DIEN hidden_dim will be beneficial | Systematic scaling experiment |
12.2 UAFM core design
Core idea: Treat all inputs as isomorphic tokens and model them jointly through a single Transformer backbone.
Token type:
[CLS] ← 全局汇聚(用于最终分类)
[USER] ← 用户属性序列起始符
[ITEM] ← 物品属性序列起始符
[ACT] ← Action 行为序列
[CON] ← Content 内容交互序列
[ITM] ← Item 物品交互序列
[PAD] ← 填充
Tokenization strategy:
原始输入:
user_feature: {fid=1: 1, fid=3: 260, ..., fid=68: [0.1,...(50d)...]}
item_feature: {fid=6: 96, fid=7: 241, ...}
action_seq: [1, 1, 1, 0, 0, ...]
content_seq: [timestamp, timestamp, ...]
item_seq: [item_id, item_id, ...]
↓ Unified Tokenization ↓
[CLS] [USER] (1%bucket) (260%bucket) ... [ITEM] (96%bucket) ...
[ACT] (1%bucket) (1%bucket) ... [CON] ... [ITM] ...
设计原则:
- 连续值 → 哈希分桶 (value % 1000)
- 序列步 → 每个 step 作为一个 token(展开)
- 预训练 embedding (uf68/uf81) → 独立投影层注入
Architecture:
[Token序列] → [UnifiedEmbedding]
├─ Type Embedding: 区分 USER/ITEM/ACT/CON/ITM/PAD
├─ Value Embedding: token bucket → d_model 向量
├─ Per-Type Position Encoding: 每个类型内部独立位置编码
├─ 预训练向量投影: uf68/uf81 → d_model 注入
└─ 标量特征: hour/dow → d_model
→ [Transformer Encoder × N]
├─ Multi-Head Self-Attention(所有 token 互相 attend)
├─ Gated Linear Unit FFN
└─ Pre-norm + LayerDrop
→ [CLS Token] → [MLP] → pCVR
单一 Loss: BCE(pCVR)
12.3 Key innovation points
-
Type-aware position encoding: Each token type (USER/ITEM/ACT) has an independent position embedding, and the order within the sequence and the order between types are modeled separately. This solves the problem of “different modalities have different location semantics”.
-
Unified Attention: Within a single Transformer, the user’s attribute token can directly attend the behavior sequence token, achieving deep interaction across paradigms. This is more thorough than DIEN’s two steps (GRU → Attention).
-
Pre-training embedding injection: The pre-computed uf68 (50d) and uf81 (24d) in the data are used as additional information and fused into the [CLS] representation through MLP. Does not interfere with sequence modeling but preserves pre-training information.
-
Scaling Law Ready: The parameter amount is adjustable from 0.1M (micro) to 80M (xlarge), and the optimal ratio of model size and data size can be systematically studied.
12.4 Scaling configuration| Scale | d_model | n_heads | n_layers | Parameter amount | Applicable data amount |
|------|----------|----------|----------|--------|-----------| | micro | 32 | 4 | 1 | ~0.1M | 1K samples | | tiny | 64 | 4 | 2 | ~0.3M | 1K-10K | | small | 64 | 8 | 4 | ~1.2M | 10K | | medium | 128 | 8 | 4 | ~5M | 10K-100K | | large | 256 | 8 | 6 | ~20M | 100K+ | | xlarge | 512 | 16 | 12 | ~80M | 500K+ |
12.5 Comparison with existing solutions
| Dimensions | Current solution (LGB+DIEN) | UAFM unified architecture |
|---|---|---|
| Number of models | 3+ (LGB/DIEN/DeepFM/Stacking) | 1 |
| Loss function | Multiple targets (BCE+LogLoss) | Single BCE |
| Feature interaction depth | LGB (tree splitting) / DIEN (2nd order FM) | Transformer attention (N order) |
| Sequence Modeling | DIEN (GRU+Attention) | Transformer (Multi-Head Attention) |
| Cross-paradigm interaction | Shallow (OOF splicing) | Deep (unified attention) |
| Parameter adjustment | Manual adjustment of GRU hidden | Automatic scaling law |
| End-to-end | No (Need to train NN first and then LGB) | Yes |
12.6 Scaling Law Experimental Design
实验维度:
1. 参数 scaling: micro → xlarge(0.1M → 80M)
2. 数据 scaling: 1K → 100K(控制其他变量)
3. 序列长度 scaling: 10 → 1000 步
目标:拟合
AUC = α × log(params)^β + γ × log(data)^δ + ε
→ 找到计算最优(compute-optimal)的配置
消融实验:
- 有/无预训练 embedding 注入
- 有/无类型感知位置编码
- 2层 vs 6层 Transformer
- Self-attention vs Cross-attention(USER attend ITEM)
12.7 Code files
models/unified_transformer.py # UAFM 主模型 + ScalingExperiment
├── TokenType # 枚举:CLS/USER/ITEM/ACT/CON/ITM/PAD
├── UnifiedTokenizer # 特征 → token 序列
├── UnifiedEmbedding # Type + Value + Position + 预训练注入
├── TransformerBlock # Multi-Head Attention + Gated FFN
├── UAFM # 主模型类 + from_config 构造器
└── ScalingExperiment # Scaling law 实验管理器
train_unified.py # 训练脚本:5-Fold CV / Scaling Experiment
11. Summary of core experience1. Small data set + data leakage = disaster: Among 1000 samples, each fold only has 20 positive samples, any slight label leakage will be amplified
- LOO Target Encoding is a necessity: When users/items appear very rarely, LOO is the key to preventing label leakage
- Multiple seed averaging is a must-choice strategy for small data sets: The randomness of fold division has a huge impact on the results, and seed averaging can reduce the AUC fluctuation from ±0.05 to ±0.01
- Feature Engineering > Model Parameter Adjustment: 116-dimensional carefully designed features + simple LGB >> 116-dimensional original features + complex model
- Semantic embedding is not effective on sparse data: When the profile information is insufficient, no matter how powerful the model is, it cannot extract effective signals from the semantic space.
13. v3 optimization: breakthroughs in key feature engineering
13.1 Key Finding: Sequence arrays are not action counts, they are Unix timestamps
This is the core finding of v3 optimization.
In the int_array of content_seq and item_seq, the values are usually:
content_seq int_array: [0, 0, 1770695032, 0, 1770696021, 1770697231, ...]
item_seq int_array: [0, 0, 0, 152341, 0, 0, ...]
Misconception: These are action count values (like number of clicks) → zero value is “no action” Correct understanding: these are Unix timestamps (of the order 1.77e9) → zero values are padding/empty
Verification:
1770695032→2026-02-10 09:10:32(reasonable, similar to sample timestamp)152341→1970-01-02→ very early time = invalid padding
Impact:
- 77% of
content_seqis zero (padding), not “77% of the time no content” - 31% of
item_seqare zero values - Must use
ts_mask = arr > 1e5to separate valid timestamps and zero padding
13.2 v3 feature design (114 dimensions)
Extract 11 types of features from timestamps:
# 时间戳基础特征(分离零值后)
content_recency_h # 最近内容交互距今小时数
content_ts_span_h # 内容交互时间跨度
content_gap_mean/std/max # 内容交互间隔统计
content_recent_1d/7d # 近 1/7 天内的交互次数
content_active_days # 内容交互活跃天数
item_recency_h / gap / active_days # 同理(item 序列)
# 零值比例特征
content_zero_ratio # 零值比例(反映序列活跃度)
item_zero_ratio
con_ts_count / con_zero_count # 时间戳数量 vs 零值数量
# 时段特征
sample_hour # 样本时间(小时)
sample_dow # 样本时间(星期)
content_hour_entropy # 内容交互的时段分布熵
# 跨序列交互特征
total_seq_len # 总序列长度
act_con_ratio / con_itm_ratio # 序列长度比值
Core advantage: Timestamp recency directly models “when did the user interact recently” - a strong signal of conversion intention.
13.3 v3 single model results| Configuration | Description | AUC |
|------|------|-----| | mid2 | num_leaves=20, lr=0.03, depth=5, subsample=0.7 | 0.7175 | | default | num_leaves=15, lr=0.02, depth=4, subsample=0.7 | 0.7144 | | narrow1 | num_leaves=8, lr=0.015, depth=3, subsample=0.8 | 0.7137 | | best_10seeds | mid2 × 10 seeds | 0.7132 | | mid1 | num_leaves=12, lr=0.025, depth=4, subsample=0.75 | 0.7070 | | shallow | num_leaves=8, lr=0.05, depth=3 | 0.7064 | | tiny3 | num_leaves=6, lr=0.03, depth=3 | 0.7043 |
v3 single model vs old single model: 0.7175 vs 0.6738 → +0.044, only by feature engineering.
13.4 Integrated optimization
Forward greedy selection of models (gradually add models that increase AUC the most)
| Steps | Add Model | Integrate AUC | Increment |
|---|---|---|---|
| 1 | +lgb_mid2 | 0.7175 | — |
| 2 | +lgb_old | 0.7271 | +0.0096 |
| 3 | +dien_old | 0.7436 | +0.0165 |
| 4 | +lgb_narrow1 | 0.7470 | +0.0034 |
| 5 | +lgb_v2 | 0.7515 | +0.0045 |
| 6 | +lgb_mid1 | 0.7517 | +0.0002 |
Stop condition: Step 6 only improves by +0.0002, diminishing returns. The ensemble of 6 models outperformed all 19 models (weak models held back).### 13.5 Final Result
| Indicators | Values |
|---|---|
| AUC | 0.7517 |
| Bootstrap 95% CI | [0.6984, 0.8098] |
| Bootstrap p-value | < 0.0001 |
| Compared to old baseline | +0.0778 (0.6738 → 0.7517) |
| Ensemble method | 6 model equal weight rank average |
Statistical significance: Bootstrap 100 resampling p-value < 0.0001, CI lower bound 0.698 is much larger than 0.5 (random), the improvement does come from feature quality rather than random fluctuations.
13.6 Other optimizations tried (ineffective)
| Optimization strategy | Results | Evaluation |
|---|---|---|
| CatBoost (5 seeds) | AUC=0.6110 | ❌ Worse than LGB |
| XGBoost (5 seeds) | AUC=0.6484 | ❌ Worse than LGB |
| Weight optimization integration | AUC=0.7586 | ⚠️ Small data weight optimization overfitting risk |
| MF averaging (power mean) | Optimal p=1.0 | Equivalent to equally weighted rank averaging |
| All 19 model integration | AUC=0.7064 | ❌ Weak models hold back |
14. Final commit
Final result
| Indicators | Values |
|---|---|
| AUC | 0.7517 |
| Bootstrap 95% CI | [0.6984, 0.8098] |
| p-value (vs baseline) | < 0.0001 |
| Improvement | +0.0778 (11.5% relative improvement) |
| Ensemble method | 6 model equal weight rank average |
Submission strategy
Use 6-model equal-weighted rank averaging without weight optimization (to prevent overfitting weights on 1000 samples).
---> Project code: TAAC2026
Data: HuggingFace TAAC2026/data_sample_1000 Complete data: algo.qq.com (requires registration and login)
Honest AUC: 0.7517 (Bootstrap 95% CI: [0.698, 0.810], p<0.0001) (1000 samples, theoretical upper limit ~0.75; complete data expected AUC 0.85%+)