Large scale Recommendation systems

State-of-the-art ML R&D solutions and production-grade ML infrastructure that serve billions of users

We build production-grade ML infrastructure and AI algorithmic solutions for large scale recommendation systems that serve billions of users producing big data (e.g. >1B requests per day).

Nowadays in order to navigate the vast amounts of content on the internet, users either rely on search queries, or on content recommendations powered by algorithms.

Our Recsys team focuses on building state-of-the-art ML algorithms while pushing the research boundaries in recommender systems to new frontiers while designing and deploying scalable cutting-edge, production-grade ML infrastructure pipelines for fast and easy ML offline/online research and experimentation.

The Team

The Team consists of exceptional ML engineers, research scientists and software engineers.


Stavros Theodorakis

   Group Lead


Stefanos Angelidis

Team Lead


Aris Konidaris


Giannis Gkinis


Petros Katsileros

Team Lead


Nikiforos Mandilaras


Alexandros Georgogiannis

Selected Projects

incremental learning

Incremental Learning for Large-scale CTR Prediction

An Incremental Learning framework for Large-scale CTR Prediction applied in Taboola's massive-scale recommendation service. Our approach enables rapid capture of emerging trends through warm-starting from previously deployed models and fine tuning on "fresh" data only. Past knowledge is maintained via a teacher-student paradigm, where the teacher acts as a distillation technique, mitigating the catastrophic forgetting phenomenon. Our incremental learning framework enables significantly faster training and deployment cycles (x12 speedup) as also brings consistent Revenue Per Mille (RPM) lift over multiple traffic segments and a significant CTR increase on newly introduced items.


Exploration vs Exploitation and uncertainty estimations

Building robust online content recommendation systems requires learning complex interactions between user preferences and content features. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this project we utilize uncertainty estimations employing them as an optimistic exploitation/exploration strategy for more efficient exploration of new recommendations. We provide a novel hybrid deep neural network model, Deep Density Networks (DDN), which integrates content-based deep learning models with a collaborative scheme that is able to robustly model and estimate uncertainty.


Self-Supervised Learning between users and context

Neural architecture-based recommender systems have achieved tremendous success in recent years. However, when dealing with highly sparse data, they still fall short of expectation. Self-supervised learning (SSL), as an emerging technique to learn with unlabeled data, recently has drawn considerable attention in many fields. There is also a growing body of research proceeding towards applying SSL to recommendations for mitigating the data sparsity issue. In this project we are incorporating SSL to learn rich representations for both the users and the context. We are incorporating a two-tower network architecture where the first tower consists of features related to the user and the second features related to the context and utilize a contrastive loss to train the two-tower network.


Mixture of Experts in Recommender Systems

Neural-based learning has been successfully used in many real-world large-scale applications such as recommendation systems. Furthermore, Mixture-of-experts (MoE), a type of conditional computation where parts of the network are activated on a per-example basis, has been proposed as a way of dramatically increasing model capacity without a proportional increase in computation. In this project we are combining both while employing MoE type of DL models for CTR & CVR estimation while incorporating prior knowledge to bootstrap manager’s routing to the experts. In addition a load balancing scheme is being employed as regularization avoiding the collapse phenomenon.


Knowledge Distillation for Transfer Learning

Recent KD-based methods are developed not only for model compression but also in order to produce more accurate models and are mainly applied in public datasets with state-of-the-art results. In this project we propose a novel framework for both cross-data and temporal Knowledge Distillation, the “xt-KD”, which was applied in large-scale production systems to account for the major challenges faced while serving billions of requests per day. The xt-KD is a transfer-learning inspired approach building on Knowledge Distillation (KD), extending and re-formulating it in an efficient manner, while being adapted for the exploitation of huge data so as to leverage both past accumulated knowledge as well as diverse data-partitions termed after “cross-data”. With xt-KD we transfer knowledge between different cohorts of our data corpus, reuse and distill knowledge from previous past-deployed models into new ones, and overall apply several different scenarios and ablations on cross data and temporal transfer knowledge setups.

An Incremental Learning framework for Large-scale CTR Prediction

In this talk we present an incremental learning framework for Click-Through-Rate (CTR) prediction and demonstrate its effectiveness for Taboola’s massive-scale recommendation service. As a leading content recommendation service, Taboola serves more than a billion requests to millions of unique users each day. Under this industrial setting, hundreds of CTR prediction models are deployed daily (on distinct segments of incoming traffic). Each model is trained from scratch for several hours on historical user impressions (collected over the previous two weeks). Thus, long training times introduce challenges with regard to (1) model freshness as new trends constantly emerge (that are not captured by historical data) and (2) the scale of required computational resources.


To address these challenges, we introduce Taboola’s incremental learning framework for CTR prediction. Instead of training each model from scratch on historical user impressions, we opt for initialising from past-deployed models through a warm-start step. New models are trained on fresh data only, which comprises only a small fraction of the original dataset. During training, we also employ a teacher-student paradigm, where the teacher (trained daily) acts as an implicit regularizer, enabling the student to maintain previously acquired knowledge.


The presented approach drastically reduces the required training time, thus allowing the deployment of “fresher” models with less computational requirements. We demonstrate a significant CTR increase on newly introduced items and overall improvement in recommendation performance, measured in terms of Revenue Per Mille (RPM) (lift of more than 0.5%). Note that given the massive scale of Taboola’s recommendation service, even a minor RPM increase can be significant.


Petros Katsileros

Team Lead


Nikiforos Mandilaras

mallis_main1 (1)

Dimitris Mallis

Knowledge Distillation: Cross Data and Temporal Knowledge Distillation for improved Click-Through-Rate

In this talk we present a novel transfer learning framework for both across datasets’ and temporal Knowledge Distillation, termed as “cross-temporal-KD”, xt-KD for short, motivated by challenges in large scale production systems, while building upon the Student-Teacher (ST) in order to address the main challenges during recommendation systems’ development, training and inference cycle: such a method has to satisfy all the performance requirements for fast inference and efficient training times.

Our solution exploits supplementary knowledge across diverse datasets to transfer it through the teacher model. In other words we also transfer knowledge from different segments of our data corpus by training the teacher model on both larger and different datasets than the student, being a superset of the student’s data. Additionally, in order to take advantage of already trained models, instead of training from scratch new models again and again, we reuse them as teachers. these are the ones trained on previous timestamps. With such a simple yet elegant and profound methodological setup and the corresponding experimental production pipeline we are able to transfer knowledge both across time and across different datasets in a model agnostic way building upon the ST paradigm.


Petros Katsileros

Team Lead


P. Katsileros, N. Mandilaras, D. Mallis, V. Pitsikalis, S. Theodorakis, G. Chamiel, "An Incremental Framework for Large- scale CTR Prediction", RecSys, 2022

Y. Zeldes, S. Theodorakis, E. Solodnik, A. Rotman, G. Chamiel, D. Friedman, “Deep density networks and uncertainty in recommender systems”, KDD 2017