Data at the core of all the cool ML

Dmitry Kan
2 min readFeb 5, 2023


We often talk about neural networks, vector search, MLOps, search scalability and cost. But one topic that deserves more attention is where this journey begins: data.

🎙 In this episode (recorded right before Christmas) with Evgeniya Sukhodolskaya (Jenny), Data Advocate at Toloka, you’ll dive into the topic of data labeling for Search and ML: from setting up the project to evaluating the skill level of an annotator and how to interpret and leverage the results in your algorithms. We also spoke about the very important topic of bias in data. I’ve learnt a ton about data labeling by chatting to Jenny! 🤩

Be sure to check out the links for getting a grant support, if you are an educator or generally in Academia!

Research grants and educator partnerships:

These are pages leading to them:

💡 Topics:

00:00 Intro
01:25 Jenny’s path from graduating in ML to a Data Advocate role
07:50 What goes into the labeling process with Toloka
11:27 How to prepare data for labeling and design tasks
16:01 Jenny’s take on why Relevancy needs more data in addition to clicks in Search
18:23 Dmitry plays the Devil’s Advocate for a moment
22:41 Implicit signals vs user behavior and offline A/B testing
26:54 Dmitry goes back to advocating for good search practices
27:42 Flower search as a concrete example of labeling for relevancy
39:12 NDCG, ERR as ranking quality metrics
44:27 Cross-annotator agreement, perfect list for NDCG and Aggregations
47:17 On measuring and ensuring the quality of annotators with honeypots
54:48 Deep-dive into aggregations
59:55 Bias in data, SERP, labeling and A/B tests
1:16:10 Is unbiased data attainable?
1:23:20 Announcements



Dmitry Kan

Founder and host of Vector Podcast, tech team lead, software engineer, manager, but also: cat lover and cyclist. Host: