Large-Scale Text Corpus Deduplication and Dataset Enhancement

Developed and deployed text corpus deduplication using a suffix array algorithm on MapReduce, boosting assessor F1-score from 0.77 to 0.82.
Trained a classifier to improve benchmark coverage, enhancing dataset relevance for pretraining tasks.
Created a dataset augmentation pipeline with Back Translation, increasing pretraining robustness and generalization.
Enhanced document parsing quality, resulting in 3% faster model convergence and improved resource efficiency.

Andrey worked on this case as the ML engineer at Yandex GPT.

ML engineer

Natural Language Processing

Large Language Models (LLM)

Benchmark Coverage Optimisation

Ranking Optimisation

Back Translation

Global

Research and Development

LLM Pretraining Pipeline

Data Pipeline

Python

Pandas

Numpy

MapReduce

Tensorboard

Git

Internal product

Andrey

Machine Learning Engineer at Yandex Zen

Andrey's cases

Natural Language Processing

Entertainment

Global

Enterprise

News & Content Platform

Web App

Python

Multi-Document Summarization and Ranking Optimisation

Engineered a SOTA multi-document summarization model for news aggregation, achieving an 87% human acceptance rate with less than 1% critical errors.
Designed automated news timelines for evolving events, resulting in a 2.8% increase in content depth and a 1.4% boost in user time spent.
Optimized ranking algorithms with a CTR prediction model, increasing daily active users by 4.3% and user engagement by 2.1%.
Created a novel similarity scoring formula, improving F1-score from 0.91 to 0.95 for news clustering.

ML engineerYandex Zen

Language Translation Systems

Telecom

Global

Research and Development

C to Eolang Compiler

Compiler

C++

C to Eolang Compiler Development

Implemented processing of all basic data types from Clang AST to equivalent constructions in Eolang.
Developed a mechanism for translating multidimensional arrays to Eolang.
Developed a mechanism for translating Enum types to Eolang.

Product Search and Logistics Automation

Developed a search robot for products, increasing cold client conversions by 4%.
Created a report monitoring system for item positions for clients.
Developed an advanced route-planning algorithm for courier logistics, increasing daily pickup points by 13%.
Added generation of invoices for product stock, improving product acceptance process speed and accuracy by 15%.

ML engineerWBprod

Similar cases

Natural Language Processing

Global

Research and Development

LLM Pretraining Pipeline

Data Pipeline

Python

Large-Scale Text Corpus Deduplication and Dataset Enhancement

Developed and deployed text corpus deduplication using a suffix array algorithm on MapReduce, boosting assessor F1-score from 0.77 to 0.82.
Trained a classifier to improve benchmark coverage, enhancing dataset relevance for pretraining tasks.
Created a dataset augmentation pipeline with Back Translation, increasing pretraining robustness and generalization.
Enhanced document parsing quality, resulting in 3% faster model convergence and improved resource efficiency.

ML engineerYandex GPT

Natural Language Processing

Entertainment

Global

Enterprise

News & Content Platform

Web App

Python

Multi-Document Summarization and Ranking Optimisation

Engineered a SOTA multi-document summarization model for news aggregation, achieving an 87% human acceptance rate with less than 1% critical errors.
Designed automated news timelines for evolving events, resulting in a 2.8% increase in content depth and a 1.4% boost in user time spent.
Optimized ranking algorithms with a CTR prediction model, increasing daily active users by 4.3% and user engagement by 2.1%.
Created a novel similarity scoring formula, improving F1-score from 0.91 to 0.95 for news clustering.

ML engineerYandex Zen

Language Translation Systems

Telecom

Global

Research and Development

C to Eolang Compiler

Compiler

C++

C to Eolang Compiler Development

Implemented processing of all basic data types from Clang AST to equivalent constructions in Eolang.
Developed a mechanism for translating multidimensional arrays to Eolang.
Developed a mechanism for translating Enum types to Eolang.

ML engineerHUAWEI

Image Classification

Global

Research and Development

Image Classification System

Python

GANs

Adapting GANs for Image Classification

Used StarGAN architecture to classify images across multiple domains.
Achieved an F1 Score of 0.83, Precision of 0.9917, and Recall of 0.8051 on the CelebA dataset.
Trained model over 500,000 iterations using PyTorch and Adam optimizer.
Applied adversarial loss, classification loss, and gradient penalty for balanced training.
Optimized performance with specific hyperparameters ensuring stable performance.

AI DeveloperFreelance

Customer development

Worldwide

MVP

Neural Networks for Computer Vision and NLP

Website

Python

Applying ML to facial pore recognition for a beauty application.

As a computer vision expert, I was consulted to address the challenge of facial pore recognition for a beauty application. The problem involved accurately identifying and measuring individual pores on various skin types and tones. To tackle this, I designed an algorithm that could adaptively capture high-resolution images and extract relevant features. Leveraging machine learning, I trained a model on a comprehensive dataset that I helped gather, comprising different face images under various lighting conditions. The resulting algorithm could recognize and measure individual pores with high precision, providing personalized skincare recommendations to users, and thereby enhancing the app's user experience.

Computer Vision Engineer