COMMUNITYAskapro

Large-Scale Text Corpus Deduplication and Dataset Enhancement

  • Developed and deployed text corpus deduplication using a suffix array algorithm on MapReduce, boosting assessor F1-score from 0.77 to 0.82.
  • Trained a classifier to improve benchmark coverage, enhancing dataset relevance for pretraining tasks.
  • Created a dataset augmentation pipeline with Back Translation, increasing pretraining robustness and generalization.
  • Enhanced document parsing quality, resulting in 3% faster model convergence and improved resource efficiency.
Andrey worked on this case as the ML engineer at Yandex GPT.
ML engineer
Natural Language Processing
Data Processing
Artificial Intelligence
Global
Research and Development
Data Pipeline
Python
Pandas
Numpy
MapReduce
Git
Internal product
Show more
Meber iconMeber thumbnail

Andrey

Machine Learning Engineer at Yandex Zen

Andrey's cases
Show more

Similar cases

Show more