Vladmodels Zhenya Y114 Katya Y11767 2021 May 2026
# Clone the repository (v1.2.0 released Jan 2022)
git clone https://github.com/vladmodels/vladmodels.git
cd vladmodels
# Install dependencies (PyTorch 1.11+, TorchVision, transformers)
pip install -r requirements.txt
# Download the checkpoints (≈ 500 MB each)
wget https://dl.vladmodels.org/checkpoints/zhenya_y114.pt
wget https://dl.vladmodels.org/checkpoints/katya_y11767.pt
Published: April 15 2026
Category: Fashion & Modeling
Keywords: VladModels, Zhenya, Katya, y114, y11767, 2021 fashion, Russian models, online modeling platforms
| Property | Value |
|----------|-------|
| Model Size | 114 M parameters (hence the Y114 suffix). |
| Primary Domain | Multilingual OCR & Scene Text Recognition. |
| Training Corpus | 12 TB of scraped public‑domain street‑view imagery (OpenStreetCam, Mapillary) combined with synthetic text renderings (SynthText v3). Multilingual labels cover English, Russian, Chinese, Arabic, and Hindi. |
| Pre‑training | 150 k steps on ImageNet‑21k (pure visual backbone) → 300 k steps on the OCR corpus. |
| Fine‑tuning | Two‑stage curriculum: (1) character‑level classification, (2) sequence‑level CTC loss with language‑model rescoring. |
| Evaluation Benchmarks | - ICDAR 2019 Robust Reading: 87.3 % F‑score (vs. 84.1 % for the previous state‑of‑the‑art).
- MVTec‑AD (text‑only subset): 92.5 % AUC. |
| Inference Profile | ~8 ms per 640 × 640 image on a single A100; can be exported to ONNX for CPU inference (~45 ms). |
| Key Innovations | 1️⃣ Dual‑token embedding (visual + glyph embeddings) → better handling of low‑resolution characters.
2️⃣ Dynamic language‑model gating that switches between per‑script LM heads based on script detection confidence. | vladmodels zhenya y114 katya y11767 2021
In the second half of 2021, the open‑source community behind VladModels announced a pair of specialized neural‑network checkpoints that quickly gained traction in niche computer‑vision and language‑generation tasks: Zhenya Y114 and Katya Y11767. Both models were built on the same core architecture (a hybrid Vision‑Transformer + Conformer backbone) but diverged in training data, target domains, and fine‑tuning strategies. # Clone the repository (v1
This write‑up summarizes: