Experience

2 positions

Makora

GPU Kernel & AI Research Engineer

Internship → Full-time Poland (3 months) → Remote
Jun 2025 - Present · 10 mo
  • Achieved 15x speedup on Flash Attention MLA generation over CK fallback for Kimi, DeepSeek V3, and a custom 8B MLA model, optimized using HIP and ISA-level analysis on AMD Mi300X/Mi355X
  • Built autonomous kernel generation agent producing optimized HIP, CUDA, and Triton kernels with iterative profiling and rewriting loops
  • Optimized GEMM FP8 (10x speedup), MXFP4 (2.4x speedup), GEMV (1.5x speedup), and Conv (10x speedup) kernels using HIP kernels (tools: profiler and ISA dump)
  • Built custom H100 GEMM agent using wgmma inline PTX — matching cuBLAS at 4096×4096×4096 MNK baseline and exceeding it on specific shapes
  • Shipped CLI endpoints and backend features exposing cleaner data APIs to the web platform UI
CUDA HIP ROCm GPU Kernels AMD NVIDIA H100 Mi300X Mi355X PTX Triton Flash Attention FP8 PyTorch AI Agents

Autonomous Systems

ML Engineer

Full-time Romania
Sep 2024 - Present · 1 yr 7 mo
  • Researched and trained custom YOLO backbones and heads for Coral TPU deployment via INT8 quantization — achieved 30 mAP on a 2.1M parameter model optimized for 1024×1024 input on VisDrone
  • Built production vehicle detection system for Romanian forest monitoring — detecting vehicles, payload, and quantity — deployed on Kubernetes, currently in production
  • Designed data pipeline with Kafka (1M+ requests/day), Redis, MinIO, PostgreSQL and Grafana observability
  • Built multiple custom datasets ranging from 2K to 150K images (largest for Re-ID) using synthetic generation via diffusion models and manual annotation; leveraged knowledge distillation for model compression
  • Built custom ML observability and data curation platform for experiment tracking and dataset management
  • Researched and developed Re-ID framework supporting multi-dataset training, custom losses, configurable backbones, and overnight automated sweep testing
Computer Vision YOLO Coral TPU Kubernetes Kafka Redis PyTorch MLOps Re-ID Diffusion PostgreSQL Grafana