Makora
GPU Kernel & AI Research Engineer
Internship → Full-time Poland (3 months) → Remote
Jun 2025 - Present · 10 mo
- — Achieved 15x speedup on Flash Attention MLA generation over CK fallback for Kimi, DeepSeek V3, and a custom 8B MLA model, optimized using HIP and ISA-level analysis on AMD Mi300X/Mi355X
- — Built autonomous kernel generation agent producing optimized HIP, CUDA, and Triton kernels with iterative profiling and rewriting loops
- — Optimized GEMM FP8 (10x speedup), MXFP4 (2.4x speedup), GEMV (1.5x speedup), and Conv (10x speedup) kernels using HIP kernels (tools: profiler and ISA dump)
- — Built custom H100 GEMM agent using wgmma inline PTX — matching cuBLAS at 4096×4096×4096 MNK baseline and exceeding it on specific shapes
- — Shipped CLI endpoints and backend features exposing cleaner data APIs to the web platform UI
CUDA HIP ROCm GPU Kernels AMD NVIDIA H100 Mi300X Mi355X PTX Triton Flash Attention FP8 PyTorch AI Agents