Inference Optimization with Pruna AI

About this Event

Scaling has fueled the latest breakthroughs in language, image, and video models. As model sizes increase, so do the computational and energy expenses of running them. But we can do something about it.

In this talk, we’ll explore compression techniques such as quantization, compilation, caching, and distillation to optimize model performance during inference. For a hands-on example, we will combine some of these techniques to reduce model size and computational load while maintaining quality, thus making AI more accessible and environmentally sustainable.

More at Pruna AI

Speakers

Nils Fleischmann is a research engineer at Pruna AI. Pruna is a German-French Start-Up with the mission to make AI models faster, cheaper, smaller, and greener.

He primarily works on Pruna’s optimization engine, which allows developers to seamlessly employ and integrate compression algorithms. He also helps customers to get the best performance out of their models.