Running EXAONE on iPhone with CoreML-based Inference

Deployment context

Mobile sLM deployment becomes practical only when the model, runtime, and target device are considered together. Production deployment requires memory-aware graph changes, precision mapping, and hardware-specific execution planning.

ModelAdaptationQuantizationOptimizationDevice

Engineering constraints

Static shapes reduce runtime ambiguity on constrained accelerators.
Memory layout and KV cache behavior determine sustained inference.
Mixed precision must be calibrated against task accuracy and device throughput.

Constraint	Why it matters	Optimization path
Memory	Limits context length and batch strategy	Cache adaptation and layout planning
Latency	Controls product usability	Operation fusion and accelerator scheduling
Power	Determines sustained inference	NPU-first execution and precision mapping

Deployment context

Engineering constraints

관련 글

What Makes On-device AI Hard?

CPU vs GPU vs NPU: Why Dedicated AI Acceleration Matters