Apple has been making significant strides in the field of generative AI. The company’s unique position, with full control over its entire stack, allows it to optimize generative models for on-device inference. Apple has released several research papers detailing its progress in this area. One such paper, “LLM in a flash,” describes a technique for running large language models (LLMs) on memory-constrained devices. This technique dynamically swaps model weights between flash memory and DRAM, reducing memory consumption while minimizing inference latency.
Apple has also released several open-source generative models. One of these, Ferret, is a multi-modal LLM that comes in two sizes: 7 billion and 13 billion parameters. Built on top of Vicuna, an open-source LLM, and LLaVA, a vision-language model (VLM), Ferret can generate responses based on specific areas of an image.
These advancements in on-device inference optimization techniques are crucial as more developers explore building apps with small LLMs that can fit on consumer devices. A few hundredths of a second can significantly affect the user experience, and Apple is ensuring that its devices provide the best balance between speed and quality.
Read more: https://venturebeat.com