Rise of Open LLMs

The recent rise of open-source large language models (LLMs) is a welcome development for the field. Open LLMs have been around for some time now but with Facebook’s announcement of Llama 2 there has been lot of enthusiasm in the developer community. Llama 2 has three models of parameter size 7B, 30B, and 70B and a context length of 4096 tokens. While the largest model is competitive in all the benchmarks, the 7B model significantly outperforms other similar sized open source LLMs. A nice read on Llama2 models can be found here.

Quantization of the 7B parameter size model reduces compute requirements even further for using the LLM, thus enabling CPU-based usage of Llama 2.

What is quantization?

Open source quantized LLMs that can run on CPUs relieve the burden of GPU compute demands and allow for more experimentation with their capabilities with commodity hardware. Small, quantized models definitely would not perform at the level LLMs not to even mention SOTA LLMs such as GPT from OpenAI. While OpenAI’s GPT has become immensely popular and has seen its use skyrocket in myriad applications, developers are still constrained by the API endpoint deployments which often times can run into context length or rate limit throttling problems. These are not resolved easily, especially rate limit throttling and level of support for small teams or individual innovators seems lacking. Moreover, data privacy and diffusion continue to be a factor and it still remains unclear that in using closed source LLMs deployed via API endpoints how much of private data remains private. In the light of this, open source/access, light weight LLMs are likely to find a niche of their own, for tasks that are appropriate.


In this post we will explore using a quantized version of the 7B parameter Llama 2 LLM. We will try a simple use case of document Q&A using the Llama 2 7B parameter quantized LLM – llama-2-7b-chat-ggmlv3.q8_0.bin. This model is available from Huggingface for free download. The document QA workflow pipeline was built on a local CPU Intel Quad Core i5 Processor with 16GB RAM. The choice of low-end commodity hardware was deliberate to probe lower bounds of performance and quality for using Llama 2

To be continued…