Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
-
Updated
Nov 6, 2023 - Python
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
Runs LLaMA with Extremely HIGH speed
RAM Coffers: Conditional Memory via NUMA-Distributed Weight Banking - O(1) lookup routing for LLM inference (Dec 16, 2025 - predates DeepSeek Engram by 27 days)
LLM inference in Fortran
AltiVec/VSX optimized llama.cpp for IBM POWER8
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
Speaker diarization for Python — "who spoke when?" CPU-only, no API keys, Apache 2.0. ~10.8% DER on VoxConverse, 8x faster than real-time.
Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.
The bare metal in my basement
Pure C inference engine for Qwen3-TTS text-to-speech. No Python, no PyTorch — just C and BLAS. Supports 0.6B and 1.7B models, 9 voices, 10 languages.
eLLM Infers LLM on CPUs in Real Time
Portable LLM - A rust library for LLM inference
Wrapper for simplified use of Llama2 GGUF quantized models.
Non-bijunctive attention collapse for LLM inference — POWER8 hardware AES (vcipher) + AltiVec vec_perm. Hebbian path selection, cross-head diffusion, O(1) KV prefiltering.
V-lang api wrapper for llm-inference chatllm.cpp
Privacy-focused RAG chatbot for network documentation. Chat with your PDFs locally using Ollama, Chroma & LangChain. CPU-only, fully offline.
Simple large language model playground app
VB.NET api wrapper for llm-inference chatllm.cpp
C# api wrapper for llm-inference chatllm.cpp
Add a description, image, and links to the cpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the cpu-inference topic, visit your repo's landing page and select "manage topics."