Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
-
Updated
Nov 6, 2023 - Python
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
Runs LLaMA with Extremely HIGH speed
Face verification in the browser. 74 KB WebAssembly. No server, no cloud, no dependencies. Also runs native at 3ms on CPU.
LLM inference in Fortran
Speaker diarization for Python — "who spoke when?" CPU-only, no API keys, Apache 2.0. ~10.8% DER on VoxConverse, 8x faster than real-time.
Pure C inference engine for Qwen3-TTS text-to-speech. No Python, no PyTorch — just C and BLAS. Supports 0.6B and 1.7B models, 9 voices, 10 languages.
eLLM can infer LLM on CPUs faster than on GPUs
Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.
The bare metal in my basement
Non-bijunctive attention collapse for LLM inference — POWER8 hardware AES (vcipher) + AltiVec vec_perm. Hebbian path selection, cross-head diffusion, O(1) KV prefiltering.
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
Portable LLM - A rust library for LLM inference
V-lang api wrapper for llm-inference chatllm.cpp
Wrapper for simplified use of Llama2 GGUF quantized models.
A FastAPI server for querying Google's Gemma Translate AI models for translations
Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch.
PlantAi is a ResNet-based CNN model trained on the PlantVillage dataset to classify plant leaf images as healthy or diseased. This repository includes PyTorch training code, tools to convert the model to TensorFlow Lite (TFLite) for deployment, and an Android app integrating the model for real-time leaf disease detection from camera images.
Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets
Add a description, image, and links to the cpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the cpu-inference topic, visit your repo's landing page and select "manage topics."