Skip to content

LLM

Specs

Tools

Frameworks

Serving

  • LMCache - Supercharge Your LLM with the Fastest KV Cache Layer
  • llm-d - Kubernetes-Native Distributed LLM Inference with vLLM
  • ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models
  • sglang - SGLang is a high-performance serving framework for large language models and multimodal models.
  • vllm - A high-throughput and memory-efficient inference and serving engine for LLMs

Guardrails

  • Guardrails - NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.

API Gateways

  • bifrost - Fastest LLM gateway (50x faster than LiteLLM) with adaptive load
  • litellm - Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]

Coding Agents

Serialization

  • toon - 🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.

Utilities

  • TokenCost - Easy token price estimates for 400+ LLMs. TokenOps projects.
  • opencommit - top #1 and most feature rich GPT wrapper for git — generate commit messages with an LLM in 1 sec — works with Claude, GPT and every other provider, supports local Ollama models too

Models

CreatorNameHugging FaceOllama
AlibabaQwen3-ASRHF
AlibabaQwen3-VLOllama
AlibabaQwen3.5HF
BAAIbge-m3HFOllama
GoogleTranslateGemmaHF
SCB 10Xtyphoon-ocr-3bOllama
SCB 10Xtyphoon-translate-4bOllama

llama-server

bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --fim-qwen-7b-default --host 0.0.0.0 --port 8080
bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --gpt-oss-120b-default --host 0.0.0.0 --port 8080

lemonade-server

bash
lemonade-server pull user.gemma-3-12b \
  --checkpoint unsloth/gemma-3-12b-it-GGUF:Q4_K_M  \
  --recipe llamacpp
bash
lemonade-server pull user.qwen3-30b \
  --checkpoint unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M  \
  --recipe llamacpp
bash
lemonade-server pull user.qwen3-next-80b \
  --checkpoint Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M  \
  --recipe llamacpp

TTS

Request

Curl

bash
curl -u "username:password" -X POST https://example.com/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3-1.7B-GGUF", "messages": [{"role": "user", "content": "Hello!"}]}'

Basicauth can also be provided as request header:

bash
echo -n "username:password" | base64

curl -X POST https://example.com/v1/chat/completions \
  -H "Authorization: Basic xxxx"

Python

bash
import os

import httpx
from openai import OpenAI

http_client = httpx.Client(
    auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)

client = OpenAI(
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key="dummy",  # openai sdk requires a dummy API key
    http_client=http_client,
)

completion = client.chat.completions.create(
    model=os.getenv("MODEL_NAME"),
    messages=[
        {"role": "user", "content": "Write a short poem about Python programming."}
    ],
)

print(completion.choices[0].message.content)

Streaming

python
import os

import httpx
from openai import OpenAI

http_client = httpx.Client(
    auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)

client = OpenAI(
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key="dummy",  # openai sdk requires a dummy API key
    http_client=http_client,
)

stream = client.chat.completions.create(
    model=os.getenv("MODEL_NAME"),
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about Python programming."}
    ],
    stream=True,
    # max_tokens=500,
    # temperature=0.7
)

with open("response.txt", "w") as f:
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            f.write(content)
            print(content, end="", flush=True)

AzureOpenAI Endpoint

python
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="",
    api_version="2024-05-01-preview",  # use the version shown in Azure
    azure_endpoint="https://$SUBSCRIPTION_NAME.cognitiveservices.azure.com",
)

Security

Misc

Hardware

Setting up NVIDIA DGX Spark with ggml

bash
bash <(curl -s https://ggml.ai/dgx-spark.sh)

Vendors

Google

OpenAI

Apps

  • gallery - A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.

Resources