LLM
Specs
- A2A
- Agent Skills
- agents.md - A simple, open format for guiding coding agents, used by over 20k open-source
Tools
Frameworks
Serving
- LMCache - Supercharge Your LLM with the Fastest KV Cache Layer
- llm-d - Kubernetes-Native Distributed LLM Inference with vLLM
- ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models
- sglang - SGLang is a high-performance serving framework for large language models and multimodal models.
- vllm - A high-throughput and memory-efficient inference and serving engine for LLMs
Guardrails
- Guardrails - NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
API Gateways
- bifrost - Fastest LLM gateway (50x faster than LiteLLM) with adaptive load
- litellm - Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
Coding Agents
Serialization
- toon - 🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.
Utilities
- TokenCost - Easy token price estimates for 400+ LLMs. TokenOps projects.
- opencommit - top #1 and most feature rich GPT wrapper for git — generate commit messages with an LLM in 1 sec — works with Claude, GPT and every other provider, supports local Ollama models too
Models
| Creator | Name | Hugging Face | Ollama |
|---|---|---|---|
| Alibaba | Qwen3-ASR | HF | |
| Alibaba | Qwen3-VL | Ollama | |
| Alibaba | Qwen3.5 | HF | |
| BAAI | bge-m3 | HF | Ollama |
| TranslateGemma | HF | ||
| SCB 10X | typhoon-ocr-3b | Ollama | |
| SCB 10X | typhoon-translate-4b | Ollama |
llama-server
bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --fim-qwen-7b-default --host 0.0.0.0 --port 8080bash
~/ggml-org/llama.cpp/build-cuda/bin/llama-server --gpt-oss-120b-default --host 0.0.0.0 --port 8080lemonade-server
bash
lemonade-server pull user.gemma-3-12b \
--checkpoint unsloth/gemma-3-12b-it-GGUF:Q4_K_M \
--recipe llamacppbash
lemonade-server pull user.qwen3-30b \
--checkpoint unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M \
--recipe llamacppbash
lemonade-server pull user.qwen3-next-80b \
--checkpoint Qwen/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M \
--recipe llamacppTTS
- Kokoro
- See available voices: https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX
Request
Curl
bash
curl -u "username:password" -X POST https://example.com/api/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-1.7B-GGUF", "messages": [{"role": "user", "content": "Hello!"}]}'Basicauth can also be provided as request header:
bash
echo -n "username:password" | base64
curl -X POST https://example.com/v1/chat/completions \
-H "Authorization: Basic xxxx"Python
bash
import os
import httpx
from openai import OpenAI
http_client = httpx.Client(
auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)
client = OpenAI(
base_url=os.getenv("OPENAI_BASE_URL"),
api_key="dummy", # openai sdk requires a dummy API key
http_client=http_client,
)
completion = client.chat.completions.create(
model=os.getenv("MODEL_NAME"),
messages=[
{"role": "user", "content": "Write a short poem about Python programming."}
],
)
print(completion.choices[0].message.content)Streaming
python
import os
import httpx
from openai import OpenAI
http_client = httpx.Client(
auth=(os.getenv("BASICAUTH_USERNAME"), os.getenv("BASICAUTH_PASSWORD"))
)
client = OpenAI(
base_url=os.getenv("OPENAI_BASE_URL"),
api_key="dummy", # openai sdk requires a dummy API key
http_client=http_client,
)
stream = client.chat.completions.create(
model=os.getenv("MODEL_NAME"),
messages=[
# {"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about Python programming."}
],
stream=True,
# max_tokens=500,
# temperature=0.7
)
with open("response.txt", "w") as f:
for chunk in stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
f.write(content)
print(content, end="", flush=True)AzureOpenAI Endpoint
python
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="",
api_version="2024-05-01-preview", # use the version shown in Azure
azure_endpoint="https://$SUBSCRIPTION_NAME.cognitiveservices.azure.com",
)Security
Misc
Hardware
Setting up NVIDIA DGX Spark with ggml
bash
bash <(curl -s https://ggml.ai/dgx-spark.sh)Vendors
Google
OpenAI
Apps
- gallery - A gallery that showcases on-device ML/GenAI use cases and allows people to try and use models locally.