Ollama API - Tích hợp AI self-hosted vào ứng dụng web

Một trong những điểm mạnh nhất của Ollama không chỉ nằm ở việc chạy model local dễ dàng, mà còn ở bộ API cực kỳ đơn giản. Đặc biệt, Ollama còn cung cấp endpoint tương thích OpenAI format, nghĩa là bạn có thể thay đổi đúng 1 dòng code để chuyển từ OpenAI sang model self-hosted. Bài này mình sẽ đi qua toàn bộ các endpoint quan trọng, cách tích hợp với Python và Node.js, streaming, function calling, và cuối cùng là build một chatbot widget đơn giản.

Các API endpoint chính của Ollama

Mặc định Ollama chạy API server ở http://localhost:11434. Dưới đây là các endpoint bạn sẽ dùng nhiều nhất.

POST /api/generate – Text completion

Endpoint cơ bản nhất, gửi prompt và nhận response:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Giải thích Docker là gì trong 3 câu",
  "stream": false
}'

Response trả về JSON chứa trường response với nội dung generated text, kèm metadata như total_duration, eval_count (số token output).

POST /api/chat – Chat với lịch sử hội thoại

Endpoint này giống ChatGPT API hơn, hỗ trợ mảng messages với các role system, user, assistant:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "Bạn là trợ lý kỹ thuật, trả lời ngắn gọn."},
    {"role": "user", "content": "Nginx và Apache khác nhau thế nào?"}
  ],
  "stream": false
}'

Để duy trì context hội thoại, bạn chỉ cần append thêm message vào mảng messages mỗi lần gọi. Ollama không lưu state giữa các request, nên toàn bộ history phải gửi lại mỗi lần.

GET /api/tags – Liệt kê model đã cài

curl http://localhost:11434/api/tags

Trả về danh sách tất cả model có trên máy, kèm thông tin size, modified date, parameter count. Hữu ích khi bạn muốn build UI cho phép user chọn model.

POST /api/embed – Vector embedding

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama hỗ trợ embedding"
}'

Endpoint này trả về vector embedding của text, dùng cho RAG (Retrieval-Augmented Generation), semantic search, hay clustering. Model phổ biến nhất cho embedding là nomic-embed-text hoặc mxbai-embed-large.

OpenAI-compatible endpoint

Đây là tính năng rất hay của Ollama. Ngoài API riêng, Ollama còn expose endpoint /v1/chat/completions tương thích hoàn toàn với OpenAI API format:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Điều này có nghĩa là bất kỳ ứng dụng nào đang dùng OpenAI SDK, bạn chỉ cần đổi base_url sang Ollama là chạy được. Không cần sửa logic, không cần thay library. Các framework như LangChain, LlamaIndex, hay Open WebUI đều tận dụng endpoint này.

Endpoint /v1/chat/completions hỗ trợ hầu hết các parameter phổ biến như temperature, top_p, max_tokens, stream, stop. Tuy nhiên một số parameter đặc thù của OpenAI (như logprobs) có thể chưa được hỗ trợ.

Tích hợp với Python

Cách 1: Dùng requests (đơn giản nhất)

import requests
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [
        {"role": "user", "content": "Viết hàm Python sắp xếp bubble sort"}
    ],
    "stream": False
})
data = response.json()
print(data["message"]["content"])

Không cần cài thêm gì ngoài requests. Phù hợp cho script nhỏ, automation, hay prototype nhanh.

Cách 2: Dùng OpenAI SDK (drop-in replacement)

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama không cần key, nhưng SDK bắt buộc truyền
)
response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "Trả lời bằng tiếng Việt."},
        {"role": "user", "content": "So sánh PostgreSQL và MySQL"}
    ]
)
print(response.choices[0].message.content)

Cách này cực kỳ tiện nếu bạn đã có codebase dùng OpenAI. Chỉ cần thay base_url và bạn có thể chuyển qua lại giữa OpenAI và Ollama tuỳ ý. Giá trị api_key truyền gì cũng được vì Ollama không kiểm tra, nhưng SDK yêu cầu field này không được để trống.

Tích hợp với Node.js

Cách 1: Dùng fetch API

const response = await fetch("http://localhost:11434/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama3.2",
    messages: [{ role: "user", content: "Hello từ Node.js!" }],
    stream: false,
  }),
});
const data = await response.json();
console.log(data.message.content);

Node.js 18+ có sẵn fetch, không cần cài thêm package. Gọn, nhẹ, đủ dùng cho hầu hết trường hợp.

Cách 2: Dùng openai npm package

import OpenAI from "openai";
const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",
});
const completion = await client.chat.completions.create({
  model: "llama3.2",
  messages: [{ role: "user", content: "Giải thích REST API là gì" }],
});
console.log(completion.choices[0].message.content);

Tương tự Python, chỉ cần đổi baseURL là xong. Tất cả method của OpenAI SDK đều hoạt động bình thường.

Streaming responses

Mặc định Ollama stream response (từng token một), giống cách ChatGPT hiện chữ dần dần. Đây là trải nghiệm tốt hơn nhiều so với đợi cả response rồi mới hiện.

Streaming với Python

import requests
response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Kể một câu chuyện ngắn"}],
    "stream": True
}, stream=True)
for line in response.iter_lines():
    if line:
        import json
        chunk = json.loads(line)
        print(chunk["message"]["content"], end="", flush=True)

Streaming với JavaScript (trình duyệt)

const response = await fetch("http://localhost:11434/api/chat", {
  method: "POST",
  body: JSON.stringify({
    model: "llama3.2",
    messages: [{ role: "user", content: "Xin chào!" }],
    stream: true,
  }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = JSON.parse(decoder.decode(value));
  process.stdout.write(chunk.message.content);
}

Ollama stream dạng newline-delimited JSON (NDJSON), mỗi dòng là một JSON object chứa token tiếp theo. Khi done: true thì response đã hoàn tất.

Function calling / Tool use

Từ phiên bản 0.5+, Ollama hỗ trợ function calling (tool use) qua endpoint /api/chat. Tính năng này cho phép model “gọi” các function bạn định nghĩa sẵn, ví dụ tra thời tiết, query database, hay gọi API bên ngoài.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Thời tiết Hà Nội hôm nay thế nào?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Lấy thông tin thời tiết theo thành phố",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "Tên thành phố"
            }
          },
          "required": ["city"]
        }
      }
    }
  ],
  "stream": false
}'

Khi model quyết định cần gọi function, response sẽ chứa tool_calls thay vì text thông thường. Ứng dụng của bạn sẽ thực thi function đó, rồi gửi kết quả lại cho model với role tool để nó tổng hợp câu trả lời cuối cùng.

Không phải model nào cũng hỗ trợ tool use tốt. Các model được recommend cho function calling: llama3.2, qwen2.5, mistral. Kiểm tra model card trên Ollama library để biết model nào hỗ trợ.

Build chatbot widget đơn giản

Dưới đây là một chatbot widget hoàn chỉnh chỉ với HTML + JavaScript thuần, gọi trực tiếp Ollama API. Bạn có thể nhúng vào bất kỳ trang web nào.

<!DOCTYPE html>
<html>
<head>
  <style>
    #chat-box {
      width: 400px; height: 500px; border: 1px solid #ddd;
      border-radius: 8px; display: flex; flex-direction: column;
      font-family: system-ui, sans-serif;
    }
    #messages {
      flex: 1; overflow-y: auto; padding: 16px;
    }
    .msg { margin: 8px 0; padding: 8px 12px; border-radius: 12px; max-width: 80%; }
    .user { background: #007bff; color: white; margin-left: auto; }
    .bot { background: #f1f1f1; }
    #input-area { display: flex; padding: 8px; border-top: 1px solid #ddd; }
    #input-area input { flex: 1; padding: 8px; border: 1px solid #ddd; border-radius: 4px; }
    #input-area button { margin-left: 8px; padding: 8px 16px; background: #007bff; color: white; border: none; border-radius: 4px; cursor: pointer; }
  </style>
</head>
<body>
<div id="chat-box">
  <div id="messages"></div>
  <div id="input-area">
    <input id="user-input" placeholder="Nhập tin nhắn..." />
    <button onclick="sendMessage()">Gửi</button>
  </div>
</div>
<script>
const OLLAMA_URL = "http://localhost:11434/api/chat";
const MODEL = "llama3.2";
let history = [
  { role: "system", content: "Bạn là trợ lý thân thiện, trả lời ngắn gọn bằng tiếng Việt." }
];
function addMessage(text, sender) {
  const div = document.createElement("div");
  div.className = `msg ${sender}`;
  div.textContent = text;
  document.getElementById("messages").appendChild(div);
  div.scrollIntoView({ behavior: "smooth" });
}
async function sendMessage() {
  const input = document.getElementById("user-input");
  const text = input.value.trim();
  if (!text) return;
input.value = "";
  addMessage(text, "user");
  history.push({ role: "user", content: text });
try {
    const res = await fetch(OLLAMA_URL, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ model: MODEL, messages: history, stream: false }),
    });
    const data = await res.json();
    const reply = data.message.content;
    addMessage(reply, "bot");
    history.push({ role: "assistant", content: reply });
  } catch (err) {
    addMessage("Lỗi kết nối tới Ollama API", "bot");
  }
}
// Gửi bằng Enter
document.getElementById("user-input").addEventListener("keydown", (e) => {
  if (e.key === "Enter") sendMessage();
});
</script>
</body>
</html>

Widget này có đầy đủ: hiển thị tin nhắn user/bot, lưu lịch sử hội thoại, gửi bằng Enter, auto-scroll. Bạn chỉ cần đảm bảo Ollama đang chạy và model đã pull là dùng được ngay.

Nếu frontend và Ollama ở khác domain/port, bạn cần cấu hình biến môi trường OLLAMA_ORIGINS để cho phép CORS. Ví dụ: OLLAMA_ORIGINS="*" ollama serve.

Tips khi làm việc với Ollama API

Timeout và retry

Model lớn có thể mất 30-60 giây để generate response, đặc biệt lần chạy đầu tiên khi model chưa được load vào RAM. Hãy set timeout đủ lớn (ít nhất 120 giây) và implement retry logic:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[502, 503, 504])
session.mount("http://", HTTPAdapter(max_retries=retries))
response = session.post(
    "http://localhost:11434/api/chat",
    json={"model": "llama3.2", "messages": [{"role": "user", "content": "Hi"}], "stream": False},
    timeout=120
)

Preload model

Lần đầu gọi API với một model, Ollama phải load model vào memory (có thể mất 10-30 giây tuỳ kích thước). Để tránh delay cho user, bạn có thể preload model khi ứng dụng khởi động:

# Preload model vào memory
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": []
}'

Error handling

Một số lỗi thường gặp khi làm việc với Ollama API:

Connection refused: Ollama chưa chạy. Kiểm tra bằng ollama serve hoặc systemctl status ollama.
Model not found: Chưa pull model. Chạy ollama pull <model> trước.
Out of memory: Model quá lớn cho RAM/VRAM hiện tại. Chuyển sang model nhỏ hơn hoặc dùng quantized version.
CORS error: Cần set OLLAMA_ORIGINS khi gọi từ browser.

Keep-alive và concurrent requests

Mặc định Ollama giữ model trong memory 5 phút sau request cuối cùng. Bạn có thể tuỳ chỉnh bằng parameter keep_alive trong request (đơn vị giây, hoặc chuỗi như "30m", "1h"). Set keep_alive: 0 để unload model ngay sau khi trả response, tiết kiệm RAM cho server ít tài nguyên.

Tổng kết

Ollama API đủ mạnh để build các ứng dụng AI production-ready mà hoàn toàn self-hosted. Điểm nhấn quan trọng nhất:

API đơn giản, chỉ cần vài endpoint là đủ cho hầu hết use case.
Endpoint /v1/chat/completions tương thích OpenAI, giúp migrate code cực nhanh.
Hỗ trợ streaming, function calling, embedding cho các ứng dụng phức tạp.
Tích hợp được với mọi ngôn ngữ lập trình qua HTTP.

Bạn không cần phải dùng framework nặng nề để bắt đầu. Một file HTML với vài dòng JavaScript đã đủ để có một chatbot chạy hoàn toàn trên máy của mình. Từ đây bạn có thể mở rộng thêm: thêm RAG với embedding, dùng function calling để kết nối với database, hay build hẳn một ứng dụng SaaS chạy model riêng.

Ollama API – Tích hợp AI self-hosted vào ứng dụng web