30. TML and Generative AI

TML uses privateGPT containers (discussed below) for secure, fast, and distributed, AI.

Attention

These containers are dependent on the NVidia GPU cards up to 5090.
Containers are compatible with CUDA versions upto 12.8.
Containers will run on AMD64 and ARM64 chip architectures.
They also require Qdrant Vector DB - Here is the Qdrant Docker Run Command to Install Qdrant Vector DB locally with TML integration:
docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Note

TML Solution developers have several privateGPT container options. These range from the small to large models with varying GPU memory requirements.

These models and containers are listed in the table below.

Tip

These models all follow a llama2 prompt style. See here for more details.

31. PrivateGPT Special Containers

TML-privateGPT Container	GPU Suggested Requirements
AMD64: Basic Model Version 1 ARM64 container LLM: TheBloke/Mistral-7B-Instruct-v0.1-GGUF Embedding: BAAI/bge-small-en-v1.5 Vector Dimension: 384 Docker Run Command for AMD64 Container: docker run -d -p 8001:8001 --net=host --gpus all \ --env PORT=8001 --env TSS=0 --env GPU=1 \ --env COLLECTION=tml --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 --env vectorsearchtype=cosine \ --env contextwindowsize=4096 --env vectordimension=384 \ --env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \ --env mainembedding="BAAI/bge-small-en-v1.5" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest	Suggested VRAM/GPU should be around 20GB SSD 2-3 TB Suggested Machine: On-demand 1x NVIDIA A10 Suggested Cost GPU/Hour: $0.75/GPU/h
AMD64: Mid Model Version 2 ARM64 container LLM: mistralai/Mistral-7B-Instruct-v0.2 Embedding: BAAI/bge-small-en-v1.5 Vector Dimension: 384 Docker Run Command for AMD64 Container: docker run -d -p 8001:8001 --net=host --gpus all \ --env PORT=8001 --env TSS=0 --env GPU=1 \ --env COLLECTION=tml --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 --env vectorsearchtype=cosine \ --env contextwindowsize=4096 --env vectordimension=384 \ --env mainmodel="mistralai/Mistral-7B-Instruct-v0.2" \ --env mainembedding="BAAI/bge-small-en-v1.5" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v2:latest	Suggested VRAM/GPU should be around 24GB SSD 2-3 TB Suggested Machine: On-demand 1x NVIDIA A10 Suggested Cost GPU/Hour: $0.75/GPU/h
AMD64: Advanced Model Version 3 ARM64 container LLM: mistralai/Mistral-7B-Instruct-v0.3 Embedding: BAAI/bge-base-en-v1.5 Vector Dimension: 768 Docker Run Command for AMD64 Container: docker run -d -p 8001:8001 --net=host --gpus all \ --env PORT=8001 --env TSS=0 --env GPU=1 \ --env COLLECTION=tml --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 --env vectorsearchtype=cosine \ --env contextwindowsize=4096 --env vectordimension=768 \ --env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \ --env mainembedding="BAAI/bge-base-en-v1.5" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3	Suggested VRAM/GPU should be around 24GB SSD 2-3 TB Suggested Machine: On-demand 1x NVIDIA A10 Suggested Cost GPU/Hour: $0.75/GPU/h
AMD64: Large Advanced Model Version 3 ARM64 container LLM: mistralai/Mistral-7B-Instruct-v0.3 Embedding: BAAI/bge-m3 Vector Dimension: 1024 Docker Run Command for AMD64 Container: docker run -d -p 8001:8001 --net=host --gpus all \ --env PORT=8001 --env TSS=0 --env GPU=1 \ --env COLLECTION=tml --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 --env vectorsearchtype=cosine \ --env contextwindowsize=4096 --env vectordimension=1024 \ --env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \ --env mainembedding="BAAI/bge-m3" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3-large	Suggested VRAM/GPU should be around 40GB SSD 2-3 TB Suggested Machine: On-demand 1x NVIDIA A6000 or A100 Suggested Cost GPU/Hour: $0.80 - $1.30/GPU/h

32. TML and Agentic AI Special Container

For TML and Agentic AI solutions users must you the following container

AMD64: Agentic AI Llama3 with Ollama Server
ARM64 container

LLM: Llama 3.1 OR Llama 3.2 OR ANY OTHER TOOL’ MODELS

Embedding: nomic-embed-text

Vector Dimension: n/a

Docker Run Command for AMD64 Container:
docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=1 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llama3.2 \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools
Suggested VRAM/GPU should be around 20GB

SSD 2-3 TB

Suggested Machine: On-demand 1x NVIDIA A10

Suggested Cost GPU/Hour: $0.75/GPU/h

33. Test The Ollama Model for GPU

To test your model - at the Linux prompt type:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello, how are you?",
  "stream": false,
  "keep_alive": -1
}'

To check whether Ollama is using GPU or CPU type the following:

ollama ps

# You should see something like this: if you have CPU (Note your ID may be different):
#NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
#llama3.2:latest    a80c4f17acd5    2.5 GB    100% GPU     4096       Forever

34. Ollama LLM Model for CPU

Sometimes you may not have access to a NVidia GPU - in this case you can configure Ollama for CPU ONLY use running in Linux (Ubuntu). Follow these steps:

Note

If MAC user replace:

maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools

with this one: maadsdocker/tml-privategpt-with-cpu-arm64-llama3-tools

Run the Ollama Container

docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=0 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llama3.2 \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools

Confirm the Ollam server is running with the LLAMAMODEL=llama3.2 by typing:

 # Note you may need to install ollama snap
 #sudo snap install ollama

 ollama ps

# You should see something like this: if you have CPU (Note your ID may be different):
#NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
#llama3.2:latest    a80c4f17acd5    2.5 GB    100% CPU     4096       Forever

Test the CPU LLM Model For a Response:

 curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello from CPU!",
  "stream": false
}'

RESPONSE FROM AI (You should see something similar): {“model”:”llama3.2”,”created_at”:”2026-02-11T00:20:48.600152776Z”,”response”:”Hello from the other side… of the digital realm!nnCPU (Central Processing Unit) is a bit unconventional, but I’ll play along. So, what’s on your digital mind today? Want to discuss some computational conundrums or just swap some bytes?”,”done”:true,”done_reason”:”stop”,”context”: [128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006, 882,128007,271,9906,505,14266,0,128009,128006,78191, 128007,271,9906,505,279,1023,3185,1131,315,279,7528,22651,2268,32715,320,44503,29225, 8113,8,374,264,2766,73978,11,719,358,3358,1514,3235,13,2100, 11,1148,596,389,701,7528,4059,3432,30,24133,311,4358,1063,55580,390,1263,81,6370, 477,1120,14626,1063,5943,30],”total_duration”:3222377627,”load_duration”: 103360250,”prompt_eval_count”:29,”prompt_eval_duration”:1 223374672,”eval_count”:54,”eval_duration”:1853177251}

If you receive a response - you now have a CPU LLM successfully running: llama3.2!

Tip

You can switch between Llama 3.1 and Llama 3.2 models by updating the:

–env LLAMAMODEL=llama3.2
You can also use ANY other TOOLS models from Ollama.com (see figure below)

Ollama server host and port can be updated by updating the:

–env OLLAMASERVERPORT=”http://localhost:11434”

To use models other models go to Ollama.com and search tools

35. TML and Vision Models

You can use the Llava vision models by setting the --env LLAMAMODEL= with the following:

--env LLAMAMODEL=llava:7b

--env LLAMAMODEL=llava:13b

--env LLAMAMODEL=llava:34b

The general reference architecture shows how TML connects to Ollama server container and Video ChatGPT in real-time to process images and Videos:

Note

VideoChatGPT uses Vicuna v1.1

Note

All images must be base64 decoded - see code below in section TML and Vision Models: Sample Code

You must have the Ollama server container running:

docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=1 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llava:7b \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools

35.1. TML and Vision Models: Sample Code

import base64
import requests

def base64encodeimage(imagefile):
     with open(imagefile, "rb") as image_file:
         data = base64.b64encode(image_file.read())

     return data

def base64ToString(b):
    return b.decode("utf-8")


def describeimage(imgname):

    imgdata=base64encodeimage(imgname)

    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
    }

    data = '{\n  "model": "llava:7b",\n  "prompt":"What is in this picture?",\n  "stream": false,\n  "images": ["'+base64ToString(imgdata)+'"]\n}'

    response = requests.post('http://localhost:11434/api/generate', headers=headers, data=data)

    print(response.text)

    return response

describeimage("./image2.png")

36. TML and Video ChatGPT

Users can call video chatGPT container as follows:

36.1. Docker Run Command

docker run --gpus all -d -p 7900:7900 \
--net=host --env CUDA_VISIBLE_DEVICES=0 \
--env VIDEOGPTPORT=7900 \
-v /mnt/c/sample_videos:/VideoChatGPT/videofile:z \
--env VIDEOGPTFOLDER=/VideoChatGPT/videofile maadsdocker/tml-videochatgpt-nvidia-gpu-amd64

Note

NOTE: Details on the Docker run command:

-p 7900:7900: This is port forwarding port 7900 host port to the container port 7900
VIDEOGPTPORT=7900: This enables the API to connect to video chatgpt on port 7900
-v /mnt/c/sample_videos:/VideoChatGPT/videofile:z: All your video files need to be stored on the host machine, the Docker container maps this host folder to the container folder for video retrieval
VIDEOGPTFOLDER=/VideoChatGPT/videofile : This is the container video folder
NOTE: You need to drop the mp4 files on your host folder that is mapped to the container folder.

36.2. Video ChatGPT Sample Code

import maadstml # import the maadstml python library: pip install maadstml

################### NOTE: This will only work if Video Chatgpt is running on a machine with NVidia GPU and Cuda toolkit installed

def videochat(videofilename):

    url='http://127.0.0.1'            # IP video chatgpt is listening on

    port='7900'                       # Port Video chatgpt is listening on

    filename=videofilename          # Video file name

    responsefolder='sample-videos'   # folder, video chatgpt will write out the responses to

    temperature=0.1                   # temperature - varies between 0-1, closer to 0 more conservative the responses

    max_output_tokens=512             #  max tokens or words returned

    prompt='What is this video about? Is there anything strange about this video?'  # prompts to ask video chatgot about the video

#Load video chatgpt
    ret=maadstml.videochatloadresponse(url,port,filename,prompt,responsefolder,temperature,max_output_tokens)
    print(ret)

#CALL Video chat gpt container - you can put this in a loop and analyse several videos at once with multiple containers
videofilename = 'sample_6.mp4'

ret = videochat(videofilename) # returns the response file name
print(ret)

37. TML API for GenAI Using MAADSTML Python Library

TML solutions can be built to access GPT technology in real-time using the MAADSTML python library functions:

MAADSTML Python Function	Description
pgptingestdocs	Set Context for PrivateGPT by ingesting PDFs or text documents. All responses will then use these documents for context.
pgptgetingestedembeddings	After documents are ingested, you can retrieve the embeddings for the ingested documents. These embeddings allow you to filter the documents for specific context.
pgptchat	Send any prompt to privateGPT (with or without context) and get back a response.
pgptdeleteembeddings	Delete embeddings.
pgpthealth	Check the health of the privateGPT http server.

37.1. GenAI With STEP 9

Several powerful, real-time, AI analysis can be performed with STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag

These are the following:

Perform post-analyis on TML output with GenAI

Use Qdrant vector DB, to use local documents, for querying with GenAI

Scale GenAI with privateGPT for secure, local, and quality AI analysis.

Tip

Take a look here TML, PrivateGPT and Qdrant Example Scenarios for more information.

37.2. TML and RAG: A Powerful Combination

TML using STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag can perform RAG (Retrieval-augmented Generation) with a few simple configurations.

Below is a figure to show Advanced RAG model (inspiration from huggingface blog) to ingest Engineering documents for real-time prompting using one of the privateGPT containers. Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9.

Important

This would be very useful especially for Cybersecurity uses cases where you want to cross-reference source IP address with web log files to determine if there are any “authentication failures” or “wrong passwords” in the log files associated to the source IP address.

Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9, in few seconds.

The incorporation of RAG with TML for real-time cybersecurity analysis of log files is demonstrated in Cybersecurity Solution with PrivateGPT, MQTT, HiveMQ

37.3. Private GPT Container

More privateGPT containers can be found here: PrivateGPT Special Containers. The container will require a NVIDIA GPU.

docker pull maadsdocker/tml-privategpt-with-gpu-nvidia-amd64

docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest

Tip

To check if privateGPT is running enter this in your browser: http://localhost:8001

You should see the private GPT website below.

Note

If you set WEB_CONCURRENCY greater than 1, you will need Qdrant Vector DB running (see below)

37.4. PrivateGPT Container With NO GPU

Tip

If you do not have a Nvidia GPU you can use the docker container with NO GPU:

docker run -d -p 8001:8001 –env PORT=8001 –env GPU=0 –env CUDA_VISIBLE_DEVICES=0 maadsdocker/tml-privategpt-no-gpu-amd64

37.4.1. Installing CUDA For NVIDIA GPU

Important

It is highly recommended that users run the privateGPT container using the NVIDIA GPU for FASTER performance.

If you have a NVIDIA GPU you must install the CUDA Software Development Kit in your Linux environment.

To confirm your GPU card is recognized in Linux type: nvidia-smi - You should see an image similar to below.

37.4.2. NVIDIA Common Issues

Important

If you run Docker or Minikube with the --gpus all flag and see an ERROR message like:

docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].

Then run the following:

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Attention

Make sure to STOP the TSS Container and other containers before running Kubernetes/Minikube.

If you get the following WARNING from Kubernetes:

Warning FailedScheduling 13m default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Issue the commands below:

sudo apt update && sudo apt install -y nvidia-docker2

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

37.5. To Enable GPU in Kubernetes

You can apply the following YML file to the Kubernetes cluster to enable GPU support.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Also see section: NVIDIA GPU On Windows WSL

37.6. Accessing PrivateGPT With MAADSTML Python API

Once you have the PrivateGPT container running you can access it using the maadstml API. Here is some sample Python code to access the privateGPT container:

Note

Since PrivateGPT is compatible with REST API, you can use any programming language, and take advantage of free, and fast AI.

import maadstml
import json

def sendpromptgpt(prompt,pgptip,pgptport):
  pgptendpoint="/v1/completions"
  includesources=False
  docfilter=""
  context=False

  try:
    response=maadstml.pgptchat(prompt,context,docfilter,pgptport,includesources,pgptip,pgptendpoint)
    jb=json.loads(response)
    response=jb['choices'][0]['message']['content']

  except Exception as e:
   print("ERROR: connecting to PrivateGPT=",e)
   return ""

  return response

def setupprompt():
     pgptip="http://127.0.0.1"
     pgptport="8001"

     prompt="Who is the prime minister of Canada?"
     message=sendpromptgpt(prompt,pgptip,pgptport)

Details of LLM Used in privateGPT Container

llm_load_print_meta: format = GGUF V2

llm_load_print_meta: arch = llama

llm_load_print_meta: vocab type = SPM

llm_load_print_meta: n_vocab = 32000

llm_load_print_meta: n_merges = 0

llm_load_print_meta: n_ctx_train = 32768

llm_load_print_meta: n_embd = 4096

llm_load_print_meta: n_head = 32

llm_load_print_meta: n_head_kv = 8

llm_load_print_meta: n_layer = 32

llm_load_print_meta: n_rot = 128

llm_load_print_meta: n_gqa = 4

llm_load_print_meta: f_norm_eps = 0.0e+00

llm_load_print_meta: f_norm_rms_eps = 1.0e-05

llm_load_print_meta: f_clamp_kqv = 0.0e+00

llm_load_print_meta: f_max_alibi_bias = 0.0e+00

llm_load_print_meta: n_ff = 14336

llm_load_print_meta: rope scaling = linear

llm_load_print_meta: freq_base_train = 10000.0

llm_load_print_meta: freq_scale_train = 1

llm_load_print_meta: n_yarn_orig_ctx = 32768

llm_load_print_meta: rope_finetuned = unknown

llm_load_print_meta: model type = 7B

llm_load_print_meta: model ftype = mostly Q4_K - Medium

llm_load_print_meta: model params = 7.24 B

llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)

llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2

llm_load_print_meta: BOS token = 1 ‘’

llm_load_print_meta: EOS token = 2 ‘’

llm_load_print_meta: UNK token = 0 ‘’

llm_load_print_meta: LF token = 13 ‘<0x0A>’

llm_load_tensors: ggml ctx size = 0.11 MB

llm_load_tensors: mem required = 4165.47 MB

37.7. Qdrant Vector Database

The privateGPT is also integrated with Qdrant Vector DB

docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Tip

After running the container, to access the Qdrant dashboard enter the following URL in your browser:

http://localhost:6333/dashboard