29. TML and Generative AI

TML uses privateGPT containers (discussed below) for secure, fast, and distributed, AI.

Attention

  1. These containers are dependent on the NVidia GPU cards up to 5090.

  2. Containers are compatible with CUDA versions upto 12.8.

  3. Containers will run on AMD64 and ARM64 chip architectures.

  4. They also require Qdrant Vector DB - Here is the Qdrant Docker Run Command to Install Qdrant Vector DB locally with TML integration:

    docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
    

Note

TML Solution developers have several privateGPT container options. These range from the small to large models with varying GPU memory requirements.

These models and containers are listed in the table below.

Tip

These models all follow a llama2 prompt style. See here for more details.

30. PrivateGPT Special Containers

TML-privateGPT Container

GPU Suggested Requirements

AMD64: Basic Model Version 1

docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest
  1. Suggested VRAM/GPU should be around 20GB

  2. SSD 2-3 TB

  3. Suggested Machine: On-demand 1x NVIDIA A10

  4. Suggested Cost GPU/Hour: $0.75/GPU/h

AMD64: Mid Model Version 2

docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.2" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v2:latest
  1. Suggested VRAM/GPU should be around 24GB

  2. SSD 2-3 TB

  3. Suggested Machine: On-demand 1x NVIDIA A10

  4. Suggested Cost GPU/Hour: $0.75/GPU/h

AMD64: Advanced Model Version 3

docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=768 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \
--env mainembedding="BAAI/bge-base-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3
  1. Suggested VRAM/GPU should be around 24GB

  2. SSD 2-3 TB

  3. Suggested Machine: On-demand 1x NVIDIA A10

  4. Suggested Cost GPU/Hour: $0.75/GPU/h

AMD64: Large Advanced Model Version 3

docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=1024 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \
--env mainembedding="BAAI/bge-m3" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3-large
  1. Suggested VRAM/GPU should be around 40GB

  2. SSD 2-3 TB

  3. Suggested Machine: On-demand 1x NVIDIA A6000 or A100

  4. Suggested Cost GPU/Hour: $0.80 - $1.30/GPU/h

31. TML and Agentic AI Special Container

For TML and Agentic AI solutions users must you the following container

    • AMD64: Agentic AI Llama3 with Ollama Server

      docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
      --env TSS=0 \
      --env GPU=1 \
      --env COLLECTION=tml \
      --env WEB_CONCURRENCY=2 \
      --env CUDA_VISIBLE_DEVICES=0 \
      --env TOKENIZERS_PARALLELISM=false \
      --env temperature=0.1 \
      --env vectorsearchtype=cosine \
      --env contextwindowsize=4096 \
      --env vectordimension=384 \
      --env mainembedding="nomic-embed-text" \
      -v /var/run/docker.sock:/var/run/docker.sock:z \
      --env LLAMAMODEL=llama3.2 \
      --env OLLAMASERVERPORT="http://localhost:11434" \
      maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools
      
      1. Suggested VRAM/GPU should be around 20GB

      2. SSD 2-3 TB

      3. Suggested Machine: On-demand 1x NVIDIA A10

      4. Suggested Cost GPU/Hour: $0.75/GPU/h

32. Test The Ollama Model for GPU

To test your model - at the Linux prompt type:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello, how are you?",
  "stream": false,
  "keep_alive": -1
}'

To check whether Ollama is using GPU or CPU type the following:

ollama ps

# You should see something like this: if you have CPU (Note your ID may be different):
#NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
#llama3.2:latest    a80c4f17acd5    2.5 GB    100% GPU     4096       Forever

33. Ollama LLM Model for CPU

Sometimes you may not have access to a NVidia GPU - in this case you can configure Ollama for CPU ONLY use running in Linux (Ubuntu). Follow these steps:

Note

If MAC user replace:
  • maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools

  • with this one: maadsdocker/tml-privategpt-with-cpu-arm64-llama3-tools

  1. Run the Ollama Container

docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=0 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llama3.2 \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools
  1. Confirm the Ollam server is running with the LLAMAMODEL=llama3.2 by typing:

 # Note you may need to install ollama snap
 #sudo snap install ollama

 ollama ps

# You should see something like this: if you have CPU (Note your ID may be different):
#NAME               ID              SIZE      PROCESSOR    CONTEXT    UNTIL
#llama3.2:latest    a80c4f17acd5    2.5 GB    100% CPU     4096       Forever
  1. Test the CPU LLM Model For a Response:

     curl http://localhost:11434/api/generate -d '{
      "model": "llama3.2",
      "prompt": "Hello from CPU!",
      "stream": false
    }'
    

RESPONSE FROM AI (You should see something similar): {“model”:”llama3.2”,”created_at”:”2026-02-11T00:20:48.600152776Z”,”response”:”Hello from the other side… of the digital realm!nnCPU (Central Processing Unit) is a bit unconventional, but I’ll play along. So, what’s on your digital mind today? Want to discuss some computational conundrums or just swap some bytes?”,”done”:true,”done_reason”:”stop”,”context”: [128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006, 882,128007,271,9906,505,14266,0,128009,128006,78191, 128007,271,9906,505,279,1023,3185,1131,315,279,7528,22651,2268,32715,320,44503,29225, 8113,8,374,264,2766,73978,11,719,358,3358,1514,3235,13,2100, 11,1148,596,389,701,7528,4059,3432,30,24133,311,4358,1063,55580,390,1263,81,6370, 477,1120,14626,1063,5943,30],”total_duration”:3222377627,”load_duration”: 103360250,”prompt_eval_count”:29,”prompt_eval_duration”:1 223374672,”eval_count”:54,”eval_duration”:1853177251}

  1. If you receive a response - you now have a CPU LLM successfully running: llama3.2!

Tip

You can switch between Llama 3.1 and Llama 3.2 models by updating the:

  • –env LLAMAMODEL=llama3.2

  • You can also use ANY other TOOLS models from Ollama.com (see figure below)

Ollama server host and port can be updated by updating the:

To use models other models go to Ollama.com and search tools

_images/agentic5.png

34. TML and Vision Models

You can use the Llava vision models by setting the --env LLAMAMODEL= with the following:

  • --env LLAMAMODEL=llava:7b

  • --env LLAMAMODEL=llava:13b

  • --env LLAMAMODEL=llava:34b

The general reference architecture shows how TML connects to Ollama server container and Video ChatGPT in real-time to process images and Videos:

Note

VideoChatGPT uses Vicuna v1.1

_images/ollama2.png

Note

All images must be base64 decoded - see code below in section TML and Vision Models: Sample Code

You must have the Ollama server container running:

docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=1 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llava:7b \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools

34.1. TML and Vision Models: Sample Code

import base64
import requests

def base64encodeimage(imagefile):
     with open(imagefile, "rb") as image_file:
         data = base64.b64encode(image_file.read())

     return data

def base64ToString(b):
    return b.decode("utf-8")


def describeimage(imgname):

    imgdata=base64encodeimage(imgname)

    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
    }

    data = '{\n  "model": "llava:7b",\n  "prompt":"What is in this picture?",\n  "stream": false,\n  "images": ["'+base64ToString(imgdata)+'"]\n}'

    response = requests.post('http://localhost:11434/api/generate', headers=headers, data=data)

    print(response.text)

    return response

describeimage("./image2.png")

35. TML and Video ChatGPT

Users can call video chatGPT container as follows:

35.1. Docker Run Command

docker run --gpus all -d -p 7900:7900 \
--net=host --env CUDA_VISIBLE_DEVICES=0 \
--env VIDEOGPTPORT=7900 \
-v /mnt/c/sample_videos:/VideoChatGPT/videofile:z \
--env VIDEOGPTFOLDER=/VideoChatGPT/videofile maadsdocker/tml-videochatgpt-nvidia-gpu-amd64

Note

NOTE: Details on the Docker run command:

  • -p 7900:7900: This is port forwarding port 7900 host port to the container port 7900

  • VIDEOGPTPORT=7900: This enables the API to connect to video chatgpt on port 7900

  • -v /mnt/c/sample_videos:/VideoChatGPT/videofile:z: All your video files need to be stored on the host machine, the Docker container maps this host folder to the container folder for video retrieval

  • VIDEOGPTFOLDER=/VideoChatGPT/videofile : This is the container video folder

  • NOTE: You need to drop the mp4 files on your host folder that is mapped to the container folder.

35.2. Video ChatGPT Sample Code

import maadstml # import the maadstml python library: pip install maadstml

################### NOTE: This will only work if Video Chatgpt is running on a machine with NVidia GPU and Cuda toolkit installed

def videochat(videofilename):

    url='http://127.0.0.1'            # IP video chatgpt is listening on

    port='7900'                       # Port Video chatgpt is listening on

    filename=videofilename          # Video file name

    responsefolder='sample-videos'   # folder, video chatgpt will write out the responses to

    temperature=0.1                   # temperature - varies between 0-1, closer to 0 more conservative the responses

    max_output_tokens=512             #  max tokens or words returned

    prompt='What is this video about? Is there anything strange about this video?'  # prompts to ask video chatgot about the video

#Load video chatgpt
    ret=maadstml.videochatloadresponse(url,port,filename,prompt,responsefolder,temperature,max_output_tokens)
    print(ret)

#CALL Video chat gpt container - you can put this in a loop and analyse several videos at once with multiple containers
videofilename = 'sample_6.mp4'

ret = videochat(videofilename) # returns the response file name
print(ret)

36. TML API for GenAI Using MAADSTML Python Library

TML solutions can be built to access GPT technology in real-time using the MAADSTML python library functions:

MAADSTML Python Function

Description

pgptingestdocs

Set Context for PrivateGPT by ingesting PDFs

or text documents. All responses will then use

these documents for context.

pgptgetingestedembeddings

After documents are ingested, you can retrieve

the embeddings for the ingested documents. These

embeddings allow you to filter the documents

for specific context.

pgptchat

Send any prompt to privateGPT

(with or without context) and get back a response.

pgptdeleteembeddings

Delete embeddings.

pgpthealth

Check the health of the privateGPT http server.

36.1. GenAI With STEP 9

Several powerful, real-time, AI analysis can be performed with STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag

These are the following:

  1. Perform post-analyis on TML output with GenAI

  2. Use Qdrant vector DB, to use local documents, for querying with GenAI

  3. Scale GenAI with privateGPT for secure, local, and quality AI analysis.

Tip

Take a look here TML, PrivateGPT and Qdrant Example Scenarios for more information.

36.2. TML and RAG: A Powerful Combination

TML using STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag can perform RAG (Retrieval-augmented Generation) with a few simple configurations.

Below is a figure to show Advanced RAG model (inspiration from huggingface blog) to ingest Engineering documents for real-time prompting using one of the privateGPT containers. Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9.

Important

This would be very useful especially for Cybersecurity uses cases where you want to cross-reference source IP address with web log files to determine if there are any “authentication failures” or “wrong passwords” in the log files associated to the source IP address.

Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9, in few seconds.

_images/rag.png

The incorporation of RAG with TML for real-time cybersecurity analysis of log files is demonstrated in Cybersecurity Solution with PrivateGPT, MQTT, HiveMQ

36.3. Private GPT Container

More privateGPT containers can be found here: PrivateGPT Special Containers. The container will require a NVIDIA GPU.

docker pull maadsdocker/tml-privategpt-with-gpu-nvidia-amd64
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest

Tip

To check if privateGPT is running enter this in your browser: http://localhost:8001

You should see the private GPT website below.

_images/pgpt1.png

Note

If you set WEB_CONCURRENCY greater than 1, you will need Qdrant Vector DB running (see below)

36.4. PrivateGPT Container With NO GPU

Tip

If you do not have a Nvidia GPU you can use the docker container with NO GPU:

docker run -d -p 8001:8001 –env PORT=8001 –env GPU=0 –env CUDA_VISIBLE_DEVICES=0 maadsdocker/tml-privategpt-no-gpu-amd64

36.4.1. Installing CUDA For NVIDIA GPU

Important

It is highly recommended that users run the privateGPT container using the NVIDIA GPU for FASTER performance.

If you have a NVIDIA GPU you must install the CUDA Software Development Kit in your Linux environment.

To confirm your GPU card is recognized in Linux type: nvidia-smi - You should see an image similar to below.

_images/nvidia.png

36.4.2. NVIDIA Common Issues

Important

If you run Docker or Minikube with the --gpus all flag and see an ERROR message like:

docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].

Then run the following:

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Attention

Make sure to STOP the TSS Container and other containers before running Kubernetes/Minikube.

If you get the following WARNING from Kubernetes:

Warning FailedScheduling 13m default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Issue the commands below:

sudo apt update && sudo apt install -y nvidia-docker2

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

36.5. To Enable GPU in Kubernetes

You can apply the following YML file to the Kubernetes cluster to enable GPU support.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Also see section: NVIDIA GPU On Windows WSL

36.6. Accessing PrivateGPT With MAADSTML Python API

Once you have the PrivateGPT container running you can access it using the maadstml API. Here is some sample Python code to access the privateGPT container:

Note

Since PrivateGPT is compatible with REST API, you can use any programming language, and take advantage of free, and fast AI.

import maadstml
import json

def sendpromptgpt(prompt,pgptip,pgptport):
  pgptendpoint="/v1/completions"
  includesources=False
  docfilter=""
  context=False

  try:
    response=maadstml.pgptchat(prompt,context,docfilter,pgptport,includesources,pgptip,pgptendpoint)
    jb=json.loads(response)
    response=jb['choices'][0]['message']['content']

  except Exception as e:
   print("ERROR: connecting to PrivateGPT=",e)
   return ""

  return response

def setupprompt():
     pgptip="http://127.0.0.1"
     pgptport="8001"

     prompt="Who is the prime minister of Canada?"
     message=sendpromptgpt(prompt,pgptip,pgptport)

Details of LLM Used in privateGPT Container

llm_load_print_meta: format = GGUF V2

llm_load_print_meta: arch = llama

llm_load_print_meta: vocab type = SPM

llm_load_print_meta: n_vocab = 32000

llm_load_print_meta: n_merges = 0

llm_load_print_meta: n_ctx_train = 32768

llm_load_print_meta: n_embd = 4096

llm_load_print_meta: n_head = 32

llm_load_print_meta: n_head_kv = 8

llm_load_print_meta: n_layer = 32

llm_load_print_meta: n_rot = 128

llm_load_print_meta: n_gqa = 4

llm_load_print_meta: f_norm_eps = 0.0e+00

llm_load_print_meta: f_norm_rms_eps = 1.0e-05

llm_load_print_meta: f_clamp_kqv = 0.0e+00

llm_load_print_meta: f_max_alibi_bias = 0.0e+00

llm_load_print_meta: n_ff = 14336

llm_load_print_meta: rope scaling = linear

llm_load_print_meta: freq_base_train = 10000.0

llm_load_print_meta: freq_scale_train = 1

llm_load_print_meta: n_yarn_orig_ctx = 32768

llm_load_print_meta: rope_finetuned = unknown

llm_load_print_meta: model type = 7B

llm_load_print_meta: model ftype = mostly Q4_K - Medium

llm_load_print_meta: model params = 7.24 B

llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)

llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2

llm_load_print_meta: BOS token = 1 ‘’

llm_load_print_meta: EOS token = 2 ‘’

llm_load_print_meta: UNK token = 0 ‘’

llm_load_print_meta: LF token = 13 ‘<0x0A>’

llm_load_tensors: ggml ctx size = 0.11 MB

llm_load_tensors: mem required = 4165.47 MB

36.7. Qdrant Vector Database

The privateGPT is also integrated with Qdrant Vector DB

docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Tip

After running the container, to access the Qdrant dashboard enter the following URL in your browser:

http://localhost:6333/dashboard