29. TML and Generative AI
TML uses privateGPT containers (discussed below) for secure, fast, and distributed, AI.
Attention
These containers are dependent on the NVidia GPU cards up to 5090.
Containers are compatible with CUDA versions upto 12.8.
Containers will run on AMD64 and ARM64 chip architectures.
They also require Qdrant Vector DB - Here is the Qdrant Docker Run Command to Install Qdrant Vector DB locally with TML integration:
docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
Note
TML Solution developers have several privateGPT container options. These range from the small to large models with varying GPU memory requirements.
These models and containers are listed in the table below.
Tip
These models all follow a llama2 prompt style. See here for more details.
30. PrivateGPT Special Containers
TML-privateGPT Container |
GPU Suggested Requirements |
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest
|
|
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.2" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v2:latest
|
|
AMD64: Advanced Model Version 3
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=768 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \
--env mainembedding="BAAI/bge-base-en-v1.5" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3
|
|
AMD64: Large Advanced Model Version 3
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=1024 \
--env mainmodel="mistralai/Mistral-7B-Instruct-v0.3" \
--env mainembedding="BAAI/bge-m3" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-v3-large
|
|
31. TML and Agentic AI Special Container
For TML and Agentic AI solutions users must you the following container
AMD64: Agentic AI Llama3 with Ollama Server
Embedding: nomic-embed-text
Vector Dimension: n/a
Docker Run Command for AMD64 Container:
docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \ --env TSS=0 \ --env GPU=1 \ --env COLLECTION=tml \ --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 \ --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 \ --env vectorsearchtype=cosine \ --env contextwindowsize=4096 \ --env vectordimension=384 \ --env mainembedding="nomic-embed-text" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ --env LLAMAMODEL=llama3.2 \ --env OLLAMASERVERPORT="http://localhost:11434" \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools
Suggested VRAM/GPU should be around 20GB
SSD 2-3 TB
Suggested Machine: On-demand 1x NVIDIA A10
Suggested Cost GPU/Hour: $0.75/GPU/h
32. Test The Ollama Model for GPU
To test your model - at the Linux prompt type:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello, how are you?",
"stream": false,
"keep_alive": -1
}'
To check whether Ollama is using GPU or CPU type the following:
ollama ps
# You should see something like this: if you have CPU (Note your ID may be different):
#NAME ID SIZE PROCESSOR CONTEXT UNTIL
#llama3.2:latest a80c4f17acd5 2.5 GB 100% GPU 4096 Forever
33. Ollama LLM Model for CPU
Sometimes you may not have access to a NVidia GPU - in this case you can configure Ollama for CPU ONLY use running in Linux (Ubuntu). Follow these steps:
Note
- If MAC user replace:
maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools
with this one: maadsdocker/tml-privategpt-with-cpu-arm64-llama3-tools
Run the Ollama Container
docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \
--env TSS=0 \
--env GPU=0 \
--env COLLECTION=tml \
--env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 \
--env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 \
--env vectorsearchtype=cosine \
--env contextwindowsize=4096 \
--env vectordimension=384 \
--env mainembedding="nomic-embed-text" \
-v /var/run/docker.sock:/var/run/docker.sock:z \
--env LLAMAMODEL=llama3.2 \
--env OLLAMASERVERPORT="http://localhost:11434" \
maadsdocker/tml-privategpt-with-cpu-amd64-llama3-tools
Confirm the Ollam server is running with the LLAMAMODEL=llama3.2 by typing:
# Note you may need to install ollama snap #sudo snap install ollama ollama ps # You should see something like this: if you have CPU (Note your ID may be different): #NAME ID SIZE PROCESSOR CONTEXT UNTIL #llama3.2:latest a80c4f17acd5 2.5 GB 100% CPU 4096 Forever
Test the CPU LLM Model For a Response:
curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Hello from CPU!", "stream": false }'
RESPONSE FROM AI (You should see something similar): {“model”:”llama3.2”,”created_at”:”2026-02-11T00:20:48.600152776Z”,”response”:”Hello from the other side… of the digital realm!nnCPU (Central Processing Unit) is a bit unconventional, but I’ll play along. So, what’s on your digital mind today? Want to discuss some computational conundrums or just swap some bytes?”,”done”:true,”done_reason”:”stop”,”context”: [128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006, 882,128007,271,9906,505,14266,0,128009,128006,78191, 128007,271,9906,505,279,1023,3185,1131,315,279,7528,22651,2268,32715,320,44503,29225, 8113,8,374,264,2766,73978,11,719,358,3358,1514,3235,13,2100, 11,1148,596,389,701,7528,4059,3432,30,24133,311,4358,1063,55580,390,1263,81,6370, 477,1120,14626,1063,5943,30],”total_duration”:3222377627,”load_duration”: 103360250,”prompt_eval_count”:29,”prompt_eval_duration”:1 223374672,”eval_count”:54,”eval_duration”:1853177251}
If you receive a response - you now have a CPU LLM successfully running: llama3.2!
Tip
You can switch between Llama 3.1 and Llama 3.2 models by updating the:
–env LLAMAMODEL=llama3.2
You can also use ANY other TOOLS models from Ollama.com (see figure below)
Ollama server host and port can be updated by updating the:
–env OLLAMASERVERPORT=”http://localhost:11434”
To use models other models go to Ollama.com and search tools
34. TML and Vision Models
You can use the Llava vision models by setting the --env LLAMAMODEL= with the following:
--env LLAMAMODEL=llava:7b
--env LLAMAMODEL=llava:13b
--env LLAMAMODEL=llava:34b
The general reference architecture shows how TML connects to Ollama server container and Video ChatGPT in real-time to process images and Videos:
Note
VideoChatGPT uses Vicuna v1.1
Note
All images must be base64 decoded - see code below in section TML and Vision Models: Sample Code
You must have the Ollama server container running:
docker run -d -p 8001:8001 --net=host --gpus all --env PORT=8001 \ --env TSS=0 \ --env GPU=1 \ --env COLLECTION=tml \ --env WEB_CONCURRENCY=2 \ --env CUDA_VISIBLE_DEVICES=0 \ --env TOKENIZERS_PARALLELISM=false \ --env temperature=0.1 \ --env vectorsearchtype=cosine \ --env contextwindowsize=4096 \ --env vectordimension=384 \ --env mainembedding="nomic-embed-text" \ -v /var/run/docker.sock:/var/run/docker.sock:z \ --env LLAMAMODEL=llava:7b \ --env OLLAMASERVERPORT="http://localhost:11434" \ maadsdocker/tml-privategpt-with-gpu-nvidia-amd64-llama3-tools
34.1. TML and Vision Models: Sample Code
import base64
import requests
def base64encodeimage(imagefile):
with open(imagefile, "rb") as image_file:
data = base64.b64encode(image_file.read())
return data
def base64ToString(b):
return b.decode("utf-8")
def describeimage(imgname):
imgdata=base64encodeimage(imgname)
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
}
data = '{\n "model": "llava:7b",\n "prompt":"What is in this picture?",\n "stream": false,\n "images": ["'+base64ToString(imgdata)+'"]\n}'
response = requests.post('http://localhost:11434/api/generate', headers=headers, data=data)
print(response.text)
return response
describeimage("./image2.png")
35. TML and Video ChatGPT
Users can call video chatGPT container as follows:
35.1. Docker Run Command
docker run --gpus all -d -p 7900:7900 \
--net=host --env CUDA_VISIBLE_DEVICES=0 \
--env VIDEOGPTPORT=7900 \
-v /mnt/c/sample_videos:/VideoChatGPT/videofile:z \
--env VIDEOGPTFOLDER=/VideoChatGPT/videofile maadsdocker/tml-videochatgpt-nvidia-gpu-amd64
Note
NOTE: Details on the Docker run command:
-p 7900:7900: This is port forwarding port 7900 host port to the container port 7900
VIDEOGPTPORT=7900: This enables the API to connect to video chatgpt on port 7900
-v /mnt/c/sample_videos:/VideoChatGPT/videofile:z: All your video files need to be stored on the host machine, the Docker container maps this host folder to the container folder for video retrieval
VIDEOGPTFOLDER=/VideoChatGPT/videofile : This is the container video folder
NOTE: You need to drop the mp4 files on your host folder that is mapped to the container folder.
35.2. Video ChatGPT Sample Code
import maadstml # import the maadstml python library: pip install maadstml
################### NOTE: This will only work if Video Chatgpt is running on a machine with NVidia GPU and Cuda toolkit installed
def videochat(videofilename):
url='http://127.0.0.1' # IP video chatgpt is listening on
port='7900' # Port Video chatgpt is listening on
filename=videofilename # Video file name
responsefolder='sample-videos' # folder, video chatgpt will write out the responses to
temperature=0.1 # temperature - varies between 0-1, closer to 0 more conservative the responses
max_output_tokens=512 # max tokens or words returned
prompt='What is this video about? Is there anything strange about this video?' # prompts to ask video chatgot about the video
#Load video chatgpt
ret=maadstml.videochatloadresponse(url,port,filename,prompt,responsefolder,temperature,max_output_tokens)
print(ret)
#CALL Video chat gpt container - you can put this in a loop and analyse several videos at once with multiple containers
videofilename = 'sample_6.mp4'
ret = videochat(videofilename) # returns the response file name
print(ret)
36. TML API for GenAI Using MAADSTML Python Library
TML solutions can be built to access GPT technology in real-time using the MAADSTML python library functions:
MAADSTML Python Function |
Description |
pgptingestdocs |
Set Context for PrivateGPT by ingesting PDFs or text documents. All responses will then use these documents for context. |
pgptgetingestedembeddings |
After documents are ingested, you can retrieve the embeddings for the ingested documents. These embeddings allow you to filter the documents for specific context. |
pgptchat |
Send any prompt to privateGPT (with or without context) and get back a response. |
pgptdeleteembeddings |
Delete embeddings. |
pgpthealth |
Check the health of the privateGPT http server. |
36.1. GenAI With STEP 9
Several powerful, real-time, AI analysis can be performed with STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag
These are the following:
Perform post-analyis on TML output with GenAI
Use Qdrant vector DB, to use local documents, for querying with GenAI
Scale GenAI with privateGPT for secure, local, and quality AI analysis.
Tip
Take a look here TML, PrivateGPT and Qdrant Example Scenarios for more information.
36.2. TML and RAG: A Powerful Combination
TML using STEP 9: PrivateGPT and Qdrant Integration: tml-system-step-9-privategpt_qdrant-dag can perform RAG (Retrieval-augmented Generation) with a few simple configurations.
Below is a figure to show Advanced RAG model (inspiration from huggingface blog) to ingest Engineering documents for real-time prompting using one of the privateGPT containers. Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9.
Important
This would be very useful especially for Cybersecurity uses cases where you want to cross-reference source IP address with web log files to determine if there are any “authentication failures” or “wrong passwords” in the log files associated to the source IP address.
Together with Qdrant vector DB, users can analyse local files with TML in real-time with no-code just configurations of Step 9, in few seconds.
The incorporation of RAG with TML for real-time cybersecurity analysis of log files is demonstrated in Cybersecurity Solution with PrivateGPT, MQTT, HiveMQ
36.3. Private GPT Container
More privateGPT containers can be found here: PrivateGPT Special Containers. The container will require a NVIDIA GPU.
docker pull maadsdocker/tml-privategpt-with-gpu-nvidia-amd64
docker run -d -p 8001:8001 --net=host --gpus all \
--env PORT=8001 --env TSS=0 --env GPU=1 \
--env COLLECTION=tml --env WEB_CONCURRENCY=2 \
--env CUDA_VISIBLE_DEVICES=0 --env TOKENIZERS_PARALLELISM=false \
--env temperature=0.1 --env vectorsearchtype=cosine \
--env contextwindowsize=4096 --env vectordimension=384 \
--env mainmodel="TheBloke/Mistral-7B-Instruct-v0.1-GGUF" \
--env mainembedding="BAAI/bge-small-en-v1.5" \
maadsdocker/tml-privategpt-with-gpu-nvidia-amd64:latest
Tip
To check if privateGPT is running enter this in your browser: http://localhost:8001
You should see the private GPT website below.
Note
If you set WEB_CONCURRENCY greater than 1, you will need Qdrant Vector DB running (see below)
36.4. PrivateGPT Container With NO GPU
Tip
If you do not have a Nvidia GPU you can use the docker container with NO GPU:
docker run -d -p 8001:8001 –env PORT=8001 –env GPU=0 –env CUDA_VISIBLE_DEVICES=0 maadsdocker/tml-privategpt-no-gpu-amd64
36.4.1. Installing CUDA For NVIDIA GPU
Important
It is highly recommended that users run the privateGPT container using the NVIDIA GPU for FASTER performance.
If you have a NVIDIA GPU you must install the CUDA Software Development Kit in your Linux environment.
To confirm your GPU card is recognized in Linux type: nvidia-smi - You should see an image similar to below.
36.4.2. NVIDIA Common Issues
Important
If you run Docker or Minikube with the --gpus all flag and see an ERROR message like:
docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].
Then run the following:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Attention
Make sure to STOP the TSS Container and other containers before running Kubernetes/Minikube.
If you get the following WARNING from Kubernetes:
Warning FailedScheduling 13m default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Issue the commands below:
sudo apt update && sudo apt install -y nvidia-docker2
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
36.5. To Enable GPU in Kubernetes
You can apply the following YML file to the Kubernetes cluster to enable GPU support.
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
Also see section: NVIDIA GPU On Windows WSL
36.6. Accessing PrivateGPT With MAADSTML Python API
Once you have the PrivateGPT container running you can access it using the maadstml API. Here is some sample Python code to access the privateGPT container:
Note
Since PrivateGPT is compatible with REST API, you can use any programming language, and take advantage of free, and fast AI.
import maadstml
import json
def sendpromptgpt(prompt,pgptip,pgptport):
pgptendpoint="/v1/completions"
includesources=False
docfilter=""
context=False
try:
response=maadstml.pgptchat(prompt,context,docfilter,pgptport,includesources,pgptip,pgptendpoint)
jb=json.loads(response)
response=jb['choices'][0]['message']['content']
except Exception as e:
print("ERROR: connecting to PrivateGPT=",e)
return ""
return response
def setupprompt():
pgptip="http://127.0.0.1"
pgptport="8001"
prompt="Who is the prime minister of Canada?"
message=sendpromptgpt(prompt,pgptip,pgptport)
Details of LLM Used in privateGPT Container |
llm_load_print_meta: format = GGUF V2 |
llm_load_print_meta: arch = llama |
llm_load_print_meta: vocab type = SPM |
llm_load_print_meta: n_vocab = 32000 |
llm_load_print_meta: n_merges = 0 |
llm_load_print_meta: n_ctx_train = 32768 |
llm_load_print_meta: n_embd = 4096 |
llm_load_print_meta: n_head = 32 |
llm_load_print_meta: n_head_kv = 8 |
llm_load_print_meta: n_layer = 32 |
llm_load_print_meta: n_rot = 128 |
llm_load_print_meta: n_gqa = 4 |
llm_load_print_meta: f_norm_eps = 0.0e+00 |
llm_load_print_meta: f_norm_rms_eps = 1.0e-05 |
llm_load_print_meta: f_clamp_kqv = 0.0e+00 |
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 |
llm_load_print_meta: n_ff = 14336 |
llm_load_print_meta: rope scaling = linear |
llm_load_print_meta: freq_base_train = 10000.0 |
llm_load_print_meta: freq_scale_train = 1 |
llm_load_print_meta: n_yarn_orig_ctx = 32768 |
llm_load_print_meta: rope_finetuned = unknown |
llm_load_print_meta: model type = 7B |
llm_load_print_meta: model ftype = mostly Q4_K - Medium |
llm_load_print_meta: model params = 7.24 B |
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) |
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 |
llm_load_print_meta: BOS token = 1 ‘’ |
llm_load_print_meta: EOS token = 2 ‘’ |
llm_load_print_meta: UNK token = 0 ‘’ |
llm_load_print_meta: LF token = 13 ‘<0x0A>’ |
llm_load_tensors: ggml ctx size = 0.11 MB |
llm_load_tensors: mem required = 4165.47 MB |
36.7. Qdrant Vector Database
The privateGPT is also integrated with Qdrant Vector DB
docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
Tip
After running the container, to access the Qdrant dashboard enter the following URL in your browser:
http://localhost:6333/dashboard