AI assistant for Beaglebone using LLM

Anup Halarnkar
Mar 23
3 min read

Introduction

For this project, I have used llama.cpp as the local inference engine and TinyLlama-1.1B-Chat-v1.0 as the language model.

llama.cpp is a lightweight C/C++ inference framework designed to run LLMs locally with minimal setup across CPUs and GPUs. It is well suited for embedded and edge-oriented workflows because it supports efficient local execution without depending on cloud APIs.

The TinyLlama model used here is the chat-tuned 1.1B parameter variant published on Hugging Face under the Apache 2.0 license. The model is available in GGUF format. GGUF is a new format introduced by the llama.cpp team and is a replacement for GGML, which is no longer supported by llama.cpp

https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

For controlled tool-planning and summarization tasks, we launch the model in non-conversation mode so that the model behaves like a constrained completion engine rather than an interactive chatbot.

In the llama.cpp tooling, the completion-style path is intended for prompt-to-output generation, while the chat-oriented tools enable conversation behavior and chat templates.

In my workflow, I use two prompt templates:

A typical launch pattern looks like this:

llama-cli.exe ^
 -m .\tinyllama.gguf ^
 -f .\planner_prompt.txt ^
 -jf .\planner_schema.json ^
 -n 160 ^
 --temp 0.1 ^
 --top-p 0.9 ^
 --simple-io ^
 -no-cnv ^
 --no-display-prompt ^
 --no-warmup

Here, the important options are:

-m to load the GGUF model
-f to pass a prompt template from a file
-jf to constrain output with a JSON schema
-no-cnv to disable conversation mode
--simple-io to make subprocess integration cleaner
--no-warmup to reduce startup overhead during rapid testing
- -temp 0.1 makes the model highly deterministic, choosing the most likely word almost every time (great for summarization)
--top-p 0.9 is a sampling safeguard

For summarization, the same pattern can be reused with a different template file:

llama-cli.exe ^
 -m .\tinyllama.gguf ^
 -f .\summarizer_prompt.txt ^
 -n 80 ^
 --temp 0.1 ^
 --top-p 0.9 ^
 --simple-io ^
 -no-cnv ^
 --no-display-prompt ^
 --no-warmup

This approach works well because the model is not asked to “know everything” about the target device. Instead, it performs two narrower jobs:

Study the user query and select the correct tools
Summarize the evidence returned by those tools

That makes the system much more reliable than allowing free-form generation. The Linux truth still comes from deterministic commands executed over SSH, while TinyLlama is used mainly for interpretation and orchestration.

System Architecture
Key Components

2.1 Llama Builder

The builder uses an LLM to decide which diagnostic tools should run.

Example prompt:
User question: Is HDMI connected?

Allowed tools:
hdmi.status
Planner output:
{
 "tools": [
   {
     "name": "hdmi.status",
     "args": {}
   }
 ],
 "confidence": 0.95
}

The builder does not generate commands. It only selects from predefined tools. This prevents unsafe command execution.

2.2 Tool Registry

Each diagnostic tool is defined in the engine.

Example:
hdmi.status
Command executed on device:
for f in /sys/class/drm/*HDMI*/status; do
 printf "%s: %s\n" "$f" "$(cat "$f")"
done

The registry maps the tool name to the command and output parser.

2.3 SSH Execution Engine

The engine connects to the device using SSH.

Instead of passing long shell commands through SSH arguments, the command is sent via standard input.

Example:
ssh debian@192.168.1.11 bash -s --

The command script is then streamed to the remote shell.

This approach avoids complex quoting issues across:

Windows
SSH
Bash

2.4 Parsing

Raw Linux output is converted into structured data.

Example raw output:
/sys/class/drm/card0-HDMI-A-1/status: disconnected
Parsed result:
{
 "connected": false,
 "entries": [
   {
     "path": "/sys/class/drm/card0-HDMI-A-1/status",
     "status": "disconnected"
   }
 ]
}

Structured evidence makes the results easier to analyze and summarize.

2.5 LLM Response

Finally, the LLM converts evidence into a concise explanation.

Example output:
- HDMI is disconnected.
- Checked DRM HDMI status from sysfs.
- Connector /sys/class/drm/card0-HDMI-A-1/status reported disconnected.

The response is constrained to use only the evidence and not invent information.

Example Questions the System Can Answer

3.1 Display diagnostics

Is HDMI connected ?

3.2 Sensor discovery

Are any sensors detected ?

3.3 Performance analysis

Is the system overloaded ?

3.4 Networking

IP address of device ?

3.5 Service health

Are any system services failing ?

GUI Screenshots

AI assistant for Beaglebone using LLM

Introduction

System Architecture

Key Components