[VicPiMakers Projects] Running the new llamafile (llama.cpp) app

Mon Apr 27 08:30:12 PDT 2026

Good article, Greg,

Clearly, you aren't using enough tokens (to be tokenmaxxing).

BTW, the llamafile (aka llama.cpp) will kick out token statistics. Not 
sure what it means:

srv   prompt_save:  - saving prompt with length 631, total state size = 34.516 MiB
srv          load:  - looking for better prompt, base f_keep = 0.022, sim = 0.500
srv        update:  - cache state: 5 prompts, 74.066 MiB (limits: 8192.000 MiB, 128000 tokens, 149758 est)
srv        update:    - prompt 0x7fc12c15f5b0:     431 tokens, checkpoints:  0,    23.576 MiB
srv        update:    - prompt 0x7fc12c1634a0:      75 tokens, checkpoints:  0,     4.103 MiB
srv        update:    - prompt 0x7fc12c15fb40:      75 tokens, checkpoints:  0,     4.103 MiB
srv        update:    - prompt 0x7fc15781b190:     142 tokens, checkpoints:  0,     7.768 MiB
srv        update:    - prompt 0x7fc12c1631b0:     631 tokens, checkpoints:  0,    34.516 MiB

Craig....

On 4/27/26 07:49, Greg H wrote:
> Thanks Craig. I have it on my list to experiment more with 
> self-hosting LLMs. I think there will be calls for self-hosting once 
> AI fervor has peaked and labs have to show profitability.
>
> Not on topic, but related to our NetSIG discussion on odd industry 
> behaviours around LLM resource consumption:
>
> https://newsletter.pragmaticengineer.com/p/the-pulse-tokenmaxxing-as-a-weird-6b2 
>
>
> We're back to the days of  "more K-LOCs!"
>
>
> On Sun, Apr 26, 2026 at 8:21 AM Craig Miller <cvmiller at gmail.com> wrote:
>
>     Hi Greg,
>
>     No I haven't. I think you could run 'strace' to see what the model
>     was doing at the time, but it would be slow, and I am not sure it
>     would tell you much.
>
>     I don't think it was a RAM issue, since the container I am running
>     the LLMs is unrestricted (can use all the host's memory, which is
>     32 GB), and the kernel is fairly recent (6.18.19-0-lts).
>
>     I didn't spend much time on it, because, my objective was to get a
>     local LLM running, not debug the model at the time.
>
>     Craig...
>
>     On 4/26/26 07:56, Greg H wrote:
>>     I was curious if you do any troubleshooting for the models that
>>     core dump. I don't have any experience with this and I'm
>>     wondering if there's much that you can do other than increase the
>>     resources (i.e. more RAM). Maybe upgrade the kernel? Guessing
>>     some models need the latest / greatest kernel versions to do
>>     their thing.
>>
>>     On Sun, Apr 26, 2026 at 7:34 AM Craig Miller <cvmiller at gmail.com>
>>     wrote:
>>
>>         Hi Deid,
>>
>>         Looking at the gguf models on HuggingFace:
>>
>>         https://huggingface.co/models?library=gguf
>>
>>         There were a couple of parameters I was looking at:
>>
>>          1. Not too big, somewhere between 5 and 10 GB in size
>>          2. Relatively recent
>>          3. Doesn't core dump right away
>>
>>         I had the best luck at running the Qwen models. I am running
>>         Qwen2.5-VL-7B-Instruct-abliterated.Q4_K_M.gguf on my PN-50,
>>         and it seems to run reasonably fast. Some of the other models
>>         were quite slow on the PN-50.
>>
>>         Have fun!
>>
>>         Craig...
>>
>>         On 4/26/26 07:13, Deid Reimer wrote:
>>>         Hey Craig,
>>>
>>>         Why did you pick that particular LLM?
>>>
>>>         Deid   VA7REI
>>>         On Apr 25, 2026, at 8:32 a.m., Craig Miller
>>>         <cvmiller at gmail.com> wrote:
>>>
>>>             Hi All,
>>>
>>>             We were chatting before the most recent NetSIG about the
>>>             new Llamafile app, which has excellent support for IPv6.
>>>             The app runs a webserver (which is IPv6 accessible). 
>>>             The new llamafile app takes a -m parameter which points
>>>             to the gguf LLM model.
>>>
>>>             * Old way*
>>>                  ./google_gemma-3-4b-it-Q6_K.llamafile --server -v2
>>>             --host lxcllama.example.com <http://lxcllama.example.com>
>>>             *New way*
>>>                  llamafile -m model.gguf --server --port 8080
>>>
>>>             Find the new llamafile at:
>>>
>>>             https://github.com/mozilla-ai/llamafile/releases/tag/0.10.0
>>>
>>>             You can find gguf (LLM models) at:
>>>
>>>             https://huggingface.co/models?library=gguf
>>>
>>>             I start my llamafile using this command:
>>>
>>>                 ./llamafile-0.10.0 -m Qwen3.5-9B.Q4_K_M.gguf
>>>             --server --port 8080 --host lxcllama.example.com
>>>             <http://lxcllama.example.com>
>>>
>>>             This way any webbrowser at my house, can access the LLM.
>>>
>>>             Happy LLM-ing,
>>>
>>>             Craig...
>>>
>>>             -- 
>>>             IPv6 is the future, the future is here
>>>             http://ipv6hawaii.org/
>>>
>>>             -- 
>>>             Projects mailing list
>>>             Projects at vicpimakers.ca
>>>             http://vicpimakers.ca/mailman/listinfo/projects_vicpimakers.ca
>>>
>>>
>>         -- 
>>         IPv6 is the future, the future is here
>>         http://ipv6hawaii.org/
>>
>>         -- 
>>         Projects mailing list
>>         Projects at vicpimakers.ca
>>         http://vicpimakers.ca/mailman/listinfo/projects_vicpimakers.ca
>>
>>
>     -- 
>     IPv6 is the future, the future is here
>     http://ipv6hawaii.org/
>
>     -- 
>     Projects mailing list
>     Projects at vicpimakers.ca
>     http://vicpimakers.ca/mailman/listinfo/projects_vicpimakers.ca
>
>
-- 
IPv6 is the future, the future is here
http://ipv6hawaii.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://vicpimakers.ca/pipermail/projects_vicpimakers.ca/attachments/20260427/76a03f9b/attachment-0001.htm>