Reading Time: 9 minutes
It’s been a long time since I posted on my blog, and while many of you will expect something related to Windows vulnerability research or exploit development. I have to manage expectations, as this one will be on LLMs. Not because everyone is jumping on the AI bandwagon and I felt the urge to do so, but because over the past few years, I’ve transitioned from a purely technical role into something more hybrid, overlapping many different aspects of the company I’m working for. Given the fact that the time is limited and this blog always been about things that I like and I’m experimenting on, despite not being nowhere an expert on this field, and after some discussion with friends of mine, I still have decided to post about it, not because it’s groundbreaking but just because I’ve spent some time on it and I believe it might be interesting to others.
It’s no secret that, lately, I’ve spent a good amount of time trying to understand how to use local LLMs for the team, aiding in report filling, code completion, and the general knowledge base of all the documentation the team has written over the past years.
Today I would like to sneak a peek into my high-level process of setting up a homemade local LLM platform, which we’re currently using as an internal pilot project; how we did it, what setup and technologies we’ve used, etc…
This blog post is a re-post of the original article “Home-made LLM Recipe” that I have written for Crowdfense on their blog.
While deciding on which hardware to use for our pet projects, we started reviewing both dedicated hardware for LLMs, GPU racks and any kind of hardware capable enough to run such models. Given that our research workstations have limited dedicated GPUs, we were unable to run large models, and performance was nowhere near acceptable in terms of speed and token output generation.
While GPUs seem to be the standard, dedicated racks are somewhat expensive, and for a pilot project, we didn’t want to justify new purchases. We opted to test a 2022 Mac Studio that was sitting idle in the office.
Dell XPS:
Mac Studio 2022:
When testing the hardware, we opted for the 2 following prompts across different models:
Both are vague enough to see, where present, the “thinking” process of the models.
For the local setup of the models, we opted to use Ollama, which can be directly installed from its main website without further complications. Then we spent some time selecting models we were interested in testing, opting for some small enough to be compared with our workstation and some bigger ones, as the Mac Studio’s RAM allowed us to do so.
We selected the following:
Models can be pulled via the command line: ollama pull llama3.1:8b
A rough rule of thumb is 1GB of RAM per billion parameters, give or take.
GB or RAM for Q8 quantised models: Q8 is 8 bits per byte, so 1B per 1GB, not counting other components to run inference. Q4 is half that usage, FP16 is double, etc.
| LLM Size | Q8 |
| 3B | 3,3 |
| 8B | 7,7 |
| 33B | 36,3 |
| 70B | 77,0 |
| 123B | 135,3 |
| 205B | 225,5 |
| 405B | 445,5 |
Using ollama serve --verbose, we can retrieve a couple of information and statistics about the provided prompt and its evaluation, specifically:
Comparison
| Dell XPS | Mac Studio 2022 | Delta | ||
| llama3.1:8b | ||||
| total duration (s) | 15,3619781 | 3,29049775 | -12,0715 | |
| load duration (ms) | 106,8099 | 101,352916 | -5,45698 | |
| prompt eval count (tokens) | 17 | 17 | ||
| prompt eval duration (ms) | 876,9471 | 622,781666 | -254,165 | |
| prompt eval rate (tokens/s) | 19,39 | 27,3 | 7,91 | |
| eval count (tokens) | 156 | 184 | 28 | |
| eval duration (s) | 14,2318621 | 2,495499444 | -11,7364 | |
| eval rate (tokens/s) | 10,96 | 73,73 | 62,77 | |
| qwen3:8b | ||||
| total duration (s) | 61,4240039 | 8,48243875 | -52,9416 | |
| load duration (ms) | 177,8828 | 91,713958 | -86,1688 | |
| prompt eval count (tokens) | 17 | 17 | ||
| prompt eval duration (ms) | 677,5161 | 262,631042 | -414,885 | |
| prompt eval rate (tokens/s) | 25,09 | 64,73 | 39,64 | |
| eval count (tokens) | 701 | 573 | -128 | |
| eval duration (s) | 60,1850603 | 8,019101906 | -52,166 | |
| eval rate (tokens/s) | 11,65 | 71,45 | 59,8 | |
| qwen2.5-coder:3b | ||||
| total duration (s) | 7,867204 | 9,042249625 | 1,175046 | |
| load duration (ms) | 125,2532 | 106,480875 | -18,7723 | |
| prompt eval count (tokens) | 38 | 38 | ||
| prompt eval duration (ms) | 300,7288 | 145,467917 | -155,261 | |
| prompt eval rate (tokens/s) | 126,36 | 261,23 | 134,87 | |
| eval count (tokens) | 541 | 513 | -28 | |
| eval duration (s) | 6,4093473 | 7,911069649 | 1,501722 | |
| eval rate (tokens/s) | 84,41 | 64,85 | -19,56 | |
| deepseek-r1:8b | ||||
| total duration (s) | 72,0693476 | 95,18973533 | 23,12039 | |
| load duration (ms) | 92,4653 | 101,297667 | 8,832367 | |
| prompt eval count (tokens) | 11 | 11 | ||
| prompt eval duration (ms) | 2,0887423 | 475,781125 | 473,6924 | |
| prompt eval rate (tokens/s) | 5,27 | 23,12 | 17,85 | |
| eval count (tokens) | 619 | 6166 | 5547 | |
| eval duration (s) | 69,6013878 | 93,4009973 | 23,79961 | |
| eval rate (tokens/s) | 8,89 | 66,02 | 57,13 | |
Larger models evaluated on Mac Studio only:
| deepseek-r1:70b | |
| total duration (s) | 228,4271719 |
| load duration (ms) | 99,152208 |
| prompt eval count (tokens) | 12 |
| prompt eval duration (ms) | 977,806583 |
| prompt eval rate (tokens/s) | 12,27 |
| eval count (tokens) | 2086 |
| eval duration (s) | 226,6527123 |
| eval rate (tokens/s) | 9,2 |
| gpt-oss:120b | |
| total duration (s) | 39,81334175 |
| load duration (ms) | 157,683958 |
| prompt eval count (tokens) | 74 |
| prompt eval duration (ms) | 4,502564917 |
| prompt eval rate (tokens/s) | 16,44 |
| eval count (tokens) | 329 |
| eval duration (s) | 35,01930734 |
| eval rate (tokens/s) | 9,39 |
| llama3.3:70b | |
| total duration (s) | 39,09012222 |
| load duration (ms) | 101,430583 |
| prompt eval count (tokens) | 17 |
| prompt eval duration (ms) | 2,720329166 |
| prompt eval rate (tokens/s) | 6,25 |
| eval count (tokens) | 185 |
| eval duration (s) | 36,19925901 |
| eval rate (tokens/s) | 5,11 |
| qwen3-coder:30b | |
| total duration (s) | 16,55513133 |
| load duration (ms) | 88,789084 |
| prompt eval count (tokens) | 17 |
| prompt eval duration (ms) | 393,031417 |
| prompt eval rate (tokens/s) | 43,25 |
| eval count (tokens) | 720 |
| eval duration (s) | 15,86309835 |
| eval rate (tokens/s) | 45,39 |
Based on the collected performance metrics, the Mac Studio 2022 significantly outperformed our workstation across every meaningful measurement: while the model load times are similar, the total duration drops significantly. The prompt evaluation (~40% faster) and generation throughput (~6x faster) clearly favour the Mac Studio for real-time workload and interactive development. The Mac Studio is also the only one to support larger models due to Apple’s high memory bandwidth.
By default, our setup with out-of-the-box tools doesn’t handle concurrency. Specifically, the unified memory doesn’t allow models to run simultaneously, and each submitted request must be fully fulfilled before the next one can be processed (inference plays a huge role here, as a model stuck “thinking” blocks the queue for everyone else). For us, that’s not a big problem, as our team size allows it, and we don’t constantly use LLMs, but it might become a pain the more we rely on them, especially for code completion.
While we were lucky to already have the hardware, we cannot ignore the costs. A new Mac Studio costs 5,500 USD for the 96GB RAM, up to 10,000 USD for the 512 GB RAM (which, in theory, should allow you to load any model out of there, and performance should still be competitive after a couple of years, especially given the cost).
A single Nvidia GPU can span from 2,500 – 5,000 USD alone, without considering the additional hardware and Enterprise GPU pricing remains staggering, with units close to 30-40,000 USD.
A very viable solution is local marketplaces (e.g., Facebook Marketplace, eBay) and refurbished Apple hardware (especially if living in the US, where the market seems much larger than in Europe). Connecting multiple units together via a similar setup: Mac Studio Cluster via MLX.
I admit that I haven’t spent much time researching which platforms are available to perform all the tasks we intended, but for ease of use, out-of-the-box setup, multiple users, knowledge base, and features, we relied on Open WebUI.
The setup is pretty straightforward via brew and uvx, and we’ve followed this nicely put-together guide. The only modification was to the automatic startup, for which we have demonised the services so they run at startup.
While we initially used Cursor for fast script and prototyping, we’re trying to replace it with Continue, configured to use our local LLMs. Though maybe just for the ease of use and the fact that I’ve gotten used to it, Cursor still feels way better in terms of usability and results.
I’ve tested the “Knowledge Base” features of Open WebUI and the related “Document Embedding” of Anything LLM, both of which promise an easy way to build internal company knowledge bases and index our documents.
However, after some testing, I’m not particularly impressed by either. Both seem to struggle to pull data from our company knowledge and mix it with the model’s internal knowledge, most of the time giving mixed results, partial answers, or failing to address questions, especially when dates past the model’s cut-off are involved.
I’m not sure if that’s because both are essentially RAG wrappers around an LLM, and the underlying LLM is not sandboxed from its own knowledge but instead relies on system prompts, or because both rely on chunking, which loses the hierarchical structure, references, and temporal context, but the biggest issue for me is inconsistent retrieval.
IMHO, these features are not yet robust enough for information retrieval, but I hope they will be updated, as I think they might be a game-changer in the future.