Home-made LLM Recipe
嗯,用户让我总结一篇关于搭建本地LLM平台的文章,控制在100字以内。首先,我需要通读整篇文章,抓住主要信息。 文章开头提到作者从技术转向混合角色,决定分享搭建本地LLM的经验。接着详细描述了硬件选择,包括Dell XPS和Mac Studio的配置。然后是模型选择和性能测试,比较了不同模型在两台设备上的表现,结果显示Mac Studio更优。 接下来是关于并发限制、成本分析以及使用的平台和技术栈。最后提到知识库功能的不足。 现在要总结这些内容,控制在100字左右。重点应放在搭建过程、硬件选择、性能测试结果、平台和未来展望上。 可能的结构:搭建本地LLM平台的过程,硬件选型(尤其是Mac Studio的优势),使用的工具如Ollama和Open WebUI,知识库功能的现状及未来潜力。 确保语言简洁明了,避免使用复杂的术语。 </think> 文章介绍了搭建本地LLM平台的过程,包括硬件选型(如Mac Studio)、模型选择与性能测试,并探讨了知识库功能的现状与潜力。 2025-12-2 14:31:51 Author: voidsec.com(查看原文) 阅读量:8 收藏

Reading Time: 9 minutes

It’s been a long time since I posted on my blog, and while many of you will expect something related to Windows vulnerability research or exploit development. I have to manage expectations, as this one will be on LLMs. Not because everyone is jumping on the AI bandwagon and I felt the urge to do so, but because over the past few years, I’ve transitioned from a purely technical role into something more hybrid, overlapping many different aspects of the company I’m working for. Given the fact that the time is limited and this blog always been about things that I like and I’m experimenting on, despite not being nowhere an expert on this field, and after some discussion with friends of mine, I still have decided to post about it, not because it’s groundbreaking but just because I’ve spent some time on it and I believe it might be interesting to others.

It’s no secret that, lately, I’ve spent a good amount of time trying to understand how to use local LLMs for the team, aiding in report filling, code completion, and the general knowledge base of all the documentation the team has written over the past years.

Today I would like to sneak a peek into my high-level process of setting up a homemade local LLM platform, which we’re currently using as an internal pilot project; how we did it, what setup and technologies we’ve used, etc…

This blog post is a re-post of the original article “Home-made LLM Recipe” that I have written for Crowdfense on their blog.

Hardware

While deciding on which hardware to use for our pet projects, we started reviewing both dedicated hardware for LLMs, GPU racks and any kind of hardware capable enough to run such models. Given that our research workstations have limited dedicated GPUs, we were unable to run large models, and performance was nowhere near acceptable in terms of speed and token output generation.

While GPUs seem to be the standard, dedicated racks are somewhat expensive, and for a pilot project, we didn’t want to justify new purchases. We opted to test a 2022 Mac Studio that was sitting idle in the office.

Dell XPS:

  • CPU: 13th Gen Intel Core i9-13900H
  • RAM: 64 GB
  • GPU: NVIDIA GeForce RTX 4070
  • OS: Windows 11 Pro x64

Mac Studio 2022:

  • CPU: M1 Ultra
  • RAM: 128 GB
  • GPU: 64-core
  • OS: macOS 26.1

When testing the hardware, we opted for the 2 following prompts across different models:

  • “Write me a poem about the moon”
  • “Write a Python binary tree lookup function with a sample”

Both are vague enough to see, where present, the “thinking” process of the models.

For the local setup of the models, we opted to use Ollama, which can be directly installed from its main website without further complications. Then we spent some time selecting models we were interested in testing, opting for some small enough to be compared with our workstation and some bigger ones, as the Mac Studio’s RAM allowed us to do so.

We selected the following:

  1. 1:8b
  2. qwen3:8b
  3. 5-coder:3b
  4. deepseek-r1:8b
  5. deepseek-r1:70b
  6. gpt-oss:120b
  7. 3:70b
  8. qwen3-coder:30b

Models can be pulled via the command line: ollama pull llama3.1:8b

RAM Requirement per Billion parameters

A rough rule of thumb is 1GB of RAM per billion parameters, give or take.

GB or RAM for Q8 quantised models: Q8 is 8 bits per byte, so 1B per 1GB, not counting other components to run inference. Q4 is half that usage, FP16 is double, etc.

LLM Size Q8
3B 3,3
8B 7,7
33B 36,3
70B 77,0
123B 135,3
205B 225,5
405B 445,5

Benchmark

Using ollama serve --verbose, we can retrieve a couple of information and statistics about the provided prompt and its evaluation, specifically:

  • total duration (s): total time from when the request is issued to when the model finishes generating the output (smaller is better)
  • load duration (ms): how long it took to load the model into memory (smaller is better)
  • prompt eval count (tokens): how many tokens were processed from the provided prompt
  • prompt eval duration (ms): how long the model took to read and process the prompt (smaller is better)
  • prompt eval rate (tokens/s): how fast the model processed the tokens of the prompt (higher is better)
  • eval count (tokens): how many tokens the model generated in its response
  • eval duration (s): the total time it took to generate the output tokens
  • eval rate (tokens/s): speed of generation (higher is better)

Comparison

Dell XPS Mac Studio 2022 Delta
llama3.1:8b
total duration (s) 15,3619781 3,29049775 -12,0715
load duration (ms) 106,8099 101,352916 -5,45698
prompt eval count (tokens) 17 17
prompt eval duration (ms) 876,9471 622,781666 -254,165
prompt eval rate (tokens/s) 19,39 27,3 7,91
eval count (tokens) 156 184 28
eval duration (s) 14,2318621 2,495499444 -11,7364
eval rate (tokens/s) 10,96 73,73 62,77
qwen3:8b
total duration (s) 61,4240039 8,48243875 -52,9416
load duration (ms) 177,8828 91,713958 -86,1688
prompt eval count (tokens) 17 17
prompt eval duration (ms) 677,5161 262,631042 -414,885
prompt eval rate (tokens/s) 25,09 64,73 39,64
eval count (tokens) 701 573 -128
eval duration (s) 60,1850603 8,019101906 -52,166
eval rate (tokens/s) 11,65 71,45 59,8
qwen2.5-coder:3b
total duration (s) 7,867204 9,042249625 1,175046
load duration (ms) 125,2532 106,480875 -18,7723
prompt eval count (tokens) 38 38
prompt eval duration (ms) 300,7288 145,467917 -155,261
prompt eval rate (tokens/s) 126,36 261,23 134,87
eval count (tokens) 541 513 -28
eval duration (s) 6,4093473 7,911069649 1,501722
eval rate (tokens/s) 84,41 64,85 -19,56
 deepseek-r1:8b
total duration (s) 72,0693476 95,18973533 23,12039
load duration (ms) 92,4653 101,297667 8,832367
prompt eval count (tokens) 11 11
prompt eval duration (ms) 2,0887423 475,781125 473,6924
prompt eval rate (tokens/s) 5,27 23,12 17,85
eval count (tokens) 619 6166 5547
eval duration (s) 69,6013878 93,4009973 23,79961
eval rate (tokens/s) 8,89 66,02 57,13

Larger models evaluated on Mac Studio only:

deepseek-r1:70b
total duration (s) 228,4271719
load duration (ms) 99,152208
prompt eval count (tokens) 12
prompt eval duration (ms) 977,806583
prompt eval rate (tokens/s) 12,27
eval count (tokens) 2086
eval duration (s) 226,6527123
eval rate (tokens/s) 9,2
gpt-oss:120b
total duration (s) 39,81334175
load duration (ms) 157,683958
prompt eval count (tokens) 74
prompt eval duration (ms) 4,502564917
prompt eval rate (tokens/s) 16,44
eval count (tokens) 329
eval duration (s) 35,01930734
eval rate (tokens/s) 9,39
llama3.3:70b
total duration (s) 39,09012222
load duration (ms) 101,430583
prompt eval count (tokens) 17
prompt eval duration (ms) 2,720329166
prompt eval rate (tokens/s) 6,25
eval count (tokens) 185
eval duration (s) 36,19925901
eval rate (tokens/s) 5,11
qwen3-coder:30b
total duration (s) 16,55513133
load duration (ms) 88,789084
prompt eval count (tokens) 17
prompt eval duration (ms) 393,031417
prompt eval rate (tokens/s) 43,25
eval count (tokens) 720
eval duration (s) 15,86309835
eval rate (tokens/s) 45,39

Based on the collected performance metrics, the Mac Studio 2022 significantly outperformed our workstation across every meaningful measurement: while the model load times are similar, the total duration drops significantly. The prompt evaluation (~40% faster) and generation throughput (~6x faster) clearly favour the Mac Studio for real-time workload and interactive development. The Mac Studio is also the only one to support larger models due to Apple’s high memory bandwidth.

Concurrency Limitation

By default, our setup with out-of-the-box tools doesn’t handle concurrency. Specifically, the unified memory doesn’t allow models to run simultaneously, and each submitted request must be fully fulfilled before the next one can be processed (inference plays a huge role here, as a model stuck “thinking” blocks the queue for everyone else). For us, that’s not a big problem, as our team size allows it, and we don’t constantly use LLMs, but it might become a pain the more we rely on them, especially for code completion.

Cost

While we were lucky to already have the hardware, we cannot ignore the costs. A new Mac Studio costs 5,500 USD for the 96GB RAM, up to 10,000 USD for the 512 GB RAM (which, in theory, should allow you to load any model out of there, and performance should still be competitive after a couple of years, especially given the cost).

A single Nvidia GPU can span from 2,500 – 5,000 USD alone, without considering the additional hardware and Enterprise GPU pricing remains staggering, with units close to 30-40,000 USD.

A very viable solution is local marketplaces (e.g., Facebook Marketplace, eBay) and refurbished Apple hardware (especially if living in the US, where the market seems much larger than in Europe). Connecting multiple units together via a similar setup: Mac Studio Cluster via MLX.

Platform

I admit that I haven’t spent much time researching which platforms are available to perform all the tasks we intended, but for ease of use, out-of-the-box setup, multiple users, knowledge base, and features, we relied on Open WebUI.

The setup is pretty straightforward via brew and uvx, and we’ve followed this nicely put-together guide. The only modification was to the automatic startup, for which we have demonised the services so they run at startup.

Code Completion

While we initially used Cursor for fast script and prototyping, we’re trying to replace it with Continue, configured to use our local LLMs. Though maybe just for the ease of use and the fact that I’ve gotten used to it, Cursor still feels way better in terms of usability and results.

Knowledge Base

I’ve tested the “Knowledge Base” features of Open WebUI and the related “Document Embedding” of Anything LLM, both of which promise an easy way to build internal company knowledge bases and index our documents.

However, after some testing, I’m not particularly impressed by either. Both seem to struggle to pull data from our company knowledge and mix it with the model’s internal knowledge, most of the time giving mixed results, partial answers, or failing to address questions, especially when dates past the model’s cut-off are involved.

I’m not sure if that’s because both are essentially RAG wrappers around an LLM, and the underlying LLM is not sandboxed from its own knowledge but instead relies on system prompts, or because both rely on chunking, which loses the hierarchical structure, references, and temporal context, but the biggest issue for me is inconsistent retrieval.

IMHO, these features are not yet robust enough for information retrieval, but I hope they will be updated, as I think they might be a game-changer in the future.

Technology Stack

  • Hardware: Mac Studio 2022
  • Software: Open WebUI, Ollama

文章来源: https://voidsec.com/home-made-llm-recipe/
如有侵权请联系:admin#unsafe.sh