It has been more than a year since the use of large models. From the initial chatgbt3.5/4 to the current domestic large models, open source large models are becoming more and more powerful, such as Llama, SD, etc. Today, I will introduce two tools that have been used for a long time, ollama and open-webui.

Ollama

Ollama is an open source deep learning framework designed for easy deployment and operation of large language models (LLM) on local machines. It provides a complete set of deep learning tool chains, including data preprocessing, model building, training, evaluation and deployment.

Project address: https://github.com/ollama/ollama

Main features of ollama:

  • Local deployment and operation: One of the main goals of Ollama is to simplify the process of deploying large language models in Docker containers. This makes it easy for non-professional users to manage and run these complex models.

  • Lightweight and scalability: As a lightweight framework, Ollama maintains a small resource footprint and has good scalability. This allows users to adjust the configuration according to the scale and hardware conditions of the project.

  • API support: Ollama provides a simple API that enables developers to easily create, run, and manage large language model instances. This lowers the technical threshold for interacting with models.

  • Pre-built model library: Ollama contains a series of pre-trained large language models. Users can directly select these models and apply them to their own applications without having to train from scratch or find the model source themselves.

  • Model import and customization: (1) Import from GGUF: Supports importing existing large language models from specific platforms (such as GGUF). (2) Import from PyTorch or Safetensors: Compatible with these two deep learning frameworks, allowing users to integrate models trained based on these frameworks into Ollama. (3) Custom prompts: Allow users to add or modify prompts for the model to guide the model to generate text output of a specific type or style.

  • Cross-platform support: Provides installation guides for macOS, Windows (preview version), Linux, and Docker to ensure that users can successfully deploy and use Ollama in multiple operating system environments.

  • Command line tools and environment variables: Command line startup: The Ollama service can be started through the command ollamaserve or its aliases serve and start.

  • Environment variable configuration: such as OLLAMA_HOST, which is used to specify the host address and port bound to the service. Users can modify it as needed.

In addition, Ollama also provides rich API interfaces, community and documentation support, making it a powerful tool for developers and individual users to run and manage large language models locally.

ollama quick installation

linux ollama installation command:

1
curl -fsSL https://ollama.com/install.sh | sh

ollama basic commands

1
2
3
4
ollama pull llama3 # pull a large model
ollama run llama3 # run a large model directly,
ollama rm llama3 # remove the local llama3 large model
ollama list # list the local large models

Talk to llama3 Chinese version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# ollama pull wangshenzhi/llama3-8b-chinese-chat-ollama-q8:latest
# ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8:latest
>>> Who are you?
I am Llama3-8B Chinese chatbot, fine-tuned by ORPO alignment algorithm on Llama3-8B-Instruct with Chinese-English mixed data. My developer is Wang Shenzhi (Chinese name). I am a helpful, respectful, honest and reliable assistant, committed to providing you with safe and accurate information and answers while promoting meaningful conversations.

>>> Please give a detailed introduction on how to fine-tune the large model llama3 and what tools are used for fine-tuning large models.
LLaMA 3 is a powerful text generation model developed by OpenAI based on the transformer architecture. It performs well in various natural language processing (NLP) tasks, such as text classification, machine translation, and dialogue systems. Fine-tuning large pre-trained models, such as LLaMA 3, can significantly improve their performance and adapt them to specific tasks or domains.

The fine-tuning process involves a small amount of additional training on the pre-trained model to fine-tune it to better adapt to the new target dataset. This process can be achieved by adding new data, adjusting hyperparameters, or changing the optimization algorithm. The main purpose of fine-tuning is to enable the model to learn from tasks with specific structure and context and perform well on these tasks.

Fine-tuning of LLaMA 3 usually involves the following steps:

1. **Data preparation**: Collect and preprocess new data for fine-tuning. The data should be relevant to the target task, specifically the chatbot task.

2. **Model selection**: Choose an architecture that is suitable for the task, such as LLaMA 3. Adjustments can be made based on the dataset and task requirements, such as increasing or decreasing the number of layers, number of hidden units, etc.

3. **Hyperparameter setting**: Adjust hyperparameters during fine-tuning, such as learning rate, batch size, and number of training iterations, to optimize model performance.

4. **Training**: Fine-tune the model using the new dataset. Usually done on a GPU, which can significantly increase the speed of fine-tuning.

Benefits of fine-tuning include:

* **Task-specific performance**: Fine-tuning can significantly improve the performance of the model on the target task, especially when the target dataset has unique structure and context.
* **Better generalization**: Fine-tuning can help the model learn patterns in new datasets, enabling it to make more accurate predictions on previously unseen inputs.
* **Resource efficiency**: Since only a small amount of additional training is required, fine-tuning can be performed on smaller computing resources, saving time and cost.

In summary, fine-tuning large pre-trained models, such as LLaMA 3, can significantly improve their performance on target tasks and adapt them to specific domains. By fine-tuning hyperparameters, datasets, and architectures, fine-tuning can help create more targeted models, leading to better
results.
>>> /?
Available Commands:
/set Set session variables
/show Show model information
/load <model> Load a session or model
/save <model> Save your current session
/bye Exit
/?, /help Help for a command
/? shortcuts Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /bye

Use ollama to adjust the prompt words and parameters of the large model

Here, based on Alibaba’s open source Qianwen large model, create Modelfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
FROM qwen:7b

# Set the temperature to 1 (the higher the more creative, the lower the more coherent)
PARAMETER temperature 0.8

# Set the context token size
PARAMETER num_ctx 8192

# Set the system message
SYSTEM """
You are an artificial intelligence assistant named wanzi developed and provided by wnote.
You are good at speaking English and jokes.
"""
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ollama create wanzi -f ./Modelfile
transferring model data
reading model metadata
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 using already created layer sha256:7c7b8e244f6aa1ac8c32b74f56d42c41a0364dd2dabed8d9c6030a86 2e805b54 using already created layer sha256:1da0581fd4ce92dcf5a66b1da737cf215d8dcf25aa1b98b44443aaf7173155f5 writing layer sha256:1744dbff14b6e7fb2a846c118720f4627cdece7180bdb9d6f176c44bc6308718
writing layer sha256:02d401d3225083c307561b27d81137360bf39c01259dddbb267897cdcde6efec
writing layer sha256:d052490c542f2b7c49e4a770cc3239a58ca7261f75bc1730dd2b4ce5999de0b0
writing manifest
success
# ollama run wanzi
>>> Who are you?
I am Wanzi, an artificial intelligence assistant developed by wnote. I am good at communicating in English and occasionally tell some humorous jokes to lighten the mood. Is there anything I can help you with?
>>> Tell me a very funny joke.
Of course, let me tell you a funny joke:

One day, a penguin walked into a bar. It was wearing a formal tuxedo, which looked funny and mysterious.

The bartender looked at the penguin and exclaimed: "Wow, are you going to have a party?"

The penguin thought for a while, shrugged his round shoulders, and said in his unique low voice: "No, I just want to order a glass of 'Zero Degree Loneliness'."

Everyone was stunned, and then burst into laughter. From then on, there was a veritable "joke" on the bar menu-Zero Degree Loneliness.

I hope this joke can make you happy!

REST API request

Generate response:

1
2
3
4
curl http://localhost:11434/api/generate -d '{
"model": "qwen:7b",
"prompt":"Why are large models so popular now?"
}'

Model dialogue:

1
2
3
4
5
6
curl http://localhost:11434/api/chat -d '{
"model": "qwen:7b",
"messages": [
{ "role": "user", "content": "Why are large models so popular now?" }
]
}'

ollama supports many large models. If you need more experience, you can move to: https://ollama.com/library

Open-webui

Open-webui is an extensible, feature-rich, user-friendly self-hosted web UI developed with the Svelte front-end framework and designed to run completely offline. It supports various LLM runners, including APIs compatible with Ollama and OpenAI.

Previously, you can operate the Ollma command line, but now you can operate it completely on the web.

open-webui features

  • Easy installation: Seamless installation with Docker or Kubernetes (kubectl, kustomize or helm), support for :ollama and :cuda tag images, and get a worry-free experience.

  • Ollama/OpenAI API integration: Easily integrate OpenAI compatible APIs for versatile conversations with Ollama models. Custom OpenAI API URLs link to LMStudio, GroqCloud, Mistral, OpenRouter, etc.

  • Pipelines, Open web plugin support: Seamlessly integrate custom logic and Python libraries into Open web with the Pipelines Plugin Framework. Start a pipeline instance, set the OpenAI URL as the pipeline URL, and explore endless possibilities. Examples include function calls, user rate limits to control access, usage monitoring with tools such as Langfuse, real-time translation with multi-language support using LibreTranslate, poisonous message filtering, and more.

  • Responsive Design: Enjoy a seamless experience on desktop, laptop, and mobile.

  • Mobile Progressive Web App (PWA): Enjoy a native app-like experience on your mobile device with our PWA, providing offline access on localhost and a seamless user interface.

  • Full Markdown and LaTeX Support: Elevate your LLM experience with comprehensive Markdown and LaTeX capabilities to enrich interactions.

  • ️ Model Generator: Easily create [recommended] models through the web UI. Easily create and add custom roles/agents, custom chat elements, and import models with Open web Community integration.

  • Native RAG Integration: Dive into the future of chat interactions with groundbreaking Retrieval Augmented Generation (RAG) support. This feature seamlessly integrates document interactions into your chat experience. You can load documents directly into the chat or add files to the document library to easily access them using the # command before querying.

  • Web Search for RAG: Perform web searches using providers like SearXNG, Google PSE, Brave Search, serstack, and serper, and inject results directly into your chat experience.

  • Web Browsing: Seamlessly integrate websites into your chat experience using the # command followed by a URL. This feature allows you to incorporate web content directly into your conversations, enhancing the richness and depth of your interactions.

  • Image Generation Integration: Seamlessly incorporate image generation capabilities using the AUTOMATIC1111 API or options like ComfyUI (local) and OpenAI’s DALL-E (external) to enrich your chat experience with dynamic visual content.

  • ️Many Model Conversations: Effortlessly engage with a variety of models simultaneously, leveraging their unique strengths to get the best responses. Enhance your experience by leveraging a diverse set of models in parallel.

  • Role-Based Access Control (RBAC): Ensure secure access with restricted permissions; only authorized individuals can access your Ollama, and exclusive model creation/extraction permissions are reserved for administrators.

  • Multi-language support: Experience the Open web in your preferred language with our internationalization (i18n) support. Join us in expanding the languages ​​we support! We are actively looking for contributors!

  • Continuous updates: We are committed to improving the Open web with regular updates, fixes, and new features.

Deploy open-webui

In order to use with ollama, here we use docker-compose to quickly deploy open-webui:

You can refer to: https://github.com/valiantlynx/ollama-docker

Since I use the Nvidia Tesla T4 card locally, I directly use the GPU configuration (the driver needs to be installed in advance, which is ignored here)

docker-compose-ollama-gpu.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
version: '3.8'

services:
app:
build: .
ports:
- 8000:8000
- 5678:5678
volumes:
- .:/code
command: uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload
restart: always
depends_on:
- ollama
- ollama-webui
networks:
- ollama-docker

ollama:
volumes:
- ./ollama/ollama:/root/.ollama container_name: ollama pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11434:11434 networks: - ollama-docker deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ollama-webui: image: ghcr.io/ollama-webui /ollama-webui:main container_name: ollama-webui volumes: - ./ollama/ollama-webui:/app/backend/data depends_on: - ollama ports: - 8080:8080 environment: - '/ollama/api=http://ollama:11434/api' extra_hosts:
- host.docker.internal:host-gateway
restart: unless-stopped
networks:
- ollama-docker

networks:
ollama-docker:
external: false

Start the service:

1
docker-compose -f docker-compose-ollama-gpu.yaml up -d

Visit https://xxx.xxx.xxx.xxx:8080/ and register for a management account to use it.

model001.png

Summary

The combination of Ollama and Open WebUI brings many advantages. First, through the intuitive UI interface, non-professional readers can also easily build a deep learning platform and train models. Second, Open WebUI provides Ollama with rich functions, such as data visualization, model performance monitoring, etc., so that users can more easily understand the training status and performance of the model. Finally, the open source nature of Ollama and Open WebUI allows users to customize and extend them according to their needs.