如何使用 llama.cpp 制作并量化 GGUF 模型，然后通过Ollama安装使用

一、前言

llama.cpp，一种简单而高效的工具，将训练好的模型转换为可在CPU上运行的版本，以加快推理速度并减少内存使用。

llama.cpp 是 Ollama、LMStudio 和其他很多热门项目的底层实现，也是 GPUStack 所支持的推理引擎之一，它提供了 GGUF 模型文件格式。

GGUF (General Gaussian U-Net Format) 是一种用于存储模型以进行推理的文件格式，旨在针对推理进行优化，可以快速加载和运行模型。

llama.cpp 还支持量化模型，在保持较高的模型精度的同时，减少模型的存储和计算需求，使大模型能够在桌面端、嵌入式设备和资源受限的环境中高效部署，并提高推理速度。

今天带来一篇介绍如何制作并量化 GGUF 模型，将模型通过Ollama安装使用的操作教程。

二、操作

1、安装 conda 环境

如果是安卓端这里安装 ARM64 架构 conda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh

bash Miniconda3-latest-Linux-aarch64.sh -b

有些安卓平板可以会出现“"$CONDA_EXEC" constructor --prefix "$PREFIX" --extract-conda-pkgs”

可以改安装 Miniforge3：各版本下载链接集合：https://conda-forge.org/miniforge/ 注意选择 aarch64

wget https://github.com/conda-forge/miniforge/releases/download/24.1.2-0/Miniforge3-24.1.2-0-Linux-aarch64.sh

bash Miniforge3-24.1.2-0-Linux-aarch64.sh -b

如果是服务器端这里安装 conda

cd /home/malata/Downloads

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh -b然后执行：

/root/miniconda3/bin/conda init （例如：/home/malata/miniconda3/bin/conda init）

（注意 conda 的安装路径）

然后输入：bash

出现如：(base)root@localhost: 表示 OK；

备注：conda deactivate 这条命令会退出当前激活的 Conda 环境，包括 base 环境。

#设置安装源为国内

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/

2、安装llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp/

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

brew install cmake

make

./llama-quantize --help

3、下载原始模型

#从 HuggingFace 下载模型，通过 HuggingFace 提供的 huggingface-cli 命令下载，首先安装依赖：

pip install -U huggingface_hub

#设置下载国内镜像源：

export HF_ENDPOINT=https://hf-mirror.com

#这里下载 Qwen/Qwen2.5-1.5B-Instruct 模型，该模型需要认证，可以直接下载，如果需要下载认证模型，请参考教程：https://juejin.cn/post/7434201140294860826

mkdir ~/huggingface.co

cd ~/huggingface.co/

huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir Qwen2.5-1.5B-Instruct

#下载过程比较久，请耐心等待，下载完后，

4、创建 GGUF 格式与量化模型的脚本：

cd ~/huggingface.co/

vim quantize.sh

#填入以下脚本内容，并把 llama.cpp 和 huggingface.co 的目录路径修改为当前环境的实际路径，需要为绝对路径

#!/usr/bin/env bash

llama_cpp="/home/malata/Ollama/llama.cpp"

b="/home/malata/huggingface.co"

export PATH="$PATH:${llama_cpp}"

s="$1"

n="$(echo "${s}" | cut -d'/' -f2)"

d="gpustack/${n}-GGUF"

# prepare

mkdir -p ${b}/${d} 1>/dev/null 2>&1

pushd ${b}/${d} 1>/dev/null 2>&1

git init . 1>/dev/null 2>&1

if [[ ! -f .gitattributes ]]; then

cp -f ${b}/${s}/.gitattributes . 1>/dev/null 2>&1 || true

echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes

if [[ ! -d assets ]]; then

cp -rf ${b}/${s}/assets . 1>/dev/null 2>&1 || true

if [[ ! -d images ]]; then

cp -rf ${b}/${s}/images . 1>/dev/null 2>&1 || true

if [[ ! -d imgs ]]; then

cp -rf ${b}/${s}/imgs . 1>/dev/null 2>&1 || true

if [[ ! -f README.md ]]; then

cp -f ${b}/${s}/README.md . 1>/dev/null 2>&1 || true

set -e

pushd ${llama_cpp} 1>/dev/null 2>&1

# convert

[[ -f venv/bin/activate ]] && source venv/bin/activate

echo "#### convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf"

python3 convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf

# quantize

qs=(

"Q8_0"

"Q6_K"

"Q5_K_M"

"Q5_0"

"Q4_K_M"

"Q4_0"

"Q3_K"

"Q2_K"

)

for q in "${qs[@]}"; do

echo "#### llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}"

llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}

ls -lth ${b}/${d}

sleep 3

done

popd 1>/dev/null 2>&1

set +e

5、转换GGUF 模型

#将模型转换为 FP16 精度的 GGUF 模型，并分别用 Q8_0、Q6_K、Q5_K_M、Q5_0、Q4_K_M、Q4_0、Q3_K、Q2_K 方法来量化模型：

bash quantize.sh Qwen2.5-1.5B-Instruct

#脚本执行完后，查看成功转换为 FP16 精度的 GGUF 模型和量化后的 GGUF 模型：

ll gpustack/Llama-3.2-3B-Instruct-GGUF/

6、安装Ollama：

#Ollama是一个命令行工具，可以在Linux上本地运行多种语言模型，包括Gemma。首先，你需要访问Ollama的官方网站或GitHub页面以获取安装指令和包。

curl -fsSL https://ollama.com/install.sh | sh （需要连接github，国内有时候可以，比较快）

#或者

curl -L -o install.sh https://ollama.com/install.sh

bash install.sh（需翻墙，比较慢）

#安装完成后，你可以通过运行以下命令来验证Ollama是否正确安装：

ollama --version

#出现版本号，安装成功

ollama version is 0.3.14

#Ollama常用命令：

ollama serve #启动ollama

ollama create #从模型文件创建模型

ollama show #显示模型信息

ollama run #运行模型

ollama pull #从注册表中拉取模型

ollama push #将模型推送到注册表

ollama list #列出模型

ollama cp #复制模型

ollama rm #删除模型

ollama help #获取有关任何命令的帮助信息

7、从ollama导入GGUF模型文件

cd ~/huggingface.co/

echo From gpustack/Qwen2.5-1.5B-Instruct-GGUF/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf >Qwen2.5-1.5B-Instruct-Q4_K_M.modlefile

ollama create Qwen2.5-1.5B-Instruct-Q4_K_M -f Qwen2.5-1.5B-Instruct-Q4_K_M.modlefile

#执行成功将出现以下信息

transferring model data 100%

using existing layer sha256:5669c912f8ef5abf7e99bd50afa74ed0d14ce46bc9da5d681f64eb1312e3f673

creating new layer sha256:073cd096f2ecfb321ce0044a93c47ad6752718444b314662992ac186ffbf8c31

writing manifest

success

#运行模型

ollama run Qwen2.5-1.5B-Instruct-Q4_K_M

#将出现以下信息表示成功

>>> Send a message (/? for help)

分享一个国内免费使用GPT4.0的AI智能问答工具：智答专家。支持AI文本、作图、语音、Sora视频。无需魔法，亲测有效，点击访问

标签：llama.cpp,GGUF 模型
来源：智答专家
发布时间：2024-11-11 14:16