快速搭建一个自己的LLM API服务

OnethingAI

发布于：2025-06-12

0，基本说明：

sgl_router: sglang 引擎开发的负载均衡模块，在RoundRobin的基础上，增加了Cache-Aware的负载均衡，极大提升了kv cache的命中率 (细节可参考 sglang 0.4)

异步HTTP 处理， python aiohttp 模块

不同的模型擅长不同的领域，根据我们的任务可以做一些选择比如：

1，llama 3.1 8B， llama 3.3 70B 擅长角色扮演；
2，llava， phi-4 擅长视觉；
3，qwen2.5-coder, deepseek-coder v2 擅长写代码； ## 1，进入onethingai.com 创建vLLM 实例（根据目标模型大小选择gpu数量，比如8B模型，一个4090即可， 70B，选择8个4090）

2，下载目标语言模型

本例使用 https://modelscope.cn/ 来下载模型

pip isntall modelscope

以llama3 8B instruct为例：

modelscope download --model LLM-Research/Meta-Llama-3-8B-Instruct

3，启动vllm 服务

查看modelscope 下载的模型路径：

以 2中的例子模型为例：

ls /root/.cache/modelscope/hub/LLM-Research/

启动命令例子：

nohup vllm serve /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3___1-8B-Instruct/ \
--host 0.0.0.0 --port 6006 --max_model_len 16384 \
--served-model-name meta-llama/Llama-3.1-8B-Instruct&

启动后可以通过本地验证

curl -v http://127.0.0.1:6006/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "user", "content": "Say this is a test!"}
  ],
  "temperature": 0.7
}'

成功后返回如下：

4，配置对外服务的API

云厂启动一个VM，安装nginx(如果没有装过nginx，需要搜一下)

默认load_balancer 用8000 端口，所以nginx配置如下：

upstream backend {
        server localhost:8000;
}

server {
  listen       80 default_server;
  listen       [::]:80 default_server;
  server_name  llms.onethingai.com;
  root         /usr/share/nginx/html;

  # Load configuration files for the default server block.
  include /etc/nginx/default.d/*.conf;

    location / {
    proxy_pass http://backend;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}


    error_page 404 /404.html;
    location = /404.html {
}

    error_page 500 502 503 504 /50x.html;
    location = /50x.html {
}
}

启动load_balancer

下载代码创建一个python环境：

git clone https://github.com/OneThingAI/simple-loadbalancer.git
cd simple-loadbalancer
conda create -n loadbalance python=3.10
conda activate loadbalance
pip intall -r requirements.txt

编辑endpoints_config.yaml，结果如下：

启动load_balancer

nohup python load_balancer.py &

测试

这里解释你的服务域名是your_domain

curl -v http://your_domain/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "user", "content": "Say this is a test!"}
  ],
  "temperature": 0.7
}'

测试结果

提交反馈