Table of Contents
Update
> Note: 1-Click Deploy of HUGS to GPU cloud servers was deprecated on September 30, 2025. Hugging Face no longer offers HUGS deployments, as the experiment was discontinued. More information can be found in the HUGS documentation. > > For current AI inference solutions, explore the cloud provider's Serverless Inference offering.
Introduction
Hugging Face’s Generative AI Services (HUGS) makes deploying and managing LLMs easier and faster. Now, with the cloud provider’s 1-Click deployment for HUGS on GPU Droplets, you can set up, scale, and optimize LLMs on a cloud infrastructure tailored for high performance. This guide walks you through deploying HUGS on a GPU cloud servers and integrating it with Open WebUI. It also explains why this setup is ideal for seamless, scalable LLM inference.
Prerequisites
- A the cloud provider Cloud account.
- A GPU Droplet deployed and running, and another Droplet up and running to deploy and run the Open WebUI docker container.
- Familiarity with SSH and basic Docker commands.
- An SSH key for logging into your Droplet.
Step 1 - Create and Access Your GPU Droplet
Go to the cloud provider’s Droplets page and create a new GPU Droplet. Under the Choose an Image tab, please select 1-Click Models and use one of the available Hugging Face images.
- Set up the Droplet:
Once your Droplet is ready, click on its name in the Droplets section and select Launch Web Console.
- Access the Console:
- Please note the Message of the Day (MOTD): This contains the bearer token and inference endpoint for API access, which you’ll need later.
Step 2 - Start Hugging Face HUGS
Hugging Face HUGS will automatically start after the Droplet setup. To verify, check the status of the Caddy service managing the inference API:
sudo systemctl status caddy
[secondary_label Output
● caddy.service - Caddy
Loaded: loaded (/lib/systemd/system/caddy.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/caddy.service.d
└─override.conf
Active: <^>active<^> (running) since Wed 2024-10-30 10:27:10 UTC; 2min 58s ago
Docs: https://caddyserver.com/docs/
Main PID: 8239 (caddy)
Tasks: 17 (limit: 629145)
Memory: 48.8M
CPU: 73ms
CGroup: /system.slice/caddy.service
└─8239 /usr/bin/caddy run --config /etc/caddy/Caddyfile
Allow 5-10 minutes for the model to fully load.
Step 3 - Start Open WebUI
Launch Open WebUI using Docker on another Droplet. Please use the below docker command to run the Open WebUI docker container.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Once Open WebUI runs, access it at http://<your_droplet_ip>:3000.
Step 4 - Integrate HUGS with Open WebUI
To connect Open WebUI with Hugging Face HUGS:
- Open Settings:
- In Open WebUI, click your user icon at the bottom left, then click Settings.
- Go to Admin:
- Navigate to the Admin tab, then select Connections.
- Set the Inference Endpoint:
- In the API link field, enter your Droplet’s IP followed by
/v1. If a specific port is required, include it, e.g.,http://<your_droplet_ip>/v1. - Use the API token from the MOTD for authentication.
- Verify Connection:
- Click Verify Connection. A green light confirms a successful connection. Open WebUI will then auto-detect available models, such as
hfhgus/Meta-Llama.
—
Step 5: Start Chatting with the Model
With HUGS integrated into Open WebUI, you’re ready to interact with your LLM:
- Ask questions like “What is the cloud provider?”
- Monitor requests logs from the container while asking a follow-up question:
Does the cloud provider offer object storage?:
sudo docker ps
sudo docker logs <your-container-ID> -f
Why Choose HUGS on GPU cloud servers?
Deploying HUGS with the cloud provider’s one-click setup is straightforward. No need for manual configurations—the cloud provider and Hugging Face handle the backend, allowing you to focus on scaling.
- Ease of Deployment and Simplified Management
HUGS on the cloud provider GPUs ensures optimal performance, running LLMs efficiently on GPU hardware without manual tuning.
- Optimized Performance for Large-Scale Inference
the cloud provider’s infrastructure supports scalable deployments with load balancers for high availability, letting you serve users globally with low latency.
- Scalability and Flexibility
By using Hugging Face HUGS on GPU cloud servers, you not only benefit from high-performance LLM inference but also gain the flexibility to scale and manage the deployment effortlessly. This combination of optimized hardware, scalability, and simplicity makes the cloud provider an excellent choice for production-level AI workloads.
Conclusion
With HUGS deployed on the cloud provider’s GPU Droplet and Open WebUI, you can efficiently manage, scale, and optimize LLM inference. This setup eliminates hardware optimization concerns and provides a ready-to-scale solution for delivering fast, reliable responses across multiple regions.