Tools

AI Gallery

Tools

AI Gallery

Tools

AI Gallery

Faizan Khan

@faizan10114

Published on Jan 4, 2025

Deploying Llama3.1 on GCP Vertex AI

Faizan Khan

@faizan10114

Published on Jan 4, 2025

This guide walks you through deploying Llama 3` on Google Cloud's Vertex AI platform using Python from your terminal. We'll cover everything from environment setup to handling compute quotas.

Prerequisites

  • Google Cloud account

  • Python 3.10+

  • Access to Llama 3.1 (request from Meta)

Compute Requirements

Llama 3.1 has different variants with different resource needs:

  • Llama 3.1 8B: Minimum 16GB GPU RAM (A2 High-Memory)

Recommended instance types:

  • 8B: a2-highgpu-2g

Step 1: Environment Setup

1.1 Install Google Cloud SDK

Install the Google Cloud SDK and authenticate using:

1.2 Initialize Google Cloud


1.3 Set Up Python Environment


Step 2: Request Quota Increase

Before deploying, ensure you have sufficient quota for GPU instances:

  1. Visit the Google Cloud Console: https://console.cloud.google.com

  2. Go to IAM & Admin → Quotas

  3. Filter for "GPUs (all regions)"

  4. Select the quota for your region (e.g., "GPUs (us-east1)")

  5. Click "EDIT QUOTAS"

  6. Enter new quota limit:

    • For 7B model: Request at least 1 A2 GPU

    • For 13B model: Request at least 2 A2 GPUs

    • For 70B model: Request at least 4 A2 GPUs

  7. Fill in the request form:

Note: Quota approval can take 24-48 hours.

Step 3: Deployment Code

Create a new file deploy_llama.py:

from google.cloud import aiplatform
import os

def deploy_hf_model(
    project_id: str,
    location: str,
    model_id: str,
    machine_type: str =  "a2-highgpu-4g,
):
    """
    Deploy a Hugging Face model using pre-built container
    """
    # Initialize Vertex AI
    aiplatform.init(project=project_id, location=location)
    
    env_vars = {
        "MODEL_ID": model_id,
        "MAX_INPUT_LENGTH": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "2048",
        "NUM_SHARD": "1",
        # "HF_TOKEN": ""  # Add your Hugging Face token if needed
    }


    # Create model using pre-built container
    model = aiplatform.Model.upload(
        display_name=f"hf-{model_id.replace('/', '-')}",
        # Using the official container for Hugging Face models
        serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310",
        serving_container_environment_variables=env_vars
    )
  
    print(f"model uploaded: {model}")

    # Deploy model to endpoint
    endpoint = model.deploy(
        machine_type=machine_type,
        min_replica_count=1,
        max_replica_count=1,
        accelerator_type="NVIDIA_TESLA_A100",
        accelerator_count=1,
        sync=True
    )
    
    print(f"Model deployed to endpoint: {endpoint.resource_name}")
    return endpoint


def create_completion(
    endpoint,
    prompt: str,
    max_tokens: int = 100,
    temperature: float = 0.7
):
    """
    Generate text using deployed model
    """
    response = endpoint.predict({
        "text": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature,
            "top_p": 0.95,
            "top_k": 40,
        }
    })
    return response

if __name__ == "__main__

Step 4: Deploy the Model

  1. Run the deployment:


Step 5: Monitor the Deployment

Using gcloud CLI:

bashCopy# List endpoints
gcloud ai endpoints list

# Get endpoint details
gcloud ai endpoints describe ENDPOINT_ID

# Get endpoint predictions
gcloud ai endpoints predict ENDPOINT_ID \
    --json-request='{"instances": [{"text": "Tell me a joke"}]

Using Google Cloud Console:

  1. Go to Vertex AI → Models

  2. Find your model in the list

  3. Click on the Endpoints tab

  4. Monitor metrics:

    • Prediction requests

    • Latency

    • Error rate

Cost Optimization

To minimize costs:

  1. Delete endpoints when not in use:

Troubleshooting

Common issues and solutions:

  1. Quota Exceeded

    • Check current quota: gcloud compute regions describe REGION

    • Request increase as described above

  2. Out of Memory

    • Reduce batch size in environment variables

    • Use larger instance type

    • Reduce sequence length

  3. Model Access Error

    • Ensure Hugging Face token is set

    • Verify Meta approval for Llama 3.1


For more information:


If you want to chat with Any github Codebase, please visit CodesALot

If you want to chat with data and generate visualizations, please visit SirPlotsAlot

If you are struggling to integrate AI into your apps

If you are struggling to integrate AI into your apps

Build AI enbaled apps from a single prompt

Build full stack Apps from prompts

READ OTHER POSTS

©2025 – Made with ❤️ & ☕️ in Montreal

©2025 – Made with ❤️ & ☕️ in Montreal

©2025 – Made with ❤️ & ☕️ in Montreal