Table of Contents
URL: https://www.progressiverobot.com/image-processing-using-llama-huggingface/
Introduction
Extracting insights from images has long been a challenge across industries like finance, healthcare, and law. Traditional methods, such as Optical Character Recognition (OCR), have struggled with complex layouts and contextual understanding.
Llama 3.2 Vision, an advanced AI model, enhances image processing capabilities like Visual Question Answering and OCR. By integrating this model with the cloud provider's cloud infrastructure, this tutorial provides a scalable and efficient way to implement AI-powered image processing.
In this tutorial, you will learn to set up Llama 3.2 Vision with the cloud provider's cloud infrastructure, and demonstrate how to use it for AI-powered image processing for extracting employee IDs and names from images. We will cover the installation and configuration steps, as well as provide examples of how to use the model for Visual Question Answering and OCR. By the end of this tutorial, you will have a solid understanding of how to leverage Llama 3.2 Vision for your image processing needs.
Prerequisites
Before proceeding, ensure you have:
- A GPU cloud servers with Python 3.10+ installed.
- A object storage account with an access key and secret key.
- A the cloud provider Managed MySQL database.
- A HuggingFace Token.
Step 1 - Set Up the Environment
SSH into Your GPU Droplet
Connect to your server via SSH:
ssh root@your_server_ip
Install Python & Create a Virtual Environment
Run the following commands to set up a Python virtual environment:
apt install python3.10-venv -y
python3.10 -m venv llama-env
Activate the Virtual Environment
source llama-env/bin/activate
Step 2 - Install Required Dependencies
Install PyTorch & Hugging Face CLI
pip install torch torchvision torchaudio
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Install the Transformers Library
pip install --upgrade transformers
Install Flask & AWS SDK (Boto3)
Boto3 is required to interact with object storage, which is S3-compatible.
pip install flask boto3
Install MySQL Connector for Python
pip install mysql-connector-python
Step 3 - Install & Configure Nginx
Install Nginx to serve your Flask application:
sudo apt install nginx -y
Step 4 - Set Up the Flask Web Application
Application Folder Structure
Organize your project as follows:
llama-webapp/
├── app.py # Main Flask app file
├── static/
│ └── styles.css # Optional: CSS file for styling
└── templates/
└── index.html # HTML template for the web page
Python Code for the Application
Below is the Flask app (app.py) that loads the Llama 3.2 model, processes uploaded images, and extracts employee details.
import os
import json
import requests
from PIL import Image
from flask import Flask, request, render_template, session
from transformers import MllamaForConditionalGeneration, AutoProcessor
import boto3
import torch
import mysql.connector
import re
# Initialize Flask app
app = Flask(__name__)
app.secret_key = "your_secret_key" # Replace with a strong secret key
# Load the Llama 3.2 model and processor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# object storage setup
SPACE_NAME = "your_space_name"
SPACE_REGION = "your_region"
ACCESS_KEY = "your_access_key"
SECRET_KEY = "your_secret_key"
s3 = boto3.client(
"s3",
region_name=SPACE_REGION,
endpoint_url=f"images/image-processing-using-llama-huggingface-section-1.png",
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY
)
# the cloud provider Managed MySQL setup
DB_HOST = "your_mysql_host"
DB_PORT = 25060
DB_NAME = "your_database_name"
DB_USER = "your_username"
DB_PASSWORD = "your_password"
# Function to establish a database connection
def get_db_connection():
try:
conn = mysql.connector.connect(
host=DB_HOST, port=DB_PORT, database=DB_NAME, user=DB_USER, password=DB_PASSWORD
)
print("✅ Database connection successful!")
return conn
except Exception as e:
print(f"❌ Error connecting to the database: {e}")
return None
# Function to extract Employee ID & Name from an image
def extract_employee_details(image_path):
"""Extracts Employee Name and ID using Llama 3.2 Vision AI."""
try:
image = Image.open(image_path)
prompt = (
"Extract the Employee ID and Name from the given image. "
"Provide output in valid JSON format with keys: 'employee_id' and 'employee_name'."
)
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
raw_result = processor.decode(output[0])
json_match = re.search(r"\{.*?\}", raw_result, re.DOTALL)
extracted_data = json.loads(json_match.group(0)) if json_match else {}
employee_id = extracted_data.get("employee_id", "").strip()
employee_name = extracted_data.get("employee_name", "").strip()
return employee_id, employee_name
except Exception as e:
return None, None
# Flask Routes
@app.route("/", methods=["GET", "POST"])
def index():
"""Handles image uploads, extracts Employee Name & ID, and stores it in MySQL."""
result = None
image_url = session.get("image_url")
if request.method == "POST":
image_file = request.files.get("image")
if not image_file:
return "Error: Please upload an image.", 400
filename = image_file.filename
image_path = os.path.join("/tmp", filename)
image_file.save(image_path)
employee_id, employee_name = extract_employee_details(image_path)
return render_template("index.html", result=result, image_url=image_url)
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Step 5: Run & Access the Application
- Start the Flask application:
python app.py
- Open your browser and visit:
http://your_server_ip:5000
- Upload an image, extract employee details, and verify data storage in the database.
FAQs
1. What is Llama 3.2, and how does it differ from previous versions?
Llama 3.2 is a state-of-the-art AI model developed by Meta (Facebook) that builds upon its predecessor, Llama 3. It offers improved natural language understanding, better performance in multimodal tasks (including image processing), and enhanced efficiency when integrated with Hugging Face Transformers.
2. Can Llama 3.2 process images directly?
Yes, Llama 3.2 introduces vision models (11B and 90B) that enable it to process and understand images directly, allowing for tasks like image captioning, object recognition, and scene interpretation
3. What are some common use cases of Llama 3.2 in image processing?
Llama 3.2 can assist in image processing tasks such as:
- Image Captioning: Generating descriptive text from images.
- Object Recognition: Identifying objects within an image when combined with a vision model.
- Text Extraction (OCR): Helping interpret extracted text from an image.
- Style Transfer & Image Editing: Assisting in AI-powered image generation and modification.
4. How do I set up Hugging Face Transformers to work with Llama 3.2?
You can install the required libraries and load the model using the following steps:
pip install transformers torch torchvision
Then, load the model with:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/llama-3-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
If working with images, you may also need transformers's vision models like CLIP:
from transformers import CLIPProcessor, CLIPModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
5. How can I use Llama 3.2 for image captioning?
Llama 3.2’s vision model can generate high-quality captions for images:
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
model_name = "meta-llama/llama-3.2-vision"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)
image = Image.open("your_image.jpg")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.batch_decode(output, skip_special_tokens=True)[0]
print("Generated Caption:", caption)
6. Can I fine-tune Llama 3.2 Vision on my own dataset?
Yes, you can fine-tune Llama 3.2 Vision using Hugging Face’s transformers library with LoRA (Low-Rank Adaptation).
Example fine-tuning setup:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"]
)
fine_tuned_model = get_peft_model(model, config)
This allows efficient fine-tuning without retraining the entire model.
Conclusion
In this tutorial, you learned how to extract employee IDs and names from images using the Llama 3.2 Vision model. We integrated object storage for storing images and used a managed MySQL database for structured data storage. This solution provides an automated way to process and manage employee verification data with AI-powered efficiency.