Real-time Object Detection for Robotics

> Introduction

Giving a robot the ability to recognize objects in real time is one of the most impactful things you can do to its autonomy. Whether the robot needs to pick up a specific item, avoid pedestrians, or follow a person, object detection is the perceptual foundation.

YOLO (You Only Look Once) has become the go-to family of models for robotics applications because it delivers excellent accuracy at speeds that matter on embedded hardware. In this guide we will set up YOLOv8 inside a ROS2 node, integrate it with a robot camera topic, add depth-based 3D localization via OpenCV, and optimize inference for deployment on a Jetson or similar edge device.

// Table of Contents

1. YOLOv8 Model Overview
2. Environment Setup
3. Writing the ROS2 Detection Node
4. OpenCV Integration and Visualization
5. 3D Object Localization with Depth Camera
6. Edge Optimization: TensorRT and ONNX

> YOLOv8 Model Overview

Ultralytics YOLOv8 is an anchor-free, single-stage detector that unifies detection, segmentation, pose estimation, and classification under one API. It comes in five size variants: nano (n), small (s), medium (m), large (l), and extra-large (x).

Model Size vs. Speed on Jetson Orin

• YOLOv8n: ~120 FPS — suitable for high-frequency control loops
• YOLOv8s: ~85 FPS — good balance for mobile robots
• YOLOv8m: ~45 FPS — preferred for manipulation arms with slower cycles
• YOLOv8l/x: <25 FPS — use only when accuracy is paramount and latency is acceptable

Task Variants Available

from ultralytics import YOLO

# Detection
model = YOLO('yolov8n.pt')

# Instance Segmentation
model = YOLO('yolov8n-seg.pt')

# Pose Estimation (human keypoints)
model = YOLO('yolov8n-pose.pt')

# OBB (Oriented Bounding Box)
model = YOLO('yolov8n-obb.pt')

> Environment Setup

We need ROS2 Humble, Ultralytics, OpenCV with Python bindings, and the cv_bridge package that converts between ROS image messages and NumPy arrays.

# System dependencies
sudo apt update
sudo apt install ros-humble-cv-bridge
sudo apt install ros-humble-image-transport
sudo apt install ros-humble-vision-msgs
sudo apt install python3-pip

# Python packages
pip install ultralytics
pip install opencv-python-headless
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Verify GPU is visible
python3 -c "import torch; print(torch.cuda.is_available())"

ROS2 Package Setup

cd ~/ros2_ws/src
ros2 pkg create --build-type ament_python yolo_detector \
    --dependencies rclpy sensor_msgs vision_msgs cv_bridge

cd ~/ros2_ws
colcon build --packages-select yolo_detector
source install/setup.bash

> Writing the ROS2 Detection Node

The core idea is simple: subscribe to a camera image topic, run YOLO inference, publish detection results as vision_msgs/Detection2DArray, and re-publish an annotated image for visualization.

yolo_node.py

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from cv_bridge import CvBridge
from ultralytics import YOLO
import cv2

class YoloDetectorNode(Node):
    def __init__(self):
        super().__init__('yolo_detector')

        # Parameters
        self.declare_parameter('model', 'yolov8n.pt')
        self.declare_parameter('confidence', 0.5)
        self.declare_parameter('device', 'cuda:0')
        model_path = self.get_parameter('model').value
        self.conf  = self.get_parameter('confidence').value
        device     = self.get_parameter('device').value

        # Load model
        self.model  = YOLO(model_path)
        self.model.to(device)
        self.bridge = CvBridge()

        # I/O
        self.sub = self.create_subscription(
            Image, '/camera/image_raw', self.image_callback, 10)
        self.det_pub = self.create_publisher(
            Detection2DArray, '/detections', 10)
        self.img_pub = self.create_publisher(
            Image, '/detections/image', 10)

        self.get_logger().info('YoloDetectorNode ready.')

    def image_callback(self, msg: Image):
        frame = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
        results = self.model(frame, conf=self.conf, verbose=False)[0]

        det_array = Detection2DArray()
        det_array.header = msg.header

        for box in results.boxes:
            det = Detection2D()
            det.header = msg.header
            hyp = ObjectHypothesisWithPose()
            hyp.hypothesis.class_id = str(int(box.cls[0]))
            hyp.hypothesis.score    = float(box.conf[0])
            det.results.append(hyp)
            x1, y1, x2, y2 = box.xyxy[0].tolist()
            det.bbox.center.position.x = (x1 + x2) / 2
            det.bbox.center.position.y = (y1 + y2) / 2
            det.bbox.size_x = x2 - x1
            det.bbox.size_y = y2 - y1
            det_array.detections.append(det)

        self.det_pub.publish(det_array)

        # Annotated image
        annotated = results.plot()
        self.img_pub.publish(
            self.bridge.cv2_to_imgmsg(annotated, 'bgr8'))


def main():
    rclpy.init()
    node = YoloDetectorNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

> OpenCV Integration and Visualization

Beyond the raw bounding boxes, OpenCV lets us compute tracking IDs, draw trajectory trails, and measure object distances from the camera's intrinsic parameters.

YOLO Tracking with ByteTrack

# Persistent tracking across frames
results = self.model.track(
    frame,
    conf=self.conf,
    tracker='bytetrack.yaml',  # or 'botsort.yaml'
    persist=True,
    verbose=False
)[0]

if results.boxes.id is not None:
    track_ids = results.boxes.id.int().tolist()
    for tid, box in zip(track_ids, results.boxes.xyxy):
        x1, y1, x2, y2 = map(int, box.tolist())
        cx, cy = (x1 + x2) // 2, (y1 + y2) // 2

        # Draw trail
        if tid not in self.trails:
            self.trails[tid] = []
        self.trails[tid].append((cx, cy))
        self.trails[tid] = self.trails[tid][-30:]  # keep last 30 pts

        for i in range(1, len(self.trails[tid])):
            cv2.line(frame, self.trails[tid][i-1],
                     self.trails[tid][i], (0,255,255), 2)

Monocular Distance Estimation

# Known real-world width W of the object (meters)
# Camera focal length fx (pixels) from calibration
# Measured pixel width pw of bounding box

def estimate_distance(fw_pixels, real_width_m, focal_length_px):
    """Thin-lens formula: D = (W * f) / pw"""
    return (real_width_m * focal_length_px) / fw_pixels

# Example: person ~0.5 m wide, fx=600
dist = estimate_distance(
    fw_pixels=bbox_size_x,
    real_width_m=0.5,
    focal_length_px=600.0
)
print(f'Estimated distance: {dist:.2f} m')

> 3D Object Localization with Depth Camera

With an RGB-D camera (RealSense D435, ZED 2, etc.) we can project detected bounding box centers into 3D using the aligned depth stream and camera intrinsics.

from sensor_msgs.msg import CameraInfo
import numpy as np

class YoloDepthNode(Node):
    def __init__(self):
        super().__init__('yolo_depth')
        self.K = None  # camera intrinsic matrix

        self.info_sub = self.create_subscription(
            CameraInfo, '/camera/depth/camera_info',
            self.info_callback, 1)
        self.depth_sub = self.create_subscription(
            Image, '/camera/aligned_depth_to_color/image_raw',
            self.depth_callback, 10)

    def info_callback(self, msg):
        self.K = np.array(msg.k).reshape(3, 3)

    def depth_callback(self, msg):
        depth_image = self.bridge.imgmsg_to_cv2(msg, '16UC1')
        depth_m = depth_image.astype(np.float32) / 1000.0  # mm -> m
        # For each detection center (cx, cy):
        self.project_to_3d(cx=320, cy=240, depth_m=depth_m)

    def project_to_3d(self, cx, cy, depth_m):
        z = float(depth_m[int(cy), int(cx)])
        if z == 0 or np.isnan(z):
            return None
        fx = self.K[0, 0]; fy = self.K[1, 1]
        ppx = self.K[0, 2]; ppy = self.K[1, 2]
        x = (cx - ppx) * z / fx
        y = (cy - ppy) * z / fy
        return np.array([x, y, z])  # in camera frame

> Edge Optimization: TensorRT and ONNX

On a Jetson Orin, exporting to TensorRT can double or triple inference throughput compared to vanilla PyTorch, and is essential for running multiple perception pipelines in parallel.

Export and Run with TensorRT

from ultralytics import YOLO

# Step 1: Export to TensorRT engine (run once on target hardware)
model = YOLO('yolov8n.pt')
model.export(
    format='engine',
    device=0,
    half=True,        # FP16 for Jetson
    imgsz=640,
    workspace=4       # GB of GPU memory for TRT optimization
)

# Step 2: Load the engine for inference
trt_model = YOLO('yolov8n.engine')
results = trt_model('image.jpg')  # now runs via TensorRT

Throughput Benchmarks (640px)

• PyTorch FP32: ~35 ms/frame
• ONNX Runtime: ~22 ms/frame
• TensorRT FP16: ~8 ms/frame
• TensorRT INT8: ~5 ms/frame

Key Optimization Tips

• Use half=True for FP16 on all Jetson platforms
• Pin image memory with torch.cuda.HostMemory
• Batch multiple camera feeds when possible
• Reduce input resolution for far-field scenes

> Conclusion

Deploying YOLOv8 in a ROS2 node gives your robot a powerful, real-time perception backbone. With depth integration and TensorRT optimization you can run multiple detection pipelines on a single edge board while still publishing results fast enough for reactive behaviors.

The next step is fine-tuning on domain-specific data — industrial parts, household objects, or outdoor pedestrians — so the model's class vocabulary matches exactly what your robot needs to recognize.

Ready to See?

Grab the full node code from the repository, connect it to your robot's camera topic, and watch detections stream in RViz2. Feel free to open an issue if you run into hardware-specific quirks.

GitHub Repository Get in Touch