> Introduction
Giving a robot the ability to recognize objects in real time is one of the most impactful things you can do to its autonomy. Whether the robot needs to pick up a specific item, avoid pedestrians, or follow a person, object detection is the perceptual foundation.
YOLO (You Only Look Once) has become the go-to family of models for robotics applications because it delivers excellent accuracy at speeds that matter on embedded hardware. In this guide we will set up YOLOv8 inside a ROS2 node, integrate it with a robot camera topic, add depth-based 3D localization via OpenCV, and optimize inference for deployment on a Jetson or similar edge device.
// Table of Contents
> YOLOv8 Model Overview
Ultralytics YOLOv8 is an anchor-free, single-stage detector that unifies detection, segmentation, pose estimation, and classification under one API. It comes in five size variants: nano (n), small (s), medium (m), large (l), and extra-large (x).
Model Size vs. Speed on Jetson Orin
- • YOLOv8n: ~120 FPS — suitable for high-frequency control loops
- • YOLOv8s: ~85 FPS — good balance for mobile robots
- • YOLOv8m: ~45 FPS — preferred for manipulation arms with slower cycles
- • YOLOv8l/x: <25 FPS — use only when accuracy is paramount and latency is acceptable
Task Variants Available
from ultralytics import YOLO
# Detection
model = YOLO('yolov8n.pt')
# Instance Segmentation
model = YOLO('yolov8n-seg.pt')
# Pose Estimation (human keypoints)
model = YOLO('yolov8n-pose.pt')
# OBB (Oriented Bounding Box)
model = YOLO('yolov8n-obb.pt')
> Environment Setup
We need ROS2 Humble, Ultralytics, OpenCV with Python bindings, and the cv_bridge package that converts between ROS image messages and NumPy arrays.
# System dependencies
sudo apt update
sudo apt install ros-humble-cv-bridge
sudo apt install ros-humble-image-transport
sudo apt install ros-humble-vision-msgs
sudo apt install python3-pip
# Python packages
pip install ultralytics
pip install opencv-python-headless
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Verify GPU is visible
python3 -c "import torch; print(torch.cuda.is_available())"
ROS2 Package Setup
cd ~/ros2_ws/src
ros2 pkg create --build-type ament_python yolo_detector \
--dependencies rclpy sensor_msgs vision_msgs cv_bridge
cd ~/ros2_ws
colcon build --packages-select yolo_detector
source install/setup.bash
> Writing the ROS2 Detection Node
The core idea is simple: subscribe to a camera image topic, run YOLO inference, publish detection results as vision_msgs/Detection2DArray, and re-publish an annotated image for visualization.
yolo_node.py
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from cv_bridge import CvBridge
from ultralytics import YOLO
import cv2
class YoloDetectorNode(Node):
def __init__(self):
super().__init__('yolo_detector')
# Parameters
self.declare_parameter('model', 'yolov8n.pt')
self.declare_parameter('confidence', 0.5)
self.declare_parameter('device', 'cuda:0')
model_path = self.get_parameter('model').value
self.conf = self.get_parameter('confidence').value
device = self.get_parameter('device').value
# Load model
self.model = YOLO(model_path)
self.model.to(device)
self.bridge = CvBridge()
# I/O
self.sub = self.create_subscription(
Image, '/camera/image_raw', self.image_callback, 10)
self.det_pub = self.create_publisher(
Detection2DArray, '/detections', 10)
self.img_pub = self.create_publisher(
Image, '/detections/image', 10)
self.get_logger().info('YoloDetectorNode ready.')
def image_callback(self, msg: Image):
frame = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
results = self.model(frame, conf=self.conf, verbose=False)[0]
det_array = Detection2DArray()
det_array.header = msg.header
for box in results.boxes:
det = Detection2D()
det.header = msg.header
hyp = ObjectHypothesisWithPose()
hyp.hypothesis.class_id = str(int(box.cls[0]))
hyp.hypothesis.score = float(box.conf[0])
det.results.append(hyp)
x1, y1, x2, y2 = box.xyxy[0].tolist()
det.bbox.center.position.x = (x1 + x2) / 2
det.bbox.center.position.y = (y1 + y2) / 2
det.bbox.size_x = x2 - x1
det.bbox.size_y = y2 - y1
det_array.detections.append(det)
self.det_pub.publish(det_array)
# Annotated image
annotated = results.plot()
self.img_pub.publish(
self.bridge.cv2_to_imgmsg(annotated, 'bgr8'))
def main():
rclpy.init()
node = YoloDetectorNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
> OpenCV Integration and Visualization
Beyond the raw bounding boxes, OpenCV lets us compute tracking IDs, draw trajectory trails, and measure object distances from the camera's intrinsic parameters.
YOLO Tracking with ByteTrack
# Persistent tracking across frames
results = self.model.track(
frame,
conf=self.conf,
tracker='bytetrack.yaml', # or 'botsort.yaml'
persist=True,
verbose=False
)[0]
if results.boxes.id is not None:
track_ids = results.boxes.id.int().tolist()
for tid, box in zip(track_ids, results.boxes.xyxy):
x1, y1, x2, y2 = map(int, box.tolist())
cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
# Draw trail
if tid not in self.trails:
self.trails[tid] = []
self.trails[tid].append((cx, cy))
self.trails[tid] = self.trails[tid][-30:] # keep last 30 pts
for i in range(1, len(self.trails[tid])):
cv2.line(frame, self.trails[tid][i-1],
self.trails[tid][i], (0,255,255), 2)
Monocular Distance Estimation
# Known real-world width W of the object (meters)
# Camera focal length fx (pixels) from calibration
# Measured pixel width pw of bounding box
def estimate_distance(fw_pixels, real_width_m, focal_length_px):
"""Thin-lens formula: D = (W * f) / pw"""
return (real_width_m * focal_length_px) / fw_pixels
# Example: person ~0.5 m wide, fx=600
dist = estimate_distance(
fw_pixels=bbox_size_x,
real_width_m=0.5,
focal_length_px=600.0
)
print(f'Estimated distance: {dist:.2f} m')
> 3D Object Localization with Depth Camera
With an RGB-D camera (RealSense D435, ZED 2, etc.) we can project detected bounding box centers into 3D using the aligned depth stream and camera intrinsics.
from sensor_msgs.msg import CameraInfo
import numpy as np
class YoloDepthNode(Node):
def __init__(self):
super().__init__('yolo_depth')
self.K = None # camera intrinsic matrix
self.info_sub = self.create_subscription(
CameraInfo, '/camera/depth/camera_info',
self.info_callback, 1)
self.depth_sub = self.create_subscription(
Image, '/camera/aligned_depth_to_color/image_raw',
self.depth_callback, 10)
def info_callback(self, msg):
self.K = np.array(msg.k).reshape(3, 3)
def depth_callback(self, msg):
depth_image = self.bridge.imgmsg_to_cv2(msg, '16UC1')
depth_m = depth_image.astype(np.float32) / 1000.0 # mm -> m
# For each detection center (cx, cy):
self.project_to_3d(cx=320, cy=240, depth_m=depth_m)
def project_to_3d(self, cx, cy, depth_m):
z = float(depth_m[int(cy), int(cx)])
if z == 0 or np.isnan(z):
return None
fx = self.K[0, 0]; fy = self.K[1, 1]
ppx = self.K[0, 2]; ppy = self.K[1, 2]
x = (cx - ppx) * z / fx
y = (cy - ppy) * z / fy
return np.array([x, y, z]) # in camera frame
> Edge Optimization: TensorRT and ONNX
On a Jetson Orin, exporting to TensorRT can double or triple inference throughput compared to vanilla PyTorch, and is essential for running multiple perception pipelines in parallel.
Export and Run with TensorRT
from ultralytics import YOLO
# Step 1: Export to TensorRT engine (run once on target hardware)
model = YOLO('yolov8n.pt')
model.export(
format='engine',
device=0,
half=True, # FP16 for Jetson
imgsz=640,
workspace=4 # GB of GPU memory for TRT optimization
)
# Step 2: Load the engine for inference
trt_model = YOLO('yolov8n.engine')
results = trt_model('image.jpg') # now runs via TensorRT
Throughput Benchmarks (640px)
- • PyTorch FP32: ~35 ms/frame
- • ONNX Runtime: ~22 ms/frame
- • TensorRT FP16: ~8 ms/frame
- • TensorRT INT8: ~5 ms/frame
Key Optimization Tips
- • Use
half=Truefor FP16 on all Jetson platforms - • Pin image memory with
torch.cuda.HostMemory - • Batch multiple camera feeds when possible
- • Reduce input resolution for far-field scenes
> Conclusion
Deploying YOLOv8 in a ROS2 node gives your robot a powerful, real-time perception backbone. With depth integration and TensorRT optimization you can run multiple detection pipelines on a single edge board while still publishing results fast enough for reactive behaviors.
The next step is fine-tuning on domain-specific data — industrial parts, household objects, or outdoor pedestrians — so the model's class vocabulary matches exactly what your robot needs to recognize.
Ready to See?
Grab the full node code from the repository, connect it to your robot's camera topic, and watch detections stream in RViz2. Feel free to open an issue if you run into hardware-specific quirks.