An Introduction to YOLO (You Only Look Once)

This blog post gives a short and beginner friendly introduction to YOLO algorithm. Please read the post, comment your views and subscribe the blog 🙂

YOLO (You Only Look Once) is an amazingly fast object detection computer vision architecture. It was presented in CVPR 2016.

Check out this amazing video by authors of Yolo’s paper


Introduction to YOLO

Yolo is an object detection algorithm. It detects multiple objects present in an image and creates a bounding box around them.


YOLO frames object detection as a regression problem instead of a classification problem. YOLO brings a unified neural network architecture to the table, single architecture which does bounding box prediction and also gives out class probabilities.

In other architectures like RCNN, they first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing refines the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

In YOLO a single convNet simultaneously predicts multiple bounding boxes and also the class probabilities for those boxes. This allows YOLO to optimize. YOLO is fast and it reasons about the image globally while making predictions example, it makes less than half the number of background errors compared to Fast R-CNN.

However, Vanilla YOLO lags behind the state of the art in terms of accuracy.

NOTE – There are many variants of YOLO available like YOLOv3, tiny YOLO etc.

How to run YOLO on your computer?

First we clone and build the repository. Interestingly, YOLO is written in C.

> git clone
> cd darknet
> make

Now we download weights of trained YOLO model –


You are all set to try YOLO now –

>./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg

I tried YOLO on an image which was taken when I had gone to Bangalore to attend Google Tensorflow Roadshow 2018.


How unified architecture works?

YOLO uses features from the entire image for predicting each bounding boxes. It also predicts all bounding boxes across all classes for an image simultaneously.

Screenshot 2019-04-23 at 8.39.41 PM.png

YOLO divides an image in SXS grids. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores tell that how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.

Screenshot 2019-04-23 at 9.05.36 PM.png

The detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. The final output of our network is the 7 × 7 × 30 tensor of predictions.

Limitations of YOLO algorithm

  • It struggles with small objects that appear in groups, such as flocks of birds.
  • It struggles to generalize to objects in new or unusual aspect ratios or configurations.
  • The architecture in the paper is not able to achieve state of the art accuracy.

Further Reads –

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s