Firebase ML Kit 9: Object Detection & Tracking

I always thought Adobe After Effects was so cool just for the ability to track moving objects. I never got around to learning After Effects myself, but what I do know is how to deploy a ML model that does the same job…. minus the effects of course.

Firebase ML Kit’s Object Detection & Tracking Model takes in an image, and returns the positions of objects it sees as well as coarse classifications they each belong to (place, fashion goods, etc).

Tracking ID 0
Bounds (95, 45), (496, 45), (496, 240), (95, 240)
Category PLACE
Classification confidence 0.9296875

Image and stats taken from Firebase Docs


Settings and Modes

The model has 2 object detection settings:

Most Prominent Object – detects the most prominent object in the image

Multiple Objects – detects up to 5 objects the model can identify

As well as 2 modes:

STREAM_MODE (default) – Tracks the most prominent image in the camera viewfinder with low latency. Tracking IDs are also provided for tracking objects across frames.

SINGLE_IMAGE_MODE – Runs detection on a single bitmap with slightly higher latency but more accuracy.

As you may guess, the STREAM_MODE works in real time so its quick and well optimized for doing that real After Effects stuff.


The model is still new but we have to point out what we can’t do with it yet.

1. STREAM_MODE can’t track objects other than the most prominent. If you want to track a more subtle object in the background, good luck to you.

2. The first few invocations of STREAM_MODE may produce incomplete results (i.e. unspecified bounding boxes or category labels)

3. Multiple Objects Settings can only detect up to 5 objects. This should be plenty for most use cases, but there’s always that if.

Now that that’s out of the way, let’s get coding.



As with all Firebase products, you’ll need to connect your app to Firebase and in addition, add these dependencies to your app/build.gradle file:

dependencies {
  implementation ''
  implementation ''

Configure the Object Detector

Configure your options according to what you need

// Live detection and tracking
val options = FirebaseVisionObjectDetectorOptions.Builder()
        .enableClassification()  // Optional

// Multiple object detection in static images
val options = FirebaseVisionObjectDetectorOptions.Builder()
        .enableClassification()  // Optional

Instantiate the Object Detector

val objectDetector = FirebaseVision.getInstance().getOnDeviceObjectDetector(options)


Set up the Firebase Vision Image

This bunch may seem convoluted. It’s really not, trust me. There’s just many ways to set up the image.

This object will prepare the image for ML Kit processing. You can make a FirebaseVisionImage from a bitmap, media.Image, ByteBuffer, byte array, or a file on the device.

From Bitmap

The simplest way to do it. The above code will work as long as your image is upright.

From media.Image

Such as when taking a photo using your device’s camera. You’ll need to get the angle by which the image must be rotated to be turned upright, given the device’s orientation while taking a photo, and calculate that against the default camera orientation of the device (which is 90 on most devices, but can be different for other devices).

private static final SparseIntArray ORIENTATIONS = new SparseIntArray();
    static {
        ORIENTATIONS.append(Surface.ROTATION_0, 90);
        ORIENTATIONS.append(Surface.ROTATION_90, 0);
        ORIENTATIONS.append(Surface.ROTATION_180, 270);
        ORIENTATIONS.append(Surface.ROTATION_270, 180);

private int getRotationCompensation(String cameraId) throws CameraAccessException {
        int deviceRotation = getWindowManager().getDefaultDisplay().getRotation();
        int rotationCompensation = ORIENTATIONS.get(deviceRotation);

        CameraManager cameraManager = (CameraManager) getSystemService(CAMERA_SERVICE);
        int sensorOrientation = cameraManager
        rotationCompensation = (rotationCompensation + sensorOrientation + 270) % 360;

        // Return the corresponding FirebaseVisionImageMetadata rotation value.
        int result;
        switch (rotationCompensation) {
            case 0:
                result = FirebaseVisionImageMetadata.ROTATION_0;
            case 90:
                result = FirebaseVisionImageMetadata.ROTATION_90;
            case 180:
                result = FirebaseVisionImageMetadata.ROTATION_180;
            case 270:
                result = FirebaseVisionImageMetadata.ROTATION_270;
                result = FirebaseVisionImageMetadata.ROTATION_0;
                Log.e(LOG_TAG, "Bad rotation value: " + rotationCompensation);
        return result;

private void someOtherMethod() {
    int rotation = getRotationCompensation(cameraId);
    FirebaseVisionImage image = FirebaseVisionImage.fromMediaImage(mediaImage, rotation);

Long method to make all those calculations, but it’s pretty copy-pastable. Then you can pass in the mediaImage and the rotation to generate your FirebaseVisionImage.

From ByteBuffer

FirebaseVisionImageMetadata metadata = new FirebaseVisionImageMetadata.Builder()

FirebaseVisionImage image = FirebaseVisionImage.fromByteBuffer(buffer, metadata);

You’ll need the above (from media.Image) rotation method as well, on top of having to build the FirebaseVisionImage with the metadata of your image.

From File

FirebaseVisionImage image = FirebaseVisionImage.fromFilePath(context, uri);

Simple to present here in one line, but you’ll be wrapping this in a try-catch block.

Run the Object Detection

AFter all that, it’s as simple as calling processImage. If it succeeds, you get a list of FirebaseVisionObjects.

        .addOnSuccessListener { detectedObjects ->
            // Task completed successfully
            // ...
        .addOnFailureListener { e ->
            // Task failed with an exception
            // ...

Each FirebaseVisionObject has properties like the example shown way above (bounding box, tracking ID, category, confidence).

Optimizing Usability & Performance

There are some guidelines and best practices in the Firebase Docs that will result in higher performance and reliability of the model. I’ll relist them here.

UX Stuffs

  • Objects with a small number of visual features might need to take up a larger part of the image to be detected. You may want to let your users know this so they don’t get frustrated.
  • If you’re handling objects based on their classifications, add special handling for unknown objects.

Streaming Mode Stuffs for best performance

  • Using Multiple Objects Setting in Stream Mode is most likely to produce a crappy framerate. Don’t do it.
  • Disable Classifications if you don’t need it
  • If a new video frame becomes available while the detector is running, drop the frame.
  • Capture images in the ImageFormat.YUV_420_888 format when using the Camera2 API, or ImageFormat.NV21 for the older Camera API.
  • If you are using the output of the detector to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each input frame. Firebase provides the CameraSourcePreview and GraphicOverlay classes in the quickstart sample app for an example.


Most of the info and snippets above are taken directly from the official Firebase documentation. These posts are created primarily for the ease of understanding of other developers of the discussed topic. Check them out for more detail on the topic.


Subscribe to the Newsletter