Video Processing With Cloud ML Services: Part 1

If you have ever worked in media & entertainment, digital asset management or any industry with a large library of video content, you’ve probably heard someone talk about the need for manual “video metadata tagging.” The reason for video tagging usually revolves around the need for the business to classify clips within a video for a quick search and retrieval not just by the asset managers but by the business end users including editors, marketers, ad sales, or production managers, to name a few. As a customer once said to us, “if you can’t find it, it doesn’t exist.” Nothing is worse than knowing you have content that could have been monetized or reused but not knowing exactly where it is.

You might be wondering, are we forever such tagging videos manually? Or is there some way to automate this process? Enter Artificial Intelligence and Machine Learning. Using the not-so-new but finally accessible concept of a Convolutional Neural Network, it’s possible to train an AI to classify scenes and detect everyday objects in videos. All three of the big cloud providers (Google, Amazon, and Microsoft) have cloud APIs that provide this service. As the first chapter of this 3-part-series, we’re going to focus on exploring the Google Cloud Platform. So let’s get started.

Google Intelligence API

The Google Video Intelligence API provides access to pre-trained machine learning modules which can identify a large number of objects, places, products, logos, inappropriate content, and even transcribe speech. Access to this API is accomplished via one of many programming languages, including C++, Java, Go, and Python. For all of our code examples we’ll be using Python.

Signing up for Google Cloud

To use this API, the first thing you’re going to need is a Google Cloud project. Signing up for Google cloud requires you have a Google managed identity (like a gmail account) and a credit card. Once you create your first Google Cloud project, you will have access to the myriad of API services provided by Google along with access to the Google Cloud Console. The console is a web based GUI which allows you to explore, configure, and monitor the various cloud services provided by Google. You must set up billing in order to use Google’s Video Intelligence.


By default the Video Intelligence API is not activated, meaning that if you attempt to authenticate and use it, you will get an error. Additionally, you will need to set up what is called a Service Account with permissions to access the Video Intelligence API, and export the credentials file for use in your code. Finally, and this depends on how much content you want to run through, you may have to open a ticket with Google to increase the quotas which guard the API from accidental overuse or misuse. If everything has been working and then one day stops working, there is a 99% chance that you’ve exceeded your quota.

We’re not going to walk you through every step in this blog post because Google has documented the process of setting up your project for Video Intelligence.

Data preparation

Now that we’re all set up, we finally get to use the API. But first we need to think about preparing our video files for processing. In terms of video formats, Video Intelligence supports MOV, MP4, and AVI, but that assumes you are not using a proprietary codec such as Apple ProRes.

If your videos are very high resolution or high bit rate then I do suggest transcoding them to something smaller, a lot smaller. AI does not usually benefit from high resolution images so you won’t see any loss in accuracy if you lower the resolution and will likely see a performance increase as well.


First things first, we have to configure our credentials file. The easiest and most trouble-free way to do this is using the credentials environment variable.

import os

from import videointelligence

# point this env var towards the credentials file
# for the GCP service account you want to use.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "gcp-credentials.json"

# Now create an instance of a VideoIntelligenceServiceClient
# will allow us to make the API calls.
client = videointelligence.VideoIntelligenceServiceClient()

Crafting the Request

Now it’s time to build the Video Intelligence request.

First we have to pick the type of analysis we want to do. Unlike a lot of services, VideoIntelligence allows you to analyze the file in multiple ways with a single API call. We’re going to analyze our movie three different ways: text detection for OCR, logo recognition, and person detection.

Next, we have to choose the content we want to analyze. That can be done by either loading a local file or providing a ‘gs:’ URL. In this case we’ll provide a URL.

features = [

uri = "gs://zorroa-public/test/ford.mp4"

request = {
    "features": features,
    "input_uri": uri

Making the Request

Now that we’ve made our request dictionary we can make the request. Making the request returns a LongRunningOperation instance which is used to poll the Google servers for the result.

operation = client.annotate_video(request=request)
# The result() method will block until a result exists.
# You can also poll manually using the operation.done() method.
results = operation.result().annotation_results[0]

The result that comes back will contain large amounts of metadata which describe specific shots, frames and segments of the video. You can dump the entire result to your screen simply by printing it.


You can also iterate the results. For example here we print each logo detected as well as the section of the movie where the logo is visible.

for annotation in results.logo_recognition_annotations:
    for track in annotation.tracks:
        start = track.segment.start_time_offset.total_seconds()
        stop = track.segment.end_time_offset.total_seconds()
        print(f"Clip Start: {start}")
        print(f"Clip Stop: {stop}")

In this case our result will look like:

Ford Motor Company
Clip Start: 0.00
Clip Stop: 2.63

What to know before setting up GCP Video Intelligence

In terms of cloud APIs for processing video with artificial intelligence, Google Video Intelligence is the best available option. The API is designed to be efficient and easy to use. The results returned come pre-organized into different tracks which make it easy to convert into WebVTT or various other types of timeline formats.

If there was anything I’d change about Video Intelligence it’s that the default traffic quotas are too low and can be a confusing roadblock to people new to Google Cloud.


As you can see, for those uninitiated to Google Cloud, there is some work to do before you can start running movies through the API, but that is really just the beginning. The hard part is yet to come, which is making the results of these APIs usable from within your own application.

For example, you may want to:

  • Store the results in a database so they can be searched and retrieved by multiple teams
  • Create thumbnails for each clip found by Video Intelligence
  • Build a UI for reviewing predictions made by Video Intelligence
  • Store the raw result from Google in case you want to clean more information at a later time
  • Augment the metadata with predictions from other ML libraries or manual tagging.

My team and I have experienced first-hand the additional work and complexity that vendor onboarding can place on product development teams. The Zorroa platform can provide all of these additional features for you out of the box, along with access to a vast array of other tools and algorithms to help classify and automatically tag your video library.