AI generated text-to-video

Here is an example of how one can using a text prompt generated a series of frames, that then are stitched together into a video.

The prompt I used was: “a man walking in the parking lot with a miniature poodle”. the final video generated is shown below.

AI generated video from a text prompt of a man walking in a parking lot with a miniature poodle

What is interesting is how it morphs from one to the next, and in some cases, the human starts out more looks like a poodle. It reminds me of the old days of morphing we did in C and C++ (Computer Science theory).

For this I I am playing with the latest build of #StableDuffision and used a max of 100 frames, and for each frames 30 samplings and 200 inference steps.

This below video shows how each of those frame is generated, and it is quite fascinating.

A video showing how AI is generating one frame.

The rise of prompt engineering

I have said this before – with the advent of large AI models, Prompt Engineering is critical and is the next challenge for us to master.

What is Prompt engineering?

Prompt engineering is the process of fine-tuning large models and often are written in natural language, outlining the intention of the user. Prompt engineering is a key element that allows the output to be accurate and reflect the needs of the user. Prompts should not be thought as the explicit one input to the model, instead are multiple tasks for the model.

We use large language models (#LLM) such as #GPT3, or #Text2Image models like #DALLE and #StableFusion using a text prompt. The prompt is a string and is our way to ask the model to do what it is meant to. It also is our way to provide hints and directions on what you need and ultimately help the model understand the patterns that are important for us and be represented in the output.

The way we write a prompt is important – including the phrases, orders of the words, hints, etc. Prompts also need to be in the context of the use-case (see screen shot below on GPT3 use case examples). For example, language generation prompts would be different from code generation or summarization, or image generation. The prompts are closely tied to the intended use cases.

Screenshot showing GPT3 example use cases.
GPT3 use case examples

Examples of prompt engineering

We start out with a couple of examples related to language generation. I figured what better way to show prompt engineering, by asking GPT3 about prompt engineering. ?

In this first screenshot below, we use GPT3’s davinci model and ask a paragraph on prompt engineering. The first sentence is the prompt that was the input, and the text with the green background is what was generated.

Screenshot of the generated output of a GPT3 model
GPT3 screen shot showing a paragraph prompt

And in this second example, it is mostly the same prompt but we ask for a blog post instead of a paragraph. As we can see the output of course is quite different, but the essence of it is still quite the same.

Screenshot of the generated output of a GPT3 model
GPT3 screen shot showing a blog post prompt

And finally another example, same as before, but in this case we outline that be for a 5 year old child (ignoring the fact would a 5 year old understand the notion of AI, and models ?).

Screenshot of the generated output of a GPT3 model
GPT3 screen shot showing a paragraph prompt for a child to understand

Even though the changes might seem subtle in the examples shown earlier – consider them as toy examples.

Small changes to the prompt can lead to significant changes on the output. To show an example, below are two examples #StableDiffusion – which is a open source image-to-text model. I used Harry Potter for inspiration and use Hogwarts and the dark forest where the first graders were forbidden to go.

For the first prompt example: a beautiful view of hogwarts school of witchcraft and wizardry and the dark forest, by Laurie Lipton, Impressionist Mosaic, Diya Lamp architecture, atmospheric, sense of awe and scale

And for the second example, the prompts was: a beautiful view of hogwarts school of witchcraft and wizardry and the dark forest, by Laurie Lipton, Impressionist Mosaic, atmospheric, sense of awe and scale.

The only difference between the two prompts was removing “Diya Lamp architecture“, resulting in dramatically different outputs. I am guessing this being image generation, the changes are more dramatic and easier to comprehend.

Prompts also are not universal and are very dependent to the models being used – what is considered a good example in one model (from one institution), won’t transpose to another model from another institution. For example the same prompt as above (a beautiful view of hogwarts school of witchcraft and wizardry and the dark forest, by Laurie Lipton, Impressionist Mosaic, atmospheric, sense of awe and scale), when used for OpenAI’s DALLE model generates the image shown below – which is very different of course.

And if I want to tweak the same prompt specifically for DALLE here is another example using the prompt: Beautiful view of Hogwarts school of witchcraft and wizardry and the dark forest with a sense of awe and scale, Awesome, Highly Detailed.

As a side note, I particularly like this one:

This also has created a number of tools that allow us to craft prompts. Given many of us don’t quite understand the options, and styles that can go in there. Some like promptoMANIA can cover multiple large models (images in this case) and can get very sophisticated themselves. And other simpler ones like this DALLE prompt generator by Adam Brown, and more like allow for tweaking and fine-tuning of prompts and effectively creating templates for GPT3.

Prompt engineering is a brand new and fascinating space for the industry and I for one am quite intrigued to see where it will lead us.

AI writing AI code?

It is 2021. And we have #AI writing #AI code. ? It is quite interesting, but also can be quite boring once you get beyond the initial technology, and just think of it as one of the tools in your arsenal. And getting to that point is a good think.

As part of a think at work I recently started playing with GitHub Copilot, which is using GPT3 to be your pair programmer — helping write code. GPT3 has multiple models (called engines), and Copilot uses one of these family of engines called Codex. Codex is a derivative of the base GPT3 engine that is trained on billions of lines of code.

Using Copilot is quite simple; you install the Github Copilot extension, and it shows up in your IDE (VSCode in my example). We need to make sure we decompose the problem we are trying to solve – we should not think of this as helping write the complete program or all parts; but as it can help with different functions and pieces of code. To do this, we need to tell it what we are trying to do – these are done via prompts (code comments). For GPT models, prompt engineering is quite critical, and would be worth getting to details and understanding.

Starting simple, I create an empty python file and entered a prompt that outlines what I want to try and do. In this case as you can see in the screenshot below – I want to load an image from a file, and using our Vision Cognitive Services, run an image analysis, and auto-generate a caption for that image.

I started typing the definition of a function, and Copilot (via the add-in) understands the prompt I outlined, and the context of the code on what I am doing. Remember Codex builds on the base GPT3 and does have all that NLU capability.

Taking all of this in, it suggests completing the function for me. In terms of using this as an end-user (i.e. the developer) – the suggested code shows up as auto-complete and you can see it in the grey color. If I like that suggestion, I press tab and have it added to the file.

In this case you can see how it is reading the file from disk, calling a function called get_caption() and printing the caption to the stdout (console in this example).

There is also an option to cycle through different suggestions and then pick another one as shown in the screenshot below.

This variant of the suggested code is creating a function called image_caption() which takes the path to the image file to load. This also expects other required things for the Vision cognitive service to work – such as the subscription key to authenticate, the API end-point details to call, etc.

Typically, Copilot can synthesize up to 10 code options (Copilot calls these as Solutions), that one can cycle through and see if there is a better variant for the task at hand. The screenshot below shows this experience in VS Code.

The ask whilst simple, still involves a bit of code which needs to be written – reading from file, setting up the subscription details, and wiring that up to call this etc. And it is in these cases really where Copilot shines – it is your copilot picking up the ‘gunk work’ – freeing up your bandwidth, and cognition capacity on the more interesting and higher order bit of code and value to your business.

To get a flavor of our AI writing AI code, below is the full set of the nine suggestions the Copilot came back for what I was trying to do.

Synthesizing 9/10 solutions


def generate_caption(image_path):
    from import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from import TextOperationStatusCodes
    from import TextRecognitionMode
    from import VisualFeatureTypes
    from msrest.exceptions import CognitiveServicesError
    from os.path import exists
    from os import remove
    from os import makedirs
    from os import listdir
    from os.path import isfile, join
    from time import sleep
    from sys import exit
    from random import randint
    from datetime import datetime

    # set up the connection to the cognitive services account
    subscription_key = "YOUR KEY HERE"
    endpoint = "https://YOUR ENDPOINT HERE"
    group_id = "YOUR GROUP ID HERE"
    client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # set up the directory where the images will be downloaded
    download_dir = "./downloads/"
    if not exists(download_dir):

    # set up the directory where the images will be saved
    save_dir = "./images/"
    if not exists(save_dir):

    # set up the directory where the images will be saved
    caption_dir = "./captions/"
    if not exists(caption_dir):

    # get a list of the files in the download directory
    files = [f for f in listdir(download_dir) if isfile(join(download_dir, f))]
    if len(files) == 0:
        print("No files found in the download directory")

    # get the first file in the list
    file_name = files[0]


def generate_caption(image_path):
    from os import environ
    from import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from import TextOperationStatusCodes
    from import TextRecognitionMode
    from import VisualFeatureTypes
    from msrest.exceptions import CognitiveServicesError

    # Add your Computer Vision subscription key to your environment variables.
        subscription_key = environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
        print("\nSet the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.\n**Restart your shell or IDE for changes to take effect.**")

    # Add your Computer Vision endpoint to your environment variables.
    if 'COMPUTER_VISION_ENDPOINT' in environ:
        endpoint = environ['COMPUTER_VISION_ENDPOINT']
        print("\nSet the COMPUTER_VISION_ENDPOINT environment variable.\n**Restart your shell or IDE for changes to take effect.**")

    # Create an authenticated Computer Vision client.
    client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # Open the local image file.
    with open(image_path, 'rb') as image_file:
        # Read the image into a byte array.
        image_data =

    # Set the features to return.
    features = [VisualFeatureTypes.categories, VisualFeatureTypes.description, VisualFeatureTypes.tags]

    # Analyze the local image.
        result = client.analyze_image(image=image_data, visual_features=features)


def generate_caption(image_file, subscription_key):
    from import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from PIL import Image
    from io import BytesIO

    client = ComputerVisionClient(ComputerVisionClient.SUBSCRIPTION_KEY_ENDPOINT,

    image_stream = open(image_file, 'rb')
    image =
    image_bytes = BytesIO(image)

    analysis = client.analyze_image(image_bytes, ComputerVisionClient.DESCRIPTION)
    return analysis.description.captions[0].text


def generate_caption(image_filename):
    import os
    import json
    from import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from PIL import Image
    from io import BytesIO
    import sys
    import requests

    # Set the environment variables for your subscription key and region
    # For example, for US West
    # "COMPUTER_VISION_SUBSCRIPTION_KEY=<subscription key>"

    # For example, for China East
    # "COMPUTER_VISION_SUBSCRIPTION_KEY=<subscription key>"

    # Set your subscription key and region
    subscription_key = os.environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
    endpoint = os.environ['COMPUTER_VISION_ENDPOINT']

    # Set your image path
    image_path = image_filename

    # Set your region
    region = "eastus"

    # Create an instance of the Computer Vision client
    computer_vision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # Load the image from disk
    with open(image_path, 'rb') as image_file:
        image_content =

    # Add the image to the list of images to analyze
    image_list = [image_content]

    # Make the analysis asynchronous
    analyze_result = computer_vision_client.analyze_image_in_stream(image_list, visual_features=['Categories', 'Description', 'Color'])

    # Get the analysis result
    analysis = analyze_result.result()

    # Display the image captions
    print("\nImage captions:")
    for caption in analysis.description.


def generate_caption(image_path):
    # load the image
    image = open(image_path, "rb").read()
    # encode the image into base64
    image = base64.b64encode(image).decode("utf-8")
    # create a dict with the image
    body = {"image": image}
    # create a dict with the subscription key
    headers = {"Ocp-Apim-Subscription-Key": "0d5a5f7f1e534d05a8e8d6f9e6e9c6d3"}
    # create a url for the api
    url = ",Tags"
    # send the request
    response =, headers=headers, json=body)
    # get the response
    # get the caption
    caption = response.json()["description"]["captions"][0]["text"]
    # return the caption
    return caption


def generate_caption(image_path):
    from os import environ
    from os.path import join
    from urllib.parse import urlparse
    from urllib.parse import parse_qs
    from http.client import HTTPSConnection
    from json import loads
    from base64 import b64encode

    # Load the image into a byte array
    image_data = open(image_path, "rb").read()
    image_data = b64encode(image_data).decode('utf-8')

    # Read the subscription key from the environment variables
    subscription_key = environ["AZURE_SUBSCRIPTION_KEY"]

    # Define the vision API endpoint
    vision_base_url = ""

    # Set the vision API endpoint to analyze the image
    vision_analyze_url = vision_base_url + "analyze"

    # Set the parameters
    parameters = {'visualFeatures': 'Categories,Description,Color'}

    # Prepare the request
    headers = {'Ocp-Apim-Subscription-Key': subscription_key,
               'Content-Type': 'application/octet-stream'}
    params = parse_qs(urlparse(vision_analyze_url).query)
    params['visualFeatures'] = 'Categories,Description,Color'
    params['language'] = 'en'

    # Send the request
    conn = HTTPSConnection('')
    conn.request("POST", vision_analyze_url, image_data, headers)
    response = conn.getresponse()
    data =

    # Load the JSON response into a Python dictionary
    parsed = loads(data)

    return parsed


def generate_caption(image_path):
    Generate a caption for an image.
    :param image_path: path to an image
    :return: the caption

    # Load the image
    from os.path import isfile
    from PIL import Image
    import requests

    if not isfile(image_path):
        raise FileNotFoundError(image_path)

    image =

    # Encode the image
    image_data = open(image_path, "rb").read()

    # Computer Vision parameters
    params = {'visualFeatures': 'Categories,Description,Color'}

    # Computer Vision API URL
    url = ""

    # Connect to Computer Vision
    headers = {'Ocp-Apim-Subscription-Key': 'd3f3e1b8e7e14d0b9d7d3f8c8f7e0b6f', 'Content-Type': 'application/octet-stream'}
    response =, headers=headers, params=params, data=image_data)

    # Load the response as JSON
    analysis = response.json()

    # Extract the word tags
    categories = [(category['name'], category['score'])
                  for category in analysis['categories']]
    for category in categories:

    # Extract the description
    descriptions = [(description['text'], description['score'])
                    for description in analysis['description']['captions']]
    for description in descriptions:

    # Display the color
        dominant_color = analysis['color']['dominantColorBackground']
        dominant_color = 'N/A'
    print('Dominant color: {}'.format(dominant_color))

    # Display the caption
    caption = '{}'.format(


def generate_caption(filename):
    import os
    import sys
    import json
    import requests
    import cognitive_services_keys
    import cognitive_services_key_names

    # create the header for the request
    headers = {'Ocp-Apim-Subscription-Key': cognitive_services_keys.get_key(cognitive_services_key_names.BING_KEY)}

    # the URL for the bing endpoint
    bing_url = ''

    # the parameters for the request
    params = {'q': 'image', 'count': 1, 'offset': 0, 'mkt': 'en-us', 'safeSearch': 'Moderate'}

    # read the contents of the file
    with open(filename, 'rb') as f:
        data =

    # build the request
    request =, headers=headers, params=params, data=data)

    # check for errors
    if request.status_code != 200:
        print('Error code: {}'.format(request.status_code))

    # process the response
    response = json.loads(request.text)
    if 'value' in response:
        caption = response['value'][0]['name']
        print('The caption is: {}'.format(caption))
        print('No caption found.')


def generate_caption(image_path):
    import os
    import json
    import requests

    subscription_key = "e9e9c8a8d1d9400e8d0b7f2d8f8e7c6e"

    assert subscription_key

    vision_base_url = ""

    analyze_url = vision_base_url + "analyze"

    image_data = open(image_path, "rb").read()

    headers    = {'Ocp-Apim-Subscription-Key': subscription_key,
                  'Content-Type': 'application/octet-stream'}
    params     = {'visualFeatures': 'Categories,Description,Color'}
    response =
        analyze_url, headers=headers, params=params, data=image_data)

    analysis = response.json()
    image_caption = analysis["description"]["captions"][0]["text"].capitalize()
    return image_caption

Reinforcement Learning – An Introduction

Reinforcement Learning is teaching by example – it is how most of us learn. Reinforcement Learning (#RL) is a different approach to ML – it is a set of techniques that allows AI algorithms to experiment and learn from experience. RL falls in between supervised and unsupervised learning – there isn’t any labeled data, but at the same time it isn’t unsupervised either. At its most simple form, RL is a computational approach for automating goal-oriented decision making and learning.

Inherent RL is the ability to operate in a dynamic uncertain environment. RL can be more formally defined as the study, science, and problem of intelligence in the form of an agent that interacts in an environment. At the end of the day, almost all RL problems can be formalized as MDP (Markov decision processes).

The problem is represented by an environment – such as a world where an agent is based in. The steps in RL are quite clear – the agent takes actions, that have some effect on the environment. The environment acts on those actions and gives back an observation to the agent – what it sees and senses.

One special signal the environment gives back to the agent is called a reward signal. This signal is what an agent used to figure out how well it is doing. The RL problem is to take actions over time, to maximize the reward signals. And this notion of maximizing is what the agent is learning from the environment, without any explicit supervision. This construct helps an agent achieve a goal, even in an uncertain environment, factoring in delayed and indirect consequences of actions.

Reinforcement Learning Overview
Reinforcement Learning Overview

An agent can have many actions (i.e., choices); it uses a ‘reward’ signal to determine which of those actions is considered ‘good’ vs. ‘bad’. Of course, this determination is in the context of the outcome that we want to achieve.

Some examples of rewards in different industries and use cases:

  • Maneuvering a UAV’s – positive for following a chosen trajectory; negative for deviating from that trajectory.
  • Managing an investment portfolio – positive for each dollar earned; negative for each dollar lost.
  • Controlling a power station – As one can imagine, this control would typically constitute a few things in the environment – a sequence of controls, motors, batteries, power sources, etc. In optimizing the throughput of a power station, we can think of positive rewards for producing power; negative for exceeding a safety threshold.
  • Playing a game – positive for increasing score; negative for decreasing score.

Core concepts that make up RL:

Agent – The ‘thing’ that is using and acting on behalf of a user or another program. This can be a program executing a business process, a embedded process, the arm of a robot, actuators on a self-driving car controlling the wheels, etc.

Policy – A policy outlines how an agent would behave at certain times and can be thought of as the problem we are trying to solve. This is an agent’s behavior function and is a mapping of the business outcome that we are after.

Reward – A reward is a feedback special signal and outlines what is considered good (or bad) and is correlated with the agents’ current action, and the current state of the environment. All goals can be described as to maximize the cumulative reward. The reward is not a binary number but is a scaler between 0 and 1 – with zero being ‘bad’ and one being the best reward attainable for that action.

Value function – A value function represents how good is it to be in a particular state and related actions. Where a reward signal is showing the specification of good in an immediate sense (current step), the value function is representing the notion of good overall. At an abstract level, when thinking about the prediction of rewards, a rewards function is the primary, we can think of value functions as the secondary. In the end, we are more concerned with getting higher-value functions to make decisions, and not as much as higher rewards.

Model – A model is an agent’s view of the environment and mimics its behavior. This allows us to make inferences on how the environment will behave and is often used for planning. Think of the model as the strategy to use in solving the problem at hand.

Taxonomy of RL Algorithms

There are many types of RL algorithms (as we can see in the figure below), but these can broadly be classified in the following two categories.

  • Model free: A model-free algorithm can be thought of as an explicit trial and error algorithm. In a model free approach, the agent doesn’t have or ignores the environment; instead, the agent uses experience and tries to optimize a Policy.
  • Model based: On the other hand, a model-based algorithm reflects how an environment works, and factors that the associated reward functions and tries to maximize that. Technically, this is the optimization of the transition probability distribution of the MDP.

The main difference between the two – in one the algorithm optimizes for the environment, and in the other for a policy gradient. There is no one right or wrong algorithm – a lot of it depends on the situation at hand and what one is trying to optimize for.

As we can see below each of these categories can be further broken down – we won’t go into the details of those quite yet, maybe that is for another post. One of the most important components of most RL algorithms is a method to efficiently estimate values – at the end of the day, this is all about value estimation.

Chart showing the taxonomy of RL algorithms.
Taxonomy of RL Algorithms

Exploration and Exploitation

There are two concepts of exploration, and exploitation which are at odds with each other and for a given situation we should aim to get a balance of some sorts. In simple terms, RL is sequential decision making – one selects actions to maximize future rewards, and we need to plan long term – rewards might be delayed and not immediate, and we cannot be greedy. Sometimes, we need to sacrifice the immediate reward to gain more (or better) longer term rewards.

This can be thought of trial-and-error learning loop – with stream of experiences that constitute loops of actions, rewards, and observation. At the end of the day, this loop is what matters.

Exploration finds more information about the environment, and in doing so gives up rewards. Exploitation on the other hand, exploits the information it already has to maximize rewards. If we don’t exploit, we might be stuck in a sub-optimal place, and how would be know if there is a better sense or rewards without trying?

When we are the trial-and-error loop we might be losing rewards, and the agent needs to discover a good policy to maximize the rewards – this is the tension at the opposite ends of a string pulling each other.

It is important to balance both exploring and exploiting.

GPT-3 vs other AI powered assistants

I been kicking the tires with Open AI’s #GPT-3. Based on the screenshot below, it might be easy to think “oh boy does the model think highly of itself”, but as with most things in life – devil is in the details.? The screenshot below was a forked version of davinci engine and follows the Q&A structure.

OpenAI's GPT3 answering questions when compared to other AI powered assistants.
GPT-3 vs other AI assistants

Using OpenAI’s API is quite simple; perhaps too simple! It is quite easy to unleash the beast as the code snippet shown below. If you are new to using GPT3, I would highly recommend you start with the use case model guidelines first.

In the context of a toy example, to get to a simple Q&A chatbot as the screenshot earlier shown is quite simple. The API is powerful, and simple to use, and getting started is easy as the code below shows.

import os
 import openai
 openai.api_key = os.getenv("OPENAI_API_KEY")
 response = openai.Completion.create(
   prompt="I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with \"Unknown\".\n\nQ: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: Unknown\n\",

There are three core concepts when using GPT-3: Prompt, Completion, and Tokens.

To start using the API, we need to start giving it some prompts – this provide some context to the engine on what is expecting. Without the surface area is too broad and we get into nonsensical situations. This is part of the task-specific fine-tuning required.

Think of when giving examples as part of the prompt, we are essentially “programming” the model and providing guidance and providing some hints to context and pattern matching. Note the training data cut off in late 2019, so the model in production today doesn’t have access to data and events post that (e.g., Covid).

Completion is the output that GPT3 generates based on the prompt. To be clear, this is not the full text but is the predicted completions; think of it as “autocomplete” in Word, or Outlook or a search engine. The API has flexibility to return more than one predicted completion along with the probabilities of alternative tokens at each position (to me it seems just like the wave function when thinking of Quantum mechanics ?).

Finally, think of Token are the smaller Lego blocks that combine to make words. The API, which is nothing but wrappers around GPT-3 breaks up the text into tokens before processing it. The GPT-3 model understands the statistical relationships between these tokens and uses this to produce the next token in a sequence of tokens.

For example, if we are curious about Tokens, we can see in the screenshot below how the API “tokenizes” this paragraph and get the details of the tokens. This paragraph contains 207 characters and 43 tokens.

Token text that GPT-3 API converts to before using.
GPT-3 Tokens – Text
Token ID's that GPT-3 API converts to before using
GPT-3 Token – IDs

At a high level, think of one token == ~4 characters of text, which is ¾ of a word; so, 100 tokens ~= 75 words.

This is just dipping our toes in the beast that is GPT-3; the API’s which wrap up and expose the engines (more on that in another post) make it simple to use and without getting too much in the weeds of 175 billion parameters. 🙂

ML algorithm cheat sheet

A #ML algorithm cheat sheet – helping narrow down to a certain set of #algorithm grouping depending on the problem at hand and what we are trying to solve from a business perspective.

Cheat sheet showing different #ML algorithms to choose from depending on the task at hand
Figure 1

Figure 2 shows what additional characteristics we need to consider when choosing the right ML algorithm for your situation at hand. This is something that cannot be generic and is very situational.

Flow diagram showing how to select a ML algorithm and additional characteristics we need to consider as we select a ML algorithm
Figure 2 – Characteristics in selecting ML algorithms

If you find this useful, I would also recommend reading “How to select algorithms” which is detailed as part of Azure ML designer.

bfloat16 – how it improves AI chip designs

Floating point calculations are slow for computers (specifically CPUs); possibly representing the same struggle for many humans. 🙂

I remember a time when a FPU (floating point unit) was an upgrade and one had to pay extra to get one. Very useful when you needed that extra precision in computing – and in my head, it always seemed like the Turbo button. 🙂

For most #ML workloads and computations, precision isn’t the most important criteria; with every increasing data and parameters (looking at you GPT-3 with 45 TB of data and 175 billion parameters!), what most ML needs today is speed and dynamic range.

This is where bfloat16 (Brain floating-point format with 16 bits) – a new floating-point format comes handy and in the context of #AI improves on IEEE 754 – the current floating-point arithmetic standard.

As per IEEE 754, a floating point it will always take up 32 bits (see Figure 1 below) – irrespective of the size of the number. The exponent (8 bits) tells us how many numbers we shift (left or right) and place the decimal. The fraction (23 bits), also called the mantissa, holds the actual number – i.e. the data.

Figure 1 – IEEE 754 Floating point representation

bfloat16 truncates the data size in a third (see Figure 2) – with the fraction truncated from 23 to 7 bits. This of course means bfloat16 isn’t as precise. However bfloat16 has the same exponent bits as IEEE-754 it can represent a similar range (small to large), but more importantly are easier to convert between bfloat16 and IEEE 754.

Figure 2 – fbloat16 representation

Less precision doesn’t impact the matrix multiplication as much so in the context of ML training and inference these chips at scale are more efficient – not only they are faster, they also use less power, and memory bandwidth.

What is interesting in some neural nets such as a DNN, these less precision bfloat16 are more precise compared to IEEE 754! This is because the regularization and quantization weights cannot use the finer precision represented by IEEE 754 but adapt better with bfloat16. 🙂

Finally, bfloat16 is not a universal standard (yet); most AI chips support this. ARM, Intel, and, AMD have started adding support for this in their chipsets.

Getting DonkeyCar working on a Mac

I have been playing with a #selfdriving car for a while, and that is super exciting. From a #AI and #ML perspective it is small scale, but allows one to exploit all aspects of the tech stack and also appreciate the limitations of not only the software, but also the hardware.

With this You run a NN on a raspberry pi that uses TensorFlow, and Keras and runs inference on the edge. The pi doesn’t have enough power to train, so you need to do that on a beefier machine and then deploy the model back to run this.

Now, I didn’t have any issues in getting this running on Windows, but to get it on a Mac was a different story. The documentation is there that outlines all the steps, and even if you follow it to the T, it breaks right in the end.

When I tried to create a car, using a createcar command (this essentially creates the buckets, where you would save the training images, and the model, and the configuration of the car when you connect to it from your machine). The actual file paths would probably be different for you but, essentially it is the same thing.

(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ donkey createcar ~/mycar
Traceback (most recent call last):
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 660, in _build_master
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 968, in require
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 859, in resolve
pkg_resources.ContextualVersionConflict: (imageio 2.4.1 (/anaconda3/envs/donkey/lib/python3.6/site-packages), Requirement.parse('imageio<3.0,>=2.5'), {'moviepy'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/donkey/bin/donkey", line 6, in <module>
    from pkg_resources import load_entry_point
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 2985, in <module>
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 2971, in _call_aside
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 2998, in _initialize_master_working_set
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 662, in _build_master
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 675, in _build_from_requirements
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg_resources/", line 854, in resolve
pkg_resources.DistributionNotFound: The 'imageio<3.0,>=2.5' distribution was not found and is required by moviepy

The key here to focus is on the last lines on both of those blocks of code – the main thing causing the issue is MoviePy (see highlighted lines above).

MoviePy is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects.

It seems like when you go through the steps – clone the repo, setup anaconda, install tensorflow and get the car configured – there is a mismatch in the MoviePy dependencies which it doesn’t like. The way to fix the issue is outlined below.

Skip MoviePy

MoviePy is something you don’t need to use right away but later when trying to make a movie (using the makemovie command – which allows you to create a movie file from the images in a Tub.); this is not essential. To do this, the easiest way is to remove (or my suggestion it to comment) out the moviepy dependency from the file.

This should be line 33 in the file that you will find in the same folder where you cloned the git repo. As an example the updated file is below, where the moviepy dependency is commented out (see highlighted). And once you save this and go about creating the car, it should work. Of course you cannot use the makemovie option later.

from setuptools import setup, find_packages

import os

with open("", "r") as fh:
    long_description =

      description='Self driving library for python.',
      author='Will Roscoe',
          'console_scripts': [

                      'tf': ['tensorflow>=1.9.0'],
                      'tf_gpu': ['tensorflow-gpu>=1.9.0'],
                      'pi': [
                      'dev': [
                      'ci': ['codecov']


          # How mature is this project? Common values are
          #   3 - Alpha
          #   4 - Beta
          #   5 - Production/Stable
          'Development Status :: 3 - Alpha',

          # Indicate who your project is intended for
          'Intended Audience :: Developers',
          'Topic :: Scientific/Engineering :: Artificial Intelligence',

          # Pick your license as you wish (should match "license" above)
          'License :: OSI Approved :: MIT License',

          # Specify the Python versions you support here. In particular, ensure
          # that you indicate whether you support Python 2, Python 3 or both.

          'Programming Language :: Python :: 3.5',
          'Programming Language :: Python :: 3.6',
      keywords='selfdriving cars donkeycar diyrobocars',

      packages=find_packages(exclude=(['tests', 'docs', 'site', 'env'])),

Once you have saved the file, you need to run the installation again with the following command and then run the create car command. Both of these are outlined below.

pip install -e .
donkey createcar ~/mycar

Once you run these, then you should see the successful installation as shown by the output below. Note – your output might be a little different depending on the conda state of packages

(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ pip install -e .
Obtaining file:///Users/amit.bahree/CloudStation/Documents/Code/donkeycar
Requirement already satisfied: numpy in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (1.14.5)
Requirement already satisfied: pillow in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (4.2.1)
Requirement already satisfied: docopt in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (0.6.2)
Collecting tornado==4.5.3 (from donkeycar==2.5.7)
Requirement already satisfied: requests in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (2.18.4)
Requirement already satisfied: h5py in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (2.7.1)
Collecting python-socketio (from donkeycar==2.5.7)
  Using cached
Collecting flask (from donkeycar==2.5.7)
  Using cached
Collecting eventlet (from donkeycar==2.5.7)
  Using cached
Collecting pandas (from donkeycar==2.5.7)
  Using cached
Requirement already satisfied: olefile in /anaconda3/envs/donkey/lib/python3.6/site-packages (from pillow->donkeycar==2.5.7) (0.44)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (2017.7.27.1)
Requirement already satisfied: idna<2.7,>=2.5 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (2.6)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (1.22)
Requirement already satisfied: six in /anaconda3/envs/donkey/lib/python3.6/site-packages (from h5py->donkeycar==2.5.7) (1.10.0)
Collecting python-engineio>=3.2.0 (from python-socketio->donkeycar==2.5.7)
  Using cached
Collecting click>=5.1 (from flask->donkeycar==2.5.7)
  Using cached
Collecting itsdangerous>=0.24 (from flask->donkeycar==2.5.7)
  Using cached
Collecting Werkzeug>=0.14 (from flask->donkeycar==2.5.7)
  Using cached
Collecting Jinja2>=2.10 (from flask->donkeycar==2.5.7)
  Using cached
Collecting monotonic>=1.4 (from eventlet->donkeycar==2.5.7)
  Using cached
Collecting greenlet>=0.3 (from eventlet->donkeycar==2.5.7)
Collecting dnspython>=1.15.0 (from eventlet->donkeycar==2.5.7)
  Using cached
Collecting pytz>=2011k (from pandas->donkeycar==2.5.7)
  Using cached
Collecting python-dateutil>=2.5.0 (from pandas->donkeycar==2.5.7)
  Using cached
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->donkeycar==2.5.7)
  Using cached
Installing collected packages: tornado, python-engineio, python-socketio, click, itsdangerous, Werkzeug, MarkupSafe, Jinja2, flask, monotonic, greenlet, dnspython, eventlet, pytz, python-dateutil, pandas, donkeycar
  Found existing installation: tornado 4.5.1
    Uninstalling tornado-4.5.1:
      Successfully uninstalled tornado-4.5.1
  Found existing installation: Werkzeug 0.12.2
    Uninstalling Werkzeug-0.12.2:
      Successfully uninstalled Werkzeug-0.12.2
  Running develop for donkeycar
Successfully installed Jinja2-2.10 MarkupSafe-1.1.1 Werkzeug-0.14.1 click-7.0 dnspython-1.16.0 donkeycar eventlet-0.24.1 flask-1.0.2 greenlet-0.4.15 itsdangerous-1.1.0 monotonic-1.5 pandas-0.24.1 python-dateutil-2.8.0 python-engineio-3.4.3 python-socketio-4.0.0 pytz-2018.9 tornado-4.5.3

And when I run the createcar, you can see it worked as expected. In my case creating the ‘mycar’ folder in my home directory. Of course you can choose this wherever you prefer.

(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ donkey createcar ~/mycar
using donkey version: 2.5.7 ...
Creating car folder: /Users/amit.bahree/mycar
making dir  /Users/amit.bahree/mycar
Creating data & model folders.
making dir  /Users/amit.bahree/mycar/models
making dir  /Users/amit.bahree/mycar/data
making dir  /Users/amit.bahree/mycar/logs
Copying car application template: donkey2
Copying car config defaults. Adjust these before starting your car.
Donkey setup complete.

It is interesting to see this is more stable on Windows, than on a Mac. Also, one last thing to leave you with – when I first ran the installation, the hint that someone was wrong was in the output, but I didn’t pay too much attention to it. See the red line highlighted in the output below.

moviepy failure - donkeycar installation
moviepy failure – donkeycar installation

Don’t know at this time on what the solution for moviepy is to get this sorted – luckily its not a big deal at the moment.

Azure Cognitive Services in containers is the smart way to go

{Cross posted from my post on Avanade}

Containers just got smarter.

That’s the news from Microsoft, which announced recently that Azure Cognitive Services now supports containers. The marriage of AI and containers is a technology story, of course, but it’s a potentially even bigger business story, one that affects where and how you can do business and gain competitive advantage.

First, the technology story
Containers aren’t new, of course. They’re an increasingly popular technology with a big impact on business. That’s because they boost the agility and flexibility with which a business can roll out new tools to employees and new products and services to customers.

With containers, a business can get software releases and changes out faster and more frequently, increasing its competitive advantage. Because containers abstract applications from their underlying operating systems and other services—like virtual machines abstracted from hardware—those applications can run anywhere: in the cloud, on a laptop, in a kiosk or in an intelligent Internet-of-Things (IoT) edge device in the field.

In many respects this frees up the application’s developer, who can focus on creating the best, most useful software for the business. With Microsoft’s announcement, that software can now more easily include object detection, vision recognition, text and language understanding.

At Avanade, we take containers a step further by including support for them in our modern engineering platform, a key part of our overall approach to intelligent IT. So, you can automate your creation and management of containers—including AI-enabled containers—for a faster, easier, more seamless DevOps process. You can take greater advantage of IoT capabilities and move technologies such as AI closer to the edge, where they can reduce latency and boost performance.

What AI containers do for business
And you can do much more, which is where the business story gets interesting. With the greater agility and adaptability that comes with container-based AI services, you can respond more quickly to new competition, regulatory environments and business models. That contrasts with the more limited responses that have been possible with traditional, cloud-based AI. 

For example, data sovereignty laws and GDPR requirements generally restrict the transfer of data to the cloud, where cloud-based cognitive services can interact with it. Now, with containers that support cognitive services, you can avoid those restrictions by running your services locally.

A retail bank might use containerized AI to identify customers, address their needs, process payments and offer additional services, boosting customer satisfaction and bank revenue—all without sending private financial data outside the region (or even outside the bank) in accordance with GDPR.

Similarly, regional medical centers and clinics subject to HIPAA privacy laws in the US can process protected information on site with containerized AI to cut patient wait times and deliver better health outcomes.

Or, think about limited-connectivity or disconnected environments—such as manufacturing shop floors, remote customer sites or oil rigs or tankers—that can’t count on accessing AI that resides in the always-on cloud. Previously, these sites might have had to batch their data to process it during narrow periods of cloud connectivity, with the delays greatly limiting the timeliness and usefulness of AI.

Now, these sites can combine IoT and AI to anticipate and respond to manufacturing disruptions before they occur, increasing safety, productivity and product quality while reducing errors and costs.

If you can’t bring your data to your AI, now you can bring your AI to your data. That’s the message of container-hosted AI and the modern engineering platform. Together, they optimize your ability to bring AI into environments where you can’t count on the cloud. Using AI where you couldn’t before makes innovative solutions possible—and innovative solutions deliver competitive advantage. 

Boost ROI and scale
If you’re already using Azure Cognitive Services, you’ve invested time and money to train the models that support your use cases. Because those models are now portable, you can take advantage of them in regulated, limited-connectivity and disconnected environments, increasing your return on that investment. 

You can also scale your use of AI with a combination of cloud- and container-based architectures. That enables you to apply the most appropriate architectural form for any given environment or use. At the same time, you’re deploying consistent AI technology across the enterprise, increasing reliability while decreasing your operating cost.

Keep in mind…

Here are three things to keep in mind as you think about taking advantage of this important news:

  1. Break the barriers between your data scientists and business creatives. Containerized cognitive services is about far more than putting AI where you couldn’t before. It’s about using it in exciting new ways to advance the business. Unless you have heterogeneous teams bringing diverse perspectives to the table, you may miss some of the most important innovation possibilities for your business.
  2. You need a cloud strategy that’s not just about the cloud. If you don’t yet have a cloud strategy, you’re behind the curve. But if your cloud strategy is limited to the cloud, you may be about to fall behind the next curve. Microsoft’s announcement is further proof that the cloud is crucial to the enterprise—and also part of a larger environment, including both legacy and edge platforms, with which it must integrate.
  3. Be prepared for the ethics issues. Putting cognitive services in places you couldn’t before could raise new ethics issues. After all, we’re talking about the ability to read people’s expressions and even their emotions. This shouldn’t put you off—but it should put you on alert. Plug your ethics committee into these discussions when appropriate. If you don’t already have an ethics committee, create one. But that’s another post. 🙂

Want to learn more?

Microsoft’s announcement furthers the democratization of AI: the use of AI in more places and in more ways throughout the enterprise and beyond. Whether you turn to us for your AI solutions or look to us to assist you in developing your own, we’re ready to help with the greatest concentration of Microsoft expertise outside of Microsoft itself.

Roots of #AI

The naming is unfortunate when talking about #AI. There isn’t anything about intelligence – not as we humans know of it. If we can rewind back to the 50’s we can perhaps rename it to something like Computational Intelligence, which is more accurate. And although I have outlined the difference between some of the elements of AI in the past, I wanted to get back to what the intent was and how this area started.

Can machines think? Some say, the origins of #AI go back to Turing and started with his paper “Computing machinery and intelligence” (PDF) when it was published in 1950.Whilst, Turing might have planed the seed, it was a program called Logic Theorist created Allen Newell, Cliff Shaw, and Herbert Simon which was the first #ArtificialIntelligence program. Of course it wasn’t called #AI then.

That started back in 1956 when a Logic Theorist was presented at a conference in Dartmouth College called “Dartmouth Summer Research Project on Artificial Intelligence (DSRPAI)” (PDF). The term “#AI” was coined at the conference.

Since then, AI has had a roller coaster of a ride over the decades – from colder than hell (I presume) winters, to hotter than lava with it being everywhere. As someone said, time will heal all wounds.

#AI Timeline

Today, many of us use #AI, #DeepLearning, and, #MachineLearning interchangeably. Over the course of last couple of years, I have learned to ignore that, but fundamentally the distinction is important.

AI, we would say is more computational intelligence – allowing computers to do tasks that would be difficult for humans to do, certainly at scale. And these tasks are accomplished using different mechanisms and techniques, using “intelligent agents”.

Machine learning is a subset of AI, where the program or algorithm can learn from previous outputs, and improve based on that data – hence the “learning” part. It is akin to it learning from experience, but isn’t the same thing as we humans can comprehend and understand. Some of us think, the program is rewriting itself, which technically isn’t an accurate description.

Deep Learning is a set of techniques and algorithms of machine learning that are inspired from how the neurals in our brain connect together and work. These set of techniques are also called Neural Networks, and essentially are nothing but type of machine learning

For any of this AI “magic” to work, the one thing it needs to feed on is data. Without data, none of this would be possible. This data is classified into two categories – features and labels.

  • Features – these are aspects of whatever we are interested in. For example if we are interested in vehicles features could be the colour, make, and, model of the vehicle.
  • Labels – these are buckets of categories we put the things we are interested in. Using the same vehicles examples, we can have labels such as SUV, Sedan, Sports Car, Trucks, etc. that categorize vehicles.

One key principle to remember when it comes to #AI – all the outcomes that are described are in the terms of probabilities and not absolutes. All it suggests is the likelihood of something to happen, and most things cannot be predicted with total certainty. And this fundamental aspect one should remember when making decisions.

There isn’t a universal definition of AI, which sometimes doesn’t help. Each has their own perception. I have gotten over it to come to their terms and ensure we are talking the same lingo and meaning. It doesn’t help to get academic about it. 🙂

For example taking three leading analysts (Gartner, IDC, and Forrester) definition of AI (outlined below) is a good indicator on how this can get confusing.

  • Gartner – At its core, AI is about solving business problems in novel ways. It stretches across any organization from innovation, R&D and IT to data science.
  • IDC defines cognitive/Artificial Intelligence (AI) systems as a set of technologies that use deep natural language processing and understanding to answer questions and provide recommendations and direction. IDC’s coverage of cognitive/AI systems examines:
    • Digital assistants
    • Automated advisors
    • Artificial intelligence, deep learning and machine learning
    • Automated recommendation systems
  • Forrester defines AI as a liberatory technology at its core, and businesses that integrate it will free workers to become more innovative, creative, and adaptive than ever before. But these technologies are still in early stages.

And the field is just exploding now – not just with new research around #DeepLearning or #MachineLearning, but also net new aspects from a business perspectives; things like:

  • Digital Ethics
  • Conversational AI
  • Democratization of AI
  • Data Engineering (OK, not new, but certainly key)
  • Model Management
  • RPA (or #IntelligentAutomation)
  • AI Strategy

It is a new and exciting world that spans multiple spectrum. Don’t try and drink from the fire-hose, but take it in slowly, appreciate the nuances and what one brings value and discuss in terms of outcomes.

#ML concepts – Regularization, a primer

Regularization is a fundamental concept in Machine Learning (#ML) and is generally used with activation functions. It is the key technique that help with overfitting.

Overfitting is when an algorithm or model ‘fits’ the training data too well – it seems to good to be true. Essentially overfitting is when a model being trained, learns the noise in the data instead of ignoring it. If we allow overfitting, then the network only uses (or is more heavily influenced) by a subset of the input (the larger peaks), and doesn’t factor in all the input. 

The worry there being that outside of the training data, it might not work as well for ‘real world’ data. For example the model represented by the green line in the image below (credit: Wikipedia), follows the sample data too closely and seems too good. On the other hand, the model represented by the black line, which is better.

Overfitting example

Regularization helps with overfitting (artificially) penalizing the weights in the neural network. These weights are represented as peaks, and this reduces the peaks in the data. This ensure that the higher weights (peaks) don’t overshadow the rest of the data, and hence getting it to overfit. This diffusion of the weight vectors is sometimes also called weight decay.

Although there are a few regularization techniques for preventing overfitting (outlined below), these days in Deep Learning, L1 and L2 regression techniques are more favored over the others. 

  • Cross validation: This is a method for finding the best hyper parameters for a model. E.g. in a gradient descent, this would be to figure out the stopping criteria. There are various ways to do this such as the holdout method, k-fold cross validation, leave-out cross validation, etc.
  • Step-wise regression: This method essentially is a serial step-by-step regression where one reduces the weakest variable. Step-wise regression essentially does multiple regression a number of times, each time removing the weakest correlated variable. At the end you are left with the variables that explain the distribution best. The only requirements are that the data is normally distributed, and that there is no correlation between the independent variables. 

  • L1 regularization: In this method, we modify the cost function by adding the sum of the absolute values of the weights as the penalty (in the cost function).  In L1 regularization the weights shrinks by a constant amount towards zero. L1 regularization is also called Lasso regression.

  • L2 regularization: In L2 regularization on the other hand, we re-scale the weight to a subset factor – it shrinks by an amount that is proportional to the weight (as outlined in the image below). This shrinking makes the weight smaller and is also sometimes called weight decay.  To get this shrinking proportional, we take a squared mean of the weights, instead of the sum.  At face value it might seem that the weight eventually get to zero, but that is not true; typically other terms cause the weights to increase. L2 regularization is also called Ridge regression.

  • Max-norm: This enforces a upper bound on the magnitude of the weight vector. The one area this helps is that a network cannot ‘explode’ when the learning rates gets very high, as it is bounded.  This is also called projected gradient descent.

  • Dropout: Is very simple, and efficient and is used in conjunction with one of the previous techniques. Essentially it adds a probably on the neuron to keep it active, or ‘dropout’ by setting it to zero. Dropout doesn’t modify the cost function; it modifies the network itself as shown in the image below.

  • Increase training data: Whilst one can artificially expand the training set theoretically possible, in reality won’t work in most cases, especially in more complex networks. And in some cases one might think also to artificially expand the dataset, typically it is not cost effective to get a representative dataset.
L1 Regularization
L2 Regularization

Between L1 and L2 regularization, many say that L2 is preferred, but I think it depends on the problem statement. Say in a network, if a weight has a large magnitude, L2 regularization shrink the weight more than L1 and will better. Conversely, if the weight is small then L1 shrinks the weight more than L2 – and is better as it tends to concentrate the weight in fewer but more important connections in the network.

In closing, the key aspect to appreciate – the small weights (peaks) in a regularized network essentially means that as our input changes randomly (i.e. noise), it doesn’t have a huge impact to the network and its output. So this makes it difficult for the network to learn the noise and respond to that. Conversely, in an unregularized networks, that has higher weights (peaks), small random changes to those weights can have a larger impact to the behavior of the network and the information it carries.

Neural Network – Cheat Sheet

Neural Networks, today, help in a great set of tasks, that until very recently wasn’t possible at all – be it from computer vision, to medical diagnosis, to speech translation and forms a key cornerstone to a lot of ‘magic’ that Machine Learning and AI offers today.

I did blog about Neural Network types (and MarI/O) sometime back; I surely cannot take credit for creating these three cheat sheets but they are awesome and hope you get to use and enjoy them too.

Neural Network Graphs

Neural network basics–Activation functions

Neural networks have a very interesting aspect – they can be viewed as a simple mathematical model that define a function. For a given function f(x) which can take any input value of x, there will be some kind a neural network satisfying that function. This hypothesis was proven almost 20 years ago (“Approximation by Superpositions of a Sigmoidal Function” and “Multilayer feedforward networks are universal approximators”) and forms the basis of much of #AI and #ML use cases possible.

It is this aspect of neural networks that allow us to map any process and generate a corresponding function. Unlike a function in Computer Science, this function isn’t deterministic; instead is confidence score of an approximation (i.e. a probability). The more layers in a neural network, the better this approximation will be.

In a neural network, typically there is one input layer, one output layer, and one or more layers in the middle. To the external system, only the input layer (values of x), and the final output (output of the function f(x)) are visible, and the layers in the middle are not and essentially hidden.

Each layer contains nodes, which is modeled after how the neurons in the brain works. The output of each node gets propagated along to the next layer. This output is the defining character of the node, and activates the node to pass on its value to the next node; this is very similar to how a neuron in the brain fires and works passing on the signal to the next neuron.

Neural Network
Neural Network

To make this generalization of function f(x) outlined above to hold, the that function needs to be continuous function. A continuous function is one where small changes to the input value x, creates small changes to the output of f(x). If these outputs, are not small and the value jumps a lot then it is not continuous and it is difficult for the function to achieve the approximation required for them to be used in a neural network.

For a neural network to ‘learn’ – the network essentially has to use different weights and biases that has a corresponding change to the output, and possibly closer to the result we desire. Ideally small changes to these weights and biases correspond to small changes in the output of the function. But one isn’t sure, until we train and test the result, to see that small changes don’t have bigger shifts that drastically move away from the desired result. It isn’t uncommon to see that one aspect of the result has improved, but others have not and overall skewing the results.

In simple terms, an activation function is a node that attached to the output of a neural network, and maps the resulting value between 0 and 1. It is also used to connect two neural networks together.

An activation function can be linear, or non-linear. A linear isn’t terribly effective as its range is infinity. A non-linear with a finite range is more useful as it can be mapped as a curve; and then changes on this curve can be used to calculate the difference on the curve between two points.

There are many times of activation function, each either their strengths. In this post, we discuss the following six:

  • Sigmoid
  • Tanh
  • ReLU
  • Leaky ReLU
  • ELU
  • Maxout

1. Sigmoid function

A sigmoid function can map any of input values into a probability – i.e., a value between 0 and 1. A sigmoid function is typically shown using a sigma (\sigma). Some also call the (\sigma) a logistic function. For any given input value,  x the official definition of the sigmoid function is as follows:

\sigma(x) \equiv \frac{1}{1+e^{-x}}

If our inputs are x_1, x_2,\ldots, and their corresponding weights are w_1, w_2,\ldots, and a bias b, then the previous sigmoid definition is updated as follows:

\frac{1}{1+\exp(-\sum_j w_j x_j-b)}

When plotted, the sigmoid function, will look plotted looks like this curve below. When we use this, in a neural network, we essentially end up with a smoothed out function, unlike a binary function (also called a step function) – that is either 0, or 1.

For a given function, f(x), as x \rightarrow \infty, f(x) tends towards 1. And, as as x \rightarrow -\infty, f(x) tends towards 0.

Sigmoid function
Sigmoid function

And this smoothness of \sigma is what will create the small changes in the output that we desire – where small changes to the weights (\Delta w_j), and small changes to the bias (\Delta b) will produce a small changes to the output (\Delta output).

Fundamentally, changing these weights and biases, is what can give us either a step function, or small changes. We can show this as follows:

\Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b


One thing to be aware of is that the sigmoid function suffers from the vanishing gradient problem – the convergence between the various layers is very slow after a certain point – the neurons in previous layers don’t learn fast enough and are much slower than the neurons in later layers. Because of this, generally a sigmoid is avoided.

2. Tanh (hyperbolic tangent function)

Tanh, is a variant of the sigmoid function, but still quite similar – it is a rescaled version and ranges from –1 to 1, instead of 0 and 1. As a result, its optimization is easier and is preferred over the sigmoid function. The formula for tanh, is

\tanh(x) \equiv \frac{e^x-e^{-z}}{e^X+e^{-x}}

Using, this we can show that:

\sigma(x) = \frac{1 + \tanh(x/2)}{2}.

Sigmoid vs Tanh
Sigmoid vs Tanh

Tanh also suffers from the vanishing gradient problem. Both Tanh, and, Sigmoid are used in FNN (Feedforward neural network) – i.e. the information always moves forward and there isn’t any backprop.


3. Rectified Linear Unit (ReLU)

A rectified linear unity (ReLU) is the most popular activation function that is used these days.

\sigma(x) = \begin{cases} x & x > 0\\ 0 & x \leq 0 \end{cases}


ReLU’s are quite popular for a couple of reasons – one, from a computational perspective, these are more efficient and simpler to execute – there isn’t any exponential operations to perform. And two, these doesn’t suffer from the vanishing gradient problem.


The one limitation ReLU’s have, is that their output isn’t in the probability space (i.e. can be >1), and can’t be used in the output layer.

As a result, when we use ReLU’s, we have to use a softmax function in the output layer.  The output of a softmax function sums up to 1; and we can map the output as a probability distribution.

\sum_j a^L_j = \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1.


Another issue that can affect ReLU’s is something called a dead neuron problem (also called a dying ReLU). This can happen, when in the training dataset, some features have a negative value. When the ReLU is applied, those negative values become zero (as per definition). If this happens at a large enough scale, the gradient will always be zero – and that node is never adjusted again (its bias. and, weights never get changed) – essentially making it dead! The solution? Use a variation of the ReLU called a Leaky ReLU.

4. Leaky ReLU

A Leaky ReLU will usually allow a small slope \alpha on the negative side; i.e that the value isn’t changed to zero, but rather something like 0.01. You can probably see the ‘leak’ in the image below. This ‘leak’ helps increase the range and we never get into the dying ReLU issue.

ReLU vs. Leaky ReLU
ReLU vs. Leaky ReLU

5. Exponential Linear Unit (ELU)

Sometimes a ReLU isn’t fast enough – over time, a ReLU’s mean output isn’t zero and this positive mean can add a bias for the next layer in the neural network; all this bias adds up and can slow the learning.

Exponential Linear Unit (ELU) can address this, by using an exponential function, which ensure that the mean activation is closer to zero. What this means, is that for a positive value, an ELU acts more like a ReLU and for negative value it is bounded to -1 for \alpha = 1 – which puts the mean activation closer to zero.

\sigma(x) = \begin{cases} x & x \geqslant 0\\ \alpha (e^x - 1) & x < 0\end{cases}


When learning, this derivation of the slope is what is fed back (backprop) – so for this to be efficient, both the function and its derivative need to have a lower computation cost.


And finally, there is another various of that combines with ReLU and a Leaky ReLU called a Maxout function.

So, how do I pick one?

Choosing the ‘right’ activation function would of course depend on the data and problem at hand. My suggestion is to default to a ReLU as a starting step and remember ReLU’s are applied to hidden layers only. Use a simple dataset and see how that performs. If you see dead neurons, than use a leaky ReLU or Maxout instead. It won’t make sense to use Sigmoid or Tanh these days for deep learning models, but are useful for classifiers.

In summary, activation functions are a key aspect that fundamentally influence a neural network’s behavior and output. Having an appreciation and understanding on some of the functions, is key to any successful ML implementation.