Machine learning is a complex discipline but implementing machine learning models is far less daunting than it used to be. Machine learning frameworks like Google’s TensorFlow ease the process of acquiring data, training models, serving predictions, and refining future results.
Created by the Google Brain team and initially released to the public in 2015, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning models and algorithms (aka neural networks) and makes them useful by way of common programmatic metaphors. A convenient front-end API lets developers build applications using Python or JavaScript, while the underlying platform executes those applications in high-performance C++. TensorFlow also provides libraries for many other languages, although Python tends to dominate.
TensorFlow, which competes with frameworks such as PyTorch and Apache MXNet, can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation, natural language processing, and PDE (partial differential equation)-based simulations. Best of all, TensorFlow supports production prediction at scale, with the same models used for training.
TensorFlow also has a broad library of pre-trained models available for use in your projects. Code from the TensorFlow Model Garden provides examples of best practices for training your own models.
How TensorFlow works
TensorFlow allows developers to create dataflow graphs—structures that describe how data moves through a graph, or a series of processing nodes. Each node in the graph represents a mathematical operation, and each connection or edge between nodes is a multidimensional data array, or tensor.
TensorFlow applications can be run on most any target that’s convenient: a local machine, a cluster in the cloud, iOS and Android devices, CPUs or GPUs. If you use Google’s own cloud, you can run TensorFlow on Google’s custom TensorFlow Processing Unit (TPU) silicon for further acceleration. Models created by TensorFlow can be deployed on most any device to serve predictions.
TensorFlow 2.0, released in October 2019, revamped the framework significantly based on user feedback. The result is a machine learning framework that is easier to work with—for example, by using the relatively simple Keras API for model training—and more performant. Distributed training is easier to run thanks to a new API, and support for TensorFlow Lite makes it possible to deploy models on a greater variety of platforms. However, code written for earlier versions of TensorFlow must be rewritten—sometimes significantly—to take maximum advantage of new TensorFlow 2.0 features.
A trained model can be used to deliver predictions as a service via a Docker container using REST or gRPC APIs. For more advanced serving scenarios, you can use Kubernetes.
TensorFlow with Python
Many programmers access TensorFlow by way of the Python programming language. Python is easy to learn and work with, and it provides convenient ways to express and couple high-level abstractions. TensorFlow is supported on Python versions 3.7 through 3.11, and while it may work on earlier versions of Python it’s not guaranteed to do so.
Nodes and tensors in TensorFlow are Python objects, and TensorFlow applications are themselves Python applications. The actual math operations, however, are not performed in Python. The libraries of transformations that are available through TensorFlow are written as high-performance C++ binaries. Python just directs traffic between the pieces and provides the programming abstractions to hook them together.
High-level work in TensorFlow—creating nodes and layers and linking them together—relies on the Keras library. The Keras API is outwardly simple; you can define a basic model with three layers in less than 10 lines of code, and the training code for the same takes just a few more lines. But if you want to “lift the hood” and do more fine-grained work, such as writing your own training loop, you can do that.
TensorFlow with JavaScript
JavaScript is also a first-class language for TensorFlow, and one of JavaScript’s massive advantages is that it runs anywhere there’s a web browser.
TensorFlow.js, as the JavaScript TensorFlow library is called, uses the WebGL API to accelerate computations by way of whatever GPUs are available in the system. It’s also possible to use a WebAssembly back end for execution. WebAssembly is faster than the regular JavaScript back end if you’re only running on a CPU, but it’s best to use GPUs whenever possible. Pre-built models help you get up and running with simple projects, giving you an idea of how things work.
TensorFlow Lite
Trained TensorFlow models can also be deployed on edge computing or mobile devices, such as iOS or Android systems. The TensorFlow Lite toolset optimizes TensorFlow models to run well on such devices, by letting you choose tradeoffs between model size and accuracy. A smaller model (that is, 12MB versus 25MB, or even 100+MB) is less accurate, but the loss is generally small, and it’s more than offset by the model’s speed and energy efficiency.
Why developers use TensorFlow
TensorFlow’s biggest advantage for machine learning development is abstraction. Instead of dealing with the nitty-gritty details of implementing algorithms, or figuring out proper ways to hitch the output of one function to the input of another, you can focus on the overall application logic. TensorFlow takes care of the details behind the scenes.
TensorFlow offers additional conveniences for developers who need to debug and gain introspection into TensorFlow apps. Each graph operation can be evaluated and modified separately and transparently, instead of constructing the entire graph as a single opaque object and evaluating it all at once. This so-called “eager execution mode,” provided as an option in older versions of TensorFlow, is now standard.
The TensorBoard visualization suite lets you inspect and profile how graphs run by way of an interactive, web-based dashboard. The open source TensorBoard project replaces TensorBoard.dev and can be used to host machine learning projects.
TensorFlow also gains many advantages from the backing of an A-list commercial outfit in Google. Google has fueled the rapid pace of development behind the project and created many significant offerings that make TensorFlow easier to deploy and use. The TPU silicon for accelerated performance in Google’s cloud is just one example.
Deterministic model training with TensorFlow
A few details of TensorFlow’s implementation make it hard to obtain totally deterministic model-training results for some training jobs. Sometimes, a model trained on one system will vary slightly from a model trained on another, even when they are fed the exact same data. The reasons for this variance are slippery—one is how and where random numbers are seeded; another is related to non-deterministic behaviors when using GPUs. TensorFlow’s 2.0 branch has an option to enable determinism across an entire workflow, which you can do with a couple of lines of code. This feature comes at a performance cost, however, and should only be used when debugging a workflow.
TensorFlow vs. PyTorch, CNTK, and MXNet
TensorFlow competes with a variety of other machine learning frameworks. PyTorch, CNTK, and MXNet are three major competitors that address many of the same needs. Let’s take a quick look at where each one stands out and comes up short against TensorFlow:
- PyTorch is built with Python and has many other similarities to TensorFlow: hardware-accelerated components under the hood, a highly interactive development model that allows for design-as-you-go work, and many useful components already included. PyTorch is generally a better choice for projects that need to be up and running in a short time, but TensorFlow wins out for larger projects and more complex workflows.
- CNTK, the Microsoft Cognitive Toolkit, is like TensorFlow in using a graph structure to describe dataflow, but it focuses mostly on creating deep learning neural networks. CNTK handles many neural network jobs faster, and has a broader set of APIs (Python, C++, C#, Java). But it isn’t as easy to learn or deploy as TensorFlow. It’s also only available under the GNU GPL 3.0 license, whereas TensorFlow is available under the more liberal Apache license. And CNTK isn’t as aggressively developed; the last major release was in 2019.
- Apache MXNet, adopted by Amazon as the premier deep learning framework on AWS, can scale almost linearly across multiple GPUs and machines. MXNet also supports a broad range of language APIs—Python, C++, Scala, R, JavaScript, Julia, Perl, Go—although its native APIs aren’t as pleasant to work with as TensorFlow’s. It also has a far smaller community of users and developers.