TensorFlow Serving for Machine Learning Models
Inception v-3 framework pre-installed and configured.
Inception v-3 is was developed for classifying complete images into 1,000 classes (such as llama, zebra, aircraft carrier, electric fan) as part of the ImageNet Large Visual Recognition Challenge.
This enables image captioning out-of-the-box, while also allowing users to add or develop new machine learning frameworks.
TensorFlow Serving is an open source system for serving a wide variety of machine learning models.
Developed and released by the Google Brain team in 2015, the system uses a standard architecture and set of APIs for new and existing machine learning algorithms and frameworks.
* Admin Package included: OpenVPN, SSH, SFTP, OS root access
TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.
TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs.
TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
To understand the architecture of TensorFlow Serving, you need to understand the following key concepts:
Servables are the central abstraction in TensorFlow Serving.
Servables are the underlying objects that clients use to perform computation (for example, a lookup or inference).
The size and granularity of a Servable is flexible.
A single Servable might include anything from a single shard of a lookup table to a single model to a tuple of inference models.
Servables can be of any type and interface, enabling flexibility and future improvements such as:
- streaming results
- experimental APIs
- asynchronous modes of operation
Servables do not manage their own lifecycle.
Typical servables include the following:
- a TensorFlow SavedModelBundle (tensorflow::Session)
- a lookup table for embedding or vocabulary lookups
TensorFlow Serving can handle one or more versions of a servable over the lifetime of a single server instance.
This enables fresh algorithm configurations, weights, and other data to be loaded over time.
Versions enable more than one version of a servable to be loaded concurrently, supporting gradual rollout and experimentation.
At serving time, clients may request either the latest version or a specific version id, for a particular model.
A servable stream is the sequence of versions of a servable, sorted by increasing version numbers.
TensorFlow Serving represents a model as one or more servables.
A machine-learned model may include one or more algorithms (including learned weights) and lookup or embedding tables.
You can represent a composite model as either of the following:
- multiple independent servables
- single composite servable
A servable may also correspond to a fraction of a model.
For example, a large lookup table could be sharded across many TensorFlow Serving instances.
Loaders manage a servable's life cycle.
The Loader API enables common infrastructure independent from specific learning algorithms, data or product use-cases involved.
Specifically, Loaders standardize the APIs for loading and unloading a servable.
Sources are plugin modules that find and provide servables.
Each Source provides zero or more servable streams.
For each servable stream, a Source supplies one Loader instance for each version it makes available to be loaded.
(A Source is actually chained together with zero or more SourceAdapters, and the last item in the chain emits the Loaders.)
TensorFlow Serving’s interface for Sources can discover servables from arbitrary storage systems.
TensorFlow Serving includes common reference Source implementations.
For example, Sources may access mechanisms such as RPC and can poll a file system.
Sources can maintain state that is shared across multiple servables or versions.
This is useful for servables that use delta (diff) updates between versions.
Aspired versions represent the set of servable versions that should be loaded and ready.
Sources communicate this set of servable versions for a single servable stream at a time.
When a Source gives a new list of aspired versions to the Manager, it supercedes the previous list for that servable stream.
The Manager unloads any previously loaded versions that no longer appear in the list.
Managers handle the full lifecycle of Servables, including:
- loading Servables
- serving Servables
- unloading Servables
Managers listen to Sources and track all versions.
The Manager tries to fulfill Sources' requests, but may refuse to load an aspired version if, say, required resources aren't available.
Managers may also postpone an "unload".
For example, a Manager may wait to unload until a newer version finishes loading, based on a policy to guarantee that at least one version is loaded at all times.
TensorFlow Serving Managers provide a simple, narrow interface -- GetServableHandle() -- for clients to access loaded servable instances.
TensorFlow Serving Core manages (via standard TensorFlow Serving APIs) the following aspects of servables:
TensorFlow Serving Core treats servables and loaders as opaque objects.
Your Virtual Machine Specs
Your TensorFlow Serving will be running on an isolated and secure Virtual Machine with the following configuration 1 :
- CPU: 1 vCPU on 7th Generation Intel® Core™ i5-7260U Physical Processor(s)
- Base Frequency: 2.20 GHz
- Max Turbo Frequency: 3.40 GHz
- Memory: 1024 MB on 32 GB DDR4-2133 Physical Memory Chip(s)
- DDR4-2133 1.2V SO-DIMM
- Max Memory Bandwidth: 34.1 GB/s
- Disk Size: 16.06 GB on 1TB M.2. SSD Physical Storage Chip(s)
- M.2 Solid-State Drive (SSD)
- Sequential Read: 530 MB/s
- Sequential Write: 510 MB/s
- Random Read IOPS: 92 K
- Random Write IOPS: 83 K
Note 1 : Virtual Machine Resources are already optimized for performance. Under extreme usage or circumstances, more resources can be easily acquired via our Add-ons section.