The article, Machine learning for Java developers: Algorithms for machine learning, introduced setting up a machine learning algorithm and developing a prediction function in Java. Readers learned the inner workings of a machine learning algorithm and walked through the process of developing and training a model. This article picks up where that one left off. You’ll get a quick introduction to Weka, a machine learning framework for Java. Then, you’ll see how to set up a machine learning data pipeline, with a step-by-step process for taking your machine learning model from development into production. We’ll also briefly discuss how to use Docker containers and REST to deploy a trained ML model in a Java-based production environment.
What to expect from this article
Deploying a machine learning model is not the same as developing one. These are different parts of the software development lifecycle, and often implemented by different teams. Developing a machine learning model requires understanding the underlying data and having a good grasp of mathematics and statistics. Deploying a machine learning model in production is typically a job for someone with both software engineering and operations experience.
This article is about how to make a machine learning model available in a highly scalable production environment. It is assumed that you have some development experience and a basic understanding of machine learning models and algorithms; otherwise, you may want to start by reading Machine learning for Java developers: Algorithms for machine learning.
Let’s start with a quick refresher on supervised learning, including the example application we’ll use to train, deploy, and process a machine learning model for use in production.
Supervised machine learning: A refresher
A simple, supervised machine learning model will illustrate the ML deployment process. The model shown in Figure 1 can be used to predict the expected sale price of a house.
Recall that a machine learning model is a function with internal, learnable parameters that map inputs to outputs. In the above diagram, a linear regression function, hθ(x), is used to predict the sale price for a house based on a variety of features. The x variables of the function represent the input data. The θ (theta) variables represents the internal, learnable model parameters.
To predict the sale price of a house, you must first create an input data array of x variables. This array contains features such as the size of the lot or the number of rooms in a house. This array is called the feature vector.
Because most machine learning functions require a numerical representation of features, you will likely have to perform some data transformations in order to build a feature vector. For instance, a feature specifying the location of the garage could include labels such as “attached to home” or “built-in,” which have to be mapped to numerical values. When you execute the house-price prediction, the machine learning function will be applied with this input feature vector as well as the internal, trained model parameters. The function’s output is the estimated house price. This output is called a label.
Training the model
Internal, learnable model parameters (θ) are the part of the model that is learned from training data. The learnable parameters will be set during the training process. A supervised machine learning model like the one shown below has to be trained in order to make useful predictions.
Typically, the training process starts with an untrained model where all the learnable parameters are set with an initial value such as zero. The model consumes data about various house features along with real house prices. Gradually, it identifies correlations between house features and house prices, as well as the weight of these relationships. The model adjusts its internal, learnable model parameters and uses them to make predictions.
After the training process, the model will be able to estimate the sale price of a house by assessing its features.
Machine learning algorithms in Java code
The HousePriceModel
provides two methods. One of them implements the learning algorithm to train (or fit) the model. The other method is used for predictions.
The fit() method
The fit()
method is used to train the model. It consumes house features as well as house-sale prices as input parameters but returns nothing. This method requires the correct “answer” to be able to adjust its internal model parameters. Using housing listings paired with sale prices, the learning algorithm looks for patterns in the training data. From these, it produces model parameters that generalize from those patterns. As the input data becomes more accurate, the model’s internal parameters are adjusted.
Listing 1. The fit() method is used to train a machine learning model
// load training data
// ...
// e.g. [{MSSubClass=60.0, LotFrontage=65.0, ...}, {MSSubClass=20.0, ...}]
List<Map<String, Double>> houses = ...;
// e.g. [208500.0, 181500.0, 223500.0, 140000.0, 250000.0, ...]
List<Double> prices = ...;
// create and train the model
var model = new HousePriceModel();
model.fit(houses, prices);
Note that the house features are double typed in the code. This is because the machine learning algorithm used to implement the fit() method requires numbers as input. All house features must be represented numerically so that they can be used as x parameters in the linear regression formula, as shown here:
hθ(x) = θ0 * x0 + ... + θn * xn
The trained house price prediction model could look like what you see below:
price = -490130.8527 * 1 + -241.0244 * MSSubClass + -143.716 * LotFrontage + … * …
Here, the input house features such as MSSubClass
or LotFrontage
are represented as x variables. The learnable model parameters (θ) are set with values like -490130.8527 or -241.0244, which have been gained during the training process.
This example uses a simple machine learning algorithm, which requires just a few model parameters. A more complex algorithm, such as for a deep neural network, could require millions of model parameters; that is one of the main reasons why the process of training such algorithms requires high computation power.
The predict() method
Once you have finished training the model, you can use the predict()
method to determine the estimated sale price of a house. This method consumes data about house features and produces an estimated sale price. In practice, an agent of a real estate company could enter features such as the size of a lot (lot-area
), the number of rooms, or the overall house quality in order to receive an estimated sale price for a given house.
Transforming non-numeric values
You will often be faced with datasets that contain non-numeric values. For instance, the Ames Housing dataset used for the Kaggle House Prices competition includes both numeric and textual listings of house features:
To make things more complicated, the Kaggle dataset also includes empty values (marked NA), which cannot be processed by the linear regression algorithm shown in Listing 1.
Real-world data records are often incomplete, inconsistent, lacking in desired behaviors or trends, and may contain errors. This typically occurs in cases where the input data has been joined using different sources. Input data must be converted into a clean data set before being fed into a model.
To improve the data, you would need to replace the missing (NA) numeric LotFrontage
value. You would also need to replace textual values such as MSZoning
“RL” or “RM” with numeric values. These transformations are necessary to convert the raw data into a syntactically correct format that can be processed by your model.
Once you’ve converted your data to a generally readable format, you may still need to make additional changes to improve the quality of input data. For instance, you might remove values not following the general trend of the data, or place infrequently occurring categories into a single umbrella category.
Java-based machine learning with Weka
As you’ve seen, developing and testing a target function requires well-tuned configuration parameters, such as the proper learning rate or iteration count. The example code you’ve seen so far reflects a very small set of the possible configuration parameters, and the examples were simplified to keep the code readable. In practice, you will likely rely on machine learning frameworks, libraries, and tools.
Most frameworks or libraries implement an extensive collection of machine learning algorithms. Additionally, they provide convenient high-level APIs to train, validate, and process data models. Weka is one of the most popular frameworks for the JVM.
Weka provides a Java library for programmatic usage, as well as a graphical workbench to train and validate data models. In the code below, the Weka library is used to create a training data set, which includes features and a label. The setClassIndex()
method is used to mark the label column. In Weka, the label is defined as a class:
// define the feature and label attributes
ArrayList<Attribute> attributes = new ArrayList<>();
Attribute sizeAttribute = new Attribute("sizeFeature");
attributes.add(sizeAttribute);
Attribute squaredSizeAttribute = new Attribute("squaredSizeFeature");
attributes.add(squaredSizeAttribute);
Attribute priceAttribute = new Attribute("priceLabel");
attributes.add(priceAttribute);
// create and fill the features list with 5000 examples
Instances trainingDataset = new Instances("trainData", attributes, 5000);
trainingDataset.setClassIndex(trainingSet.numAttributes() - 1);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 90.0);
instance.setValue(squaredSizeAttribute, Math.pow(90.0, 2));
instance.setValue(priceAttribute, 249.0);
trainingDataset.add(instance);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 101.0);
...
The data set or Instance
object can also be stored and loaded as a file. Weka uses an ARFF (Attribute Relation File Format), which is supported by the graphical Weka workbench. This data set is used to train the target function, known as a classifier in Weka.
Recall that in order to train a target function, you have to first choose the machine learning algorithm. The code below creates an instance of the LinearRegression
classifier. This classifier is trained by calling the buildClassifier()
method. The buildClassifier()
method tunes the theta parameters based on the training data to find the best-fitting model. Using Weka, you do not have to worry about setting a learning rate or iteration count. Weka also does the feature scaling internally:
Classifier targetFunction = new LinearRegression();
targetFunction.buildClassifier(trainingDataset);
Once it’s established, the target function can be used to predict the price of a house, as shown here:
Instances unlabeledInstances = new Instances("predictionset", attributes, 1);
unlabeledInstances.setClassIndex(trainingSet.numAttributes() - 1);
Instance unlabeled = new DenseInstance(3);
unlabeled.setValue(sizeAttribute, 1330.0);
unlabeled.setValue(squaredSizeAttribute, Math.pow(1330.0, 2));
unlabeledInstances.add(unlabeled);
double prediction = targetFunction.classifyInstance(unlabeledInstances.get(0));
Weka provides an Evaluation
class to validate the trained classifier or model. In the code below, a dedicated validation data set is used to avoid biased results. Measures such as the cost or error rate will be printed to the console. Typically, evaluation results are used to compare models that have been trained using different machine-learning algorithms, or a variant of these:
Evaluation evaluation = new Evaluation(trainingDataset);
evaluation.evaluateModel(targetFunction, validationDataset);
System.out.println(evaluation.toSummaryString("Results", false));
The examples above use linear regression, which predicts a numeric-valued output such as a house price based on input values. Linear regression supports the prediction of continuous, numeric values. To predict binary yes/no values or classifiers, you could use a machine learning algorithm such as decision tree, neural network, or logistic regression:
// using logistic regression
Classifier targetFunction = new Logistic();
targetFunction.buildClassifier(trainingSet);
You might use one of these algorithms to predict whether an email was spam or ham, or to predict whether a house for sale could be a top-seller or not. If you wanted to train your algorithm to predict whether a house was likely to sell quickly, you would need to label your example records with a new classifying label such as topseller:
// using topseller label attribute instead price label attribute
ArrayList<String> classVal = new ArrayList<>();
classVal.add("true");
classVal.add("false");
Attribute topsellerAttribute = new Attribute("topsellerLabel", classVal);
attributes.add(topsellerAttribute);
This training set could be used to train a new prediction classifier: topseller
. Once trained, the prediction call will return the class label index, which can be used to get the predicted value:
int idx = (int) targetFunction.classifyInstance(unlabeledInstances.get(0));
String prediction = classVal.get(idx);
Building a machine learning data pipeline
Often, the data preparation or preprocessing steps of training a machine learning model are arranged as a pipeline. For instance, the simplified house prediction pipeline in Figure 6 arranges a set of preprocessing transformer components with a final house prediction model.
The transformer components clean the raw data and transform it into a format the model is able to consume. The data becomes more suitable for the model after each stage in the transformation.
The pipeline pattern allows you to organize your transformation code so that each transformer component has a single responsibility. For instance, the CategoryToNumberTransformer
class below replaces all textual feature values with numeric ones. Because this transformer implementation does not handle null values, the transformer has to be processed after applying an AddMissingValuesTransformer
. Internally, the CategoryToNumberTransformer
holds a map using textual feature values as the key, and unique, generated numbers as values. The mapping of the MSZoning
feature might look as follows:
{FV=1, RH=2, RM=3, C=5, …, RL=8, «default»=-1}
When you call the transform()
method, textual values will be detected and transformed into numbers using the mapping collection, as shown in Listing 2.
Listing 2. Replace textual feature values with numeric ones
public class CategoryToNumberTransformer implements Transformer<Object, Double, Double> {
private final CategoryToNumberResolver categoryToNumber = new CategoryToNumberResolver();
public List<Map<String, Double>> transform(List<Map<String, Object>> houses) {
return houses.stream().map(this::transform).collect(Collectors.toList());
}
private Map<String, Double> transform(Map<String, Object> house) {
return house.entrySet()
.stream()
.collect(Collectors.toMap(feature -> feature.getKey(),
feature -> (feature.getValue() instanceof String)
? categoryToNumber.map(feature)
: (Double) feature.getValue()));
}
public void fit(List<Map<String, Object>> houses , List<Double> prices) {
houses.forEach(house -> house.entrySet()
.stream()
.filter(feature -> feature.getValue() instanceof String)
.forEach(categoryToNumber::add));
}
private static final class CategoryToNumberResolver {
private final Map<String, Double> categoryToNumberMapping = Maps.newHashMap();
void add(Map.Entry<String, Object> feature) {
// ..
}
Double map(Map.Entry<String, Object> feature) {
// ..
}
}
}
There are two ways to create the internal category-to-number map. To do it manually, you would add all possible entries to the map during development time. To do it dynamically, as shown in Listing 2, you would scan all the available records at training time. In this example, the fit()
training method dynamically builds the category-to-number map. First it extracts a set of all textual values, then it uses the value set to build a map, which includes the newly generated numbers that are associated to the unique textual values.
Configuring the machine learning data pipeline
In most cases, preprocessing logic is specific to the model, so updating the logic of the preprocessing components requires re-training the model. For this reason, the preprocessing code and the model code are often packaged together, as shown in Listing 3. Here, a generic Pipeline
class is used to arrange the transformers together with a final house prediction model.
Listing 3. A generic Pipeline class
var pipeline = Pipeline.add(new DropNumericOutliners("LotArea", 10))
.add(new AddMissingValuesTransformer())
.add(new CategoryToNumberTransformer())
.add(new AddComputedFeatureTransformer())
.add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
.add(new HousePriceModel());
pipeline.fit(houses, prices);
// …
Some machine learning libraries provide pipeline abstractions similar to the example above. Others provides configurable and customizable preprocessing components only.
Training the machine learning data pipeline
Calling the pipeline fit()
method trains all of the included transformers and the final model. Typically, the required raw training dataset is provided by a data acquisition component. This component collects data from a variety of sources and prepares the data for ingestion into the machine learning pipeline. For instance, the Housedata Ingestion component shown in Figure 7 encapsulates data sourcing and produces raw house and price data records, which are fed into the estimation pipeline.
Internally, the Housedata Ingestion component may access a database storing sales transactions as well as other data sources such as a database storing geographical area data. Using an ingestion component separates the machine learning pipeline from the data source, so that changes in the data source will not impact the pipeline.
During the development process, different versions or variants of the pipeline may be trained and evaluated. For example, you might apply different thresholds to gradually weed outliers from the data. Working with machine learning data pipelines is a highly iterative process; it is common to test many pipeline versions or variants during development, eventually selecting the most consistently accurate pipeline for production usage.
Machine learning models in production
When you deploy the selected trained pipeline in production, you will be faced with new requirements. In order to manage production requirements such as reliability or maintainability, the packaging and deploying processes have to be reproducible. You should be able to re-package or re-deploy the pipeline with no change to its behaviors, even if the training data changes. You also have to be able to test or to rollback to older trained pipeline versions in case of erroneous system behavior in production.
Making the machine learning data pipeline reproducible
Ensuring that your machine learning pipeline is reproducible is easier said than done. Over time, your training dataset will change. It may increase in size as it gains more labeled data records, or it may decrease as data becomes unavailable due to external factors. Even if you use the exact same pipeline code, changes to your training dataset will produce different settings of the internal learnable pipeline parameters.
As an example, say you add a house record with a new MSZoning
category, “A,” which was not in the older dataset. In this case, although the transformation code is untouched, the internal CategoryToNumberTransformer
map will include an additional entry for this new, unseen category:
{FV=1, RH=2, RM=3, C=5, …, RL=8, A=9, «default»=-1}
The newly trained pipeline’s behavior differs from its previous iteration.
Use version control
To support reproducibility, all pipeline code and trained instances must be under strict version control. According to a traditional software development process, the data ingestion should be versioned, released, and uploaded into a repository along with the untrained and trained pipeline components. Typically, you would use a build system such as Maven. In this case, we could store the results of the build-and-release process and component binaries such as ingest-housedata-2.2.3.jar
and pipeline-estimate-houseprice-1.0.3.jar
in a repository like JFrog’s Artifactory or Sonatype’s Nexus repository.
CI/CD in the machine learning data pipeline
It is important to note that machine learning data pipelines and CI/CD pipelines are not the same. A machine learning data pipeline controls the data flow to transform input data into output data or predictions. A CI/CD pipeline is used to build, integrate, deliver, and deploy software artifacts in different stages. The diagram below illustrates the difference between the two pipeline types.
If we wanted to integrate CI/CD into the machine learning data pipeline, we could build our .jar files artifacts during the CI/CD development stage. We could also extend the CI/CD pipeline to trigger the training process and provide the trained, serialized pipeline, which could then be deployed into the production environment.
As shown in Listing 4, the appropriate version of the ingestion and pipeline components would be loaded from the repository to train a production-ready pipeline. In this example, the downloaded executable .jar files contain the compiled Java classes, as well as a main class. When you execute ingest.jar
, internally the Ames Housing dataset will be loaded and the raw house and price records files will be produced.
Listing 4. A script to train and upload a machine learning data pipeline in a CI/CD context
#!/bin/bash
# define the pipeline version to train
groupId=eu.redzoo.ml
artifactId=pipeline-estimate-houseprice
version=1.0.3
echo task 1: copying ingestion jar to local dir
ingest_app_uri="https://github.com/grro/ml_deploy/raw/master/example-repo/lib-releases/eu/redzoo/ml/ingestion-housedata/2.2.3/ingestion-housedata-2.2.3-jar-with-dependencies.jar"
curl -s -L $ingest_app_uri --output ingestion.jar
echo task 2: copying pipeline jar to local dir
pipeline_app_uri="https://github.com/grro/ml_deploy/raw/master/example-repo/lib-releases/${groupId//.//}/${artifactId//.//}/$version/${artifactId//.//}-$version-jar-with-dependencies.jar"
curl -s -L $pipeline_app_uri --output pipeline.jar
echo task 3: performing ingestion jar to produce houses.json and prices.json. Internally http://jse.amstat.org/v19n3/decock/AmesHousing.xls will be fetched
java -jar ingestion.jar train.csv houses.json prices.json
echo task 4: performing pipeline jar to create and train a pipeline consuming houses.json and prices.json
version_with_timestamp=$version-$(date +%s)
pipeline_instance=$artifactId-$version_with_timestamp.ser
java -jar pipeline.jar houses.json prices.json $pipeline_instance
echo task 5: uploading trained pipeline
echo curl -X PUT --data-binary "@$pipeline_instance" "https://github.com/grro/ml_deploy/blob/master/example-repo/model-releases/${groupId//.//}/${artifactId//.//}/$version_with_timestamp/$trained"
Note that most shops use a platform like Gitlab CI, TravisCI, CircleCI, Jenkins, or GoCD for CI/CD. All of these tools use a custom DSL (domain-specific language) to define CI/CD tasks. To keep the code examples simple, Listing 4 uses bash scripts instead of tool-specific CI/CD task definitions. When using a CI/CD platform, you would typically embed a stripped version of the example code within the CI/CD tasks.
After performing the ingest step shown in Listing 4 (task 3), the raw dataset files are used by the executable pipeline.jar
to produce a trained pipeline instance. Internally, the pipeline’s HousePricePipelineBuilder
main class creates a new instance of the estimation pipeline. The newly created instance will be trained and serialized into an output file like pipeline-estimate-houseprice-1.0.3-1568611516.ser
. This file contains the serialized state of the pipeline instance as a byte sequence and the names of the used Java classes.
To support reproducibility, the output filename includes the component version ID and a training timestamp. A new timestamp is generated for each training run. As a last step, the serialized trained pipeline file will be uploaded into a model repository, as shown in Listing 5.
Listing 5. Helper class to train a house price prediction pipeline
public class HousePricePipelineBuilder {
public static void main(String[] args) throws IOException {
new HousePricePipelineBuilder().train(args[0], args[1], args[2]);
}
public void train(String housesFilename, String pricesFilename, String instanceFilename) throws IOException {
var houses = List.of(new ObjectMapper().readValue(new File(housesFilename), Map[].class));
var prices = List.of(new ObjectMapper().readValue(new File(pricesFilename), Double[].class));
var pipeline = newPipeline();
pipeline.fit(houses, prices);
pipeline.save(new File(instanceFilename));
}
public Pipeline<Object, Double> newPipeline() {
return Pipeline.add(new DropNumericOutliners("LotArea", 10))
.add(new AddMissingValuesTransformer())
.add(new CategoryToNumberTransformer())
.add(new AddComputedFeatureTransformer())
.add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
.add(new HousePriceModel());
}
}
REST and Docker in the machine learning data pipeline
In order to make your newly trained pipeline instance available to end users and other systems, you will have to make it available in a production environment. How you integrate the trained pipeline into the production environment will strongly depend on your target infrastructure, which could be a datacenter, an IoT device, a mobile device, and so on.
As one example, integrating the pipeline into a classic batch-oriented big data production environment requires providing a batch interface to train machine learning models and perform predictions. In a batch-oriented approach you would process your data in bulk using shared databases or filesystems like Apache Spark.
In most cases, a pipeline can be trained offline, so a batch-oriented approach is often ideal for this purpose. For example, I used the batch-oriented approach for the HousePricePipelineBuilder
, where input files are read from the filesystem. The downside of this approach is the time delay. In batch processing, data records are collected over a period of time and then processed together, all at once.
In contrast to training, processing a trained pipeline in production often requires a more real-time approach. Processing incoming data as it arrives means that predictions will be available immediately, without delay. To support real-time requirements, you could extend a big data infrastructure like Hadoop with a messaging or streaming platform like Apache Kafka. In this case, the pipeline would have to be connected to the streaming system and listen for incoming records.
Machine learning with REST
An alternative to streaming or messaging would be to use an RPC-based infrastructure. Instead of consuming incoming records from a stream, the pipeline listens for incoming remote calls such as HTTP requests. The machine learning pipeline is accessed via a REST interface, as shown in the example below. Here, a minimal REST service handles incoming HTTP requests and uses the trained pipeline instance to perform predictions and send the HTTP response message. The trained pipeline instance is loaded during the REST service’s initialization procedure. To be able to deserialize the pipeline, its classes have to be available in the REST service’s classpath.
Listing 6. A REST interface for the machine learning pipeline
@SpringBootApplication
@RestController
public class RestfulEstimator {
private final Estimator estimator;
RestfulEstimator(@Value("${filename}") String pipelineInstanceFilename) throws IOException {
this.estimator = Pipeline.load(new ClassPathResource(pipelineInstanceFilename).getInputStream());
}
@RequestMapping(value = "/predictions", method = RequestMethod.POST)
public List<Object> batchPredict(@RequestBody ArrayList<HashMap<String, Object>> records) {
return estimator.predict(records);
}
public static void main(String[] args) {
SpringApplication.run(RestfulEstimator.class, args);
}
}
Typically, all artifacts required to run the server are packaged within a server .jar file. A server .jar file such as a server-pipeline-estimate-houseprice-1.0.3-1568611516.jar
could include the pipeline-estimate-houseprice-1.0.3.jar
, the serialized pipeline pipeline-estimate-houseprice-1.0.3-1568611516.ser
, and all required third-party libraries.
To build such an executable server .jar file, you could use a CI/CD pipeline as shown in Listing 7. The simplified bash script clones the source code of the generic REST service and adds the dependencies of the Houseprice pipeline, as well as the serialized, trained pipeline file. In this case, the Maven build tool is used to compile and package the code. Maven resolves and merges the party library dependencies of the generic REST server and the Houseprice pipeline during the build, making it easier to detect and avoid version conflicts between the generic REST server code and the pipeline code.
Note that the bash script below includes an additional step after providing the executable server .jar file. A Docker container image is built in task 6. The script provides an executable server .jar file as well as a Docker container image.
Listing 7. Bash script to build a RESTful machine learning data pipeline
#!/bin/bash
groupId=eu.redzoo.ml
artifactId=pipeline-estimate-houseprice
version=1.0.3
timestamp=1568611516
mkdir build
cd build
echo task 1: copying framework-rest source to local dir
git clone --quiet -b 1.0.3.4 https://github.com/grro/ml_deploy.git
cd ml_deploy/module-pipeline-rest
echo task 2: download trained pipeline to pipeline-rest/src/main/resources dir
pipeline_instance=$artifactId-$version-$timestamp".ser"
pipeline_instance_uri="https://github.com/grro/ml_deploy/raw/master/example-repo/model-releases/"${groupId//.//}/${artifactId//.//}/$version-$timestamp/$pipeline_instance
mkdir src/main/resources
curl -s -L $pipeline_instance_uri --output src/main/resources/$pipeline_instance
echo "filename: $pipeline_instance" > src/main/resources/application.yml
echo task 3: adding the pipeline artefact id to framework-rest pom.xml file
pom=$(<pom.xml)
additional_dependency="<dependency><groupId>"$groupId"</groupId><artifactId>"$artifactId"</artifactId><version>"$version"</version></dependency>"
new_pom=${pom/"<!-- PLACEHOLDER -->"/$additional_dependency}
echo $new_pom > pom.xml
echo task 4: build rest server jar including the specific pipeline artifacts
mvn -q clean install package
echo task 5: copying the newly created jar file into the root of the build dir
server_jar=server-$artifactId"-"$version"-"$timestamp.jar
cp target/pipeline-rest-1.0.3.jar ../../$server_jar
cd ../../..
echo task 6: build docker image
docker build --build-arg arg_server_jar=$server_jar -t $groupId"https://www.infoworld.com/"$artifactId":"$version"-"$timestamp .
rm -rf build
Machine learning with Docker containers
Although the newly created executable server .jar is a deployable and runnable artifact, devops and system administrators often prefer Docker containers over executable .jars. You could view a Docker container as a sort of customized software stack, including a virtual operating system running on the top of a host operating system. This allows you to package up an application with all of its required parts, system components, and configurations. In contrast to traditional virtual machine solutions, a Docker container uses the same kernel as the host system that it runs on, which reduces the overhead of virtualization.
For instance, you could create a Docker container image including a slim Debian Linux distribution, the newest OpenJDK runtime, as well as your executable server .jar. In contrast to a .jar-based deployment, Docker makes it easy to implement a customized configuration such as specific JVM garbage collector settings or to install custom certificates as part of your deployment unit. Instead of delivering an executable .jar with a more or less large list of installation prerequisite, you would provide a self-contained Docker container without having to install anything else.
To assemble a new Docker container image, you have to define a DOCKERFILE containing a collection of Docker commands instructing Docker as to how it should build your image. In Listing 8, a new Docker image will be built based on an OpenJDK/buster base image, including the Debian Linux distribution and OpenJDK 13. With the exception of the last command, all commands will be executed at the Docker image build time. Essentially, the DOCKERFILE copies the server .jar file located in the build directory into the container’s file system. Assuming the Docker container has been started, the last command will run the Java-based REST service.
Listing 8. DOCKERFILE to build the machine learning data pipeline
FROM openjdk:13-jdk-slim-buster
# build time params (provided by 'docker build --build-arg arg_server_jar=server-pipeline-estimate-houseprice-1.0.3-1568611516.jar')
ARG arg_server_jar
# copy the executable server jar file into the docker container
COPY build/$server_jar /etc/restserver/$server_jar
# copy the build time param to a runtime param (required for runtime command CMD below)
ENV server_jar=$arg_server_jar
# default command, which will be executed on runtime by performing 'docker run'
CMD java -jar /etc/restserver/$server_jar
When you execute the image build process, Docker will read the DOCKERFILE in the local directory. In the example below a Docker container image will be created and tagged with a unique Docker identifier: eu.redzoo.ml/pipeline-estimate-houseprice:1.0.3-1568611516
. By default, the newly created Docker image is stored into your local Docker environment.
docker build --build-arg arg_server_jar= server-pipeline-estimate-houseprice-1.0.3-1568611516.jar -t eu.redzoo.ml/pipeline-estimate-houseprice:1.0.3-1568611516
The docker run
command will be used as shown below, to load the image and start the container.
docker run -p 9090:8080 eu.redzoo.ml/pipeline-estimate-houseprice:1.0.3-1568611516
In most cases, additional environment parameters such as the -p parameter will be set. The -p parameter is used to make the 8080 port of the Java server inside the Docker container available to services outside of the container. In this example, port 9090 of the host system will be mapped to Docker’s internal Java server port 8080.
Additional parameters limit the resource consumption of the Docker container. For instance, the -m parameter limits the container’s access to memory. Typically, such resource limiting parameters will be used to implement a Bulkhead stability pattern. The Bulkhead pattern helps to protect systems against cascading errors. For instance, a buggy Java server inside the container may start to consume more and more memory and CPU power. If the consumption is limited by using Docker’s resource parameters, other containers running on the same host will not be negatively affected by running out of memory or waiting for CPU time.
Conclusion
This article has introduced a generalized process for training, deploying, and processing a machine learning model in a production environment. In practice, numerous requirements and conditions will weigh on the approach you use to put a machine learning model into production. Depending on your business requirements, machine learning models may have to be executed using a real-time solution such as a streams-based architecture, or a batch-oriented architecture that prioritizes throughput for heavy data loads. Additional factors to consider are the communication patterns, which may favor a database/filesystem-based pipeline API, a streams-based pipeline API, or a REST-based pipeline API. The whole pipeline may be packaged as a single deployment unit, or parts of the preprocessing components may be packaged as dedicated deployment units. Furthermore, the pipeline may be deployed as a self-contained Docker container, or you could use a central model repository, serving nodes to load and process models in a dynamic way like TensorFlow Serving does.
In contrast to traditional software development, these approaches all require that you handle an additional dimension of complexity. In traditional programming, you hardcode the behavior of the program. In a machine learning pipeline, you also write code, but the code you write will be trained and adjusted based on production data, which adapts the behavior of the program. In contrast to traditional programming, the unit of deployment is a trained frozen instance, which makes deployment and software maintenance more complex. Key to handling this additional complexity is to make things reproducible. With comprehensive version and release management, you will be able to re-train and re-deploy a pipeline instance, such that given the same raw data as input it will return the exact same output. This gives you the ability to deploy and run your machine learning pipelines in production environments in a way that is controllable, transparent, and maintainable.