Scalable Prediction Services in #RStats
Lessons learned at Socure deploying real-time prediction web services based on fraud models developed in R. Leveraging AWS IaaS and Docker containerization to deploy auto-scaling redundant deployment units with clear separation of modeling and dev-op concerns.
Presented at the NYC R Conference
Socure provides a real-time fraud detection service that uses offline, social media, and online sources to provide lift over traditional credit models. Our service is integrated with financial institutions and credit agencies at various points in their on-boarding workflow. In this sign-up or application process for consumers, for things such as credit cards, person-to-person payments accounts and online lenders, there are stringent latency SLAs. Typical transactions, which in this case are consumer authentications, need to happen in a few seconds.
A service call involves collecting third party data, extracting features and generating a prediction score. Much of this data can be collected in parallel however the prediction phase obviously cannot and time spent here is a direct contribution to overall latency. A white paper describing details of the scoring process is available on our website.
ChallengesThird party prediction platforms usually offer a restricted set of models and are not conducive to experimentation or exploration with different model types. These services are expensive and a large commitment to maintain. For startups and nascent services with unproven ROI, this is a large impediment to building scaleable prediction services. Also, models developed on third party systems are not necessarily serializable. This is an issue from vendor lock-in perspective as well as governance in the financial industry.
Our solution was to leverage existing models already constructed in R. Models were deployed within the Apache web server using the RApache extension. This provides a dev-op friendly interface with which system engineers are familiar. Further we, encapsulate all required components and dependencies into a portable Docker container which abstracts the entire application down to the OS level and requires no knowledge of the internal implementation to deploy and operate. These containers can be deployed locally or to cloud services, in our case, AWS. The resulting deployment process is familiar to dev-ops and follows established processes for testing, monitoring and scaling.
At startup the R process loads serialized models from a model directory and adds them to an internal map keyed on a name and version attribute embedded within the model objects themselves. The model objects are expected to support a predict function which makes them compatible with most popular R model types but also allows the implementation of custom model types or the implementation of preprocessing hooks.
The web endpoint, exposed by the Rook package, parses the request URL path and extracts the model name and version which is used to select the correct model from the internal model map. The Rook controller delegates to the registered R request handler method. Prediction inputs are parsed from the request body and passed to the model’s predict function. Model predictions are serialized and returned as JSON objects.
Our implementation uses HTTP POST parameter formatting for encoding input variables but we are seeing third party services favour a JSON format, which is more flexible and can be used to implement a typing system between caller and service.
The Rook application can be used as-is to expose the R model as a web endpoint and is also useful for local testing since it can be run directly from an R session. However to implement a robust multi-threaded web server, we embedded our Rook application within the Apache web server via the RApache module. This module connects the Apache request delegation stack with an R interpreter.
Multithreading is achieved using the Apache prefork MPM module which uses a single thread per child process, each process replicating the Rook-R prediction application detailed previously. More efficient MPM configurations are possible, but the prefork eliminates any concerns with respect to R model thread safety. No consideration is needed in the R code to handle the multithreading aspects.
On startup, Apache will fork MinSpareServers child processes to handle requests and will scale up to MaxRequestWorkers depending on load. Apache can be configured to periodically recycle child processes which is useful for live updating of models as the contents of the model directory are reloaded each time the preediction application initializes.
We are currently prototyping a similar system to previous slide which uses PMML (predictive markup language), an open specification for defining the implementation of predictive models.
The PMML package in R will generate PMML definition of most common model types. These specification can be re-incarnated using, amongst other tools, JPMML, an open source Java PMML evaluator. The resulting Java model object is embedded in a Servlet and deployed as a Java Web application on Apache Tomcat.
The application described up to this point, R or Java, can be deployed as-is on on-premises or on cloud based servers. To further facilitate deployment we used Docker containerization to encapsulate the prediction service into a portable self-contained unit.
A Dockerfile build script builds a portable application image. This image can be used locally or pushed to a public Docker repository, or in our case, a private Docker repository within AWS Elastic Container Services. From here these images can be pulled by developers, testers, and deployed directly into AWS Elastic Beanstalk.
Another useful aspect of the Docker repository is that it allows for image versioning and change management. Although Docker is native to the Linux OS, images can be created and executed on other platforms using virtualization tools like Docker Machine.
Once the application image is deployed to AWS Elastic Beanstalk it can be distributed across multiple geographic zones as a matter of dev-ops configuration. Requests are load balanced across the zones and across the instances within each zone. Auto-scaling rules can be configured to automatically increase the available resource within each zone in response to system level metrics like CPU or memory usage, network throughput, or request rate.
In AWS the prediction service looks very much like any other of our web services. The management console can be used for lifecycle management and AWS Cloudwatch metrics for monitoring, alarming, and performance testing.
- Rapid deployment of R models in a scalable robust environment.
- Directly leverage R models developed by data scientists and analysts.
- Apply existing dev-ops processes for testing, monitoring, scaling, alerting of predictive models.
- Possible use of PMML to serialize models in future for compliance.
For more information find the Socure R module on GitHub!
Presented at the NYC R Conference