🚀 CloudSEK has raised $19M Series B1 Round – Powering the Future of Predictive Cybersecurity
Read More
Protect your organization from external threats like data leaks, brand threats, dark web originated threats and more. Schedule a demo today!
Schedule a Demo
Machine Learning (ML) models are essential to identifying patterns and making reliable predictions. And at CloudSEK, our models are trained to derive such predictions across data collected from more than 1000 sources. With over 50 different models running in production, monitoring these Machine Learning models is a daunting task and yet indispensable.
The ML development life cycle consists of training and testing models, their deployment to production, and monitoring them for improved accuracy. A lack of adequate monitoring could lead to inaccurate predictions, obsolete models, and the presence of unnoticed bugs in them.
CloudSEK’s Data Engineering team works together with Data Scientists to deploy ML models and track their performances continuously. To achieve this, we ensure that the following requirements are fulfilled:
At CloudSEK, different Machine Learning models and their multiple versions classify a document across various stages. Whereby, the client is alerted only to the most accurate results from efficient models or an ensemble of results by combining different versions.
At its core, all machine learning modules are composed of 2 parts. The output of an ML module depends on both of these components:
As a rule of thumb, any significant modifications made to these two components qualify as a version upgrade. However, minor changes or bug fixes or even static rules additions don’t lead to an upgrade, and are simply considered as regular code updates, which we track via Git.
Generally, Machine Learning models are hosted on stateless docker containers. Such containerized models listen to queues for messages, as soon as the docker container runs on a system. The container maintains a configuration file with information about the type of models, their versions, and whether these models are meant for production.
When the docker container is built, you can pass the latest Git repository Git commit hash to it, to be set as an environment variable. The diagram explains the data flow between ML models and their different versions:
When the container is run, data is consumed from a message queue. The model name present in the configuration file determines the data that is consumed. Once it is processed, the predictions are returned as a dictionary which is then persisted into a database.
The ML modules can also possibly return optional metadata that contains information such as the actual prediction scores, functions triggered inside, etc.
Given below is a sample of a document after processing the results from all the models:
{
"document_id" : "root-001#96bfac5a46",
"classifications_stage_1_clf_v0" : {
"answer" : “suspected-defaced-webpage",
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T14:54:09.892Z"),
"commit_hash" : "6f8e8033"
},
"classifications_stage_2_clf_v0" : {
"answer" : {
"reason" : null,
"type" : "nonthreat",
"severity" : null
},
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),
"commit_hash" : null
},
"classifications_stage_2_clf_v1" : {
"answer" : {
"reason" : null,
"type" : "nonthreat",
"severity" : null
},
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),
"commit_hash" : null
}
}
This process allows us to find, for any given document, the exact state of all the models that classified a particular document. We can rollback between the model versions and a minor change in the value provided in the configuration file should allow us to set the main production model apart from the test models.
A Metabase instance can be leveraged to visualize key metrics and the performance of each classifier, on a dashboard. It may also contain details about the documents that are processed by each model, or how many documents were classified with category X, category Y, etc. ( in the case of classification tasks), and more.
Monitoring also allows data scientists to study and compare the results of the various versions of the models, given that the particulars of version outputs are retrieved. This data provides them with a set of documents that reveal which output may have been influenced by a new model. This data is then added to the training data to calibrate the models.
Discover how CloudSEK's comprehensive takedown services protect your brand from online threats.
How to bypass CAPTCHAs easily using Python and other methods
What is shadow IT and how do you manage shadow IT risks associated with remote work?
Take action now
CloudSEK Platform is a no-code platform that powers our products with predictive threat analytic capabilities.
Digital Risk Protection platform which gives Initial Attack Vector Protection for employees and customers.
Software and Supply chain Monitoring providing Initial Attack Vector Protection for Software Supply Chain risks.
Creates a blueprint of an organization's external attack surface including the core infrastructure and the software components.
Instant Security Score for any Android Mobile App on your phone. Search for any app to get an instant risk score.
min read
Here’s how we proactively monitor XVigil’s 50+ Machine Learning models
Machine Learning (ML) models are essential to identifying patterns and making reliable predictions. And at CloudSEK, our models are trained to derive such predictions across data collected from more than 1000 sources. With over 50 different models running in production, monitoring these Machine Learning models is a daunting task and yet indispensable.
The ML development life cycle consists of training and testing models, their deployment to production, and monitoring them for improved accuracy. A lack of adequate monitoring could lead to inaccurate predictions, obsolete models, and the presence of unnoticed bugs in them.
CloudSEK’s Data Engineering team works together with Data Scientists to deploy ML models and track their performances continuously. To achieve this, we ensure that the following requirements are fulfilled:
At CloudSEK, different Machine Learning models and their multiple versions classify a document across various stages. Whereby, the client is alerted only to the most accurate results from efficient models or an ensemble of results by combining different versions.
At its core, all machine learning modules are composed of 2 parts. The output of an ML module depends on both of these components:
As a rule of thumb, any significant modifications made to these two components qualify as a version upgrade. However, minor changes or bug fixes or even static rules additions don’t lead to an upgrade, and are simply considered as regular code updates, which we track via Git.
Generally, Machine Learning models are hosted on stateless docker containers. Such containerized models listen to queues for messages, as soon as the docker container runs on a system. The container maintains a configuration file with information about the type of models, their versions, and whether these models are meant for production.
When the docker container is built, you can pass the latest Git repository Git commit hash to it, to be set as an environment variable. The diagram explains the data flow between ML models and their different versions:
When the container is run, data is consumed from a message queue. The model name present in the configuration file determines the data that is consumed. Once it is processed, the predictions are returned as a dictionary which is then persisted into a database.
The ML modules can also possibly return optional metadata that contains information such as the actual prediction scores, functions triggered inside, etc.
Given below is a sample of a document after processing the results from all the models:
{
"document_id" : "root-001#96bfac5a46",
"classifications_stage_1_clf_v0" : {
"answer" : “suspected-defaced-webpage",
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T14:54:09.892Z"),
"commit_hash" : "6f8e8033"
},
"classifications_stage_2_clf_v0" : {
"answer" : {
"reason" : null,
"type" : "nonthreat",
"severity" : null
},
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),
"commit_hash" : null
},
"classifications_stage_2_clf_v1" : {
"answer" : {
"reason" : null,
"type" : "nonthreat",
"severity" : null
},
"content_meta" : null,
"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),
"commit_hash" : null
}
}
This process allows us to find, for any given document, the exact state of all the models that classified a particular document. We can rollback between the model versions and a minor change in the value provided in the configuration file should allow us to set the main production model apart from the test models.
A Metabase instance can be leveraged to visualize key metrics and the performance of each classifier, on a dashboard. It may also contain details about the documents that are processed by each model, or how many documents were classified with category X, category Y, etc. ( in the case of classification tasks), and more.
Monitoring also allows data scientists to study and compare the results of the various versions of the models, given that the particulars of version outputs are retrieved. This data provides them with a set of documents that reveal which output may have been influenced by a new model. This data is then added to the training data to calibrate the models.