fake apps brand monitor

How does CloudSEK’s XVigil detect rogue, fake applications

 

Brand monitoring is one of the premium services offered by CloudSEK’s flagship digital risk monitoring platform – XVigil. This functionality is comprehensive of a wide range of use cases such as:

  • Fake domain monitoring
  • Rogue/ fake application detection
  • VIP monitoring

Threat actors deploy fake or rogue apps that masquerade as the official application, by infringing on our client’s trademark and copyrighted material. Upon seeing our client’s familiar trademark, their customers are tricked into installing such apps on their devices, thereby running a malicious code that allows threat actors to exfiltrate data. Just this year, XVigil reported and alerted our clients regarding over 2.4 lakh fake apps from various third party app stores.

 

Classification through similarity scoring

Classification of such threats forms a major part of XVigil’s threat monitoring framework. The platform identifies and classifies the app as fake or rogue based on whether it impersonates our client’s apps, or if the uploaded APK files are different from our client’s official APKs. 

For any machine learning problem, you need training data, where you expect your test data to be similar to the training data. In this case, however, the data is different for each client, and you would need a separate model for every new client.

We approach the problem as a similarity score problem. We compare the suspected app with all the official apps of the client and see how similar the suspected app is to the official app in terms of the app-title, description, screenshots provided, logos, etc. Here, a greater concern arises when we don’t have all the information related to the client, to compare it with the suspicious app. 

 

CloudSEK’s Knowledge Extraction Module

To gather more information about the client, we built a knowledge extraction module. When a new client signs up, this module is triggered and it tries to collect all the information it can about the client. The knowledge extraction module was built as a generic module that tries to obtain every piece of information it can about the client. The client’s name and their primary domain are the only inputs required for the process. 

With these details, the knowledge extraction module can identify the industry that our client operates in, their various products and services, competitors of the client, the main tech-stack/ technologies the client works with, their official apps, and so on. These details are sourced from Google Play store, or by crawling and parsing the client’s website, various job listings posted by the clients themselves, etc. The gathered information is then passed through custom Named Entity Recognition (NER) models or static rules on them to get the client details in a structured manner. 

When we monitor for malicious applications, we run a text/ image similarity model on the gathered information (client’s logo, official app, competitors, etc.) and the information present in the suspicious app. For example, text similarity checks how contextually similar the client’s app description is to that of the suspicious app. Another module tries to find if the client’s logos appear on the screenshots provided within the malicious app. The different scores are then summarised to recommend an ensemble score. 

 

Finally…

If the similarity score is greater than a certain threshold, we can safely say that the app resembles our client’s brand/ app. Now we need to check if the APK uploaded is a modified APK or the original client’s APK itself. The only challenge here is to maintain the APKs released by the client, for any and all versions of their apps, for all devices/ regions. To work around this issue, we compare the suspicious app’s certificate with the certificates present in the official app of the client present on Google Play store. This removes all the apps’ APKs from the official app developers of the client. This leaves us with malicious apps, which is then reported to our clients.

ML model

Here’s how we proactively monitor XVigil’s 50+ Machine Learning models

 

Machine Learning (ML) models are essential to identifying patterns and making reliable predictions. And at CloudSEK, our models are trained to derive such predictions across data collected from more than 1000 sources. With over 50 different models running in production, monitoring these Machine Learning models is a daunting task and yet indispensable. 

The ML development life cycle consists of training and testing models, their deployment to production, and monitoring them for improved accuracy. A lack of adequate monitoring could lead to inaccurate predictions, obsolete models, and the presence of unnoticed bugs in them.

CloudSEK’s Data Engineering team works together with Data Scientists to deploy ML models and track their performances continuously. To achieve this, we ensure that the following requirements are fulfilled:

  • Model versioning: Enable multiple versions of the same models
  • Initializing observer patterns using incoming data 
  • Comparing results from different versions

At CloudSEK, different Machine Learning models and their multiple versions classify a document across various stages. Whereby, the client is alerted only to the most accurate results from efficient models or an ensemble of results by combining different versions.

 

What constitutes a version upgrade? 

At its core, all machine learning modules are composed of 2 parts. The output of an ML module depends on both of these components:

  • The core ML model weights file which is generated upon training a model.
  • The surrounding code statements that represent preprocessing, feature extraction, post-processing, etc. 

As a rule of thumb, any significant modifications made to these two components qualify as a version upgrade. However, minor changes or bug fixes or even static rules additions don’t lead to an upgrade, and are simply considered as regular code updates, which we track via Git. 

 

Deploying and Monitoring Models

 Generally, Machine Learning models are hosted on stateless docker containers. Such containerized models listen to queues for messages, as soon as the docker container runs on a system. The container maintains a configuration file with information about the type of models, their versions, and whether these models are meant for production. 

When the docker container is built, you can pass the latest Git repository Git commit hash to it, to be set as an environment variable. The diagram explains the data flow between ML models and their different versions: 

Machine Leaning models diagram

 

When the container is run, data is consumed from a message queue. The model name present in the configuration file determines the data that is consumed. Once it is processed, the predictions are returned as a dictionary which is then persisted into a database. 

The ML modules can also possibly return optional metadata that contains information such as the actual prediction scores, functions triggered inside, etc.

Given below is a sample of a document after processing the results from all the models:

 

{

"document_id" : "root-001#96bfac5a46", 

"classifications_stage_1_clf_v0" : {

"answer" : “suspected-defaced-webpage",

"content_meta" : null,

"hit_time" : ISODate("2019-12-24T14:54:09.892Z"),

"commit_hash" : "6f8e8033"

},

"classifications_stage_2_clf_v0" : {

"answer" : {

"reason" : null,

"type" : "nonthreat",

"severity" : null

},

"content_meta" : null,

"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),

"commit_hash" : null

},

"classifications_stage_2_clf_v1" : {

"answer" : {

"reason" : null,

"type" : "nonthreat",

"severity" : null

},

"content_meta" : null,

"hit_time" : ISODate("2019-12-24T15:40:46.245Z"),

"commit_hash" : null

}

}

 

How this helps us

This process allows us to find, for any given document, the exact state of all the models that classified a particular document. We can rollback between the model versions and a minor change in the value provided in the configuration file should allow us to set the main production model apart from the test models. 

A Metabase instance can be leveraged to visualize key metrics and the performance of each classifier, on a dashboard. It may also contain details about the documents that are processed by each model, or how many documents were classified with category X, category Y, etc. ( in the case of classification tasks), and more.

 

 

Screenshot of the internal dashboard which helps in visualisations key metrics
Screenshot of the internal dashboard which helps in visualizing key metrics

Monitoring also allows data scientists to study and compare the results of the various versions of the models, given that the particulars of version outputs are retrieved. This data provides them with a set of documents that reveal which output may have been influenced by a new model. This data is then added to the training data to calibrate the models.