Machine Learning (ML) models are essential to identifying patterns and making reliable predictions. And at CloudSEK, our models are trained to derive such predictions across data collected from more than 1000 sources. With over 50 different models running in production, monitoring these Machine Learning models is a daunting task and yet indispensable.
The ML development life cycle consists of training and testing models, their deployment to production, and monitoring them for improved accuracy. A lack of adequate monitoring could lead to inaccurate predictions, obsolete models, and the presence of unnoticed bugs in them.
CloudSEK’s Data Engineering team works together with Data Scientists to deploy ML models and track their performances continuously. To achieve this, we ensure that the following requirements are fulfilled:
Model versioning: Enable multiple versions of the same models
Initializing observer patterns using incoming data
Comparing results from different versions
At CloudSEK, different Machine Learning models and their multiple versions classify a document across various stages. Whereby, the client is alerted only to the most accurate results from efficient models or an ensemble of results by combining different versions.
What constitutes a version upgrade?
At its core, all machine learning modules are composed of 2 parts. The output of an ML module depends on both of these components:
The core ML model weights file which is generated upon training a model.
The surrounding code statements that represent preprocessing, feature extraction, post-processing, etc.
As a rule of thumb, any significant modifications made to these two components qualify as a version upgrade. However, minor changes or bug fixes or even static rules additions don’t lead to an upgrade, and are simply considered as regular code updates, which we track via Git.
Deploying and Monitoring Models
Generally, Machine Learning models are hosted on stateless docker containers. Such containerized models listen to queues for messages, as soon as the docker container runs on a system. The container maintains a configuration file with information about the type of models, their versions, and whether these models are meant for production.
When the docker container is built, you can pass the latest Git repository Git commit hash to it, to be set as an environment variable. The diagram explains the data flow between ML models and their different versions:
When the container is run, data is consumed from a message queue. The model name present in the configuration file determines the data that is consumed. Once it is processed, the predictions are returned as a dictionary which is then persisted into a database.
The ML modules can also possibly return optional metadata that contains information such as the actual prediction scores, functions triggered inside, etc.
Given below is a sample of a document after processing the results from all the models:
This process allows us to find, for any given document, the exact state of all the models that classified a particular document. We can rollback between the model versions and a minor change in the value provided in the configuration file should allow us to set the main production model apart from the test models.
A Metabase instance can be leveraged to visualize key metrics and the performance of each classifier, on a dashboard. It may also contain details about the documents that are processed by each model, or how many documents were classified with category X, category Y, etc. ( in the case of classification tasks), and more.
Monitoring also allows data scientists to study and compare the results of the various versions of the models, given that the particulars of version outputs are retrieved. This data provides them with a set of documents that reveal which output may have been influenced by a new model. This data is then added to the training data to calibrate the models.
Deep learning models are often criticized for being complex and opaque. They are called black-boxes because they deliver predictions and insights, but the logic behind their outputs is hard to comprehend. Due to its complex multilayer nonlinear neural networks, Data Scientists find it difficult to ascertain the factors or reasons for a certain prediction.
This unintelligibility makes people wary of taking important decisions based on the models’ outputs. As human beings, we trust what we understand; what we can verify. And over time, this has served us well. So, being able to show how models go about solving a problem to produce insights, will help build trust, even among people with cursory knowledge of data science.
To achieve this it is imperative to develop computational methods that can interpret, audit, and debug such models. Debugging is essential to understanding how models identify patterns and generate predictions. This will also help us identify and rectify bugs and defects.
In this article we delve into the different methods used to debug machine learning models.
Also known as permutation feature importance, it is an algorithm that computes the sensitivity of a model to permutations/alterations in a feature’s values. In essence, feature importance evaluates each feature of your data and scores it based on its relevance or importance towards the output. While permutation feature importance, measures each feature of the data after it has been altered, and scores it based on its importance towards generating an output.
For instance, let us randomly permute or shuffle the values of a single column in the validation dataset with all the other columns intact. If the model’s accuracy drops substantially and causes an increase in error, that feature is considered “important”. On the other hand, a feature is considered ‘unimportant’ if shuffling its values doesn’t affect the model’s accuracy.
Debugging ML models using ELI5
ELI5 is a Python library that assists several ML frameworks and helps to easily visualize and debug black-boxes using unified API. It helps to compute permutation importance. But it should be noted that the permutation importance is only computed on test data after the model is built.
After our model is ready, we import ELI5 to calculate permutation importance.
The output for above code is shown below:
The features that are on top are the most important, which means that any alterations made to these values will reduce the accuracy of the model significantly. The features that are at the bottom of the list are unimportant in that any permutation made to their values will not reduce the accuracy of the model. In this example, OverallQual was the most important feature.
Debugging CNN-based models using Grad-CAM (Gradient-weighted Class Activation Mapping)
Grad-CAM is a technique that produces visual explanations for outputs to deliver transparent Convolutional Neural Network (CNN) based models. It examines the gradient information that flows into the final layer of the neural network to understand the output. Grad-CAM can be used for image classification, image captioning, and visual question answering. The output provided by Grad-CAM is a heatmap visualization, which is used to visually verify that your model is trained to look at the right patterns in an image.
Debugging ML models using SHAP (SHapley Additive exPlanations)
SHAP is a game theoretic approach that aims to explain a prediction by calculating the importance of each feature towards that prediction. The SHAP library uses Shapley values at its core and explains individual predictions. Lloyd Shapley introduced the concept of Shapley in 1953 and it was later applied to the field of machine learning.
Shapley values are derived from Game theory, where each feature in the data is a player, and the final reward is the prediction. Depending on their contribution towards the reward, Shapley values tell us how to distribute this reward fairly among the players.
We use SHAP quite often, especially for the models in which explainability is critical. The results are really quite accurate.
SHAP can explain:
General feature importance of the model by using all the data
Why a model calculates a particular score for a specific row/ record
The more dominant features for a segment/ group of data
Shapley values calculate the feature importance by comparing two predictions, one with the feature included and the other without it. The positive SHAP values affect the prediction/ target variable positively whereas the negative SHAP values affect the target negatively.
Here is an example to explain the same. For this purpose, I am taking a red wine quality dataset from kaggle.
Now, we produce variable importance plots, which lists the most significant variable in descending order. Wherein, the top variable would contribute more to the model.
In the above figure, all the variables are plotted in descending order. Color of the variables represent the feature value, whether they are high (in red) or low (in blue) in that observation. A high level of the “sulphates” content has a high and positive impact on the quality rating. X-axis represents the “positive” impact. Similarly, we can say that “chlorides” is negatively correlated with the target variable.
Now, I would like to show you how SHAP values are computed in individual cases. Then, we execute such values on several observations and choose a few of the observations randomly.
After choosing random observations, we initialize our notebook with initjs().
Explanation for some of the terms seen in the plot above:
Output value: Prediction for that observation, which in this case is 5.35
Base value: Base value is the mean prediction, or mean (yhat), here it is 5.65
Blue/ red: Features that can impact the prediction more are shown in red, and those that have the least impact are in blue color.
For more information and to test your skills, check out kaggle.
Since the coinage of the term in 1956, Artificial Intelligence (AI) has evolved considerably. From its metaphorical reference in Mary Shelly’s Frankenstein, to its most popular recent application in autonomous cars, AI has made a progressive shift, over the years. It influences all the major industries such as transportation, communication, banking, education, healthcare, media, etc.
When it comes to cybersecurity, AI is changing how we detect and respond to threats. However, with the benefits, comes the risk of the potential misuse of AI capabilities. Is the primary catalyst for cybersecurity, also a threat to it?
How do we use AI in our daily life?
Social media users encounter AI on a daily basis and probably don’t recognize it at all. Online shopping recommendations, image recognition, personal assistants such as Siri and Alexa, and smart email replies, are the most popular examples.
For instance, Facebook identifies individual faces in a photo, and helps users “tag” and notify them. Businesses often embed chatbots in their websites and applications. These AI-driven chatbots detect words in the questions entered by customers, to predict and deliver prompt responses.
How do malicious actors abuse and weaponize AI?
To orchestrate attacks, cyber criminals often tinker with existing AI systems, instead of developing new AI programs and tools. Some common attacks that exploit Artificial Intelligence include:
Misusing the nature of AI algorithms/ systems: AI capabilities such as efficiency, speed and accuracy can be used to devise precise and undetectable attacks like targeted phishing attacks, delivering fake news, etc.
Input attacks/ adversarial attacks: Attackers can feed altered inputs into AI systems, to trigger unexpected/incorrect results.
Data Poisoning: Malicious actors corrupt AI training data sets by poisoning them with bad data, affecting the system’s accuracy.
Examples of how AI can be weaponized
GPT-2 text generator/ language models
In November 2019, OpenAI released the latest and largest version of GPT-2 (Generative Pretrained Transformer 2). This language model has the training to generate unique textual content, based on a given input. It even tailors the output style and subject based on the input. So, if you input a specific topic or theme, GPT-2 will yield a few lines of text. GPT-2 is exceptional in that it doesn’t produce pre-existing strings, but singular content that didn’t exist before the model created it.
Drawbacks of GPT-2
The language model is built with 1.5 billion parameters and has a “credibility score” of 6.9 out of 10. The model received a training with the help of 8 million text documents. As a result, OpenAI claims that “GPT-2 outperforms other language models.” The text generated by GPT-2 is as good as text composed by a human. Since detecting this synthetic text is challenging, creating spam emails and messages, fake news, or performing targeted phishing attacks, among other things, becomes easier.
Image recognition software
Image recognition is the process of identifying pixels and patterns to detect objects in digital images. The latest smartphones (for biometric authentication), social networking platforms, Google reverse image search, etc. use facial recognition. AI-based face recognition softwares detect faces in the camera’s field of vision. Given its multiple uses across industries and domains, researchers expect the image recognition software market to make a whopping USD 39 billion, by 2021.
Drawbacks of image recognition softwares
Major smartphone brands are now using facial recognition instead of fingerprint recognition, in their biometric authentication systems. Since this cutting-edge technology is popular among consumers, cyber criminals have found ways to exploit it.
Tricking facial recognition: It has been demonstrated that Apple’s Face ID can be duped using 3D masks. There are also other instances of deceiving facial recognition with infrared lights, glasses, etc. Identical twins, such as myself, can swap our smartphones to trick even the most efficient algorithms, currently available.
Blocking automated facial recognition: As facial recognition depends on key features of the face, an alteration made to the features can block automated facial recognition. Similarly, researchers are exploring various ways by which automated facial recognition can be blocked.
For example: Researchers found that minor modifications to a stop sign confuses autonomous cars. If implemented in real life, these technologies could have severe consequences.
Poisoned training sets
Machine learning algorithms that power Artificial Intelligence, learn from data sets (training sets) or by extracting patterns from data sets.
Drawbacks of Machine Learning algorithms
Attackers can poison training sets with bad data, to alter a system’s accuracy. They can even “teach” the model to behave differently, through a backdoor or otherwise. As a result, the model fails to work in the intended way, and will remain corrupted.
In the most unusual of ways, Microsoft’ AI chatbot, Tay, was corrupted through Twitter trolls. Releasing the smart chatbot was on an experimental basis, to engage people in “playful conversations.” However, Twitter users deluged the chatbot with racist, misogynistic, and anti-semitic tweets, turning Tay into a mouthpiece for a terrifying ideology in under a day.
AI is here to stay. So, as we build Artificial Intelligence systems that can efficiently detect and respond to cyber threats, we should take small steps to ensure they are not exploited:
Focus on basic cybersecurity hygiene including network security and anti-malware systems.
Ensure there is some human monitoring/ intervention even for the most advanced AI systems.
Teach AI systems to detect foreign data based on timestamps, data quality etc.
Text classification is an important task in Natural Language Processing in which predefined categories are assigned to text documents. In this article, we will explore recent approaches for text classification that consider document structure as well as sentence-level attention.
In general, a text classification workflow is like this:
You collect a lot of labeled sample texts for each class (your dataset). Then you extract some features from these text samples. The text features from the previous step along with the labels are then fed into a machine learning algorithm. After the learning process, you’ll save your classifier model for future predictions.
One difference between classical machine learning and deep learning when it comesto classification is that in deep learning, feature extraction and classification is carried out together but in classical machine learning, they’re usually separate tasks.
Proper feature extraction is an important part of machine learning for document classification, perhaps more important than choosing the right classification algorithm. If you don’t pick good features, you can’t expect your model to work well. Before discussing further the feature extraction, let’s talk about some methods of representing text for feature extraction.
A text document is made of sentences which in turn made of words. Now the question is: how do we represent them in a way that features can be extracted efficiently?
Document representation can be based on two models:
Standard Boolean models: These models use boolean logic and the set theory are used for information retrieval from the text. They are inefficient compared to the below vector space model due to many reasons and are not our focus.
Vector space models: Almost all of the existing text representation methods are based on VSMs. Here, documents are represented as vectors.
Let’s look at two techniques by which vector space models can be implemented for feature extraction.
Bag of words (BoW) with TF-IDF
In Bag of words model, the text is represented as the bag of its words. The below example will make it clear.
Here are two sample text documents:
(i) Bofin likes to watch movies. Rahul likes movies too.
(ii) Bofin also likes to play tabla.
Based on the above two text documents, we can construct a list of words in the documents as follows.
Now if you consider a simple Bag of words model with term frequency (the number of times a term appears in the text), feature lists for the above two examples will be,
(i) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(ii) [1, 1, 1, 0, 0, 0, 0, 1, 1, 1]
A simple term frequency like this to characterize the text is not always a good idea. For large text documents, we use something called term frequency – inverse document frequency (tf-idf). tf-idf is the product of two statistics, term frequency (tf) and inverse document frequency (idf). We’ve just seen from the above example what tfis, now let’s understand idf. In simple terms, idf is a measure of how much a word is common to all the documents. If a word occurs frequently inside a document, that word will have high term frequency, if it also occurs frequently in the majority of the documents, it will have a low inverse document frequency. Basically, idf helps us to filter out words like the, i, an that occur frequently but are not important for determining a document’s distinctiveness.
Word embeddings with word2vec
Word2vec is a popular technique to produce word embeddings. It is based on the idea that a word’s meaning can be perceived by its surrounding words (Like the proverb: “Tell me who your friends are and I’ll tell you who you are” ). It produces a vector space from a large corpus, with each unique word in the corpus being assigned a corresponding vector in the vector space. The heart of word2vec is a two-layer neural network which is trained to model linguistic context of words such that words that share common contexts in a document are in close proximity inside the vector space.
Word2vec uses two architectures for representation. You can (loosely) think of word2vec model creation as the process consisting of processing every word in a document in either of these methods.
Continuous bag-of-words (CBOW): In this architecture, the model predicts the current word from its context (surrounding words within a specified window size)
Skip-gram: In this architecture, the model uses the current word to predict the context.
Word2vec models trained on large text corpora (like the entire English Wikipedia) have shown to grasp some interesting relations among words as shown below.
Following these vector space models, approaches using deep learning made progress in text representations. They can be broadly categorized as either convolutional neural network based approaches or recurrent neural network (and its successors LSTM/GRU) based approaches.
Convolutional neural networks (ConvNets) are found impressive for computer vision applications, especially for image classification. Recent research that explores ConvNets for natural language processing tasks have shown promising results for text classification, like charCNNin which text is treated as a kind of raw signal at the character level, and applying temporal (one-dimensional) ConvNets to it.
Recurrent neural network and its derivatives might be perhaps the most celebrated neural architectures at the intersection of deep learning and natural language processing. RNNs can use their internal memory to process input sequences, making them a good architecture for several natural language processing tasks.
Although straight neural network based approaches to text classification have been quite effective, it’s observed that better representations can be obtained by including knowledge of document structure in the model architecture. This idea is conceived from common sense that,
Not all part of a document is equally relevant for answering a query from it.
Finding the relevant sections in a document involves modeling the interactions of the words, not just their presence in isolation.
We’ll explore one such approach where word and sentence level attention mechanisms are incorporated for better document classification.
APPLYING HIERARCHICAL ATTENTION TO TEXTS
Two basic insight from the traditional methods we discussed so far in contrast to hierarchical attention methods are:
Words form sentences, sentences form documents. And essentially, documents have a hierarchical structure and a representation capturing this structure can be more effective.
Different words and sentences in a document are informative to different extents.
To make this clear, let’s look at the example below:
In this restaurant review, the third sentence delivers strong meaning (positive sentiment) and words amazing and superb contributes the most in defining sentiment of the sentence.
Now let’s look at how hierarchical attention networks are designed for document classification. As I said earlier, these models include two levels of attention, one at the word level and one at the sentence level. This allows the model to pay less or more attention to individual words and sentences accordingly when constructing the representation of the document.
The hierarchical attention network (HAN) consists of several parts,
a word sequence encoder
a word-level attention layer
a sentence encoder
a sentence-level attention layer
Before exploring them one by one, let’s understand a bit about the GRU based sequence encoder, which’s the core of the word and the sentence encoder of this architecture.
Gated Recurrent Units or GRU is a variation of LSTMs (Long Short Term Memory networks) which is, in fact, a kind of Recurrent Neural Network. If you are not familiar with LSTMs, I suggest you read thiswonderful article.
Unlike LSTM, the GRU uses a gating mechanism to track the state of sequences without using separate memory cells. There are two types of gates, the reset gate, and the update gate. They together control how information is updated to the state.
Refer the above diagram for the notations used in the following content.
1. Word Encoder
A bidirectional GRU is used to get annotations of words by summarising information from both directions for words and thereby incorporating the contextual information.
Where xit is the word vector corresponding to the word wit. and We is the embedding matrix.
We obtain an annotation for a given word wit by concatenating the forward hidden state and backward hidden state,
2. Word Attention
Not all words contribute equally to a sentence’s meaning. Therefore we need an attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.
At first, we feed the word annotation hit through a one-layer MLP to get uit (called the word-level context vector) as a hidden representation for hit. Then we get the importance vector (?) as shown in the above equation. The context vector is basically a high-level representation of how informative a word is in the given sentence and learned during the training process.
3. Sentence Encoder
Similar to the word encoder, here we use a bidirectional GRU to encode sentences.
The forward and the backward hidden states are calculations are carried out similar to the word encoder and the hidden state h is obtained by concatenating them as,
Now the hidden state hisummarises the neighboring sentences around the sentence i but still with the focus on i.
4. Sentence Attention
To reward sentences that are clues to correctly classify a document, we again use attention mechanism at the sentence level.
Similar to the word context vector, here also we introduce a sentence level context vector us.
Now the document vector v is a high-level representation of the document and can be used as features for document classification.
So we’ve seen how document structure and attention can be used as features for classification. In the benchmark studies, this method outperformed the existing popular methods with a decent margin on popular datasets such as Yelp Reviews, IMDB, and Yahoo Answers.
Let us know what you think of attention networks in the discussion section below and feel free to ask your queries.