Internet service providers generally face the risk of authentication-related attacks, spam, Denial-of-Service attacks, and data mining bots. Completely Automated Public Turing test, to tell Computers and Humans apart, popularly known as CAPTCHA, is a challenge-response test created to selectively restrict access to computer systems. As a type of Human Interaction Proof, or a human authentication mechanism, CAPTCHA generates challenges to identify users. In essence, a CAPTCHA test can tell machines/ computers and humans apart. This has caused a heightened adoption of CAPTCHAs across various online businesses and services.

The concept of CAPTCHA depends on human sensory and cognitive skills. These skills enable humans to read a distorted text image or choose specific images from several different images. Generally, computers and computer programs such as bots are not capable of interpreting a CAPTCHA as they generate distorted images with text or numbers, which most Optical Character Recognition (OCR) technologies fail to make sense of. However, with the help of Artificial Intelligence, algorithms are getting smarter and bots are now capable of cracking these tests. For instance, there are bots that are capable of solving a text CAPTCHA through letter segmentation mechanisms. That said, there aren’t a lot of automated CAPTCHA solving algorithms available.

This article outlines the various methods of generating and verifying CAPTCHAs, their application, and multiple ways to bypass CAPTCHAs.

Reasons for using CAPTCHA

Web developers deploy CAPTCHAs on websites to ensure that they are protected against bots. CAPTCHAs are generally used to prevent:

Bots from registering for services such as free email.
Scraper bots from gathering your credentials or personal information, upon logging in or while making online payments.
Bots from submitting online responses.
Brute-force bot attacks.
Search engine bots from indexing pages with personal/ sensitive information.

General flow of CAPTCHA generation and verification

The image below represents the common method of generating and verifying CAPTCHAs:

Application of different types of CAPTCHA and how to bypass them

I. reCAPTCHA and the protection of websites

recaptcha

Google reCAPTCHA is a free service offered to prevent spam and abuse of websites. It uses advanced risk analysis techniques and allows only valid users to proceed.

How to bypass reCAPTCHA?

Verification using browser extensions

Browser extensions such as Buster help solve CAPTCHA verification challenges. Buster, for instance, uses speech recognition software to bypass reCAPTCHA audio challenges. reCAPTCHA allows users to download audio files. Once it is downloaded, Google’s own Speech Recognition API can be used to solve the audio challenge.

CAPTCHA solving services

Online CAPTCHA solving services offer human based services. Such services involve actual human beings hired to solve CAPTCHAs.

II. Real person CAPTCHA and automated form submissions

The jQuery real person CAPTCHA plugin prevents automated form submissions by bots. These plugins offer text-based CAPTCHAs in a dotted font. This solves the problem of fake form submissions.

How to bypass real person CAPTCHA?

The following steps can be used to solve real person CAPTCHAs:

A. Create data set

In this one-time process:

Collect texts from real person HTML tags
Group the texts based on the words
Create data set model for A-Z words (training data)

B. Testing to predict the solutions

After successfully completing process A, set up a process to:

Collect texts from real person HTML tags
Group the texts based on the words
Fetch the word from the data set model created in process A.

Example:



from selenium import webdriver
import time


dataset = {'     *       *      *      *      *      ******* ': 'J',
           '*******      *      *      *      *      *      *': 'L',
           '********  *  **  *  **  *  **  *  **  *  * ** ** ': 'B',
           '*       *       *       ****  *     *     *      ': 'Y',
           '*      *      *      ********      *      *      ': 'T',
           ' ***** *     **     **     **     **     * *   * ': 'C',
           '********  *  **  *  **  *  **     **     **     *': 'E',
           '********     **     **     **     **     * ***** ': 'D',
           '*     **     **     *********     **     **     *': 'I',
           ' ***** *     **     **     **     **     * ***** ': 'O',
           '******* *       *       *     *     *     *******': 'M',
           '******* *       *       *       *       * *******': 'N',
           '********  *   *  *   *  *   *      *      *      ': 'F',
           ' **  * *  *  **  *  **  *  **  *  **  *  * *  ** ': 'S',
           ' ***** *     **     **     **   * **    *  **** *': 'Q',
           '*******   *     * *    * *   *   *  *   * *     *': 'K',
           '     **   **   ** *  *   *   ** *     **       **': 'A',
           '******       *      *      *      *      ******* ': 'U',
           '*******   *      *      *      *      *   *******': 'H',
           '**       **       **       *    **   **   **     ': 'V',
           '*     **    ***   * **  *  ** *   ***    **     *': 'Z',
           '********  *   *  *   *  *   *  *   *  *    **    ': 'P',
           '*     * *   *   * *     *     * *   *   * *     *': 'X',
           ' ***** *     **     **     **   * **   * * *  ** ': 'G',
           '********  *   *  *   *  *   *  **  *  * *  **   *': 'R',
           '*******     *     *     *       *       * *******': 'W'}


def group_captcha_string(word_pos):
    captcha_string = ''
    for i in range(len(word_pos[0])):
        temp_list = []
        temp_string = ''

        for j in range(len(word_pos)):
            val = word_pos[j][i]
            temp_string += val

            if val.strip():
                temp_list.append(val)

        if temp_list:
            captcha_string += temp_string
        else:
            captcha_string += 'sp'

    return captcha_string.split("spsp")


# create client
client = webdriver.Chrome()
client.get("http://keith-wood.name/realPerson.html")
time.sleep(3)

# indexing text
_get = lambda _in: {index: val for index, val in enumerate(_in)}

# get text from html tag
captcha = client.find_element_by_css_selector('form [class="realperson-text"]').text.split('\n')

word_pos = list(map(_get, captcha))

# group text
text = group_captcha_string(word_pos)

# get text(test)
captcha_text = ''.join(list(map(lambda x: dataset[x] if x else '', text)))
print("captcha:", captcha_text)

III. Text-in-image CAPTCHA

Text-based/ text-in-image CAPTCHAs are the most commonly deployed kind and they use distorted text rendered in an image. There are two types of text-based CAPTCHAs:

Simple CAPTCHA

Simple CAPTCHAs can be bypassed using the Optical Character Recognition (OCR) technology that recognizes the text inside images, such as scanned documents and photographs. This technology converts images containing written text into machine-readable text data.

simple

Example:



import pytesseract
import sys
import argparse
try:
    import Image
except ImportError:
    from PIL import Image
from subprocess import check_output


def resolve(path):
    print("Resampling the Image")
    check_output(['convert', path, '-resample', '600', path])
    return pytesseract.image_to_string(Image.open(path))


if __name__=="__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument('path', help = 'Captcha file path')
    args = argparser.parse_args()
    path = args.path
    print('Resolving Captcha')
    captcha_text = resolve(path)
    print('Extracted Text', captcha_text)


# command to run script
python3 captcha_resolver.py cap.jpg

Complicated CAPTCHA

These text-in-image CAPTCHAs are too complex to be solved using the OCR technology. Instead the following measures can be considered:

Build machine learning models such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN)
Resort to CAPTCHA solving services

IV. Sum of integers or logical operations

This unique challenge involves solving mathematical problems, particularly, finding the sum of integers.

logical captcha

To bypass this challenge, one can:

Extract text from HTML tags or images
Identify the operator
Perform the logic
Get the result

V. Mitigating DDoS attacks using CAPTCHAs

In distributed denial-of-service attacks, cyber criminals target network resources and render them inaccessible to users. These attacks temporarily or indefinitely slows down the target resource by flooding the target with incoming traffic from several hosts. To prevent such attacks, businesses use CAPTCHAs.

DDoS

The following methods or programs can be used to bypass DDoS protected sites:

JavaScript supported browsers (Chrome/ Firefox)
Deriving logic to generate DDoS answers
Fetch the DDoS problem on the site and execute it using node.js

Author

Sellamani C.

He is a Senior Software Engineer working as a part of the Data Acquisition team at CloudSEK. In his role, he is responsible for writing reusable codes and scalable web crawlers for XVigil. In his spare time, Sellamani loves to take on new challenges and find solutions to real-time problems.