Internet service providers generally face the risk of authentication-related attacks, spam, Denial-of-Service attacks, and data mining bots. Completely Automated Public Turing test, to tell Computers and Humans apart, popularly known as CAPTCHA, is a challenge-response test created to selectively restrict access to computer systems. As a type of Human Interaction Proof, or a human authentication mechanism, CAPTCHA generates challenges to identify users. In essence, a CAPTCHA test can tell machines/ computers and humans apart. This has caused a heightened adoption of CAPTCHAs across various online businesses and services.
The concept of CAPTCHA depends on human sensory and cognitive skills. These skills enable humans to read a distorted text image or choose specific images from several different images. Generally, computers and computer programs such as bots are not capable of interpreting a CAPTCHA as they generate distorted images with text or numbers, which most Optical Character Recognition (OCR) technologies fail to make sense of. However, with the help of Artificial Intelligence, algorithms are getting smarter and bots are now capable of cracking these tests. For instance, there are bots that are capable of solving a text CAPTCHA through letter segmentation mechanisms. That said, there aren’t a lot of automated CAPTCHA solving algorithms available.
This article outlines the various methods of generating and verifying CAPTCHAs, their application, and multiple ways to bypass CAPTCHAs.
Reasons for using CAPTCHA
Web developers deploy CAPTCHAs on websites to ensure that they are protected against bots. CAPTCHAs are generally used to prevent:
- Bots from registering for services such as free email.
- Scraper bots from gathering your credentials or personal information, upon logging in or while making online payments.
- Bots from submitting online responses.
- Brute-force bot attacks.
- Search engine bots from indexing pages with personal/ sensitive information.
General flow of CAPTCHA generation and verification
The image below represents the common method of generating and verifying CAPTCHAs:
Application of different types of CAPTCHA and how to bypass them
I. reCAPTCHA and the protection of websites
Google reCAPTCHA is a free service offered to prevent spam and abuse of websites. It uses advanced risk analysis techniques and allows only valid users to proceed.
How to bypass reCAPTCHA?
Verification using browser extensions
Browser extensions such as Buster help solve CAPTCHA verification challenges. Buster, for instance, uses speech recognition software to bypass reCAPTCHA audio challenges. reCAPTCHA allows users to download audio files. Once it is downloaded, Google’s own Speech Recognition API can be used to solve the audio challenge.
CAPTCHA solving services
Online CAPTCHA solving services offer human based services. Such services involve actual human beings hired to solve CAPTCHAs.
II. Real person CAPTCHA and automated form submissions
The jQuery real person CAPTCHA plugin prevents automated form submissions by bots. These plugins offer text-based CAPTCHAs in a dotted font. This solves the problem of fake form submissions.
How to bypass real person CAPTCHA?
The following steps can be used to solve real person CAPTCHAs:
A. Create data set
In this one-time process:
- Collect texts from real person HTML tags
- Group the texts based on the words
- Create data set model for A-Z words (training data)
B. Testing to predict the solutions
After successfully completing process A, set up a process to:
- Collect texts from real person HTML tags
- Group the texts based on the words
- Fetch the word from the data set model created in process A.
Example:
from selenium import webdriver
import time
dataset = {' * * * * * ******* ': 'J',
'******* * * * * * *': 'L',
'******** * ** * ** * ** * ** * * ** ** ': 'B',
'* * * **** * * * ': 'Y',
'* * * ******** * * ': 'T',
' ***** * ** ** ** ** * * * ': 'C',
'******** * ** * ** * ** ** ** *': 'E',
'******** ** ** ** ** * ***** ': 'D',
'* ** ** ********* ** ** *': 'I',
' ***** * ** ** ** ** * ***** ': 'O',
'******* * * * * * *******': 'M',
'******* * * * * * *******': 'N',
'******** * * * * * * * * ': 'F',
' ** * * * ** * ** * ** * ** * * * ** ': 'S',
' ***** * ** ** ** * ** * **** *': 'Q',
'******* * * * * * * * * * * *': 'K',
' ** ** ** * * * ** * ** **': 'A',
'****** * * * * ******* ': 'U',
'******* * * * * * *******': 'H',
'** ** ** * ** ** ** ': 'V',
'* ** *** * ** * ** * *** ** *': 'Z',
'******** * * * * * * * * * ** ': 'P',
'* * * * * * * * * * * * *': 'X',
' ***** * ** ** ** * ** * * * ** ': 'G',
'******** * * * * * * ** * * * ** *': 'R',
'******* * * * * * *******': 'W'}
def group_captcha_string(word_pos):
captcha_string = ''
for i in range(len(word_pos[0])):
temp_list = []
temp_string = ''
for j in range(len(word_pos)):
val = word_pos[j][i]
temp_string += val
if val.strip():
temp_list.append(val)
if temp_list:
captcha_string += temp_string
else:
captcha_string += 'sp'
return captcha_string.split("spsp")
# create client
client = webdriver.Chrome()
client.get("http://keith-wood.name/realPerson.html")
time.sleep(3)
# indexing text
_get = lambda _in: {index: val for index, val in enumerate(_in)}
# get text from html tag
captcha = client.find_element_by_css_selector('form [class="realperson-text"]').text.split('\n')
word_pos = list(map(_get, captcha))
# group text
text = group_captcha_string(word_pos)
# get text(test)
captcha_text = ''.join(list(map(lambda x: dataset[x] if x else '', text)))
print("captcha:", captcha_text)
III. Text-in-image CAPTCHA
Text-based/ text-in-image CAPTCHAs are the most commonly deployed kind and they use distorted text rendered in an image. There are two types of text-based CAPTCHAs:
Simple CAPTCHA
Simple CAPTCHAs can be bypassed using the Optical Character Recognition (OCR) technology that recognizes the text inside images, such as scanned documents and photographs. This technology converts images containing written text into machine-readable text data.
Example:
import pytesseract
import sys
import argparse
try:
import Image
except ImportError:
from PIL import Image
from subprocess import check_output
def resolve(path):
print("Resampling the Image")
check_output(['convert', path, '-resample', '600', path])
return pytesseract.image_to_string(Image.open(path))
if __name__=="__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('path', help = 'Captcha file path')
args = argparser.parse_args()
path = args.path
print('Resolving Captcha')
captcha_text = resolve(path)
print('Extracted Text', captcha_text)
# command to run script
python3 captcha_resolver.py cap.jpg
Complicated CAPTCHA
These text-in-image CAPTCHAs are too complex to be solved using the OCR technology. Instead the following measures can be considered:
- Build machine learning models such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN)
- Resort to CAPTCHA solving services
IV. Sum of integers or logical operations
This unique challenge involves solving mathematical problems, particularly, finding the sum of integers.
To bypass this challenge, one can:
- Extract text from HTML tags or images
- Identify the operator
- Perform the logic
- Get the result
V. Mitigating DDoS attacks using CAPTCHAs
In distributed denial-of-service attacks, cyber criminals target network resources and render them inaccessible to users. These attacks temporarily or indefinitely slows down the target resource by flooding the target with incoming traffic from several hosts. To prevent such attacks, businesses use CAPTCHAs.
The following methods or programs can be used to bypass DDoS protected sites:
- JavaScript supported browsers (Chrome/ Firefox)
- Deriving logic to generate DDoS answers
- Fetch the DDoS problem on the site and execute it using node.js