Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.
In this article, we will learn:
What is crawling?
Applications of crawling
Python modules used for crawling
Use-case: Fetching downloadable URLs from YouTube using crawlers
How do CloudSEK Crawlers work?
What is crawling?
Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.
Basic stages of crawling:
Scrape data from the source
Parse the collected data
Clean the data of any noise or duplicate entries
Structure the data as per requirement
Applications of crawling
Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:
Comparing data for market analysis
Monitoring data leaks
Preparing data sets for Machine Learning algorithms
Fact-checking information on social media
Python modules used for crawling
Requests – Allow you to send HTTP requests to web pages
Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.
Use-case: Fetching downloadable URLs from YouTube using crawlers
A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:
The project will contain three files:
app.py: For the api interface, using flask micro framework
cli.py: For command line interface, using argparse module
core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.
from flask import jsonify, request
app = flask.Flask(__name__)
app.config["DEBUG"] = True
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
The flask interface code to get downloadable URLs through API.
ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.
How do CloudSEK Crawlers work?
CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:
Fetch data from various sources on the internet
Push the gathered data to a centralized queue
ML Classifiers group the data into threats and non-threats
Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.
The process of threading allows the execution of multiple instructions of a program, at once. Only multi-threaded programming languages like Python support this technique. Several I/O operations running consecutively decelerates the program. So, the process of threading helps to achieve concurrency.
In this article, we will explore:
Types of concurrency in Python
Global Interpreter Lock
Need for Global Interpreter Lock
Thread execution model in Python 2
Global Interpreter Lock in Python 3
Types of Concurrency In Python
In general, concurrency is the parallel execution of different units of a program, which helps optimize and speed up the overall process. In Python, there are 3 methods to achieve concurrency:
We will be discussing the fundamentals of thread-execution model. Before going into the concepts directly, we shall first discuss Python’s Global Interpreter Lock (GIL).
Global Interpreter Lock
Python threads are real system threads (POSIX threads). The host operating system fully manages the POSIX threads, also known as p-threads.
In multi-core operating systems, the Global Interpreter Lock (GIL) prevents the parallel execution of p-threads of a multi-threaded Python process. Thus, ensuring that only one thread runs in the interpreter, at any given time.
Why do we need Global Interpreter Lock?
GIL helps to simplify, the implementation of the interpreter, and memory management. To understand how GIL does this, we need to understand reference counting.
For example: In the code below, ‘b’ is not a new list. It is just a reference to the previous list ‘a.’
>>> a = 
>>> b = a
Python uses reference counting variables to track the number of references that point to an object. The memory occupied by the object is released if the value of the reference counting variable is zero. If threads, of a process sharing the same memory, try to access this variable to increment and decrement simultaneously, it can cause leaked memory that is never released, or releasing the memory incorrectly. And this leads to crashes.
One solution is to have a lock for the reference counting variable object memory by using semaphores so that it is not modified simultaneously. However, adding locks to all objects increases the performance overhead in the acquisition and release of locks. Hence, Python has a GIL which gives access to all resources of a process to only one thread at one time. Apart from GIL there are other solutions, such as garbage collection used in JPython interpreter, for memory management.
So, the primary outcome of GIL is, instead of parallel computing you get pre-emptive (threading) and co-operative multitasking (asyncio).
Thread Execution Model in Python 2
while n > 0:
n -= 1
count = 50000000
start = time.time()
end = time.time()
print('Time taken in seconds -', end-start)
from threading import Thread
COUNT = 50000000
n -= 1
t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))
start = time.time()
end = time.time()
print('Time taken in seconds -', end - start)
~/Pythonpractice/blog❯ python seq.py
(‘Time taken in seconds -‘, 1.2412900924682617)
~/Pythonpractice/blog❯ python par.py
(‘Time taken in seconds -‘, 1.8751227855682373)
Ideally, the par.py execution time should be half of the seq.py execution time. However, in the above example we can see that the par.py execution time is slightly higher than that of seq.py. To understand the reduction in performance, despite the sharing the work between two threads which run in parallel, we need to first discuss CPU-bound and I/O-bound threads.
CPU-bound threads are threads performing CPU intense operations such as matrix multiplication or nested loop operations. Here, the speed of program execution depends on CPU performance.
I/O-bound threads are threads performing I/O operations such as listening to a socket or waiting for a network connection. Here, the speed of program execution depends on factors including external file systems and network connections.
Scenarios in a Python multi-threaded program
When all threads are I/O-bound
If a thread is running, it holds the GIL. When the thread hits the I/O operation, it releases the GIL, and another thread acquires it to get executed. This alternate execution of threads is called multi-tasking.
Where one thread is CPU bound, and another thread is IO-bound:
A CPU bound thread is a unique case in thread execution. A CPU bound thread releases the GIL after every few ticks and tries to acquire again. A tick is a machine instruction. When releasing the GIL, it also signals the next thread in the execution queue (ready queue of the operating system) that the GIL has been released. Now, these CPU-bound and I/O-bound threads are in a race to acquire the GIL. The operating system decides which thread needs to be executed. This model of executing the next thread in the execution queue, before completing the previous thread, is called pre-emptive multitasking.
In most of the cases, operating system gives preference to the CPU bound thread and allows it to reacquire the GIL, leaving the I/O-bound thread starving. In the below diagram, the CPU-bound thread has released GIL and signaled thread 2. But even before thread 2 tries to acquire GIL, the CPU-bound thread has reacquired the GIL. This issue has been resolved in Python 3 interpreter’s GIL.
In a single core operating system, if the CPU bound thread reacquires GIL, it pushes back the second thread to the ready queue, assigning it some priority. This is because Python doesn’t have control over the priorities assigned by the operating system.
In a multicore operating system, if the CPU bound thread reacquires the GIL then it does not push back the second thread, but continuously tries to acquire the GIL using another core. This characterizes thrashing. Thrashing reduces the performance if many threads try to acquire the GIL, using different cores of the operating system. Python 3 also addresses the issue of thrashing.
Global Interpreter Lock in Python 3
The Python 3 threat execution model has a new GIL. If there is only one thread running, it continues to run, until it hits an I/O operation or other thread requests, to drop the GIL. A global variable (gil_drop_request) helps to implement this.
If gil_drop_request = 0, running thread can continue until it hits I/O
If gil_drop_request = 1, running thread is forced to give up the GIL
Instead of CPU-bound thread check after every few ticks, the second thread is sending a GIL drop request by setting the variable gil_drop_request = 1 after reaching a timeout. The first thread will then immediately drop the GIL. Additionally, to suspend the first thread’s execution, the second thread sends a signal. This helps to avoid thrashing. This check is not available in Python 2.
Missing Bits in the New GIL
While the new GIL does address issue such as thrashing, it still has some areas of improvement:
Waiting for time out
Waiting for timeout can make the I/O response slow. Especially, when there are multiple, recurrent I/O operations. I/O-bound Python programs take considerable time to add a GIL, followed by a time out. This happens after every I/O operation, and before the next input/output is ready.
Unfair GIL acquiring
As seen below, the thread that makes the GIL drop request is not the one that gets the GIL. This type of situation can reduce the performance of I/O, where response time is critical.
There is a need for the GIL to distinguish between CPU-bound and I/O-bound threads and then assign priorities. High priority threads must be able to immediately preempt low priority threads. This will help improve the response time considerably. This issue has already been resolved in operating systems. Operating systems use timeout to automatically adjust task priorities. If a thread is preempted by a timeout, it is penalized with a low priority. Conversely, if a thread suspends early, it is rewarded with raised priority. Incorporating this in Python will help improve thread performance.
What makes Android applications irresistible targets?
The ease with which an attacker can acquire the entire source code of an Android application.
The source codes often contain API keys, secret tokens, sensitive credentials, and endpoints, which developers forget to remove after staging.
Developers often lack awareness about attack vectors on a mobile app. Owing to a false sense of security, provided by the sandboxed and permission-oriented operating systems. However, mobile apps have the same attack vectors as web apps, albeit with different exploitation techniques.
When an attacker meets an Android app
After getting their hands on an Android app’s source code, attackers first decompile it, analyse it for weaknesses, and then exploit it.
Decompiling the source code
The attacker first extracts files and folders from the apk file of an Android app, which is similar to unzipping a zip file. However, the files are compiled. To read the source code, the files are decompiled using:
Apktool: To read the AndroidManifest.xml file. It also disassembles to small code and allows to repack the apk after modifications.
Dex2jar: To convert the Dex file into a Jar file.
Jadx/Jd-gui: To read the code in GUI format.
Analysing the source code
The attacker analyses the source code using 2 methods:
In this approach, the attacker examines the source code for secret tokens, API keys, credentials, and secret paths. They also understand the source code, check for activities, content providers, broadcast receivers, vulnerable permissions, local storage, sensitive files, etc.
Here are some examples of what attackers look for during static analysis:
As we can see, the strings.xml file below exposes sensitive information such as API keys, bucket name, Firebase database URL, etc.
Often, usernames and passwords are hidden in the source code, because developers forget to remove them.
Content Providers allow applications to access data from other applications. And in order to access a Content Provider, you need its URI.
Attackers check if the Content Provider attribute is ‘exported=true,’ which implies that it can be accessed by third-party applications.
Below we are accessing the content provider declared by the vulnerable application through the tool i.e. content, which acts as a third-party application.
An activity implements a screen/window in an app. So, an app usually invokes only an activity i.e. a particular screen in another app, and not the app as a whole.
Attackers check for weaknesses in activities in the AndroindManifest.xml file. Where, if an activity is marked as ‘exported=true,’ a third-party app can initiate that activity.
For example, the screen below has a functionality to submit a password. And only on submitting the password, we can see the dashboard.
However, if the activity responsible for showing the dashboard is marked ‘exported= true,’ an attacker can use an Activity Manager (AM) tool to run it.
And by doing this the attacker can access the dashboard without a password.
Broadcast receivers listen for system-generated or custom generated events from other applications or from the system itself.
Attackers check for weaknesses in Broadcast Receivers in the AndroindManifest.xml file. And if the intent-filter tag is declared ‘exported=true,’ a third-party application can easily access it. To prevent this, we need to explicitly declare ‘exported=false.’
After checking for the parameters that the receiver can accept, attackers can write a command that will trigger the receiver on behalf of the application.
For example: The following command triggers the receiver to send a message to a phone number.
Following which, the message is delivered to the phone number:
In this approach, the attacker uses a binary toolkit such a Frida to hook to a targeted application and change its implementation. It also allows the attacker to bypass root detection and SSLs.
Bypassing root detection
The following application has root detection. Hence, to access all the capabilities of the application, the attackers need to bypass root detection.
First, the attacker needs to identify the function that is responsible for root detection. Which they can get from the source code:
If one of the functions i.e. ‘doesSuperuserApkExist(‘’) and doesSuExist(‘’)’ returns True, it will be identified as a Rooted Device.
So, the attacker needs to change the implementation of these two functions to False, in order to bypass the rooting. And this is where Frida comes to use.
By injecting the following JS payload, the attacker changes the implementation of the functions responsible for root detection.
With the help of use(), the attacker accesses the PostLogin class. Here, they mention the function i.e. ‘doesSUexist or doesSuperuserApkExist,’ that they need to hook. After which, the function will start returning False instead of True.
Frida Client then sends the server this JS payload. And the server makes it a thread and sends it to the JS Engine. So, whenever the application calls this function, Frida will call the thread instead of the actual function declared in source code.
Given the increasing sophistication of cyber-attacks, it is important that Android apps undergo proper vulnerability assessments before publishing. Also, developers should not leave sensitive information such as API keys and credentials exposed in the source code.
In the recent past, several security vulnerabilities have been discovered, in widely used software products. Since these products are installed on a significant number of devices, connected to the internet, it entices threat actors to develop botnets, steal sensitive data, and more.
In this article we explore:
Vulnerabilities detected in some popular products.
Target identification and exploitation techniques employed by intrusive threat actors.
Threat actors’ course of action in the event of identifying a flaw in widely used internet products/technology.
Popular Target Vulnerabilities and their Exploitation
Ghostcat: Apache Tomcat Vulnerability
All Apache Tomcat Server versions are vulnerable to Local File Inclusion and Potential RCE. The issue resides in the AJP protocol, which is an optimised version of the HTTP protocol. The years old vulnerability is vulnerable because of the component which handled a request attribute improperly. The AJP protocol, enabled by default, listens on TCP port 8009. Multiple scanners, exploit scripts, honeypots surfaced in a matter of days after the original disclosure by Apache.
Stats published by researchers indicate a large number of affected systems, the numbers being much greater than originally predicted.
Recently, Directory Traversal and RCE vulnerabilities, in Citrix ADC and Gateway products, affected at least 80,000 systems. Shortly after the disclosure, multiple entities (ProjectZeroIndia, TrustedSec) released PoC scripts publicly that engendered a slew of exploit attempts, from multiple actors in the wild.
Jira Sensitive Data Exposure
A few months ago, researchers found Jira Instances leaking sensitive information such as names, roles, email IDs of employees. Additionally, internal project details, such as milestones, current projects, owner and subscriber details, etc., were also accessible to anyone making a request to the following unauthenticated JIRA endpoints:
Avinash Jain, from Grofers, tested the vulnerability on multiple targets, and discovered a large number of vulnerable Jira instances, revealing sensitive data belonging to various companies, such as NASA, Google and Yahoo, and its employees.
Spring Boot Data Leakage via Actuators
Spring Boot is an open source Java-based MVC framework. It enables developers to quickly set up routes to serve data over HTTP. Most apps using the Spring MVC framework now also use the Boot utility. Boot helps developers to configure what components to add, and also to setup the Framework faster.
An added feature of the tool called Actuator, enables developers to monitor and manage their applications/REST API, by storing and serving request dumps, metrics, audit details, and environment settings.
In the event of a misconfiguration, these Actuators could be a back door to the servers, making exposed applications susceptible to breaches. The misconfiguration in Spring Boot Versions 1 to 1.4 granted access to Actuator endpoints without authentication. Although later versions secure these endpoints by default, and allow access only after authentication, developers still tend to ignore the misconfiguration before deploying the application.
The following actuator endpoints leak sensitive data:
performs a thread dump and returns the dump
returns the dump of HTTP requests received by the app
returns the app-logged content
commands the app to shutdown gracefully
returns a list of all the @RequestMapping paths
exposes all the Spring’s ConfigurableEnvironment values
returns application’s health information
There are other such defective Actuator endpoints, that provide sensitive information to:
Gain system information
Send requests as authenticated users (by leveraging session values obtained from the request dumps)
Execute critical commands, etc.
Webmin RCE via backdoored functionality
Webmin is a popular web-based system configuration tool. A zero-day pre-auth RCE vulnerability, affects some of its versions, between 1.882 and 1.921. This vulnerability enables the remote password change functionality. The Webmin code repository on SourceForge was backdoored with malicious code allowing remote command execution (RCE) capability on an affected endpoint.
The attacker sends his commands piped with Password Change parameters through `password_change.cgi` on the vulnerable host running Webmin. And if the Webmin app is hosted with root privileges, the adversary can execute malicious commands as an administrator.
Why do threat actors exploit vulnerabilities?
Breach user/company data: Data exfiltration of Sensitive/PII data
Computing power: Infecting systems to mine Cryptocurrency, serve malicious files
Botnets, serving malicious files: Exploits targeted at adding more bots to a larger botnet
Service disruption and eventually Ransom: Locking users out of the devices
Political reasons, cyber war, angry user, etc.
How do adversaries exploit vulnerabilities?
On disclosure of such vulnerabilities, adversaries probe the internet for technical details and exploit codes, to launch attacks. Rand corporation’s research and analysis on zero-day vulnerabilities states that, after a vulnerability disclosure, it takes 6 to 37 days and a median of 22 days to develop a fully functional exploit. But when an exploit disclosure comes with a patch, developers and administrators immediately patch the vulnerable software. Auto update, regular security updates, large scale coverage of such disclosures help to contain attacks. However, several systems run the unpatched versions of a software or application and become easy targets for such attacks.
Steps involved in vulnerability exploitation
Once a bad actor decides to exploit a vulnerability they have to:
Obtain a working exploit or develop an exploit (in case of a zero-day vulnerability)
Utilize Proof of Concept (PoC) attached to a bug report (in case of a bug disclosure)
Identify as many hosts as possible that are vulnerable to the exploit
Maximise the number of targets to maximise profits.
Even though the respective vendors patch vulnerabilities reported, upon searching GitHub or specific CVEs on ExploitDB, we can find PoC scripts for the issues. Usually PoC scripts require a host/ URL as an input and it measures the success of the exploit/ examination.
Adversaries identify a vulnerable host through their signatures/ behaviour, to generate a list of exploitable hosts. The following components possess signatures that determine whether a host is vulnerable or not:
Indexed Content/ URL
Many commonly used software has a specific default installation port(s). If a port is not configured, the software installs on a pre-set port. And in most cases a software installs on the default port. For example, most systems use default port 3306 to install MySQL and port 9200 for Elasticsearch. So, by curating a list of all servers with an open 9200 port, a threat actor can determine systems running the Elasticsearch. However, port 9200 can be used to install other services/ software as well.
Using port scans to discover targets to exploit the Webmin RCE vulnerabilities
Determining that the default port where Webmin listens to after installation is Port 10000.
Get a working PoC for the Webmin exploit.
Execute a port scan on all hosts connected to the internet for port 10000.
This will lead to a discovery of all possible Webmin installations that could be vulnerable to the exploit.
In addition, tools like Shodan make port-based target discovery effortless. At the same time, if Shodan does not index the target port, attackers leverage tools like MassScan, Zenmap and run an internet-wide scan. The latter approach hardly takes a day if the attacker has enough resources.
Similarly, an attacker in search of an easy way to find a list of systems affected by Ghostcat, will port scan all the target IPs and narrow down on machines with port 8009 open.
Software/ services are commonly installed on a distinct default path. Thus, the software can be fingerprinted by observing the signature path. For instance, WordPress installations can be identified if the path ‘wp-login.php’ is detected on the server. This facilitates locating the service as it accesses a web browser.
For example, when phpmyadmin utility is installed, by default it installs on the path ‘/phpmyadmin’. A user can access the utility through this path. In this case, a port scan won’t help, because this utility doesn’t install on a specific port.
Using distinct paths to discover targets to exploit Spring Boot Data Leakage
Gather a list of hosts that run Spring Boot. Since the default Spring Boot applications start on port 8080, it would help to have a list of hosts that have this port open. This allows threat actors to see a pattern.
Hit specific endpoints like ‘/trace’, ‘/env’ on the hosts and check the response for sensitive content.
Web path scanners and web fuzzer tools such as Dirsearch or Ffuf facilitate this process.
Though responses may include false positives, actors can use techniques, such as signature matching or static rule check, to constrict the list of vulnerable hosts. As this method operates with HTTP requests and responses, the process can be much slower than mass scale port scans. Shodan can also fetch hosts based on http responses, from its index.
Software are commonly installed on a specific subdomain since is an easier, standard, and convenient way to operate the software.
For example, Jira is commonly found on a subdomain as in ‘jira.domain.com’ or ‘bug-jira.domain.com’. Even though there are no rules when it comes to subdomains, adversaries can identify certain patterns. Similar services, usually installed on a subdomain, are Gitlab, Ftp, Webmail, Redmine, Jenkins, etc.
Security Trails, Circl.lu, Rapid7 Open Data hold passive DNS records. Other scanners that maintain such records would be sites such as Crt.sh and Censys. They collect SSL certificate records regularly and have an add-on feature that supports queries.
The content published by services is generally unique. If we employ search engines such as Google, to find pages based on particular signatures, serving specific content, the results will have a list of URLs running a particular service. This is one of the most common techniques to hunt down targets, easily.
It is commonly known as ‘Google Dorking’. For instance, adversaries can quickly curate a short list of all cPanel login pages. For which, they could use the following Dork in Google Search: “site:cpanel.*.* intitle:”login” -site:forums.cpanel.net”. The Google Hacking database contains numerous such Dorks and after understanding the search mechanism, it is easy to write such search queries.
There have been multiple honey pot experiments to study the mass scale exploration and exploitation in the wild. Setting up honey pots is not only a good way of understanding the attack patterns, it also serves in identifying malicious actors out there, trying to exploit systems in the wild. These identified IPs/ Network trying to enumerate targets or exploit vulnerable systems end up in various public blacklists. Various research attempts have set up diverse honeypots and studied the techniques used to gain access. Most attempts are to gain access via default credentials, and originated mainly from blacklisted IP addresses.
Another interesting observation is that, most honeypot detected traffic, seems to originate from China. It is also very common to see honeypots specific to a zero-day surface on Github as soon after a the release of an exploit. The Citrix ADC vulnerability (CVE-2019-19781) also saw a few honeypots being published on Github within a short time after the first exploit PoC was released.
Research carried out by Sophos highlights the high rate of activity on exposed targets using honeypots. As reported in the research paper, it took from less than a minute to 2 hours for the first attack on the exposed target. Therefore, if an accidental misconfiguration leaves a system exposed to the internet, for even a short period of time, it should not be assumed that the system was not exploited.
The purpose of content, regardless of the choice of font, size, colour, and design, is to communicate one’s thoughts. The aesthetic of the text is just as important, as the composition of sentences and paragraphs, in conveying ideas, to the reader, with clarity. So, it’s not just what, but also how, a writer presents their content, that determines its impact on the reader. Better typography produces better content.
How does typography affect content?
Typography is crucial in invoking and sustaining a reader’s interest from the first line to the last.
It can enhance the overall readability of the text.
It makes the reading experience less effortful, thus allowing the reader to absorb the content with ease.
90% of the content we consume on the web, in books, posters, and emails, is text. So, typography is not a mere afterthought. It is a primary factor that determines the reach and impact of the content it presents.
In essence, typography is:
Where type refers to letters and other characters which are arranged to form textual content.
How to choose a Typeface?
Writers are often confronted with the task of choosing the appropriate font to deliver their content.
Here are a few factors to consider when choosing a typeface:
Choosing a typeface with multiple weights to create textual contrast
Varying the text weights will add contrast to the content. Apart from beautifying the text, it also helps to distinguish key text fragments from the rest of the content.
Tailor the content to suit your audience
While creating content, the writer should have the intended audience in mind. The readers’ age, interests, and cultural backgrounds determine what they read.
For instance, children and adults prefer different types of fonts.
To stimulate an interest in reading, young readers need well-shaped and legible letters, such as Sassoon Primary or Gill Sans. Similarly, commonly used font styles such as Roboto or Futura are more appealing to adults.
How to add variety and flair to text content?
Contrasting helps to draw the focus of readers to certain words or phrases. It also establishes hierarchy within the content and creates seamless content flow, by building relationships. Thus making it easy to navigate the different sections of the content.
Tips: To reflect hierarchy in the content, font size of the text can be adjusted by 1.5x to 2x. This together with a variation in the weight of the text, will help the writer create visible gradations within, and between, sections.
How to make text more engrossing?
Creating a style guide for typography will help writers standardize type and enhance legibility across their content.
Keeping the right line height for the text content
The vertical space between lines is crucial to the visual impact of the content. Narrow line spacing makes the content look crammed and will prove tedious to the reader. While increasing line spacing improves readability drastically, it adds to the content real estate.
Tips: For better legibility, the line height of the text should be between 1.2 to 1.5 times the size of the font.
Typography plays a significant role in the process of writing textual content. Good typography makes reading effortless, whereas poor typography is off putting. Since typography is as much an art as it is a technique, the scope for experimentation is unlimited. Writers should play with different styles and patterns, before settling on the one that suits their content best.
Payment gateways, such as Wibmo, CCAvenue, and PayUbiz, facilitate payments on thousands of online portals. And customers implicitly trust them to secure their transactions. But, as reported by a security researcher, a flaw in the logical design of a previous version of Wibmo payment gateway put its customers at risk. This was because the payment gateway did not distinguish between transactions initiated within the same time frame.
Payment gateways serve as a channel of communication, between merchants and banks, to conduct secure transactions. The gateway encrypts the transaction information, which includes the credit/debit card number, CVV, expiry date, etc. And passes on the information to the payment processor, which acts as the link between the user bank and merchant bank. The gateway confirms the payment, unless the information is incorrect. Then, the processor settles the payment with the merchant’s bank.
One Time Passwords for gateways
In order to secure transactions, 3-dimensional payment gateways add time-based One Time Passwords (OTPs) as an additional layer of authentication. The payment gateway only accepts time-based OTPs submitted within the permitted time frame. After which the OTP is not valid. Even though this additional layer of authentication should secure transactions, a vulnerable gateway, could reduce its efficacy. A payment gateway that is not able to distinguish between transactions, could permit unauthorized transactions.
Flaw in the design of Wibmo Payment Gateway
Wibmo fails to distinguish between transactions processed during a single 180 second time frame.
So, the OTP generated for a transaction is valid for other transactions, in the same time period. Irrespective of the amount or geo-location.
This vulnerability increases the possibilities of a man-in-the-middle attack (MITM) by which the attacker forges the request.
And if the OTP remains unused for the first few seconds or minutes, it allows attackers to conduct fraudulent transactions within the validity period of the OTP.
Explaining the flaw through a scenario
A user initiates a legitimate transaction for Re.1.
They receive an OTP, on their registered mobile number, which is valid for 180 seconds.
Before the user applies the OTP for that transaction, an attacker intercepts the OTP and uses it to process a transaction for Rs.1000. Irrespective of the attacker’s location, and transaction amount, the fraudulent transaction is considered legitimate. And the attacker successfully receives the amount.
Verification of the Wibmo Payment Gateway flaw
CloudSEK’s research team tested Wibmo with various banking systems to confirm the flaw. We found that the same OTP is valid for 180 seconds or more, for any transaction, provided the OTP has not been used already. The screenshots below prove the same:
With the increasing number of online transactions, flaws such as Wibmo’s make users vulnerable to threat actors. Apart from financial losses, it could impact the reputation of the payment gateway, and the online portals using it.
Note: Wibmo became aware of this flaw on the 3rd of August, 2019. The security team at Wibmo closed the issue and marked it as a known functionality on August 12, 2019. And publicly disclosed the flaw on August 25, 2019. Wibmo recommends that portals using its payment gateway should fix the vulnerability, to avoid security incidents.
FASTag Phishing Campaigns Flourish on Social MediaWith FASTag, toll collection is the latest of our everyday services that has gone digital. And, as is their wont, cyber criminals have already figured out ways to exploit it. FASTag, which is an Electronic Toll Collection (ETC) instrument, is mandated by the Government of India, for all vehicles passing through toll booths across the country. Considering the growing adoption, combined with users’ limited experience, it is not surprising that scammers are launching phishing campaigns by employing novice social engineering approaches.
In this article, we explore the different types of phishing campaigns and the channels that facilitate them.
FASTag Phishing Campaigns
Though FASTag is a straightforward service, there are several avenues, ranging from distribution to after-sales support, through which scammer can exploit it.
Scammers are defrauding people in the following ways:
Selling fake FASTags
Recruiting other scammers
Selling FASTag distributor rights
Operating fake helpline numbers
Providing unblocking services for blacklisted FASTags
Scammers are delivering these campaigns via:
Deep web sites
Surface web sites
We will investigate each of these scamming methods and the channels used to facilitate them. While FASTag scammers are present across the internet, they are especially active on social media because of how easy it is to create accounts and conceal their identities.
Selling Fake FASTags
There are social media profiles, personally promoting the “FASTag” project implementation (especially in local languages), even though they are not officially authorized or connected to the project.
Some accounts are also offering services on behalf of authorized FASTag banking partners, by advertising the bank’s name along with their personal contact numbers. Since we cannot verify if such individuals are authorized to act on behalf of these financial institutions, it is best to avoid responding to their posts, to avail their services.
There are also social media posts that are promising free FASTags and FASTag services, even though the actual price is INR 500. However, they appear trustworthy to the general public because some of these campaigns include genuine images.
Since FASTag became mandatory on 1st December 2019, we have observed phishing emails, delivered from various networks, to personal email IDs. Many of these campaigns use the classical approach of furnishing lookalike “from” names. In this case, ‘FASTag’, in some form, appears in the name of the sender. The domain name of the email is only visible when we purposely expand the ‘from’ address. This allows scammers to mislead receivers of the emails, since we don’t generally inspect the sender’s complete email address.
As seen above, the sender’s name is ‘Axis FASTag’ and only on closer inspection, we notice that the email id is: [email protected] and the domain name is: indiafamous.info. And, the website’s location is listed as Bihar. It is safe to assume that the below email is a phishing attempt. (We have noticed that previous phishing campaigns targeting NPCI, were also mapped to the same location).
Given the size of the targeted audience, scammers will not spare any platform through which they can prey on the public.
Here is a case of an OLX listing that is advertising Axis Bank’s FASTag service.
Further investigation threw up listings like the ones below, in which the prices have been inflated. By inflating and then reducing the price of the tags, scammers are trying to make their proposition more attractive. This is a major red flag that is indicative of a phishing campaign.
We also observed that some of the vendors are offering free GPS along with the tags. And the tags themselves are listed at prices lower than the actual cost of INR 500. But, it is not clear from the listing, if a standalone GPS comes free with the purchase of a FASTag.
As seen from the below post, in which a vendor ‘Vivek Shukla’ from UP, has listed FASTag as “Fastage” along with a GPS app. The app is not officially associated with FASTag.
Deep web campaigns
We have spotted a series of phishing campaigns on various blogs and deep web sites. These advertisements offer FASTag services by using the names of popular banks such as Axis Bank, HDFC Bank, etc.
These campaigns are being widely spread through chat platforms such as Sharechatas well.
On clicking the link, the page is redirected to an ad-hosted campaign which is not connected with Axis Bank FASTag services. And visiting these malicious links makes the visitor’s device vulnerable to malicious software, such as adware or other PUPs (Potentially Unwanted Programs). This, in turn, creates a backdoor to all vital information on the device and helps scammers fund other malicious campaigns they run.
Moreover, on analysing the details of the page through Virus Total, it was found to be listed as spam.
Ad campaigns on other sites
We spotted ad campaigns on other unrelated websites such as a music download service. Through which unwary users can be clickjacked to phishing sites.
Surface web sites
The official way to buy FASTags is via NPCI , authorized banking partners such as ICICI or HDFC, wallet partners such as UPI Airtel Payments, or authorized vendors. Yet there are similar looking domains, registered to individuals, that are masquerading as official vendors of FASTag.
Though the above mentioned sites are not functional at the moment, there is a chance that they may become available at any time, to host phishing campaigns, by assuming an air of legitimacy.
These are only a few examples of domains that use some version of “fastag” in their name. There are many more, yet to be listed or found. Some of these domain names, which have not been bought yet, are available at cheap prices.
Recruiting other scammers
While scammers directly exploit new FASTag users, they also attempt to recruit other people to carry out such campaigns. Here are examples of such posts, from a private Facebook group, in which a scammer has advertised FASTag as an opportunity to make money.
Selling FASTag distributor rights
Authorized sales and service providers/vendors employ agents to sell and top-up FASTags. However, we have observed the presence of unauthorized people, on closed Facebook groups, who are selling free agent IDs. Which is why, FASTags procured from 3rd party agents, may or may not be genuine.
Here are some examples of Facebook posts offering free Agents IDs.
Operating fake FASTag helpline numbers
There are posts on social networking sites that are advertising phone numbers and email ids that are not the official FASTag support contacts. They offer to set up FASTags or provide other related support. Calling such numbers is a sure-fire way to get defrauded.
We also found several unofficial social media accounts listing email ids that mimic the official email contact and vendor names. For example: [email protected]contains “FASTag” and “HDFC”. Thus, setting up a honeypot, for unsuspecting people looking for genuine support.
As observed from the above post, threat actors are advertising FASTag at a discounted price of INR 300, even though the original price of INR 500. Subsequently, people tempted by such offers, call these numbers, and become easy victims.
A well-crafted poster appeals to the general public as advertisements for reliable/ legitimate services. Upon investigation, we found that the service provider ‘Aveon’ does not have an official website.
Providing unblocking services for blacklisted FASTags
As with any new service, FASTag has a few ongoing issues. Some tags appear as ‘blacklisted,’ while passing through the toll gate, even though there is sufficient balance in the owner’s wallet. Consequently, scammers are exploiting this loophole in the system, by launching a campaign that offers unblocking of “blacklisted” tags.
How do we avoid becoming victims?
In conclusion, these examples are just a tip of the iceberg, in the zeitgeist of ongoing scams. But they clearly show that if we, as end users of FASTag, are not vigilant, we can become easy victims of these malicious campaigns.
End user precautions:
Don’t rely on individual vendors. Instead, buy FASTags from NTEC or from other official banks.
Don’t reveal OTPs received on your phone to anyone via call, or in person.
Never fill forms found on blogs or websites with look-alike domains that include the keyword “fastag”.
Never click on hyperlinks provided in phishing emails, especially with subject lines such as “Free FASTag” etc.
Avoid calling random toll free numbers, especially those flashed on third party websites/blogs. And, reach out to NTEC or Official Bank Helplines, for support.
Above all, don’t post or tweet any of your personal/transaction details (if you have not received your FASTag after applying for it). As this would help fraudsters customize their approach based on your specific problem.
During the 2019 — 2020 holiday season, XVigil identified several phishing sites targeting popular eCommerce companies. Many of the domains were registered in December and were subsequently taken down after Christmas or New Year. This indicates that the sites’ main targets were shoppers, eager to avail holiday discounts.
Detection of phishing sites
XVigil’s fake domain finder monitors the web for fake or similar looking domains that might infringe on a brand. When we calibrated XVigil to monitor Indian eCommerce companies, we detected a wide range of phishing domains.
Firstly, we ascertained the phishing sites’ domain details, including the server, IP, registrant, and admin.
Prima facie, we were able to determine that the sites had certain similarities:
Irrespective of the eCommerce site being targeted, the most common payment platform was Paytm payment gateway.
Many of sites, including 2 Paytm phishing sites (hxxp://paytm-megaoffer.com* , hxxp://wowbuzz4.com/pytm_mall*) were hosted on the same IP. So, both the sites could be the work of the same scammer/ group of scammers.
Some sites, though not hosted on the same server, share overall website design, look and feel, site navigation, and data input methods.
Paytm phishing analysis
The sites appear familiar and trustworthy because:
The look and feel of the sites are similar to the official Paytm site.
Usage of Paytm logo.
Transacting through the widely trusted Paytm payment gateway.
The sites list a limited number of products, but at highly discounted prices. For example: the listed price of the iPhone 11 is INR 5999. And there is a countdown that indicates the offer is valid only for the next few minutes. These factors make it tempting, for even the most discerning of customers, to make hasty purchases.
The following characteristics of the sites are proof of the scammers’ rudimentary technical skills:
Presence of default or dummy content.
Poor web design features such as blurred images and grammatical errors.
Poor coding practices such as the absence of validation of details entered in the phone number and pin code fields.
The conspicuous lack of https certification.
Limited product catalogue.
Unbelievably low pricing.
How the phishing sites work
The shopper browses the site and adds the product to the cart.
The billing section collects the customer’s personal details including phone number, email id, and address. The scammers could use these details to devise other fraudulent schemes.
The customer is directed to the payment page.
The customer then lands on the Paytm payment gateway to complete the transaction.
Paytm Payment Gateway Analysis
Many phishing sites, irrespective of the eCommerce company they are targeting, use the Paytm payment gateway. It is notable that there are merchants registered with fake names such as ‘for’. One of the merchants goes by ‘One Communications’. The name closely mimics One97 Communications, which is Paytm’s parent company; lending the site an air of legitimacy.
From the source code of the payment pages we identified the following merchant details:
hxxp://paytm-megaoffer.com* Merchant: One Communications MID: kRdXWH24078674748775
hxxp://paytmmallcart.com* Merchant Name: for MID: GPZvOS78323169981271
hxxp://flipkart-loot-offers.com* Merchant: Online Mobile Shop MID: kLJwiy42558605770665
hxxp://newyearflipkart.com* Merchant: Lucky Mobile And Lamination MID: nixGaL07658395498481
Source Code Analysis
We analysed the source codes of both the sites and discovered that hxxp://paytm-megaoffer.com* was importing the hxxp://wowbuzz4.com/pytm_mall* source code.
It was found that hxxp://paytm-megaoffer.com* and hxxp://wowbuzz4.com/pytm_mall* have the same Google Analytics ID (UA-131481750-1). It is uncommon for 2 unrelated sites to have the same Google Analytics ID.
This indicates that both the sites belong to the same scammer/ group of scammers.
The contact details used to register hxxp://paytmmallcart.com* are not available, and that of hxxp://wowbuzz4.com/pytm_mall* cannot be traced back to any person or organization. However, hxxp://paytm-megaoffer.com* can be traced back to Parate Traders, a business in Nagpur.
Despite having different name servers, hxxp://wowbuzz4.com/pytm_mall* and hxxp://paytm-megaoffer.com* are hosted on the same IP. Therefore, whoever runs hxxp://paytm-megaoffer.com*, is likely responsible for hxxp://wowbuzz4.com/pytm_mall* also.
Impact of phishing
Phishing scams are the oldest and most rampant type of cyber threats. They are fairly simple to orchestrate, but have the potential to severely impact a company’s reputation and revenue.
Apart from the targeted eCommerce companies, phishing also damages the reputation of the payment gateway that facilitates the fraud. Paytm for Business enables a variety of online and offline transactions. Hence its reputation, among shoppers and legitimate merchants, will be tarnished by the concerted misuse.
We found a social media poster who claims to have lost money to a Paytm phishing site. Other than the immediate loss of money, users could become victims of other scams that leverage the personal details, collected via the phishing sites.
Considering how easy it is to buy a domain, phishing cannot be tackled by taking down pages or sites. Also, companies often detect phishing sites, only after users have been affected. To begin with, eCommerce companies should proactively monitor and take down phishing sites. In addition, Paytm should also disable/block the scammers’ Paytm for Business accounts. This will hinder transactions on all phishing sites that use the same merchant accounts.
In the long term, eCommerce companies should identify and counteract the servers that host these phishing sites. Furthermore, they should also take action against scammers, whom they can identify, by leveraging the domain details and MIDs.
Phishing sites such as hxxp://paytm-megaoffer.com*, hxxp://wowbuzz4.com/pytm_mall*, and hxxp://paytmmallcart.com*, are not anomalies. When combined with the misuse of Paytm payment gateway, these scams indicate, a concerted effort to exploit Paytm and its users.
A company’s brand image is the fruit of sustained effort and strategic planning. However, it takes only one malicious attack, to undo the hard won trust and goodwill of their customers. And any damage to this intangible asset can have serious and far-reaching consequences.
A continuous monitoring tool, such as CoudSEK’s XVigil, helps companies sustain continual brand scan, to effectively combat fake pages, impostors, rogue applications, and domains.
*Note: All http links have been obfuscated to hxxp to avoid spam alerts. [/vc_column_text][/vc_column][/vc_row]
The dark web, which is a component of the deep web, is the nesting ground of online, as well as offline criminal activities. Though most of us have a general understanding of the dark web, we are still unaware of the specific activities it facilitates, and how it affects us on a daily basis.
ATMs are a common part of our everyday lives, yet we know little about how ATMs can be exploited, by even the most novice of attackers. At CloudSEK, we have unearthed a range of techniques and devices, that are used and sold on the dark web, for the purpose of hacking ATMs.
There used to be a time when hacking an ATM required sophisticated skills and tools. Not anymore. We have encountered amateurs with rudimentary skills, who have hacked ATMs, using the tools and tutorials available on dark web marketplaces. This is possible because the devices sold on the dark web come with detailed instruction manuals. And most of these devices can be operated remotely, using an Antenna, to target systems that run on basic Windows XP.
ATM Malware Card
On the dark web, anybody can buy an ATM Malware Card, that comes with the PIN Descriptor, Trigger Card and an Instruction Guide. This manual provides step-by-step instructions on how to use the card to suspend cash from ATM machines. Once the ATM Malware card is installed in the ATM, it captures card details of all the customers who subsequently use the ATM. The Trigger card is then used to dispense cash from ATMs.
The image above, shows the product description provided on dark web marketplaces, to advertise the features and benefits. This malware mainly targets ATM machines that run on Windows XP. This card is capable of drawing out all the money that is available in the affected machine; which could amount to as much as $500,000. The product description is so detailed that even a layman can use it to hack an ATM.
USB ATM Malware
Another prevalent method to fraudulently dispense cash from ATM Machines, is by infecting them with a Malware hosted USB drive. This method also targets machines that run on Windows XP.
This image describes the product in simple words, with details about what files are contained in the USB drive, and instructions on how to use it to orchestrate an attack.
ATM SKIMMER SHOP (ALL IN ONE)
Apart from individual sellers, there are also online shops that sell such products. One such shop is the ATM Skimmer Shop (all in one), that offers ATM hacking appliances such as EMV Skimmers, GSM Receivers, ATM Skimmers, PoSs, Gas Pumps, Deep Inserts, etc.
The same shop also offers prepaid credit cards with high balances at different price points. The shop also updates and stocks itself with the latest cracking devices released in the market, such as POS Terminals, Upgraded Antenna, custom-made ATM Skimmers, RFID Reader/Writer, etc. This shop was previously available on the surface web, but is now available only on the dark web. Here, hacking devices that need be physically attached to ATM machines, such as the ATM Insert Skimmer or Deep insert, are also sold.
The image above describes the benefits of using an insert skimmer to hack an ATM. It is advertised as a “plug and play” product, implying that it is a ready-to-use product.
Anyone who has access to the dark web and this shop, can order any of their products, hassle-free. Another such online shopping site is the Undermarket that claims to sell bank fullz and physical bank cards on their platform.
There are underground hacking forums that discuss and sell tutorials on how to hack bank accounts using Botnets, and other such topics. Forums such as Optimus Store, sell these malicious files for $100.
A recently uncovered, active ATM Jackpotting method that uses a malware, is called Ploutus-D. It works by compromising components of a well-known multivendor ATM software, to gain control over hardware devices such as dispensers, card readers, and pin pads. It allows the hacker to suspend all the cash from affected machines, in a few minutes. The source code for this malware, along with instructions on how to use it, are sold on the dark web.
As hacking tools and techniques become ubiquitous, it is important to be aware and vigilant, by understanding new and sophisticated trends in hacking, and how you can defend yourself against them.
XVigil Solutions provide organizations unified supervision across the internet, their brand, and their infrastructure. It yields analytics and actionable intelligence, needed to tackle external threats, by deploying comprehensive security scans and monitors.
Text classification is an important task in Natural Language Processing in which predefined categories are assigned to text documents. In this article, we will explore recent approaches for text classification that consider document structure as well as sentence-level attention.
In general, a text classification workflow is like this:
You collect a lot of labeled sample texts for each class (your dataset). Then you extract some features from these text samples. The text features from the previous step along with the labels are then fed into a machine learning algorithm. After the learning process, you’ll save your classifier model for future predictions.
One difference between classical machine learning and deep learning when it comesto classification is that in deep learning, feature extraction and classification is carried out together but in classical machine learning, they’re usually separate tasks.
Proper feature extraction is an important part of machine learning for document classification, perhaps more important than choosing the right classification algorithm. If you don’t pick good features, you can’t expect your model to work well. Before discussing further the feature extraction, let’s talk about some methods of representing text for feature extraction.
A text document is made of sentences which in turn made of words. Now the question is: how do we represent them in a way that features can be extracted efficiently?
Document representation can be based on two models:
Standard Boolean models: These models use boolean logic and the set theory are used for information retrieval from the text. They are inefficient compared to the below vector space model due to many reasons and are not our focus.
Vector space models: Almost all of the existing text representation methods are based on VSMs. Here, documents are represented as vectors.
Let’s look at two techniques by which vector space models can be implemented for feature extraction.
Bag of words (BoW) with TF-IDF
In Bag of words model, the text is represented as the bag of its words. The below example will make it clear.
Here are two sample text documents:
(i) Bofin likes to watch movies. Rahul likes movies too.
(ii) Bofin also likes to play tabla.
Based on the above two text documents, we can construct a list of words in the documents as follows.
Now if you consider a simple Bag of words model with term frequency (the number of times a term appears in the text), feature lists for the above two examples will be,
(i) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(ii) [1, 1, 1, 0, 0, 0, 0, 1, 1, 1]
A simple term frequency like this to characterize the text is not always a good idea. For large text documents, we use something called term frequency – inverse document frequency (tf-idf). tf-idf is the product of two statistics, term frequency (tf) and inverse document frequency (idf). We’ve just seen from the above example what tfis, now let’s understand idf. In simple terms, idf is a measure of how much a word is common to all the documents. If a word occurs frequently inside a document, that word will have high term frequency, if it also occurs frequently in the majority of the documents, it will have a low inverse document frequency. Basically, idf helps us to filter out words like the, i, an that occur frequently but are not important for determining a document’s distinctiveness.
Word embeddings with word2vec
Word2vec is a popular technique to produce word embeddings. It is based on the idea that a word’s meaning can be perceived by its surrounding words (Like the proverb: “Tell me who your friends are and I’ll tell you who you are” ). It produces a vector space from a large corpus, with each unique word in the corpus being assigned a corresponding vector in the vector space. The heart of word2vec is a two-layer neural network which is trained to model linguistic context of words such that words that share common contexts in a document are in close proximity inside the vector space.
Word2vec uses two architectures for representation. You can (loosely) think of word2vec model creation as the process consisting of processing every word in a document in either of these methods.
Continuous bag-of-words (CBOW): In this architecture, the model predicts the current word from its context (surrounding words within a specified window size)
Skip-gram: In this architecture, the model uses the current word to predict the context.
Word2vec models trained on large text corpora (like the entire English Wikipedia) have shown to grasp some interesting relations among words as shown below.
Following these vector space models, approaches using deep learning made progress in text representations. They can be broadly categorized as either convolutional neural network based approaches or recurrent neural network (and its successors LSTM/GRU) based approaches.
Convolutional neural networks (ConvNets) are found impressive for computer vision applications, especially for image classification. Recent research that explores ConvNets for natural language processing tasks have shown promising results for text classification, like charCNNin which text is treated as a kind of raw signal at the character level, and applying temporal (one-dimensional) ConvNets to it.
Recurrent neural network and its derivatives might be perhaps the most celebrated neural architectures at the intersection of deep learning and natural language processing. RNNs can use their internal memory to process input sequences, making them a good architecture for several natural language processing tasks.
Although straight neural network based approaches to text classification have been quite effective, it’s observed that better representations can be obtained by including knowledge of document structure in the model architecture. This idea is conceived from common sense that,
Not all part of a document is equally relevant for answering a query from it.
Finding the relevant sections in a document involves modeling the interactions of the words, not just their presence in isolation.
We’ll explore one such approach where word and sentence level attention mechanisms are incorporated for better document classification.
APPLYING HIERARCHICAL ATTENTION TO TEXTS
Two basic insight from the traditional methods we discussed so far in contrast to hierarchical attention methods are:
Words form sentences, sentences form documents. And essentially, documents have a hierarchical structure and a representation capturing this structure can be more effective.
Different words and sentences in a document are informative to different extents.
To make this clear, let’s look at the example below:
In this restaurant review, the third sentence delivers strong meaning (positive sentiment) and words amazing and superb contributes the most in defining sentiment of the sentence.
Now let’s look at how hierarchical attention networks are designed for document classification. As I said earlier, these models include two levels of attention, one at the word level and one at the sentence level. This allows the model to pay less or more attention to individual words and sentences accordingly when constructing the representation of the document.
The hierarchical attention network (HAN) consists of several parts,
a word sequence encoder
a word-level attention layer
a sentence encoder
a sentence-level attention layer
Before exploring them one by one, let’s understand a bit about the GRU based sequence encoder, which’s the core of the word and the sentence encoder of this architecture.
Gated Recurrent Units or GRU is a variation of LSTMs (Long Short Term Memory networks) which is, in fact, a kind of Recurrent Neural Network. If you are not familiar with LSTMs, I suggest you read thiswonderful article.
Unlike LSTM, the GRU uses a gating mechanism to track the state of sequences without using separate memory cells. There are two types of gates, the reset gate, and the update gate. They together control how information is updated to the state.
Refer the above diagram for the notations used in the following content.
1. Word Encoder
A bidirectional GRU is used to get annotations of words by summarising information from both directions for words and thereby incorporating the contextual information.
Where xit is the word vector corresponding to the word wit. and We is the embedding matrix.
We obtain an annotation for a given word wit by concatenating the forward hidden state and backward hidden state,
2. Word Attention
Not all words contribute equally to a sentence’s meaning. Therefore we need an attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.
At first, we feed the word annotation hit through a one-layer MLP to get uit (called the word-level context vector) as a hidden representation for hit. Then we get the importance vector (?) as shown in the above equation. The context vector is basically a high-level representation of how informative a word is in the given sentence and learned during the training process.
3. Sentence Encoder
Similar to the word encoder, here we use a bidirectional GRU to encode sentences.
The forward and the backward hidden states are calculations are carried out similar to the word encoder and the hidden state h is obtained by concatenating them as,
Now the hidden state hisummarises the neighboring sentences around the sentence i but still with the focus on i.
4. Sentence Attention
To reward sentences that are clues to correctly classify a document, we again use attention mechanism at the sentence level.
Similar to the word context vector, here also we introduce a sentence level context vector us.
Now the document vector v is a high-level representation of the document and can be used as features for document classification.
So we’ve seen how document structure and attention can be used as features for classification. In the benchmark studies, this method outperformed the existing popular methods with a decent margin on popular datasets such as Yelp Reviews, IMDB, and Yahoo Answers.
Let us know what you think of attention networks in the discussion section below and feel free to ask your queries.