Using Prometheus to Monitor Complex Applications and Infrastructure

How to use Prometheus to Monitor Complex Applications and Infrastructure

Prometheus is an open-source monitoring software designed to monitor containerized applications in a microservice architecture. Given the increasing complexity of modern infrastructure, it is not easy to track an application’s state, health, and performance. Failure of one application might affect other applications’ ability to run correctly. Prometheus helps to track the condition, performance, or any custom metrics of an application over time, which allows an engineer to avoid unexpected situations and improve the overall functionality of microservice infrastructure. However, engineers can also use Prometheus to monitor monolithic infrastructure.

Before we delve into the nitty gritties of Prometheus, we should discuss some important terms:

  • Metric: A number with a name, measuring which has a meaning. For example: “cpu_usage = 1000 microns” is a metric.
  • Target: Any containerized application exporting metrics at ‘/metrics’ HTTP endpoint in Prometheus format.
  • Exporter: A library/code which converts existing metrics into Prometheus format.
  • Scrape: Pulling metrics from the target by making an HTTP request.

Prometheus Architecture

Prometheus has three major components :

  • Retrieval: This component is responsible for scraping the metrics from all targets at configured intervals of time.
  • Time Series DB: Stores the metrics scraped at regular intervals of time as a time-series data or as a vector.
  • HTTP Server: Accepts query over time-series data as an HTTP request and returns the result as an HTTP response. The query language used here is PromQL.

 

Prometheus Architecture
Prometheus Architecture

Prometheus Metric Types

Prometheus provides different metrics. Out of which, Counter, Gauge, Summary, and Histogram work in most situations. It’s the job of the application to provide the metrics in a predefined format that Prometheus understands. It is easy to publish these metrics from your application using the provided client libraries. Currently, libraries exist for popular languages such as GO, Python, Java, Ruby etc.

In this blog, we will be tackling the python version. However, it is easy to translate these concepts into other languages.

Counter

Any value that increases with time, such as HTTP request count, HTTP error response count, etc., can use counter metrics. A metric that can decrease can never use counter metrics. Counter has the advantage of querying the rate at which the value increases using the rate() function.

In the example below we are counting the number of function calls to a python def:

from prometheus_client import start_http_server, Counter
import random
import time

COUNTER = Counter('function_calls', 'number of times the function is called', ['module'])

def process_request(t):
    """A dummy function that takes some time."""
    COUNTER.labels('counter_demo').inc()
    time.sleep(t)

if __name__ == '__main__':
    start_http_server(8001)
    while True:
        process_request(random.random())

This code generates a counter metric named function_calls_total. We are incrementing function_calls_total every time we call the function process_request. Prometheus metrics have labels to identify similar metrics generated from different applications. In the above code, function_calls_total has a label named module with a value equal to counter_demo

The output of

curl http://localhost:8001/metrics

will be:

# HELP function_calls_total number of times the function is called
# TYPE function_calls_total counter
function_calls_total{module="counter_demo"} 22.0
# HELP function_calls_created number of times the function is called
# TYPE function_calls_created gauge
function_calls_created{module="counter_demo"} 1.6061945420716858e+09

Gauge

Any value that increases or decreases with time uses gauge metrics such as CPU usage, memory usage, and processing time.

In the example below we are calculating the time taken for the latest function call of python def process_request:

 

from prometheus_client import start_http_server, Gauge
import random
import time

TIME = Gauge('process_time', 'time taken for each function call', ['module'])

def process_request(t):
    """A dummy function that takes some time."""
    TIME.labels('gauge_demo').set(t)
    time.sleep(t)

if __name__ == '__main__':
    start_http_server(8002)
    while True:
        process_request(random.random())

Gauge metric supports set(x), inc(x), and dec(x) methods to set the metric, increment the metric, and decrement the metric by x respectively.

The output of http://localhost:8002/metrics will be:

# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="9",patchlevel="0",version="3.9.0"} 1.0
# HELP process_time number of times the function is called
# TYPE process_time gauge
process_time{module="gauge_demo"} 0.4709189918033123

Histogram

The histogram has predefined buckets of different sizes from 0.005 to 10. For example, if you want to measure/observe every HTTP request’s duration coming to your application, each request duration may fall under any predefined bucket. If a request’s duration is ten, then the value of the bucket of size ten is incremented by one. The histogram also has the sum of the duration of all requests and the number of requests.

In the example below we are observing the duration of a function call of python def: process_request:

h = Histogram('request_duration', 'Description of histogram', ['module'])
HIST = h.labels('histogram_demo')

def process_request(t):
    """A dummy function that takes some time."""
    HIST.observe(t)
    time.sleep(t)

The output of curl http://localhost:8003/metrics will be:

# HELP request_duration Description of histogram
# TYPE request_duration histogram
request_duration_bucket{le="0.005",module="histogram_demo"} 0.0
request_duration_bucket{le="0.01",module="histogram_demo"} 0.0
request_duration_bucket{le="0.025",module="histogram_demo"} 0.0
request_duration_bucket{le="0.05",module="histogram_demo"} 2.0
request_duration_bucket{le="0.075",module="histogram_demo"} 3.0
request_duration_bucket{le="0.1",module="histogram_demo"} 3.0
request_duration_bucket{le="0.25",module="histogram_demo"} 5.0
request_duration_bucket{le="0.5",module="histogram_demo"} 9.0
request_duration_bucket{le="0.75",module="histogram_demo"} 11.0
request_duration_bucket{le="1.0",module="histogram_demo"} 16.0
request_duration_bucket{le="2.5",module="histogram_demo"} 16.0
request_duration_bucket{le="5.0",module="histogram_demo"} 16.0
request_duration_bucket{le="7.5",module="histogram_demo"} 16.0
request_duration_bucket{le="10.0",module="histogram_demo"} 16.0
request_duration_bucket{le="+Inf",module="histogram_demo"} 16.0
request_duration_count{module="histogram_demo"} 16.0
request_duration_sum{module="histogram_demo"} 7.188765686771258
# HELP request_duration_created Description of histogram
# TYPE request_duration_created gauge
request_duration_created{module="histogram_demo"} 1.60620555290144e+09

  • request_duration_bucket are the buckets of size ranging from 0.005 to 10.
  • request_duration_sum is the sum of durations of each function call.
  • request_duration_count is the total number of function calls.

 

Summary

A summary is very similar to a histogram, except it doesn’t store the bucket information but only has the sum of the observations and count of total observations.

In the example below we are observing the duration of a function call of python def:

s = Summary('request_processing_seconds', 'Time spent \ 
processing request', ['module'])
SUMM = s.labels('pymo')

@SUMM.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

The output of curl http://localhost:8004/metrics will be:

# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count{module="pymo"} 20.0
request_processing_seconds_sum{module="pymo"} 8.590708531
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created{module="pymo"} 1.606206054827252e+09

Configuring Prometheus

So far we have seen the different types of metrics and how to generate them. Now let us see how to configure Prometheus to monitor the metrics we have developed. Prometheus uses a YAML file to define the configuration.


prometheus.yaml
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
scrape_configs:
  - job_name: 'prometheus_demo'
    static_configs:
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

The four primary configuration sections used in most situations are:

  1. scrape_interval defines the interval in which Prometheus scrapes the targets
  2. evaluation_interval determines how frequently Prometheus evaluates the rules defined in the rules file.
  3. job_name defined will be added as a label to any time-series scraped from this config.
  4. targets define the list of HTTP endpoints to scrape for metrics

Starting Prometheus

tar xvfz prometheus-*.tar.gz
./prometheus --config.file=proemtheus.yaml 

Docker

docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

Accessing UI

http://localhost:9090/graph

You can see time-series data as graphs as well.

Alerting

If you need an alert on function_calls_total > 100, you need to set up alert rules and alert managers. Prometheus uses another YAML file to define alert rules.

alertrules.yaml
groups:
- name: example
  rules:
  - alert: ExampleAlertName
    expr: function_calls_total{module="counter_demo"} > 100
    for: 10s
    labels:
      severity: low
    annotations:
      summary: example summary

prometheus.yaml
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
rule_files:
  - alertrules.yml
scrape_configs:
  - job_name: 'prometheus_demo'
    static_configs:
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

Once you restart Prometheus with the new configuration, you can see alerts listed in the alert section of the Prometheus UI. But these alerts reach your slack channel or email only if setup and configure the alert manager.

Leveraging STIX and TAXII for better Cyber Threat Intelligence

Leveraging STIX and TAXII for better Cyber Threat Intelligence (Part 1)

The modern cyberspace, with its increasingly complex attack scenarios and sophisticated modus operandi, is becoming more and more difficult to defend and secure. And given the evolving complexities of the threat landscape, the speed at which events occur, and the vast quantities of data involved, the need of the hour is a machine-readable and easily automatable system for Sharing Cyber Threat Intelligence (CTI) data.

This is where STIX and TAXII come into the picture.

STIX is a structured representation of threat information that is expressive, flexible, extensible, automatable, and readable. Using STIX feeds with TAXII enables organizations to exchange cyber threat intelligence in a more structured and standardized manner, allowing for deeper collaboration against threats.

In this article, we will explore the basics of STIX and TAXII and some of their applications in the cybersecurity space.

What is STIX?

STIX, as per the oasis guide, is “Structured Threat Information Expression (STIX™) is a language and serialization format used to exchange cyber threat intelligence (CTI)”.

It’s nothing but a standard defined by the community to share threat intel across various organizations. Using STIX, all aspects of a potential threat such as suspicion, compromise, and attack attribution can be represented clearly with objects and descriptive relationships. STIX is easy to read and consume because it is in the JSON format and it can also be integrated with other popular threat intel platforms such as QRADAR, ThreatConnect etc.

Applications of STIX

(UC1) Analyzing Cyber Threats

A security analyst analyses a variety of cyber threats from different sources every day. During which it is important to analyse various factors of a threat such as its behaviour, modes of operation, capabilities, threat actors etc. The STIX objects make it easier to represent all the data required for analysis easily.

(UC2) Specifying Indicator Patterns for Cyber Threats

An analyst often looks out for patterns in a cyber attack or a threat feed. This includes assessing the characteristics of the threat, the relevant set of observables (Indicators of Compromise (IOCs), attachments, files, IP addresses etc.), and suggested course of action. This data too can be represented well by assigning the required STIX objects to a threat.

(UC3) Managing Cyber Threat Response Activities

Remediating or preventing a cyber attack is the most important role of a security professional. After analysing the threat data, it is expected to plan a proper remedial action plan to safeguard one from future attacks. STIX enables analysts to plan remedial action.Remediating or preventing a cyber attack is the most important role of a security professional. After analysing the threat data, it is expected to plan a proper remedial action plan to safeguard one from future attacks. STIX enables analysts to plan remedial action.

 

What is TAXII?

TAXII, as per the oasis guide, is “Trusted Automated Exchange of Intelligence Information (TAXII™) and is an application protocol for exchanging CTI over HTTPS. ”

TAXII is a standard that defines a set of protocols for Client and Servers to exchange CTI along with a RESTful API (a set of services and message exchanges).

TAXII defines two primary services to support a variety of common sharing models

Collection: A server-provided repository of objects where TAXII Clients and Servers exchange information in a request-response model.

Channel: When there is more than one producer, and all the producers feed the objects onto the Channels which are then consumed by TAXII clients, TAXII Clients exchange information within a publish-subscribe model.

The TAXII 2.1 specification reserves the keywords required for Channels but does not specify Channel services. Channels and their services will be defined in a later version of TAXII.

Note: The TAXII 2.1 specification reserves the keywords required for Channels but does not specify Channel services. Channels and their services will be defined in a later version of TAXII.

TAXII was specifically designed to support the exchange of CTI represented in STIX, and support for exchanging STIX 2.1 content. It is important to note that STIX and TAXII are independent standards and TAXII can be used to transport non-STIX data.

The three principal models for TAXII

1. Hub and spoke – one repository of information

Hub and spoke – one repository of information
2. Source/subscriber – one single source of information

Source/subscriber – one single source of information

3.Peer-to-peer – multiple groups share information

Peer-to-peer – multiple groups share informationUpcoming…

In Part 2 we will delve deeper into STIX architecture, implementation, and usage, and dissect to get a deeper understanding of the different versions of TAXII, and their Client and Server implementations.

References: 

  1. https://oasis-open.github.io/cti-documentation/taxii/intro.html
  2. https://oasis-open.github.io/cti-documentation/stix/intro 
  3. https://www.first.org/resources/papers/munich2016/wunder-stix-taxii-Overview.pdf
  4. https://stixproject.github.io
Telegram and Cybercrime

The Rise of Cybercrime on Telegram and Discord and the Need for Continuous Monitoring

Instant messaging, popularly called IM or IM’ing, is the exchange of near real-time messages through a stand-alone application or embedded software. Unlike chat rooms with many users engaging in multiple and overlapping conversations, IM sessions usually take place between two users in private.

One of the core features of many instant messenger clients is the ability to see whether a friend or co-worker is online on the service — a capability known as presence. As the technology has evolved, many IM clients have added support for exchanging more than just text-based messages, allowing actions like file transfers and image sharing within the IM session.

Popular messaging services
Popular messaging services

The top three messaging apps by the number of users are WhatsApp – 2 billion users, Facebook Messenger – 1.3 billion users, and WeChat at 1.12 billion users. Messenger is the top messaging app in the US. In 2017, approximately 260 million new conversations were taking place each day on the app. In total, 7 billion conversations were occurring daily.

Most widely used messaging apps
Most widely used messaging apps

The power of social media platforms lies in their ability to connect users and create unique avenues for interaction. For individuals, enterprises, and governments, they facilitate new ways of reaching their audience, promoting a product, and fostering communities.

The growing presence of cybercriminals on social media platforms

The universal appeal of social media platforms makes it equally attractive to cybercriminals. Yet the growing range of criminal risks encountered across social media remains significantly under-researched.

Cybercriminals, it seems, aren’t that different from consumers and enterprise users—they want tools that are easy to use and widely available. They prefer services that are simple, have a clean graphical user interface, are intuitive to use, and are not buggy. Localization and language support also make a difference. Cybercriminals are very careful about who they let into their exclusive club, but they also don’t want to jump through excessive (and costly) hoops to communicate with each other.

The rise of Telegram and Discord

The point of serious concern is that many Telegram and Discord groups are being used by cybercriminals to perform illegal activities such as selling exploits and botnets, offering hacking services, and advertising stolen data.

The double-edged sword of Telegram’s end-to-end encryption

The end-to-end encryption provided by Telegram has paved the way for a host of illegal activities, turning coveted online privacy into a breeding ground for crime. Telegram claims to be more secure than mass market messengers such as WhatsApp and Line. It allows, among other things, anonymous forwards, which means your forwarded messages will no longer lead back to your account. You can also unsend messages and delete entire chats from not just your own phone, but also the other person’s. In addition, it allows you to set up usernames through which you can talk to Telegram users without revealing your phone number. These two features differentiate it from WhatsApp.

Telegram’s secret chat option

Telegram’s secret chats feature uses end-to-end encryption, which means it leaves no trace on servers, it supports self-destructing messages, and it doesn’t allow forwarding. Voice calls are end-to-end encrypted as well. It even allows bots to be set up for specific tasks. Due to its rich feature set and rapid adoption, Telegram has become a sought after tool on the fraud scene. According to Telegram’s website, the app allows users to create private groups containing up to 200,000 members as well as public channels that can be accessed by anyone who has the app.

The double edged sword of Telegram
The double-edged sword of Telegram

Telegram reported in April 2020 that it was logging 1.5 million new users daily. It added that it was the most-downloaded social app in 20 markets globally. The platform has been widely adopted globally and is available in 13 languages.

How threats actors are exploiting Telegram

Until recently, fraudsters mainly utilized Telegram groups and channels to organize their communities. Groups can be best described as chat rooms in which all members can read, comment, and post. This is where fraudsters advertise, connect, and share knowledge and compromised information, akin to dark web forums. Channels, on the other hand, are groups in which only the administrator is authorized to post and regular members have access to view, similar to blogs. Fraudsters mainly use Telegram channels to advertise fraud services and products in bulk. In this way Telegram serves as a ‘Dark Web lite’ for shady businesses.

 

Examples of a Telegram channel being used as ‘Dark web lite”
Examples of a Telegram channel being used as ‘Dark web lite”

The discovery of an exploit is not in itself illegal. Indeed, it is often rewarded by software companies or related businesses that may be affected. But if an exploit is sold, knowing that it is going to be used to commit a crime, then there is a possibility of being charged as an accomplice. The legal ambiguities have generated another grey economy in the trading of exploits. Several sites on social media platforms have been found to be openly vending exploits, including accounts such as Injector exploits database, Exploit Packs.

Telegram channels selling exploits and
Telegram channels selling exploits

Unprotected databases are one of the primary reasons for the rise in the exposed user records. Reports indicate that the data posted by hackers contain authentic databases that could lead to serious concerns for affected individuals and organizations. After the disclosed data breach, potential threat actors could discuss over the telegram channels and hacking forums. An attacker can further use the data to gather sensitive information and facilitate further attacks.

Threat actors trading information and databases via Telegram
Threat actors trading information and databases via Telegram

Around 30-40% of the social media sites inspected offered some form of hacking service. Very often there was an emphasis on ‘ethical’ hacking services, though there were no obvious ways to corroborate this. These groups offer tools for hacking websites, hackers for hire, and hacking tutorials. Cybercriminals in fact offer everything necessary to arrange a fraud or to conduct a personal attack. The offers are usually very specific and include malicious code and software that can help get access to personal accounts.

Hacking services being advertised and solicited on Telegram
Hacking services being advertised and solicited on Telegram

Discord is now where the world hangs out

Discord is a real-time messaging platform that bills itself as an “all-in-one voice and text chat for gamers,” due to its slick interface, ease of use, and extensive features. The platform has made it easy to communicate with friends and create and sustain communities via text, voice, and video.

The app allows users to set up their own servers where they can chat with their friends or with others who share their interests. Discord was originally created for gamers to collaborate and communicate, but has now been widely adopted by other groups and communities ranging from local hiking clubs to art communities and study groups.

The rising popularity of Discord as a communication channel
The rising popularity of Discord as a communication channel

Discord has garnered 100 million active users per month, 13.5 million active servers per week, and 4 billion servers with people talking for upwards of 4 hours per day. Discord is now where the world talks, hangs out, and builds relationships with their communities and friends. There are servers set up to function as online book clubs, fan groups for television shows or podcasts, and science discussions, to name a few. All this sounds harmless, but does Discord have a dark side? Yes, there are servers that promote illegal activities using the platform.

A convenient way to create and sustain communities and friendships
A convenient way to create and sustain communities and friendships

How cybercriminals are leveraging Discord

Being an encrypted service, Discord hosts numerous chat channels that promote illicit practices. Besides the obvious gaming chats, Discord is exploited to carry out other nefarious activities, like selling credit and loyalty cards, drugs, hacker resources, and doxing services. Much of the popularity has to do with the secure, encrypted, peer-to-peer communications available on the platform, allowing criminals to transact openly while avoiding scrutiny from law enforcement.

Is Discord the new dark web?

Illicit markets on Discord work much like “conventional” Dark Web markets on TOR. First, a buyer must locate a seller, join their network, and pay in bitcoin. One of the most popular goods on Discord marketplaces are credit and loyalty points. Some of these markets, having been kicked off TOR by law enforcement, have relocated their services to Discord.

Stolen credit card data, when sold on Discord or across other dark web sites, often include other identifying information such as names, email addresses, phone numbers, physical addresses, and passwords. These cards can be used to make purchases online and offline, or to buy untraceable gift cards. Loyalty points, another very popular item on Discord, can be purchased for just a few dollars (paid in bitcoin) and these can be exchanged for cash, or for goods and gift cards.

Discord being used to sell credit cards
Discord being used to sell credit cards

Besides the purchase of credit cards and loyalty points, some powerful hacking tools have found their way to Discord, making it possible for buyers to compromise accounts directly. One prominent example is OpenBullet, released on Microsoft’s GitHub code platform. Originally intended as a testing tool for security professionals, it was modified by hackers and spread quickly. It is easy to use, configure, and deploy, and helps the server owner set up DDOS services, carding methods, and malware sales.

DDos botnets being sold on Discord
DDos botnets being sold on Discord

It is easy to monitor paste websites like Pastebin because we know the structure of websites; what type of data is pasted, etc. But monitoring discussions on Discord, while not as simple, is critical for organizations. And time is of the essence when it comes to detecting and alerting organizations to information being exchanged or discussed, that pertains to their data and assets.

Cybercriminals using Discord to communicate and exchange data and services
Cybercriminals using Discord to communicate and exchange data and services

Cybercriminals also tend to use these platforms to share news, exchange vulnerability and exploit information, and cite research work from within the cybersecurity community.

Exploits for sale on Discord
Exploits for sale on Discord

The need for continuous monitoring

This is just the beginning of cybercriminals using instant messaging platforms to further their businesses. And with the rising popularity of encrypted messaging apps, we can expect illegal activities to flourish on these platforms.  Given the quick turnaround time on IM platforms, as opposed to forums where criminals first post their needs/ services and then have to wait for a reply, it is only a matter of time before cybercriminals shift their transactions to these platforms. And tools like chatbots allow for automated replies and advertising, helping threat actors achieve more in less time.

Which is why real time monitoring of dark web markets, Telegram channels, and Discord servers is no longer a luxury but a basic requirement for organizations to secure their data and assets. And this is where CloudSEK’s proprietary digital risk monitoring platform XVigil can help you stay ahead of cybercriminals and their increasingly sophisticated schemes. XVigil scours the internet, including surface websites, dark web marketplaces, and messaging platforms like Telegram and Discord. It detects malicious mentions and exchanges pertaining to your organization’s digital assets and provides you real-time alerts. Thus giving you enough time to take proactive measures to prevent costly breaches and attacks.

mongoDB

MongoDB Sharding 101: Creating a Sharded Cluster

What is Sharding?

Sharding is a process in MongoDB, that divides large data sets into smaller volumes of data and distributes them across multiple MongoDB instances. When the data within MongoDB is huge, the queries run against such large data sets can lead to high CPU usage. MongoDB leverages the feature of Sharding to tackle this. During this process, the large database is split into shards (subsets of data) that logically function as a collection. Shards are implemented through clusters that consist of:

  1. Shard(s): They are MongoDB instances that host subsets of data. In production, all shards need to be a part of replica sets that maintain the same data sets. Replica sets provide redundancy, high availability and are the basis for all production deployments.Sharding - Shards in parts
  2. Config server: It is a mongod instance that holds metadata about various mongoDB instances. A config server runs small mongod processes. In a production environment, usually, there are 3 config servers, wherein each server consists of a copy of metadata, that are kept in sync. And as long as one config server is alive, the cluster will remain active.
  3. Query router: This is a mongoDB instance that is responsible for re-directing the client’s commands to the right servers. It doesn’t have a persistent state. It acts as the interface between the client application and the relevant shard. The query router gathers information  that is needed to answer a query, reduces false positives and false negatives in the data, and routes the query to the most accurate information sources.

MongoDB Sharding explained

In a two-part series about sharding in MongoDB, we explain:

  • How to create a sharded cluster
  • How to select a shard key

In the first part we provide a tutorial on how to create a sharded cluster, explaining all the basics of sharding.

Sharded Cluster Tutorial

To create a cluster, we need to be clear about:

  1. The number of shards required initially
  2. Replication factor and other parameters for the replica set

mongod and mongos are two different processes in MongoDB. While mongod is the primary daemon that handles data requests, data access and performs background management operations, mongos is a shard utility that processes and routes user queries and determines the location of data in the sharded cluster. The shards and config servers run the mongod process, whereas the query router server runs the mongos process.

For our illustration, let’s take 4 shards (namely a, b, c, and d) and a replication factor of 3. So in total there will be 12 shard servers (mongod processes). We’ll also run 4 query routers (mongos processes) and 3 config servers (small mongod processes) as one would in a production environment; running all of this in one server simulating a cluster.

In actuality, these shards along with their replica sets are run on 12 different machines. However, config servers are lightweight processes and with only 4 shards. So, the load will be considerably low. This means you can run config servers on any one of the shard servers. The mongos processes can be run either on a shard server or directly on a client application machine. 

The benefit of running mongos on a client application machine is that it runs on a local interface without having to go outside of the network. However, you should remember to enable the right security options. On the other hand, if you run it on the shard server or any other server, the client will not have to go through the process of setting the cluster up.

 

  • Data directories

The first step in the process is to create data directories for our shard servers, config servers, and logs. 

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/a0
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/a1
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/a2

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/b0
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/b1
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/b2

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/c0
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/c1
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/c2

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/d0
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/d1
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p data/d2

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p config/cfg0
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p config/cfg1
Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir -p config/cfg2

Tue Oct 27 piyushgoyal mongo_tutorial $ mkdir logs

 

  • Config server and Shard server

Use the mongod command, as shown below, to start the config server. We have, thus, passed the configsvr parameter that declares the config server for that cluster.

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --configsvr --replSet cfg --dbpath config/cfg0 --port 26050 --fork --logpath logs/log.cfg0 --logappend
Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --configsvr --replSet cfg --dbpath config/cfg1 --port 26051 --fork --logpath logs/log.cfg1 --logappend
Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --configsvr --replSet cfg --dbpath config/cfg2 --port 26052 --fork --logpath logs/log.cfg2 --logappend

Now let’s get on with our shard servers. Here, we run the mongod command to pass the shardsvr parameter which in turn declares that it’s part of a sharded cluster. We then rename the replica set with the name of the shard.

 

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet a --dbpath data/a0 --port 27000 --fork --logpath logs/log.a0 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet a --dbpath data/a1 --port 27001 --fork --logpath logs/log.a1 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet a --dbpath data/a2 --port 27002 --fork --logpath logs/log.a2 --logappend


Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet b --dbpath data/b0 --port 27100 --fork --logpath logs/log.b0 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet b --dbpath data/b1 --port 27101 --fork --logpath logs/log.b1 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet b --dbpath data/b2 --port 27102 --fork --logpath logs/log.b2 --logappend


Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet c --dbpath data/c0 --port 27200 --fork --logpath logs/log.b0 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet c --dbpath data/c1 --port 27201 --fork --logpath logs/log.b1 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet c --dbpath data/c2 --port 27202 --fork --logpath logs/log.b2 --logappend


Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet d --dbpath data/d0 --port 27300 --fork --logpath logs/log.c0 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet d --dbpath data/d1 --port 27301 --fork --logpath logs/log.c1 --logappend

Tue Oct 27 piyushgoyal mongo_tutorial $ mongod --shardsvr --replSet d --dbpath data/d2 --port 27302 --fork --logpath logs/log.c2 --logappend

 

  • Mongos Server

In the next step, we run the mongos command to start mongos processes. In order to tell where the config servers are, we pass config servers IP using the –configdb parameter. We’ll run a single mongos process on a standard mongo TCP port to adhere to best practices.

Tue Oct 27 piyushgoyal mongo_tutorial $ mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos0 --port 27017 --bind_ip 0.0.0.0 --fork --logappend
Tue Oct 27 piyushgoyal mongo_tutorial $ mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos1 --port 26061 --bind_ip 0.0.0.0 --fork --logappend
Tue Oct 27 piyushgoyal mongo_tutorial $ mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos2 --port 26062 --bind_ip 0.0.0.0 --fork --logappend
Tue Oct 27 piyushgoyal mongo_tutorial $ mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos3 --port 26063 --bind_ip 0.0.0.0 --fork --logappend

 

Note that we never run the config server and the sharded server on the same standard mongo TCP port.

Now, check if all the processes are running smoothly.

Tue Oct 27 piyushgoyal mongo_tutorial $ ps -A | grep mongo

26745 ??         0:00.95 mongod --configsvr --replSet cfg --dbpath config/cfg0 --port 26050 --fork --logpath logs/log.cfg0 --logappend

26748 ??         0:00.90 mongod --configsvr --replSet cfg --dbpath config/cfg1 --port 26051 --fork --logpath logs/log.cfg1 --logappend

26754 ??         0:00.90 mongod --configsvr --replSet cfg --dbpath config/cfg2 --port 26052 --fork --logpath logs/log.cfg2 --logappend

26757 ??         0:00.77 mongod --shardsvr --replSet a --dbpath data/a0 --port 27000 --fork --logpath logs/log.a0 --logappend

26760 ??         0:00.77 mongod --shardsvr --replSet a --dbpath data/a1 --port 27001 --fork --logpath logs/log.a1 --logappend

26763 ??         0:00.76 mongod --shardsvr --replSet a --dbpath data/a2 --port 27002 --fork --logpath logs/log.a2 --logappend

26766 ??         0:00.76 mongod --shardsvr --replSet b --dbpath data/b0 --port 27100 --fork --logpath logs/log.b0 --logappend

26769 ??         0:00.77 mongod --shardsvr --replSet b --dbpath data/b1 --port 27101 --fork --logpath logs/log.b1 --logappend

26772 ??         0:00.75 mongod --shardsvr --replSet b --dbpath data/b2 --port 27102 --fork --logpath logs/log.b2 --logappend

26775 ??         0:00.73 mongod --shardsvr --replSet c --dbpath data/c0 --port 27200 --fork --logpath logs/log.b0 --logappend

26784 ??         0:00.75 mongod --shardsvr --replSet c --dbpath data/c1 --port 27201 --fork --logpath logs/log.b1 --logappend

26791 ??         0:00.74 mongod --shardsvr --replSet c --dbpath data/c2 --port 27202 --fork --logpath logs/log.b2 --logappend

26794 ??         0:00.77 mongod --shardsvr --replSet d --dbpath data/d0 --port 27300 --fork --logpath logs/log.c0 --logappend

26797 ??         0:00.75 mongod --shardsvr --replSet d --dbpath data/d1 --port 27301 --fork --logpath logs/log.c1 --logappend

26800 ??         0:00.71 mongod --shardsvr --replSet d --dbpath data/d2 --port 27302 --fork --logpath logs/log.c2 --logappend

26803 ??         0:00.00 mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos0 --port 27017 --bind_ip 0.0.0.0 --fork --logappend

26804 ??         0:00.24 mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos0 --port 27017 --bind_ip 0.0.0.0 --fork --logappend

76826 ??         8:58.30 /usr/local/opt/mongodb-community/bin/mongod --config /usr/local/etc/mongod.conf

26817 ttys009    0:00.00 grep mongo

26801 ttys016    0:00.01 mongos --configdb cfg/127.0.0.1:26050,127.0.0.1:26051,127.0.0.1:26052 --logpath logs/log.mongos0 --port 27017 --bind_ip 0.0.0.0 --fork --logappend

 

  • Status and Initiation of Replica Sets

Connect to one of the shard servers. For instance, let’s connect to 0 shard

Tue Oct 27 piyushgoyal mongo_tutorial $ mongo --port 27000

 

Run the following command to check the status of the replica set

> rs.status()

{

"operationTime" : Timestamp(0, 0),

"ok" : 0,

"errmsg" : "no replset config has been received",

"code" : 94,

"codeName" : "NotYetInitialized",

"$clusterTime" : {

"clusterTime" : Timestamp(0, 0),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

 

What follows is a dialogue box that says “run rs.initiate() is not yet done for the set”. To initiate the replica set, run the following command:

> rs.initiate()

{

"info2" : "no configuration specified. Using a default configuration for the set",

"me" : "localhost:27000",

"ok" : 1,

"$clusterTime" : {

"clusterTime" : Timestamp(1604637446, 1),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

},

"operationTime" : Timestamp(1604637446, 1)

}

 

If you run rs.status() once again, you’ll see that a single member has been added. Now, let’s add others as well to this replica set.

a:PRIMARY> rs.add("127.0.0.1:27001")

{

"ok" : 1,

"$clusterTime" : {

"clusterTime" : Timestamp(1604637486, 1),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

},

"operationTime" : Timestamp(1604637486, 1)

}
a:PRIMARY> rs.add("127.0.0.1:27002")

{

"ok" : 1,

"$clusterTime" : {

"clusterTime" : Timestamp(1604637491, 1),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

},

"operationTime" : Timestamp(1604637491, 1)
}

 

Run rs.status() to check the status of the initiation. Once it has been initiated, repeat the same process for other shards and the config server.

 

  • Add Shards

Connect to the mongos process so as to add shards to the cluster. 

Tue Oct 27 piyushgoyal mongo_tutorial $ mongo --port 27017

 

Run the command below to add shards:

mongos> sh.addShard("a/127.0.0.1:27000")

{

"shardAdded" : "a",

"ok" : 1,

"operationTime" : Timestamp(1604637907, 8),

"$clusterTime" : {

"clusterTime" : Timestamp(1604637907, 8),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

mongos> sh.addShard("b/127.0.0.1:27100")

{

"shardAdded" : "b",

"ok" : 1,

"operationTime" : Timestamp(1604638045, 6),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638045, 6),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

mongos> sh.addShard("c/127.0.0.1:27200")

{

"shardAdded" : "c",

"ok" : 1,

"operationTime" : Timestamp(1604638065, 4),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638065, 4),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

mongos> sh.addShard("d/127.0.0.1:27300")

{

"shardAdded" : "d",

"ok" : 1,

"operationTime" : Timestamp(1604638086, 6),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638086, 6),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

Note: When you run sh.addShard(‘a/127.0.0.1:27000’), if you get an output as shown below, try running sh.addShard(‘a/127.0.0.1:27001’) or sh.addShard(‘a/127.0.0.1:27002’).

{
 "ok" : 0,
 "errmsg" : "in seed list a/127.0.0.1:27000, host 127.0.0.1:27000 does not belong to replica set a; found { hosts: [ \"localhost:27000\", \"127.0.0.1:27001\", \"127.0.0.1:27002\" ], setName: \"a\", setVersion: 3, ismaster: true, secondary: false, primary: \"localhost:27000\", me: \"localhost:27000\", electionId: ObjectId('7fffffff0000000000000001'), lastWrite: { opTime: { ts: Timestamp(1604637886, 1), t: 1 }, lastWriteDate: new Date(1604637886000), majorityOpTime: { ts: Timestamp(1604637886, 1), t: 1 }, majorityWriteDate: new Date(1604637886000) }, maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, maxWriteBatchSize: 100000, localTime: new Date(1604637894239), logicalSessionTimeoutMinutes: 30, connectionId: 21, minWireVersion: 0, maxWireVersion: 8, readOnly: false, compression: [ \"snappy\", \"zstd\", \"zlib\" ], ok: 1.0, $clusterTime: { clusterTime: Timestamp(1604637886, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1604637886, 1) }",
 "code" : 96,
 "codeName" : "OperationFailed",
 "operationTime" : Timestamp(1604637888, 1),
 "$clusterTime" : {
 "clusterTime" : Timestamp(1604637888, 1),
 "signature" : {
 "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
 "keyId" : NumberLong(0)
 }
 }
}

 

Sharding a collection
Run sh.status() to check if the shards have been added.
Now that we have set up the cluster, let’s test it. 

When you shard a collection, you should choose a specific shard key. Without which sharding is impossible. It helps to distribute documents among members of the cluster.
The data stored by the key is called a chunk. The chunk size is only allowed to be between 1 and 1024 megabytes.
For example, let’s assume a collection of the following documents:

{
    "_id": ObjectId("5f97d97eb7a0a940f157ebc8"),
    "x": 1
}

 

And we use key x to shard the collection. The data is distributed as shown below:

x_low x_high shard

0    1000    S2    ----> chunk

1000  2000    S0    ----> chunk

2000  3000    S1    ----> chunk

 

The collection’s documents are distributed based on the shard key assigned. Documents of a particular shard key belong to the same shard.

The process of sharding can be divided into two: 

  1. Range-based sharding
  2. Hashed sharding

To keep this tutorial simple, we prefer range-based sharding. Range-based sharding is helpful for queries involving a range. Against the backdrop of these processes, the system also carries out these two key operations:

  • Split
  • Migration (it is been handled by cluster balancer)

For example, if a chunk size exceeds the prescribed size, the system, on its own, will attempt to find out the appropriate median key to divide the chunk into 2 or more parts. This is known as split, which is an inexpensive, metadata change. However, migration is responsible for maintaining the balance, for instance, moving a chunk from one shard to another. This is an expensive operation which usually involves transferring data worth ~1024MB.
You can still read and write to the database while the migration is in process. There’s a liveliness property to these migrations which keeps the database alive for operations, so a big lock doesn’t occur.
For our example, we’re creating a database named tutorial with collection foo and are going to insert some documents.

Tue Oct 27 piyushgoyal mongo_tutorial $ mongo --port 27017
mongos> use tutorial
mongos> for(var i=0; i<999999; i++) { db.foo.insert({x:i}) }

 

To allow sharding for the database, we have to manually enable it. Connect to the mongo instance and run sh.enableSharding(“<dbname>”).

mongos> sh.enableSharding("tutorial")

{

"ok" : 1,

"operationTime" : Timestamp(1604638168, 21),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638168, 21),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

 

If you run sh.status() in the database, it returns “partitioned = True” in the tutorial database.

Now we shard our collection using key x. To shard a collection, we have to first create an index on the key and then run sh.shardCollection(“<db_name>.<collection_name>”, <key>).

mongos> db.foo.ensureIndex({x: 1})

{

"raw" : {

"b/127.0.0.1:27101,127.0.0.1:27102,localhost:27100" : {

"createdCollectionAutomatically" : false,

"numIndexesBefore" : 1,

"numIndexesAfter" : 2,

"ok" : 1

}

},

"ok" : 1,

"operationTime" : Timestamp(1604638185, 9),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638185, 9),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

mongos> sh.shardCollection("tutorial.foo", {x: 1})

{

"collectionsharded" : "tutorial.foo",

"collectionUUID" : UUID("b6506a90-dc0f-48d2-ba22-c15bbc94c0d6"),

"ok" : 1,

"operationTime" : Timestamp(1604638203, 39),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638203, 39),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

 

The collection is now sharded. To check the shard distribution, number of chunks (nchunks), run the following command:

mongos> use tutorial
mongos> db.foo.stats()

 

To know the distribution of chunks across all databases, run the command below:

mongos> sh.status()

 

Let’s run some queries.

mongos> use tutorial
mongos> db.foo.find({x: 55})

To view the query plan or and understand how the cluster queries through different servers to get the document, append .explain() at the end of the query.

mongos> db.foo.find({x: 55}).explain()

{

"queryPlanner" : {

"mongosPlannerVersion" : 1,

"winningPlan" : {

"stage" : "SINGLE_SHARD",

"shards" : [

{

"shardName" : "b",

"connectionString" : "b/127.0.0.1:27101,127.0.0.1:27102,localhost:27100",

"serverInfo" : {

"host" : "Piyushs-MacBook-Pro.local",

"port" : 27100,

"version" : "4.2.8",

"gitVersion" : "43d25964249164d76d5e04dd6cf38f6111e21f5f"

},

"plannerVersion" : 1,

"namespace" : "tutorial.foo",

"indexFilterSet" : false,

"parsedQuery" : {

"x" : {

"$eq" : 55

}

},

"queryHash" : "716F281A",

"planCacheKey" : "0FA0E5FD",

"winningPlan" : {

"stage" : "FETCH",

"inputStage" : {

"stage" : "SHARDING_FILTER",

"inputStage" : {

"stage" : "IXSCAN",

"keyPattern" : {

"x" : 1

},

"indexName" : "x_1",

"isMultiKey" : false,

"multiKeyPaths" : {

"x" : [ ]

},

"isUnique" : false,

"isSparse" : false,

"isPartial" : false,

"indexVersion" : 2,

"direction" : "forward",

"indexBounds" : {

"x" : [

"[55.0, 55.0]"

]

}

}

}

},

"rejectedPlans" : [ ]

}

]

}

},

"serverInfo" : {

"host" : "Piyushs-MacBook-Pro.local",

"port" : 27017,

"version" : "4.2.8",

"gitVersion" : "43d25964249164d76d5e04dd6cf38f6111e21f5f"

},

"ok" : 1,

"operationTime" : Timestamp(1604638250, 30),

"$clusterTime" : {

"clusterTime" : Timestamp(1604638255, 27),

"signature" : {

"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),

"keyId" : NumberLong(0)

}

}

}

 

In the second part of this two-part series, we will explore various methods by which we can select a shard key.

Contemporary Single-Page Applications and Frontend Security

 

The past decade witnessed a meteoric rise in the number of applications adopting the Single-Page Application (SPA) model. Such applications are designed in such a way that the content on every new page within the application appears on a single web page, without having to load new HTML pages. Single-Page Applications leverage JavaScript capabilities to manipulate Document Object Model (DOM) elements, allowing to update all the content on the same page. 

A more traditional web page architecture loads different pages as the user attempts to open a new web page. In this case, each HTML page is usually linked to other pages. Upon each page load, the browser fetches and displays a fresh page.

Whereas, Single-Page Applications are enabled by JavaScript frameworks such as React, Angular, and VueJS. These frameworks help to conceal the complex functions that SPAs perform. They also provide additional benefits such as modular reusable components, state management, etc. Such modern frameworks help SPAs execute web pages effortlessly, as compared to multi-page applications that use Vanilla JavaScript. Using plain JavaScript makes it difficult to keep the UI updated with dynamic state changes. 

The emergence of such frameworks causes changes in the security implications it may have on the frontend. Therefore, it is necessary to understand their internal machinery while disassembling client-side code and how modern frameworks change the attack vectors. 

 

Build Process

As is customary, a developer creates a web page by defining the page structure in an HTML file and its styling in a CSS file. Then, it is linked in HTML using tags such as <style> or <link>, by embedding JavaScript code with <script> tags. However, building Single-Page Applications is more complicated than this. 

Frameworks such as React and Vuejs provide a virtual DOM, which allows you to steer clear of raw HTML. A virtual DOM, unlike a normal DOM, is a concept in which the representation of the UI is stored in memory, instead of rendering it on the browser with every change. This allows faster changes to DOM. The code is also written in modular JavaScript files which is then processed in the build process.

Here’s a diagram overviewing the build process:

 

Single-page application build process

 

Transpiling

JavaScript is a dialect of ECMAScript and it isn’t a fixed standard; new features are added to ECMAScript with newer standards, every few years. While newer standards are released frequently, no browser JavaScript engine (Chromium V8, Safari Javascript Core, Firefox SpiderMonkey) fully implements all the ECMA specifications, each having certain differences in features they support. Also your code needs to be compatible with older browsers released to support older specifications alone.

Transpiling enables you to write modern code which is then converted to standards and features that are supported everywhere, for e.g. ES6 -> ES5. You might use const, let to define your variables but the transpiler helps to convert it to var, internally.

Transpiling is also used to convert code in JavaScript dialects such as Typescript and Coffeescript to plain JavaScript. Each dialect contains its own transpiler such as tsc for Typescript. Babel is the most prevalent transpiler.

 

Bundling

A typical modern app contains hundreds of external dependencies when you look at it recursively and it isn’t practical to load each of those separately in a script tag.

Through the process of bundling, you can take each import or require statement, find the source files and bundle them all into one single JavaScript file, applying appropriate scoping.

With the help of this process, all of the code that makes the application work is bundled together; the business logic code as well as the boiler-plate code are bundled together.

 

Minification/ Obfuscation

The final JavaScript output usually can be very large due to extra spaces or unnecessary, redundant data. Therefore, the final step in the build process is to minify the code. During the process of minification, comments/ whitespaces are removed, identifiers are renamed to shorter variants, and certain routines are simplified. This process then leads to obfuscation, wherein the final code is unreadable and differs highly from the source code.

Bundling and minification is usually done using Webpack.

 

Single-page application: Original code - easy to read
Original code – easy to read

 

Single-page application: Transpiled, Bundled, Minified final output - unreadable
Transpiled, Bundled, Minified final output – unreadable

 

Reverse engineering code

Once the code has been minified in the build process, security researchers will have to de-minify it to study the app. As most of the information such as variable names are lost during the process of minification, there isn’t a straightforward way to do this. However, there are certain tools to aid you in the process such as jsnice.org

jsnice

This tool uses machine learning to restore minified variable, function names and infer type information. It also formats the code and adds comments.

After this step, you are still left with a bundled code, but the main logic would be readable.

To debundle it into modules, we need to know how Webpack or any other bundler works.

 

Debundling

A bundler starts with an entry point file – the root of your application logic – and traces import and require statements. It then builds a dependency graph, module A requires B which requires C and D, and so on. 

If you look through the Webpack chunks after they have been passed through jsnice, you’ll find a lot of calls to “__webpack_require__(<number>).”

__webpack_require__” is similar to the require JavaScript syntax in functionality, except it contains the logic to load modules from the chunks.

The only way to unbundle a bundle is to construct the abstract syntax tree (AST) manually, as there have been attempts to debundle that aren’t maintained anymore.

You could use these resources to study in depth how a bundle file works and to know the internals of Webpack. In this video, Webpack founder Tobias Koppers shows us how bundling is done manually.

 

Security Tip

Single-page application security

How do these frameworks change attack vectors?

Reduced XSS

React does not render dynamic HTML; it sanitizes content coming from all variables, even if they do not contain dynamic content. Here XSS is all but eradicated unless the developer uses an unsafe function such as dangerouslySetInnerHTML.

In that case, even if you find data reflection you would not be able to insert HTML.

 

CSP-Bypasses

You could use certain gadgets available in Angular to bypass Content Security Policy (CSP) in certain cases.

<img src=”/” ng-on-error=”$event.srcElement.ownerDocument.defaultView.alert($event.srcElement.ownerDocument.domain)”

  />

Here, if CSP is blocking inline scripts but the page is using Angular, you could use the ng-on-error attribute to get Angular to execute the JavaScript. These types of gadgets are often patched but discovered regularly in Vuejs and Angularjs. 

 

Development of a Language-Independent Microservice Architecture

 

There is a proliferation of modern programming languages now more than ever. Nevertheless, development organizations tend to stick to one or two languages, allowing them to manage their stack properly and to keep it clean. 

Programming languages are essential tools that if not chosen appropriately, could have a crippling effect on the business. For example, developers prefer Python over Java to set up machine learning algorithms, because Python has better libraries required for machine learning and artificial intelligence applications. 

Every language has a particular set of capabilities and properties. For instance, some languages are faster than others but may take time to write. Some languages consume more resources and yet make development faster. And, there are other languages that fall in between.

Microservice multilingual

Building a language-independent microservice architecture 

While it is not quite common, a microservice architecture can be built to suit multiple languages. It provides developers with the freedom to choose the right tool for their task. In the final phase of the software development lifecycle, if Docker can deploy containers irrespective of its language, its architecture can be replicated in the initial stages of building microservices as well.

However, managing multiple languages can be a huge task and a developer should definitely consider the following aspects when they plan to add more languages to their technology stack:

  • New boiler plate code, 
  • Defining models/data structures again, 
  • Documentation,
  • The environment it works on, 
  • New framework, etc.

 

Why RESTful API is not the right choice

REST API requires a new framework for every new language such as Express for Node.js, Flask for Python, or Gin for Go. Also, the developer may have to code their models/ data structures over and over for each service as well. This makes the whole process redundant and can result in several errors.

For example, it is very common to have one database model for multiple services. But, when the database is updated, it has to undergo changes for each service. To build a multilingual microservice, we need a common framework that supports multiple languages and platforms and one that is scalable.

 

Make way for gRPC and Protocol Buffers

gRPC 

gRPC is Google’s high performing, open-source remote procedure calls framework. It’s compatible with multiple languages such as Go, Node.js, Python, Java, C-Sharp, and many more. It uses RPC to provide communication capabilities to microservices. Moreover, it is a better alternative to REST. gRPC is much faster than REST. And this framework uses Protocol Buffers as the interface definition language for serialization and communication instead of JSON/ XML. It works based on server-client communication.

An in-depth analysis of gRPC is given here.

 

Protocol Buffers 

Protocol buffers are a method of serializing data that can be transmitted over wire or be stored in files. In short, Protocol buffer or Protobuf is the same as what JSON is with regards to REST. It is an alternative to JSON/ XML, albeit smaller and faster. Protobuf defines services and other data and is then compiled into multiple languages. The gRPC framework is then used to create a service.

gRPC uses the latest transfer protocol – HTTP/2, which supports bidirectional communication along with the traditional request/ response. In gRPC the server is loosely coupled with the client. In practice, the client opens a long-lived connection with the gRPC server and a new HTTP/2 stream will be opened for each RPC call.

 

How do we use Protobuf?

We first define our service in Protobuf and compile it to any language. Then, we use the compiled files to create our gRPC framework to create our server and client, as shown in the diagram below.

Protobuffers for microservices

To explain the process:

  1. Define a service using Protocol Buffers
  2. Compile that service into other languages as per choice
  3. Generate boiler-plate code in each language 
  4. Use this to create gRPC server and clients

 

Advantages of using this architecture:

  • You can use multiple languages with a single framework.
  • Single definition across all the services, which is very useful in an organisation.
  • Any client can communicate with any server irrespective of the language.
  • gRPC allows two-way communication and uses HTTP/2 which is uber fast.

 

Creating a microservice with gRPC

Let’s create a simple microservice – HelloWorldService. This is a very basic service that invokes only one method – HelloWorld – which would return the string “hello world”. 

These are the 4 steps to follow while creating a microservice: 

  1. Define the service (protocol buffer file)
  2. Select your languages
  3. Compile the defined service into selected languages
  4. Create a gRPC server and client, using the compiled files

For this simple service, we are opting 2 languages: Go and Node.js. Since gRPC works on a client-server architecture, we shall use Go on the server (since it is fast and resource efficient) and Node.js for clients (since most of the apps these days are React/ Angular).

**One can also decide to run gRPC servers with REST API too, if they do not want to create clients. They can do this by creating a REST Proxy server. Although it sounds laborious, it is actually pretty simple and will be dealt with in the ‘Compiling’ section that will follow.

 

Step 1:  Defining a service

Messages and services are defined using Protobuf in a proto file (.proto file).

The syntax is quite simple; the messages are defined and are used as Request and Response for the RPC calls, for each service. Here is a language guide for Protocol Buffers.

We can create a new folder for each service under the name of that particular service. And, for ease of accessibility, we can store all these services under the folder “services.”

The service we are defining here is “helloworld.” And each folder should contain 2 files, namely:

  • service.proto – contains definition 
  • .protolangs – contains the languages that this service should be generated in.

 

Services folder
Folder structure

Now, let’s define our service:

gRPC hello world

 

As shown above, we have defined an empty Request, a string Response and a HelloWorldService with a single RPC call HelloWorld. This call accepts requests and returns responses. You can see the full repo here.

 

Step 2: Selecting languages

Once the service has been defined, we will have to choose the languages to compile the services into. This choice is made based on service requirements and usage, and also the developer’s comfort. The different languages that one can choose are Go, Node.js, Python, Java, C, C#, Ruby, Scala, PHP, etc. As mentioned earlier, in this example, we’ll be using Go and Node.js. We’ll then add these languages to the .protolangs file, mentioning each language in a new line.

adding languages for microservices

 

Step 3: Compiling

Compiling is the most interesting part of this entire process. In step 3, we’ll compile the .proto file to the selected languages, Go and Node.js.

The Protocol Buffer comes with a command line tool called “protoc”, which compiles the service definition for use. But for each language we’ll have to download and install plugins. This can be achieved by using a dockerfile.

Namely is a docker file which includes all of this and is available to all. It has the “proto compiler” with support for all languages and additional features like documentation, validator, and even a REST Proxy server that we talked about in step-1.  

For example,


$ docker run -v `pwd`:/defs namely/protoc-all -f myproto.proto -l ruby

It accepts a .proto file and language, both of which we already have. The following is a simple bash script that will loop through our services folder, and pick up the service.proto file to compile to the languages in the .protolangs file.


#!/bin/bash
echo "Starting ... "
set x
 
REPO="`pwd`/repo"
 
function enterDir {
 echo "Entering $1"
 pushd $1 > /dev/null
}
 
function leaveDir {
 echo "Exiting"
 popd > /dev/null
}
 
function complieProto {
   for dir in */; do
       if [ -f $dir/.protolangs ]; then
           while read lang; do
               target=${dir%/*}
               mkdir -p $REPO/$lang
               rm -rf $REPO/$lang/$target
               mkdir -p $REPO/$lang/$target
               mkdir -p $REPO/$lang/$target/doc
 
               echo "  Compiling for $lang"
               docker run -v `pwd`:/defs namely/protoc-all -f $target/service.proto -l $lang --with-docs --lint $([ $lang == 'node' ] && echo "--with-typescript" || echo "--with-validator")
 
               cp -R gen/pb-$lang/$target/* $REPO/$lang/$target
               cp -R gen/pb-$lang/doc/* $REPO/$lang/$target/doc
               sudo rm -rf gen
 
           done < $dir/.protolangs
       fi
   done
}
 
function complie {
 echo "Starting the Build"
 mkdir -p $REPO
 for dir in services/; do
   enterDir $dir
   complieProto $dir
   leaveDir
 done
}
 
complie

 

  • The scripts run a loop for each folder in /services
  • It then picks up the .protolangs file and loops them again with each of the languages written in the folder. 
  • It then compiles service.proto with the language. 
  • The docker generates the files in gen/pb-{language} folder. 
  • We simply copy the content to repos/{language}/{servicename} folder.

We, then, run the script:

$ chmod +x generate.sh

$ ./genetare.sh

The file generated appears in the /repos folder.

Tip: You can host these definitions in a repository, and use the script generated in a CI/CD Pipeline to automate this process.

Microservice node file

The successfully generated service_pb for both Node.js and Go, along with some docs and validators, constitutes the boiler-plate code for our server and clients that we are about to create.

**As discussed earlier, if you don’t want to use the client and want REST-JSON APIs instead, you can create a REST Proxy by adding a single tag in the namely/protoc-all dockerfile i.e. — with-gateway. For this we’ll have to add api paths in our protofiles. Have a look at this for further information. Now, run this gateway and the REST Proxy server will be ready to serve the gRPC server. 

 

Tip: You can also host this protorepo on github as a repository. You can have a single repo for all the definitions in your organisation, in the same way as google does.

 

Step 4: gRPC Server and Client

Now that we have service_pb code for Go and Node.js, we can use it to create a server and a client. For each language the code will be slightly different, because of the differences in the languages. But the concept will remain the same.

For servers: We’ll have to implement RPC functions.

For clients: We’ll have to call RPC functions.

 

You can see the gRPC code for all languages here. With only a few lines of code we can create a server and with fewer lines of code we can create a client.

 

Server (Go): 

package main
 
import (
   "context"
   "fmt"
   "log"
   "net"
 
   helloworld "github.com/rohan-luthra/service-helloworld-go/helloworld"
   "google.golang.org/grpc"
)
 
type server struct {
}
 
func (*server) HelloWorld(ctx context.Context, request *helloworld.Request) (*helloworld.Response, error) {
   response := &helloworld.Response{
       Messsage: "hello world from go grpc",
   }
   return response, nil
}
 
func main() {
   address := "0.0.0.0:50051"
   lis, err := net.Listen("tcp", address)
   if err != nil {
       log.Fatalf("Error %v", err)
   }
   fmt.Printf("Server is listening on %v ...", address)
 
   s := grpc.NewServer()
   helloworld.RegisterHelloWorldServiceServer(s, &server{})
 
   s.Serve(lis)
}

 

 

 

As you can see, we have imported the service.pb.go that was generated by our shell script. Then we implemented the function HelloWorld – which returns the Response “hello world from go grpc.” Thus creating a gRPC server.

 

Client (Node.js):

var helloword = require('./helloworld/service_pb');
var services = require('./helloworld/service_grpc_pb');
 
var grpc = require('grpc');
 
function main() {
 var client = new services.HelloWorldServiceClient('localhost:50051',
                                         grpc.credentials.createInsecure());
 var request = new helloword.Request();
 var user;
 if (process.argv.length >= 3) {
   user = process.argv[2];
 } else {
   user = 'world';
 }
 client.helloWorld(request, function(err, response) {
     if(err) console.log('Node Client Error:', err)
     else console.log('Node Client Message:', response.getMesssage());
 });
}
 main();  

 

We have imported the service_pb.js file which was generated by our shell script. Add the gRPC server address and call the HelloWorld function and output response to console.

 

Test the code

Run the server and make sure that the code works. 

Testing code Microservice

Now that our server is running, let’s make a call from our Node.js client: 

Testing Node.js

 *Ignore the warning.

 

When we receive the console output “hello world from go grpc,” we can conclude that everything worked out as expected.

Thus, we have successfully created a gRPC server and client with a single proto definition and a shell script. This was a simple example of an RPC call returning a “hello world” text. But you can do almost anything with it. For example, you can create a CRUD microservice that performs add, get, update RPC calls, or an operation of your choice. You just have to define it once and run the shell script. You can also call one service from another, by creating clients wherever you want or in any language you want. This is an example of a perfect microservice architecture.

 

Summary

Hope this blog helped you build a microservice architecture with just Protocol Buffers, gRPC, a docker file, and a shell script. This architecture is language-independent which means that it suits multiple programming languages. Moreover, it is almost 7 times faster than the traditional REST server, with additional benefits of documentation and validations. 

Not only is the architecture language-independent, this allows you to manage all the data structures in your organisation under a single repository, which can be a game changer. Protocol buffers also support import, inheritance, tags, etc.

You just have to follow these steps when you create a new microservice:

  1. Define your service
  2. Select languages
  3. Compile i.e. run the shell script (automated – can be done on CI/CD pipeline)
  4. Finally, code your servers and clients.
  5. Deploy however you want (Suggestion: use docker containers)

Now you are equipped to create multiple microservices, enable communication between microservices, or just use them as it is and also add a REST Proxy layer to build APIs. Remember that all services use a single framework irrespective of their programming language.

You can find the whole code repo here.

 

Bonus: 

You can also publish the generated proto files of a service as packages for each language, for e.g. Node.js – npm, Python – pip, golang – GitHub, Java – Maven and so on. Run “npm install helloworld-proto to call the .pb files. And if someone updates the Proto definition you just have to run “npm update helloworld-proto”.

fake apps brand monitor

How does CloudSEK’s XVigil detect rogue, fake applications

 

Brand monitoring is one of the premium services offered by CloudSEK’s flagship digital risk monitoring platform – XVigil. This functionality is comprehensive of a wide range of use cases such as:

  • Fake domain monitoring
  • Rogue/ fake application detection
  • VIP monitoring

Threat actors deploy fake or rogue apps that masquerade as the official application, by infringing on our client’s trademark and copyrighted material. Upon seeing our client’s familiar trademark, their customers are tricked into installing such apps on their devices, thereby running a malicious code that allows threat actors to exfiltrate data. Just this year, XVigil reported and alerted our clients regarding over 2.4 lakh fake apps from various third party app stores.

 

Classification through similarity scoring

Classification of such threats forms a major part of XVigil’s threat monitoring framework. The platform identifies and classifies the app as fake or rogue based on whether it impersonates our client’s apps, or if the uploaded APK files are different from our client’s official APKs. 

For any machine learning problem, you need training data, where you expect your test data to be similar to the training data. In this case, however, the data is different for each client, and you would need a separate model for every new client.

We approach the problem as a similarity score problem. We compare the suspected app with all the official apps of the client and see how similar the suspected app is to the official app in terms of the app-title, description, screenshots provided, logos, etc. Here, a greater concern arises when we don’t have all the information related to the client, to compare it with the suspicious app. 

 

CloudSEK’s Knowledge Extraction Module

To gather more information about the client, we built a knowledge extraction module. When a new client signs up, this module is triggered and it tries to collect all the information it can about the client. The knowledge extraction module was built as a generic module that tries to obtain every piece of information it can about the client. The client’s name and their primary domain are the only inputs required for the process. 

With these details, the knowledge extraction module can identify the industry that our client operates in, their various products and services, competitors of the client, the main tech-stack/ technologies the client works with, their official apps, and so on. These details are sourced from Google Play store, or by crawling and parsing the client’s website, various job listings posted by the clients themselves, etc. The gathered information is then passed through custom Named Entity Recognition (NER) models or static rules on them to get the client details in a structured manner. 

When we monitor for malicious applications, we run a text/ image similarity model on the gathered information (client’s logo, official app, competitors, etc.) and the information present in the suspicious app. For example, text similarity checks how contextually similar the client’s app description is to that of the suspicious app. Another module tries to find if the client’s logos appear on the screenshots provided within the malicious app. The different scores are then summarised to recommend an ensemble score. 

 

Finally…

If the similarity score is greater than a certain threshold, we can safely say that the app resembles our client’s brand/ app. Now we need to check if the APK uploaded is a modified APK or the original client’s APK itself. The only challenge here is to maintain the APKs released by the client, for any and all versions of their apps, for all devices/ regions. To work around this issue, we compare the suspicious app’s certificate with the certificates present in the official app of the client present on Google Play store. This removes all the apps’ APKs from the official app developers of the client. This leaves us with malicious apps, which is then reported to our clients.

How to build a secure AWS infrastructure

 

Every day more businesses migrate from their traditional IT infrastructure, while the pandemic has only accelerated the adoption of cloud technologies among remote workforces. Cloud services such as Amazon Web Services (AWS) have been widely accepted as a channel for cloud computing and delivering software and applications to a global marketplace, cost effectively and securely. However, cloud consumers tend to wash their hands of the responsibility towards securing their cloud infrastructure. 

Cloud service providers and consumers share the responsibility of ensuring a safe and secure experience on the cloud. While service providers are liable for the underlying infrastructure that enables cloud, users are responsible for the data that goes on the cloud and who has access to it. 

AWS cloud service

The AWS Well-Architected Framework is a guide/ whitepaper issued by Amazon on AWS key concepts, design principles, and architectural best practices. Security is one of the five pillars that this Framework is based on, upholding the fact that protecting your data and improving security is crucial for AWS users. This blog intends to summarize the whitepaper on the security pillar and discuss:

  • Design principles for AWS
  • Few use case scenarios, and 
  • Recommend ways to implement a securely designed AWS infrastructure. 

 

AWS provides a variety of cloud services, for computation, storage, database management, etc. A good architecture commonly focuses on the efficient methods for reaching peak performance, scalable design, and cost saving techniques. But other cloud infrastructure design aspects are given more importance, quite often, compared to the security dimension.

The security of the cloud infrastructure can be divided into five phases:

  1. Identity verification and access management with respect to AWS resources.
  2. Attack detection, identification of potential threats and misconfigurations.
  3. Controlling access via defining trust boundaries, applying best practices in operation.
  4. Classifying all data, protecting data at all states: rest and transit.
  5. Incident response: Pre-defined mechanisms to respond and mitigate any surfacing security incident.

 

The Shared Responsibility Model

As I mentioned earlier, it is the collective responsibility of the user and the AWS service provider to secure the cloud infrastructure. It is important to keep this in mind while we explore the different implementation details and design principles. 

AWS provides plenty of monitoring, protection and threat identification tools to reduce the operational burden of its users, and it is very important to understand and choose an appropriate service to achieve a well secured environment.

AWS offers multiple services of different nature and use cases such as EC2 and Lambda. Each of these cloud services have varying levels of abstraction that enable users to focus on the problem to be solved instead of its operation. The share of each party’s responsibilities similarly vary based on the level of abstraction. With higher levels of abstraction, the share of responsibility to provide security in the cloud shifts further to the service providers (with some exceptions).

AWS - Shared Responsibility Model
AWS – Shared Responsibility Model

 

Management and Separation of User Accounts to Organise Workload

Based on the nature of processes that are run on AWS, and the sensitivity of the data that is processed, workloads can change. They must be separated by a logical boundary and organised into multiple user accounts to make sure that different environments are isolated. For instance, the production environment commonly has stricter policies, more compliance requirements, and must be isolated for the development and test environments.

It is important to note that the AWS root user account must not be used for common operations. And using AWS Organizations one could simplify things and create multiple users under the same organisation, with different access policies and roles. Also, it is ideal to enable Multi-Factor Authentication, especially on the root account.

 

Managing Identity and Permissions

AWS Resources can be accessed by humans (such as developers or app users) or machines (such as EC2 instance or Lambda functions). Setting up and managing an access control mechanism based on the identity of the requester is very important, as these individuals seeking access could be an external or internal part of the organization. 

Each account should be granted access to different resources and actions using IAM (Identity and Access Management) roles, with policies defining the access control rules. Based on the identity of the user account and the IAM attached, certain critical functionalities can be disabled. For example, denying certain changes from all the user accounts, with exceptions for the Admin. Or preventing all users from deleting Amazon VPC flow logs.

For each identity added on AWS Organisation, they should be given access to only a set of functions that are necessary to fulfil the required tasks. This will limit unintended access to functionalities. And unexpected behaviours arising from any identity will only have a small impact. 

 

Leveraging AWS Services to Monitor and Detect for Security Issues

Regular collection and analysis of logs generated from each workload component is very important to detect any unexpected behaviour, misconfiguration or a potential threat. However, collection and analysis of logs is not quite enough. The volume of incoming logs can be huge, and an alerting and reporting flow should be set up along with an integrated ticketing system. AWS provides services such as these to ensure automated and easy processes:

  • CloudTrail: Provides the event history of the AWS account activity which includes all AWS services, Management console, SDKs, CLIs, etc.
  • Config: Enables automated assessment, auditing, and evaluation of the configuration of each AWS resource.
  • GuardDuty: Continuous security monitoring service that flags malicious activity surfacing within AWS environments by analysing log data and searching for patterns that may indicate any sort of privilege escalation, exposed credentials, established connections to malicious IPs, or domains.
  • Security Hub: Presents a comprehensive view of the security status of AWS infrastructure by enabling aggregation, prioritization, deduplication of security alerts from multiple AWS services and even third party products.

 

Protecting the Infrastructure: Networks and Compute

Obsolete software programmes and outdated dependencies are not unusual and it is essential to patch all systems in the infrastructure. This can be done manually by system administrators, but it is better to use the AWS Systems Manager Patch Manager which basically automates the process of applying patches to the OS, applications and code dependencies.

It is crucial to set up AWS security groups in the right way, mainly during the phase when the infrastructure is growing at a fast rate. Things often go wrong when unorganized, messy security groups are added to the infrastructure. Creation of security groups and assignment of them should be dealt with caution, as even a slight overlook can result in the exposure of critical assets and data stores, on the internet. Security groups should clearly define ingress and egress traffic rules, which can be set under the Outbound traffic settings. 

If some assets are required to be exposed on the internet, make sure your network is protected against DDoS attacks. AWS services such as Cloudfront, WAF, and Shield help to enable DDoS protection at multiple layers. 

 

Protecting the Data

The classification of all data stored at multiple locations inside the infrastructure is essential. Unless it is clear which data is most critical and which ones can be directly exposed on the internet, setting up protection mechanisms can be a bit of a task. Data resting inside all the different data stores must be classified in terms of sensitivity and criticality. If the data is sensitive enough to prevent direct access from users, policies and mechanisms for ‘action at a distance’ shall be put in place. 

AWS provides multiple data storage services, the most common ones being S3 and EBS disks. Application data can usually be found lying around inside data stores self hosted on EBS volumes. Also, all sensitive data that goes into S3 buckets should be properly encrypted prior to that. In fact, it would be better to enable encryption by default on these.

Protecting in transit data is also equally important, and to do that, secure connections are required, which can be obtained using TLS encryptions. Making sure that data is transferred over secure channels should be enough. AWS Certificate Manager is a good tool to manage SSL/ TLS certificates.

 

Preparing and Responding to Security Incidents the Right Way

Once all the automation has been set up, and security controls are put in place, designing incident response plans and playbooks becomes easier. A good plan must cover the response, communication, and recovery steps following any security incident. This is where the logs, snapshots and backups, GuardDuty findings play a critical role. They make the task relatively more efficient. Overall, the aim should be to prepare for an incident before it happens and to iterate and train the entire team to thoroughly follow the incident response plan.

crawling

How are Python modules used for web crawling?

 

Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.

In this article, we will learn:

  1. What is crawling?
  2. Applications of crawling
  3. Python modules used for crawling
  4. Use-case: Fetching downloadable URLs from YouTube using crawlers
  5. How do CloudSEK Crawlers work?

web crawler

What is crawling?

Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.

Basic stages of crawling: 
  1. Scrape data from the source
  2. Parse the collected data
  3. Clean the data of any noise or duplicate entries
  4. Structure the data as per requirement

 

Applications of crawling

Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling: 

  • Comparing data for market analysis
  • Monitoring data leaks
  • Preparing data sets for Machine Learning algorithms
  • Fact-checking information on social media

 

Python modules used for crawling

  • Requests – Allow you to send HTTP requests to web pages
  • Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
  • Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.

 

Use-case: Fetching downloadable URLs from YouTube using crawlers

A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:

 

Project structure 

youtube
|
|---- app.py
|---- cli.py
`---- core.py

The project will contain three files:

app.py: For the api interface, using flask micro framework

cli.py: For command line interface, using argparse module

core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.

# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()

The flask interface code to get downloadable URLs through API.

Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs

# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')

Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)

Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'

# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]

This is the core (common) function for both API and CLI interface. 

The execution of these commands will:

  1. Take YouTube url as an argument
  2. Gather page source using the Requests module
  3. Parse it and get streaming data
  4. Return response objects: url and itag

How to use these URLs?

  • Build your own YouTube downloader (web app)
  • Build an API to download YouTube video

Sample result 

[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]

What is ITag?

ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.

How do CloudSEK Crawlers work?

 

cloudsek crawlers

 

CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:

  1. Fetch data from various sources on the internet
  2. Push the gathered data to a centralized queue
  3. ML Classifiers group the data into threats and non-threats
  4. Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.

How do you achieve concurrency with Python threads?

Introduction

The process of threading allows the execution of multiple instructions of a program, at once. Only multi-threaded programming languages like Python support this technique. Several I/O operations running consecutively decelerates the program. So,  the process of threading helps to achieve concurrency.

In this article, we will explore:

  1. Types of concurrency in Python
  2. Global Interpreter Lock
  3. Need for Global Interpreter Lock
  4. Thread execution model in Python 2
  5. Global Interpreter Lock in Python 3

Types of Concurrency In Python

In general, concurrency is the parallel execution of different units of a program, which helps optimize and speed up the overall process. In Python, there are 3 methods to achieve concurrency:

  1. Multi-Threading
  2. Asyncio
  3. Multi-processing

We will be discussing the fundamentals of thread-execution model. Before going into the concepts directly, we shall first discuss Python’s Global Interpreter Lock (GIL).

Global Interpreter Lock

Python threads are real system threads (POSIX threads). The host operating system fully manages the POSIX threads, also known as p-threads.

In multi-core operating systems, the Global Interpreter Lock (GIL) prevents the parallel execution of p-threads of a multi-threaded Python process. Thus, ensuring that only one thread runs in the interpreter, at any given time.

Why do we need Global Interpreter Lock?

GIL helps to simplify, the implementation of the interpreter, and memory management. To understand how GIL does this, we need to understand reference counting.

For example: In the code below, ‘b’ is not a new list. It is just a reference to the previous list ‘a.’

>>> a = []
>>> b = a
>>> b.append(1)
>>> a
[1]
>>> a.append(2)
>>> b
[1,2]

 

Python uses reference counting variables to track the number of references that point to an object. The memory occupied by the object is released if the value of the reference counting variable is zero. If threads, of a process sharing the same memory, try to access this variable to increment and decrement simultaneously, it can cause leaked memory that is never released, or releasing the memory incorrectly. And this leads to crashes.

One solution is to have a lock for the reference counting variable object memory by using semaphores so that it is not modified simultaneously. However, adding locks to all objects increases the performance overhead in the acquisition and release of locks. Hence, Python has a GIL which gives access to all resources of a process to only one thread at one time. Apart from GIL there are other solutions, such as garbage collection used in JPython interpreter, for memory management.

So, the primary outcome of GIL is, instead of parallel computing you get pre-emptive (threading) and co-operative multitasking (asyncio).

Thread Execution Model in Python 2

 

#sek.py #par.py
import time
def countdown(n):
     while n > 0:
     n -= 1
count = 50000000
start = time.time()
countdown(count)
end = time.time()
print('Time taken in seconds -', end-start)

 

import time
from threading import Thread
COUNT = 50000000
def countdown(n):
     while n>0:
     n -= 1
t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))
​start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
end = time.time()
​print('Time taken in seconds -', end - start)

 

~/Pythonpractice/blog❯ python seq.py
(‘Time taken in seconds -‘, 1.2412900924682617)
~/Pythonpractice/blog❯ python par.py
(‘Time taken in seconds -‘, 1.8751227855682373)

Ideally, the par.py execution time should be half of the seq.py execution time. However, in the above example we can see that the par.py execution time is slightly higher than that of seq.py. To understand the reduction in performance, despite the sharing the work between two threads which run in parallel, we need to first discuss CPU-bound and I/O-bound threads.

CPU-bound threads are threads performing CPU intense operations such as matrix multiplication or nested loop operations. Here, the speed of program execution depends on CPU performance.

 

 

I/O-bound threads are threads performing I/O operations such as listening to a socket or waiting for a network connection. Here, the speed of program execution depends on factors including external file systems and network connections.

 

Scenarios in a Python multi-threaded program

When all threads are I/O-bound

If a thread is running, it holds the GIL. When the thread hits the I/O operation, it releases the GIL, and another thread acquires it to get executed. This alternate execution of threads is called multi-tasking.

 

 

Where one thread is CPU bound, and another thread is IO-bound:

A CPU bound thread is a unique case in thread execution. A CPU bound thread releases the GIL after every few ticks and tries to acquire again. A tick is a machine instruction. When releasing the GIL, it also signals the next thread in the execution queue (ready queue of the operating system) that the GIL has been released. Now, these CPU-bound and I/O-bound threads are in a race to acquire the GIL. The operating system decides which thread needs to be executed. This model of executing the next thread in the execution queue, before completing the previous thread, is called pre-emptive multitasking.

 

 

 

In most of the cases, operating system gives preference to the CPU bound thread and allows it to reacquire the GIL, leaving the I/O-bound thread starving. In the below diagram, the CPU-bound thread has released GIL and signaled thread 2. But even before thread 2 tries to acquire GIL, the CPU-bound thread has reacquired the GIL.  This issue has been resolved in Python 3 interpreter’s GIL.

 

 

In a single core operating system, if the CPU bound thread reacquires GIL, it pushes back the second thread to the ready queue, assigning it some priority. This is because Python doesn’t have control over the priorities assigned by the operating system.

In a multicore operating system, if the CPU bound thread reacquires the GIL then it does not push back the second thread, but continuously tries to acquire the GIL using another core. This characterizes thrashing. Thrashing reduces the performance if many threads try to acquire the GIL, using different cores of the operating system. Python 3 also addresses the issue of thrashing.

 

 

Global Interpreter Lock in Python 3

The Python 3 threat execution model has a new GIL. If there is only one thread running, it continues to run, until it hits an I/O operation or other thread requests, to drop the GIL. A global variable (gil_drop_request) helps to implement this.

If gil_drop_request = 0, running thread can continue until it hits I/O

If gil_drop_request = 1, running thread is forced to give up the GIL

Instead of CPU-bound thread check after every few ticks, the second thread is sending a GIL drop request by setting the variable gil_drop_request = 1 after reaching a timeout. The first thread will then immediately drop the GIL. Additionally, to suspend the first thread’s execution, the second thread sends a signal. This helps to avoid thrashing. This check is not available in Python 2.

 

 

Missing Bits in the New GIL

While the new GIL does address issue such as thrashing, it still has some areas of improvement:

Waiting for time out

Waiting for timeout can make the I/O response slow. Especially, when there are multiple, recurrent I/O operations. I/O-bound Python programs take considerable time to add a GIL, followed by a time out. This happens after every I/O operation, and before the next input/output is ready.

Unfair GIL acquiring

As seen below, the thread that makes the GIL drop request is not the one that gets the GIL. This type of situation can reduce the performance of I/O, where response time is critical.

 

 

Prioritizing threads

There is a need for the GIL to distinguish between CPU-bound and I/O-bound threads and then assign priorities. High priority threads must be able to immediately preempt low priority threads. This will help improve the response time considerably.  This issue has already been resolved in operating systems. Operating systems use timeout to automatically adjust task priorities. If a thread is preempted by a timeout, it is penalized with a low priority. Conversely, if a thread suspends early, it is rewarded with raised priority. Incorporating this in Python will help improve thread performance.