Threat actors are constantly evolving and developing sophisticated tactics to evade detection, or discovering new attack vectors altogether. By one way or the other, threat actors sniff out sensitive, personal information that is buried beneath layers and layers of networks and sell it for financial gain or otherwise. Cybersecurity solutions will always be one step behind the adversaries, since they are mostly defensive and not offensive. Therefore, to keep pace with the unknown unknowns, it is vital to methodically identify, classify and report exposure of an organization’s data on the internet, to prevent or reduce the risk of potential attacks.
Security solutions like CloudSEK’s XVigil scour the internet to identify potential cyber threats relevant to specific organizations and set up alerts to keep adversaries at bay. This process of identifying compromised data on the internet is also known as Data Acquisition. Using crawlers for this purpose helps to detect leaked information, from all parts of the internet, including the surface web, dark web marketplaces, forums, social media platforms, steadily and concurrently.
Data Acquisition with Selenium
At CloudSEK, we follow certain criteria to ensure that the solution we choose is capable of accepting test cases that can perform the following functions so that the crawlers can acquire data successfully:
- Execute Javascripts that would render all the dynamic content of a target site.
- Programmable in such a way that it can perform user interface activities, such as, clicking a tab or replying to a post using an automation script.
- Scroll and resolve any type of captcha, login screens and other screenwalls with the help of automation scripts.
The perfect solution that would satisfy the above criterias and run in scale is to use Selenium Grid as the base and AWS as the infrastructure provider, which will help to run numerous browser instances to assist crawling based on keywords or auto discovered assets of clients.
In this article we discuss the various components of a complex architecture to help readers understand how singular services interact with each other to acquire data.
What is Selenium Grid?
Selenium Grid is a testing tool that is part of the Selenium Suite that specializes in running multiple selenium projects across browsers, operating systems, and machines in parallel.
The two central components of Selenium Grid are hubs and nodes. Hub acts as the brain that schedules the test requests that it receives from WebDriver clients, whereas a node is the executor that executes the test requests on browser instances within a node.
The number of browser instances running on a node is directly proportional to the CPU and Memory resources allocated to each node. There can only be one hub and N number of nodes in a Selenium Grid architecture.
Selenium Grid Architecture
Being a cloud-based SaaS provider, CloudSEK leverages AWS’ cloud computing services to build Selenium Grid. We discuss some of the AWS services/components that we use to build a Selenium Grid architecture.
Elastic Load Balancer
Initially, the crawlers pass through an internal DNS that is managed by AWS Route 53, which converts an AWS-generated load balancer DNS to CNAME that is easier to manage. CNAME is mapped to the AWS service, Elastic Load Balancer. ELB (Elastic Load Balancer) provides support for a static entry point (IP address), which becomes useful when we have to scale or reconfigure the Hub. And the Hub gets recreated every time there is a configuration change.
Elastic Container Service
The Selenium hub and nodes are hosted as separate containers, managed by the AWS Container service Elastic Container Service. This enables us to run tests in parallel on Selenium Grid. ECS (Elastic Container Service) helps to orchestrate, monitor and configure the hub and the nodes.
AWS ECS auto scaling also enables users to have more control over the worker nodes and set them at a desired count; scale up to maximum or down to minimum, based on the threshold set on the CPU and Memory utilization of the worker nodes.
CloudWatch
The AWS service CloudWatch allows users to configure a threshold, followed by which the CloudWatch Alarm goes off when the metric exceeds the threshold. CloudWatch also gathers streams of logs from the worker nodes or the hub, which is used for troubleshooting any errors.
Fargate
The AWS services, mentioned above, allow us to build a comprehensive Selenium Grid architecture. However, to make it more cost efficient we also utilize AWS Fargate. Fargate offers metered services and removes the need to maintain and manage servers. It also isolated applications, thereby improving its security.
Fargate capacity providers create a strategy on how to use the infrastructure while running the worker nodes. Fargate also offers an alternative option known as Fargate Spot. Spot capacity providers allow spare capacity at a discounted rate. However, Spot can be interrupted at any point, affecting even the tasks that are running. The Spot is either terminated or replaced with a new Spot capacity.
CloudMap and Spot Replacement
To address the above problem we use another AWS service, CloudMap, which is a fully-managed service discovery tool for all cloud resources and application components that allow applications to connect to the correct endpoints, enabling safer deployments and reducing the complexity of the infrastructure code. Additionally, with CloudMap, we can define custom names for our application resources and maintain the updated location of these dynamically changing resources. At CloudSEK, we use this capability of CloudMap to dynamically locate and configure DNS to the worker nodes which will then be registered to the Hub.
Conclusion
Selenium Grid offers the convenience to execute concurrent crawlers on multiple browsers, browser versions, and machines. AWS services, on the other hand, provide the economical means to run these workloads at scale. Together, with Selenium Grid and AWS services in place, CloudSEK’s Data Acquisition team runs 100s of crawlers every second, scraping data from sources, using multi-platform browser instances, ten times faster than using multiple monolithic machines. This, in turn, helps in reducing the operational expenses of maintaining huge on-demand servers to run the same number of individual Selenium browser instances and crawler services on the same machines.