Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.

In this article, we will learn:

What is crawling?
Applications of crawling
Python modules used for crawling
Use-case: Fetching downloadable URLs from YouTube using crawlers
How do CloudSEK Crawlers work?

What is crawling?

Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.

Basic stages of crawling:

Scrape data from the source
Parse the collected data
Clean the data of any noise or duplicate entries
Structure the data as per requirement

Applications of crawling

Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:

Comparing data for market analysis
Monitoring data leaks
Preparing data sets for Machine Learning algorithms
Fact-checking information on social media

Python modules used for crawling

Requests – Allow you to send HTTP requests to web pages
Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.

Use-case: Fetching downloadable URLs from YouTube using crawlers

A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:

Project structure

youtube | |---- app.py |---- cli.py `---- core.py

The project will contain three files:

app.py: For the api interface, using flask micro framework

cli.py: For command line interface, using argparse module

core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.

# youtube/app.py import flask from flask import jsonify, request import core app = flask.Flask(__name__) app.config["DEBUG"] = True @app.route('/', methods=['GET']) def get_downloadable_urls(): if 'url' not in request.args: return "Error: No url field provided. Please specify an youtube url." url = request.args['url'] urls = core.get_downloadable_urls(url) return jsonify(urls) app.run()

The flask interface code to get downloadable URLs through API.

Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs
# youtube/cli.py import argparse import core my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url') my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url') args = my_parser.parse_args() urls = core.get_downloadable_urls(args.url) print(f'Got {len(urls)} urls\n') for index, url in enumerate(urls, start=1): print(f'{index}. {url}\n')

Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)

Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'

# youtube/core.py import json import re import requests def get_downloadable_urls(url): html = requests.get(url).text RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});') conf = json.loads(RE.search(html).group(1)) player_response = json.loads(conf['args']['player_response']) data = player_response['streamingData'] return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]

This is the core (common) function for both API and CLI interface.

The execution of these commands will:

Take YouTube url as an argument
Gather page source using the Requests module
Parse it and get streaming data
Return response objects: url and itag

How to use these URLs?

Build your own YouTube downloader (web app)
Build an API to download YouTube video

Sample result

[{ 'itag': 251, 'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W' }]

What is ITag?

ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.

How do CloudSEK Crawlers work?

cloudsek crawlers

CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:

Fetch data from various sources on the internet
Push the gathered data to a centralized queue
ML Classifiers group the data into threats and non-threats
Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.