Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.
In this article, we will learn:
- What is crawling?
- Applications of crawling
- Python modules used for crawling
- Use-case: Fetching downloadable URLs from YouTube using crawlers
- How do CloudSEK Crawlers work?
What is crawling?
Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.
Basic stages of crawling:
- Scrape data from the source
- Parse the collected data
- Clean the data of any noise or duplicate entries
- Structure the data as per requirement
Applications of crawling
Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:
- Comparing data for market analysis
- Monitoring data leaks
- Preparing data sets for Machine Learning algorithms
- Fact-checking information on social media
Python modules used for crawling
- Requests – Allow you to send HTTP requests to web pages
- Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
- Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.
Use-case: Fetching downloadable URLs from YouTube using crawlers
A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:
Project structure
youtube
|
|---- app.py
|---- cli.py
`---- core.py
The project will contain three files:
app.py: For the api interface, using flask micro framework
cli.py: For command line interface, using argparse module
core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.
# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()
The flask interface code to get downloadable URLs through API.
Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs
# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')
Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)
Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'
# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]
This is the core (common) function for both API and CLI interface.
The execution of these commands will:
- Take YouTube url as an argument
- Gather page source using the Requests module
- Parse it and get streaming data
- Return response objects: url and itag
How to use these URLs?
- Build your own YouTube downloader (web app)
- Build an API to download YouTube video
Sample result
[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]
What is ITag?
ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.
How do CloudSEK Crawlers work?
CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:
- Fetch data from various sources on the internet
- Push the gathered data to a centralized queue
- ML Classifiers group the data into threats and non-threats
- Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.