Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.
In this article, we will learn:
Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.
Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:
A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:
youtube
|
|---- app.py
|---- cli.py
`---- core.py
The project will contain three files:
app.py: For the api interface, usingflask micro framework
cli.py: For command line interface, usingargparsemodule
core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.
# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()
The flask interface code to get downloadable URLs through API.
Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs
# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')
Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)
Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'
# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]
This is the core (common) function for both API and CLI interface.
The execution of these commands will:
Sample result
[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]
What is ITag?
ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be foundhere.
CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:
Sign up so that you don't miss any updates from us
Didn't Find what you are looking for search here