🚀 CloudSEK has raised $19M Series B1 Round – Powering the Future of Predictive Cybersecurity
Read More
Protect your organization from external threats like data leaks, brand threats, dark web originated threats and more. Schedule a demo today!
Schedule a Demo
Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.
In this article, we will learn:
Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.
Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:
A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:
youtube
|
|---- app.py
|---- cli.py
`---- core.py
The project will contain three files:
app.py: For the api interface, using flask micro framework
cli.py: For command line interface, using argparse module
core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.
# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()
The flask interface code to get downloadable URLs through API.
Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs
# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')
Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)
Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'
# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]
This is the core (common) function for both API and CLI interface.
The execution of these commands will:
Sample result
[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]
What is ITag?
ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.
CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:
Discover how CloudSEK's comprehensive takedown services protect your brand from online threats.
How to bypass CAPTCHAs easily using Python and other methods
What is shadow IT and how do you manage shadow IT risks associated with remote work?
Take action now
CloudSEK Platform is a no-code platform that powers our products with predictive threat analytic capabilities.
Digital Risk Protection platform which gives Initial Attack Vector Protection for employees and customers.
Software and Supply chain Monitoring providing Initial Attack Vector Protection for Software Supply Chain risks.
Creates a blueprint of an organization's external attack surface including the core infrastructure and the software components.
Instant Security Score for any Android Mobile App on your phone. Search for any app to get an instant risk score.
min read
How are Python modules used for web crawling?
Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.
In this article, we will learn:
Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.
Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling:
A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:
youtube
|
|---- app.py
|---- cli.py
`---- core.py
The project will contain three files:
app.py: For the api interface, using flask micro framework
cli.py: For command line interface, using argparse module
core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.
# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()
The flask interface code to get downloadable URLs through API.
Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs
# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')
Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)
Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'
# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]
This is the core (common) function for both API and CLI interface.
The execution of these commands will:
Sample result
[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]
What is ITag?
ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.
CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers: