Login | Register

Privacy and Security Risks of URL Scanning and Archiving Services

Title:

Privacy and Security Risks of URL Scanning and Archiving Services

Ramadan, Tarek ORCID: https://orcid.org/0009-0009-9600-5503 (2026) Privacy and Security Risks of URL Scanning and Archiving Services. Masters thesis, Concordia University.

[thumbnail of RAMADAN_MA_S2026.pdf]
Text (application/pdf)
RAMADAN_MA_S2026.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2026.
Available under License Spectrum Terms of Access.
2MB

Abstract

Private data is frequently exposed on the web through publicly accessible URLs that were
meant to be private. These URLs are routinely submitted to scanners or captured by web crawlers,
archiving services, and paste-sharing platforms. Measuring the prevalence and nature of such exposure
at scale remains challenging. Prior work relies on keyword inference and ad-hoc detection
approaches, leaving URL-based privacy leaks underrepresented. We address this gap by designing
novel and efficient detection systems, relying on selective LLM reasoning and rule-based validation.
Our systems detect private exposure through URL parameters, rendered content, PDFs, visual
artifacts via OCR, and execution-time artifacts captured during scans or rendered pages, including
network requests, cookies, and encoded visual elements. We evaluate these systems’ false positive/
negative rates and their comparative runtime upon analyzing a corpus of 15 million publicly
accessible URLs. Our results suggest that a hybrid LLM–heuristics approach achieves the best
efficacy, with an F1 score of 0.90, compared to F1 scores of 0.79 for the heuristics-only approach
and 0.65 for the LLM-only approach, while remaining feasible to deploy at scale. The hybrid system
identified an alarming number of privacy leaks in our dataset: 181,204 leaks in 139,388 URLs,
largely dominated by authentication and financial artifacts, including account-recovery links and
publicly accessible e-signed contracts. Our findings highlight that the detection methodology influences
the measured leak prevalence, error behavior, and scalability trade-offs. We then investigate
in depth the root causes behind such systemic (versus accidental) URL-based privacy leaks, and
shed light on the main drivers of such exposure: scanner defaults and tiered user accounts. We
propose scanner-side policy changes and improved integration with end-user software as urgently
needed mitigation.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (Masters)
Authors:Ramadan, Tarek
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Information Systems Security
Date:25 March 2026
Thesis Supervisor(s):mannan, mohammad and youssef, amr
ID Code:996930
Deposited By: Tarek Ramadan
Deposited On:29 Jun 2026 14:44
Last Modified:29 Jun 2026 14:44
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top