The internet provides a world of knowledge and opportunity, thanks to it we have unprecedented access to information. On the other hand, it is a space for malicious activity, including cybercrime. One of the most common methods criminals use is malicious URLs which interact with the victim user via phishing. Phishing is a type of social engineering attack where attackers send malicious emails that look like legitimate messages. The aim is to access an organization’s assets or steal sensitive data such as login credentials, and credit card number.
Phishing is in the top five crime types. While there were 26,379 victims in 2018, this number increased to 300,497 in 2022 [1]. On average, phishing-related breaches took 295 days to detect and contain. It is the third-longest cyber threat in 2022 [2]. There are 710 million phishing emails blocked per week in 2022 [3] and phishing emails contributed to over $44,213,707 in losses in 2021 [4].
The purpose of url-ex is to detect these malicious URLs by using deep learning. The model is trained using a dataset containing about 500k URLs (for now) labeled both “bad” and “good”.
Url-ex analyzes a URL in 6 basic parts: protocol, sub-domain, domain name, second-level domain, top-level domain, and path.
When a URL is entered, the major points of the deep learning model’s investigation are:
- Which protocol does the URL have
- Is the URL shortened
- Is there a suspicious redirection
- Is there an XSS attempt
- How many paths does the path have
As a plus feature, it also detects a URL as malicious if it contains an attack payload targeting the server side like SQL injection or directory traversal.
Technical Details of Deep Learning Model
Url-ex consists of Artificial Neural Networks (ANNs) that have been built by using Keras sequential model. The reasons for using the sequential model are as follows:
- The model has just 1 input and 1 output
- Any of the layers has no multiple inputs or multiple outputs
- I did not need to do layer sharing
For reducing the overall loss and improving accuracy, I chose to use the Adaptive Moment Estimation (Adam) optimizer. Because, a stable learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a stable learning rate that is too small can cause the process to get stuck. For the best prediction accuracy in my case, I should start with a small learning rate but not continue with it. In addition, Adam optimization can be done with relatively low memory requirements and usually works well even with a little tuning of hyperparameters.
Web Application
Url-ex is published under https://url-ex.com/ link. Do not hesitate to test it! The dataset continues to expand and Url-ex keeps learning. If you think you detected misclassification, please report it under the https://url-ex.com/report link.
References
[1] Internet Crime Report 2022 (2023), Internet Crime Complaint Center (IC3). https://www.ic3.gov/Media/PDF/AnnualReport/2022_IC3Report.pdf
[2] Cost of a Data Breach Report 2022 (2023), IBM. https://www.ibm.com/downloads/cas/3R8N1DZJ
[3] Microsoft Digital Defense Report 2022 (2023), Microsoft. https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5bUvv?culture=en-us
[4] Internet Crime Report 2021 (2022), Internet Crime Complaint Center (IC3). https://www.ic3.gov/Media/PDF/AnnualReport/2021_IC3Report.pdf?_sp=0a7b7784-1d4b-4e1a-860e-e727dc69b8bd