Phishing URL Detection Using CNN-LSTM and Random Forest Classifier
Abstract
Hemant Gurung, Roshan Nepal and Sopnil Nepal
This paper presents the classification of phishing URL's apart from legitimate URL's with the use of machine learning and deep learning techniques. Phishing is defined as an act to steal the private information by pretending to be a legitimate entity which they are not. Machine learning model, Random Forest classifier is trained on the extracted features based on Address Bar, Domain and HTML and JavaScript of the URL. On the other hand, CNN-LSTM hybrid model was trained to learn the character sequence features of the given URL and make the classification. The dataset used was public data from Kaggle which was downloaded from their website. The dataset contained 11,430 URLs: 5,715 legitimate URLs and 5,715 phishing URL. Hereafter, we classified the URL of the current address bar as legitimate or phishing with the use of previously trained model. Thus, proposed paper focuses on the study and development of models for detection of phishing sites so that properties of various URLs can be learnt by feature extraction and can be classified as accurately as possible.