Innovation Concept
COS70008 - Technology Innovation
Research and Project
Date: 14 April 2025
Innovation Concept Report
Executive Summary
This paper proposes an intelligent web-based security system that leverages hybrid machine learning to identify and assess cyber threats. The system was created alongside Swinburne University, DFAT, and the Internet of Things Training Academy using behavior-based and data-driven analysis. It aims to fix standard detection methods. Its method is more accurate and adaptive since it uses LSTM to analyze sequential patterns and Random Forest to classify structured data.
The system manages real and created datasets to identify known and undiscovered hazards. An interactive online interface lets users download data, analyze projections, and visually monitor hazard indicators. One access level is for system administrators who have full configuration control, while the other is for regular users who focus on analysis and reports.
Role-based authentication, JWT-secured access, input validation, and data encryption are crucial. Together, these components ensure safe and compliant data management. To promote transparency and accountability, the system anonymizes sensitive inputs and monitors user actions.
The platform. was built to change using cutting-edge technologies and modular architecture. It adapts to new cybersecurity threats, grows across cloud environments, and integrates with other services. This research-backed method can promote danger awareness and resilience in academic and operational settings.
Part A
Project Overview
Demand for smart security systems that can proactively identify and react to harmful actions is rising as cyber threats become more complex and influential. Especially when it comes to examining sophisticated threats aimed at cyber-physical systems (CPS), conventional detection technologies can fall short. Dr. Siva Chandrasekaran heads a joint effort started by the IoT Training Academy, DFAT, and Swinburne University to meet this difficulty. The aim is to create and run a web-based system driven by hybrid machine learning algorithms able to identify, classify, and forecast cyber-attacks in real time.
The system integrates two complementing models: Long Short-Term Memory (LSTM) networks, which are appropriate for spotting anomalies in sequences across time, and Random Forest, renowned for its efficacy in structured data processing. These models taken together help to more fully identify harmful conduct across many assault forms.
The system's capacity to examine behavior. in CPS settings where any abnormality in device or network activity could indicate a possible threat is a key feature. The technology can provide early alarms and assist quick mitigation efforts by processing live data inputs.
Using tools like Flask, React.js, Docker, and MongoDB, the application will be designed with scalability and adaptability in mind. With a simple, safe interface, administrators will control models and system operations while regular users may conduct scans and see results. User role distinction will be supported.
Client Requirements
Under the direction of Dr. Siva Chandrasekaran, the IoT Training Academy, the Department of Foreign Affairs and Trade (DFAT), and Swinburne University work together to launch the initiative. Using a hybrid machine learning technique, the client's main objective is to build a web-based system able to identify and examine several kinds of harmful cyber behavior. This method is also meant to enable predictive behavioral analysis of cyber-physical systems (CPS) to assist in future threat prevention.
The following essential criteria were found to be based on the client's brief, project goals, and the more general consequences of developing a successful security solution, thus fulfilling this vision.
1. Finding and Using Appropriate Datasets:
a. Finding and assessing appropriate datasets that can assist malware identification and CPS behaviour analysis is absolutely necessary. The customer anticipates the system to manage real-world or simulated data from trustworthy, publicly available sources. These datasets should include many kinds of cyber-attacks like denial-of-service, phishing, ransomware, and botnets.
b. To enable seamless integration into machine learning pipelines, the data should be appropriately prepared and cleansed. Ensuring ethical management of data— any personally identifiable information (PII) should be anonymized or removed—demands attention. This guarantees the system development stays in line with ethical research procedures and data governance ideas.
2. Building Machine Learning Models for Threat Detection:
• The development of several machine learning models able to detect and categorize various kinds of harmful attacks is another crucial need. The customer underlines the need of employing a hybrid approach—such as mixing models like Random Forest and LSTM—to enhance detection reliability and adaptability.
• Relevant measures—accuracy, precision, recall, F1-score—should guide training and assessment of these models, which should also be tested on known attack types as well as undiscovered patterns. The system should allow multi-class classification and offer obvious, understandable outcomes to help decision-making for technical and non-technical users.
3. Study of Cyber-Physical System Behavior
a. Apart from finding malware, the customer wishes the system to be able to track and forecast cyber-physical system activity. This means looking at consecutive events and sensor or protocol-level data to find patterns that could point to unusual or harmful behavior.
b. Models like LSTM (Long Short-Term Memory) are preferred for this work since they can learn from time-series data. The system should be able to forecast trends in system behavior. and offer alerts or insights upon detection of anomalies. For real-time applications where early identification can avert significant damage, these predictive characteristics are absolutely vital.
4. Web-Based Application for Real-Time Analysis:
a. A responsive and safe web-based platform. will supply the whole detection and prediction engine. This system should provide interfaces for two user groups: administrators and general users. While common users will engage with features including file uploads, scan findings, and real-time threat dashboards, administrators will control system settings, track alarms, and assess model performance.
b. The program has to enable safe authentication using session tokens or JWT and apply access control to safeguard sensitive functionality. The UI should be mobile-compatible, user-friendly, and uncluttered. Understanding system performance and identified dangers will depend on visual components like as graphs, charts, and real-time warnings.
5. Ethical Compliance, Privacy, and Security Issues:
a. Though not usually expressly stated, the client anticipates the platform. to follow standard practices for security and data protection. These covers adopting safe storage techniques, documenting actions for audit reasons, ensuring appropriate user role segregation, and applying encrypted communications.
b. The client also anticipates that the system would follow ethical standards since the project deals with possibly sensitive or mimicked real-world data, hence guaranteeing legal and safe processing of data without endangering people or organizations.
6. Future Enhancements Scalability and Flexibility:
a. The client sees this system as a foundation for eventual large-scale implementation or further investigation. It has to be structured, thus, to allow future upgrades such integration of more sophisticated machine learning models, inclusion of external APIs, or cloud platform. deployment including AWS.
b. Technologies like as Docker for containerization, Flask for backend functionality, and modular frontend development with tools like React or Tailwind CSS are recommended to remain flexible, simplify updates, and guarantee compatibility with various environments.
7. Research Documentation and Alignment:
a. At last, the client wants a methodical approach given the project is in an academic and research-oriented environment. Academic citations and tests should support all technological choices, including model selection and performance assessment. So it may be reused or enhanced in future work, the completed system must be well documented containing installation instructions, system architectural diagrams, and user manuals.
Part B
Preliminary Design
Design Concept
Ensuring performance, security, and scalability, the URL-Based Malware Detection System seeks to offer a smooth platform. for spotting malware across several URLs. The design gives a user-friendly experience top priority while preserving technical strength and adaptability for future development. Following best practices in development, it allows effective teamwork and maintainability.
A. Repository Structure: The repository will be set up to guarantee obvious separation between the frontend and backend, hence supporting modularity and simple navigation. This company enables teamwork by letting various team members work on separate parts free of interference. Changes will be tracked using Git version control, hence guaranteeing seamless parallel development.
B. Coding Standards: The project will use coding standards for Python (Flask backend) and JavaScript. (React.js frontend). Tools such as PEP 8 for Python and ESLint for JavaScript. will guarantee that the code is consistent, clear, and simple to maintain. These guidelines will enable future improvements or debugging to maintain the project simple and scalable.
C. Commenting Standard: Consistent commenting techniques will be applied all over the project to enhance readability and cooperation. Comments clarifying the goal and logic will accompany every function or method. Making the code available even to new developers joining the team, this will help future debugging and development.
D. Environment Setup: The infrastructure will be set up to facilitate the seamless integration of the frontend (React.js) and backend (Flask). A consistent environment across development, testing, and production will be created using Docker. npm and pip will handle dependencies for the frontend and backend, ensuring all devs work with identical setups to minimize problems.
E. Overall Prototype Architecture: Using React.js, the frontend will follow a component-based design guaranteeing modularity and reusability of UI components. Every component, including file upload forms and visualisation charts, will perform. particular functions, hence enabling the system to be scalable and simple to modify. Using a microservice design, the backend will let services like URL analysis, machine learning COS70008 - Technology Innovation Research and Project Group 2 processing, and authentication run separately, hence enabling upgrades or scalability without compromising the whole system.
F. Components: Tasks like file uploads and showing malware detection findings will be handled by reusable components making up the frontend. Redux will effectively control state, hence guaranteeing seamless data flow between components. Running machine learning models and processing URL inputs in real-time will be microservices included in the backend. Google Cloud or Amazon S3 will handle file storage; JWT/OAuth will be utilized for safe user authentication.
i. Methodology
Small, functional components were built, tested, and refined as part of an agile, iterative development process. After establishing fundamental interactions through the creation of user-facing websites, the team proceeded to construct administrative functionality. Development was expedited and visual consistency was guaranteed via a modular codebase with reusable templates. Regular testing of responsiveness and interaction was conducted, and modifications were made in response to user input and usability assessments. In-line documentation of the source made future maintenance simpler. In response to changing requirements, features like a roadmap page were added, emphasizing the project's adaptable, feedback-driven approach.
ii. Design Constraints
Commercial Constraints: The project has to include only fundamental features under budget and time limitation - oriented on the identification of malware and the administration of users.
Compliance Constraints: The system also has to satisfy the criteria in terms of data privacy control so that the personal information is safeguarded.
Functional Limitations: To assist in controlling system load and manage handle huge files more quickly, file upload is only permitted where there is a single file to be uploaded in the Admin dashboard.
Non-functional Constraints: The system should be able to offer a timely indication of whether the URL is dangerous or not; handling small batches of URL in seconds while bigger ones in a minutes. Ease of use is really important. most current dashboards are created since possible for mere mortal to grasp.
Security Limitations: Strict access controls especially ensure the privacy and integrity of data the system has to protect.
iii. Specifications
The main features of web-based malware detection systems have two dashboards one for Admins and one for Users, each customized to their responsibilities. A hybrid model combining LSTM and CNN aims to deliver correct malware forecasts.
Features:
• Admin Features:
o Login/logout (username/password)
o Upload dataset (file upload)
o Malware identification
o Malware classification (top 5, sorted list)
o Behavior/usage prediction
o User management (add, delete, update, read)
o Notification management (via pop-up message)
o View malwares identified in last 7/14/28 days
• User Features:
o Upload URL
o Detect malware (with useful info if detected)
o Predictive analysis
o Prediction history view
o Create account (registered users)
o Login/logout (registered users)
o Personalized info/pop-up message (registered users)
Ensuring both speed and accuracy in predictions, the hybrid model employs CNN for more detailed malware classification and LSTM for probabilistic analysis of URL data.
Frontend:
• Framework: React.js
• UI Library: TailwindCSS
• State Management: Redux
• Visulization: Chat.js
Backend:
• Runtime: Flask
• Database: MySQL
• Authentication: JSON Web Tokens(JWT)
• ML Integration: LSTM & CNN (implemented in Python)
iv. Vulnerability Analysis
Identifying and lowering possible hazards helps to guarantee the system stays steady and safe. Some of the main weaknesses listed below and their solutions are as follows:
Validate Input: Users can submit URLs or data with damaging material. Improper system checks on this input could cause code injection among other attacks. All user input has to be carefully vetted before processing to avoid this.
Data Protection: User data and prediction results are among the sensitive information that should never go over the network unprotected. The system has to use HTTPS to encrypt all communications and stop illegal access, hence safeguarding this data.
Model Prejudice: Inaccurate predictions may result from the machine learning model being trained on an uneven dataset. The model may prefer some results depending on the data it has seen, which explains this. The model should be retrained using a balanced and varied dataset to increase accuracy.
Justification of Forecasts: Users should be able to grasp why the algorithm produced a particular forecast. For instance, the CNN model should indicate which elements affected its choice. This increases confidence in the outcomes of the system's malware detection.
System Outage: Performance can be greatly affected if the system fails or becomes unavailable during peak use for example, when real-time malware analysis is underway. The system should have failover mechanisms and backup plans to keep it operating effectively even if something goes wrong to manage this.
Screen of Prototype:
User DashBoard:
Admin Dashboard:
Design 2
Design Concept
This innovative, research-backed tool detects and manages questionable web-based activity. It stresses modular architecture, real-time engagement, and explainable machine learning to increase detection accuracy and system transparency. LSTM analyses behavioural patterns over time and Random Forest classifies structures, balancing accuracy and interpretability. Interactive simulations, real-time monitoring, and role-specific dashboards make it a versatile tool for education, research, and operations for varied users.
i. Methodology
Guided by the ideas of Design Science Research Methodology (DSRM), this initiative guarantees a systematic, theory-informed, and iterative development approach. It starts with recognizing the need to improve web systems against illegal activities brought on by poor interface validation. The main goal is to create a real-time system that identifies and detects anomalous activities using a hybrid machine learning technique. Built as a functional prototype, the system has an interactive user interface, a strong backend API, and an inference engine for analysis. Simulations are used to examine its efficacy; technical performance criteria and user input help to evaluate it. Detailed papers, user-friendly interfaces, and official presentations help to share the final results, so matching research theory with practical application.
ii. Design Constraints
The system implementation is created with knowledge of some key limitations. Ethically, all input data is meticulously anonymised to fit institutional ethics policies, hence guaranteeing that no personally identifiable information is saved or handled. Technically speaking, the absence of plenty real-world examples drives the need of simulated and synthetically produced data for testing and development objectives. Legally, the system follows data protection rules by using safe communication methods like TLS and keeping thorough audit logs. It is also designed to operate effectively in low-resource settings, such those offered by academic virtual machines. From a scalability perspective, the design is meant to handle future integration with more sophisticated tools or external services without need significant structural modifications.
iii. Specifications
The system processes cleaned and organized input data obtained from publically available logs and behavioral traces using a sequence of procedures including labeling, padding, and conversion into vectorized formats appropriate for machine learning models. In the hybrid architecture, feature extraction is customized for each model; the LSTM component uses n gram approaches to capture temporal dynamics via metrics such as request timing gaps, sequence entropy, and tokenized input patterns. The Random Forest model, on the other hand, makes use of structural markers including query composition patterns, atypical length-based behaviors, and frequency distributions of certain parameters. Built with React.js, stylized with Tailwind CSS, the user interface has dynamic Recharts charts for real-time visualisation. A Flask-based API on the backend securely processes requests using JWT-based authentication and rate limitation to avoid abuse. Wrapped in Python for seamless orchestration, the machine learning engine combines Scikit-learn for classification tasks and TensorFlow for sequence modeling. Its adaptable schema design and quick indexing features make MongoDB the data storage choice. Docker containers and GitHub Actions manage continuous integration and deployment, hence guaranteeing compatibility with cloud systems including AWS EC2. Administrators supervising user management, data validation, and model lifespan interact with the system; standard users who can submit input, get categorization results, and monitor system response via a simplified interface.
iv. Vulnerability Analysis
The system uses smart tools to spot possibly dangerous input behavior. The LSTM model examines timing discrepancies and odd request patterns in circumstances when inputs seek to influence backend queries; the Random Forest model finds strings closely matching database-related syntax. The LSTM warns unanticipated bursts of complicated characters and the Random Forest model finds typical script. indications including tags and event-driven properties for situations involving embedded scripts in form. fields or redirection. The system implements user session controls, uses a two-tiered input filtering technique, and applies IP-based request limiting to improve general defense. Moreover, thorough logging closely controls user rights to guarantee clear and trackable system interactions.
Snapshots of Prototype: