Cyber threats are an ever-increasing concern for organizations of all sizes. With the rise of technology, the need for professionals who can understand and respond to cyber threats has also increased. Data scientists and data engineers are in a unique position to play a critical role in protecting organizations from cyber threats. With the ability to analyze and interpret large amounts of data, data scientists and data engineers can help organizations identify and respond to cyber threats.
This article will provide an overview of the field of Cyber Threat Intelligence (CTI) and its importance for data scientists and data engineers. It will also discuss the key concepts, tools, and techniques that data scientists and data engineers need to know to be effective in CTI.
1. Introduction to Cyber Threat Intelligence
Cyber Threat Intelligence (CTI) is the process of collecting, analyzing, and interpreting information about cyber threats. CTI is used to understand the nature of cyber threats, who is behind them, and how they can be mitigated. This includes information about the tactics, techniques, and procedures (TTP) used by attackers, technical indicators of their behaviour (Indicators of Compromises, IoC) as well as information about the vulnerabilities, malware and exploits they may use. CTI is an essential part of an organization’s overall security strategy and can feed technical audience, up to strategic audience.
1.1 – Threat intelligence use cases
CTI is a cyberdefense activity. It constantly watch for new threat development and evolution, capitalize knwoledge and disseminate risk mitigation measures to Security Operating Center (SOC), feed Threat Hunting with new IoC or TTP; or Red Team with new attack simulation scenario. During an incident, it supports the global Digital Forensic & Incident Response by collecting their raw findings, analyze them and loop back with enriched information about the incident and recommandation. After an incident it provides retex to the organization teams that may be shared outside. Some of the main usecase are the following:
- Threat hunting: Proactively searching for signs of potential threats within an organization’s network (a.k.a. “Proactive Incident Response”);
- Security Operating Center: Provide technical indicators and attack pattern to be detected and gather SOC statistics to identify patterns and trends to improve detection rules;
- Incident response: Responding to and containing security incidents;
- Vulnerability management: Identifying and mitigating vulnerabilities in an organization’s systems and networks with a threat based prioritization;
- Compliance: Helping organizations meet regulatory requirements related to security and adjust the monitoring regarding the threat landscape evolution;
- Red Team: Provide simulation scenario, attack pattern and IOC to be tested then capitalize the Red Team Exercice to improve global security;
- Risk management: Identifying and assessing the risk of potential threats to an organization after structuring it with Threat Modeling.
CTI is a knowledge-based and fact-driven process that feeds all the line of defenses of an organization.
1.2 – Types of cyber threats
Cyber threats come from a variety of shapes. There is no shared understanding of what a threat is accross the cybersecurity community. Here we understand the threat as the source of the risk, in other word the individual or group that is performing an attack (Threat Actor, Threat Agent, Adversary). We can use the reputable Adversary Model of Bruce Schneier:
- Individual hackers: These are individuals or small groups of individuals who engage in cyber attacks for personal gain or for the thrill of it. They often lack the resources and skills of more sophisticated attackers, but they can still cause significant damage. Their main motivation is typically financial gain or to cause disruption. They usually have average technical skills and limited resources.
- Hacktivists: Hacktivists are individuals or groups that engage in cyber attacks to promote a political or social agenda. They often target organizations that they perceive as being opposed to their cause. They can have a range of technical skills and resources, depending on the group or individual. Their main motivation is to raise awareness or to make a statement.
- Terrorist: Terrorist organizations and lone-wolf actors may use cyber attacks as a means to support their operations or to cause disruption. Their main motivation is to cause destruction, harm, and fear. They usually have limited technical skills, but they may have access to some resources.
- Cybercriminals: Cybercriminals are individuals or groups that engage in cyber attacks for financial gain. They often target organizations in order to steal sensitive information or to hold data for ransom. They can have a range of technical skills and resources, but their main motivation is financial gain.
- State-sponsored groups: State-sponsored groups are cyber adversaries that are sponsored by a nation-state. They are often highly skilled and well-resourced, and they may target organizations in order to steal sensitive information or to disrupt operations. Their main motivation is to gather intelligence, disrupt operations or to steal intellectual property.
- Competitor spies: These adversaries are companies or individuals who are looking to gain an economic advantage by stealing sensitive information from other organizations. They often have access to significant resources and may have a range of technical skills. Their main motivation is to gain an economic advantage over their competitors.
- Insider threats: Insider threats are individuals or groups that have legitimate access to an organization’s systems and networks but use that access for malicious purposes. They can have a range of technical skills and resources, depending on their role within the organization. Their main motivation is usually financial gain or to cause damage to the organization and driven by revenge.
1.3 – Threat Intelligence Frameworks
Threat intelligence frameworks are used to organize and structure the process of collecting, analyzing, and interpreting cyber threat information. The most popular threat intelligence frameworks are:
- MITRE ATT&CK framework (source): The MITRE ATT&CK framework is a widely-used framework that provides a comprehensive understanding of the tactics, techniques, and procedures (TTPs) used by cyber adversaries. It includes a matrix that organizes TTPs by the phases of an attack, including reconnaissance, weaponization, delivery, exploitation, installation, command and control, and exfiltration. This framework can be used to identify gaps in an organization’s defenses and to develop threat-informed defense strategies.
- Diamond Model of Intrusion Analysis (source): The Diamond Model of Intrusion Analysis is a framework that provides a structured approach to analyzing cyber incidents. It includes four stages: Preparation, Identification, Containment and Eradication, and Recovery. This framework can be used to understand the nature of an incident and to develop a plan for responding to it. It is based on the famous economist Porter’s Diamond Model.
- Cyber Kill Chain (source): The Cyber Kill Chain is a framework developed by Lockheed Martin that describes the stages of a cyber attack, including reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objectives. This framework can be used to identify the stage of an attack and to develop a plan for responding to it.
Those frameworks provides a structure that enable Cyber Threat Analyst to aggregate, organize and normalize there findings. As any model, they have each a purpose, drawback and benefits.
2 – Threat Intelligence, Data Science & Data Engineering
The threat intelligence process is similar to the KDD Data Mining Framework. The Knowledge Discovery and Data Mining Framework is a powerful Data Science approach for data collection, knowledge discovery and fact-based business decision-making. Recent advancement in Artificial Intelligence and Cloud Computing are improvement of approach that are well-established since the 1970-1980’s such as Extract-Transform-Load (ETL) technologies for data engineering or statistical inferences for Data Science.
2.1 – Data Collection and Analysis using Data Mining Methodology
Intelligence, as a product, is an operational and contextual knowledge which answer to a question build from technical data. As a process, intelligence collect data from an environment, process data into information, analyse information into knowledge, package knowledge into intelligence. Intelligence and Data Mining focus on seeing what used to be unseen. They have a very similar methodology. In both of them models are used to fill the gap between data and knowledge.
- Data Acquisition: The first step in collecting cyber threat intelligence data is to acquire the data. This can be done through a variety of methods, such as network traffic monitoring, endpoint data collection, and external data feeds. It is important to identify the relevant data sources and to ensure that the data is being collected in a consistent and reliable manner.
- Data Preprocessing: Once the data has been acquired, it needs to be preprocessed in order to make it ready for analysis. This can include data cleaning, data transformation, and data integration. Data cleaning involves removing or correcting any errors or inconsistencies in the data. Data transformation involves converting the data into a format that is suitable for analysis, such as converting raw network logs into a structured format. Data integration involves combining multiple data sources to create a single, cohesive dataset.
- Data Reduction: After the data has been preprocessed, it may need to be reduced in order to focus on the most relevant information. This can include methods such as feature selection, which involves identifying the most important variables or attributes in the data, and dimensionality reduction, which involves reducing the number of variables or attributes in the data.
- Data Modeling: Once the data has been collected, preprocessed, and reduced, it can be projected against a model for to generate useful information. This can include methods such as clustering, which groups similar data points together, and classification, which assigns data points to predefined categories.
- Data Interpretation: The final step in the KDD process is data interpretation, where the insights and patterns discovered through the data mining process are interpreted and used to generate actionable intelligence. It is important to note that data interpretation is an iterative process, where the data and the results of the data mining process are continuously evaluated and refined.
Data collected and processed in Threat Intelligence can be technical indicators (such a phishing email address or an malicious server IP address) or textual information (such as antivirus companies blog article or whitepaper). In the second case, a sub-field a Data Mining, called Text Mining, will be probably requiere.
2.2 – Threat Detection and Response augmented with Data Science
Data scientists and data engineers can also play a critical role in threat detection and response. Understanding the cyber attack life cycle is essential for identifying and responding to cyber threats.
- Identifying and Prioritizing Threats: One of the first steps in developing a threat detection and response strategy is to identify and prioritize potential threats. This can be done by analyzing data from various sources, such as network logs, endpoint data, and threat intelligence feeds; then performing Threat Modeling workshop to triage the candidates. By identifying and prioritizing potential threats, then loop back with business manager to identify threats related to real risk for the organizations. They can focus their efforts on the most pressing issues.
- Developing a Threat-Informed Defense Strategy: Once relevant threats have been selected, organizations can use data science techniques to develop a threat-informed defense strategy. This modern initiative to cyberdefense can involve identifying gaps in an organization’s defenses and implementing controls to mitigate those gaps. It can also involve developing incident response plans and training employees on how to respond to potential threats.
- Implementing Continuous Monitoring and Response: Developing a threat detection and response strategy also involves implementing continuous monitoring and response. The best practice is to set up Endpoint Detection and Response (EDR) automated systems to detect and respond to potential threats in near real-time, as well as conducting regular reviews of security logs and other data sources to detect potential threats. The log review process, often manual or ruled-based, can be improve with the machine-learning models and heuristics of Data Scientist.
- Measuring the effectiveness of the threat detection and response strategy: In order to evaluate the effectiveness of a threat detection and response strategy, organizations should regularly measure the performance of their threat detection and response functions, such as the accuracy and timeliness of their threat detection and response, the effectiveness of their incident response plans, and the overall reduction in the risk of a successful attack.
- Continual improvement: Continual improvement is a key element in developing a threat detection and response strategy, just like in Data Science. Organizations should regularly review and update their threat detection and response strategy to ensure that it is aligned with the current threat landscape and that it is providing the necessary level of protection. This can include reviewing and updating incident response plans, security controls, and threat intelligence feeds, as well as training employees on new threats and response procedures or conducting Red Team attack simulation.
Data scientists and data engineers can use machine learning, data visualization, and other data science techniques to identify, qualify and produce detection rules more advanced that what a classical Cyber Threat Intelligence Analyst could do.
2.3 – Building and Maintaining Threat Intelligence Systems using Data Engineering
Data engineers are responsible for building and maintaining threat intelligence systems. This includes designing and implementing data pipelines, data warehousing, big data analytics, and data governance.
- Data Warehousing: Data warehousing is a technique used to store and manage large amounts of data for reporting and analytics. It involves creating a centralized repository of data from various sources and using tools such as ETL (extract, transform, load) to clean, transform, and load the data into the warehouse. This technique can be used to store and manage large amounts of cyber threat intelligence data for reporting and analytics.
- Big Data Platforms: Big data platforms, such as Apache Hadoop and Apache Spark, are designed to process and analyze large amounts of data. These platforms can be used to process and analyze cyber threat intelligence data in real-time.
- NoSQL Databases: NoSQL databases, such as MongoDB and Cassandra, are designed to handle large amounts of unstructured data. These databases can be used to store and manage large amounts of cyber threat intelligence data.
- DevOps and Infrastructure: DevOps tools and Infrastructure management technologies such as Terraform, Kybernetes and Gitlab provide valuable solution to turn infrastructure as code. It enable to automate and manage the infrastructure evolution that support the intelligence logic and databases and tie the intelligence process with data processing and logic development best practive such as CI/CD.
- Data Quality and Data Management: Data quality and data management are critical considerations when building and maintaining threat intelligence systems. This includes ensuring that the data is accurate, complete, and consistent, as well as regularly reviewing and cleaning the data.
- Data Processing and Storage Architecture: The data processing and storage architecture of a threat intelligence system must be designed to handle the large amounts of data that are generated and must be able to scale to meet the changing needs of the organization. This can include implementing data pipelines and workflows, as well as designing the system for horizontal scalability.
Scaling the threat intelligence system is an unavoidable consideration for data engineers to produce intelligence in a timely manner.
3 – Legal and Ethical Considerations
CTI also involves legal and ethical considerations. Data scientists and data engineers need to understand the legal and ethical considerations surrounding the collection, storage, and sharing of cyber threat intelligence data.
- Compliance: Organizations must comply with a variety of laws, regulations and industry standards when collecting, analyzing, and sharing cyber threat intelligence data. This can include laws related to data privacy, such as the General Data Protection Regulation (GDPR), or laws related to national security.
- Data Privacy: Data privacy is a critical consideration for cyber threat intelligence program. Organizations must ensure that the personal data of individuals is protected and that it is being used in compliance with data protection laws and regulations. They must also ensure that any data that is shared with third parties is protected by appropriate security measures and has a lawful basis.
- Legal Investigations: Cyber threat intelligence can play a critical role in legal investigations, such as providing leads to identify and prosecute cyber criminals. However, organizations must ensure that they adhere to best practices, legal and ethical guidelines when collecting, analyzing, and sharing cyber threat intelligence data in order to support legal investigations.
- Ethical Considerations: The use of cyber threat intelligence raises a number of ethical considerations, such as the use of personal data, potential biases in the data and analysis, and the potential for unintended consequences. Organizations must ensure that they are using data science techniques in an ethical and responsible manner.
Without substituting to Data Protection Officer or an Ethic Committee, Data Scientist and Data Engineer can help to develop policies and procedures for data management, implementing data quality checks, data access controls and assist during data privacy audits.
Conclusion
Cyber Threat Intelligence is the practice or collecting, analyzing and disseminating contextualized information to support the whole cybersecurity, cyberdefense and cyber risk management effort. It is based on technical data left behind by attackers that are observed during cyber attacks. Those information have then to be analyze and interpretated to support decision-making.
Data Science and Data Engineering are two disciplines that can support Cyber Threat Intelligence, by providing an efficient and scalable infrastructure. It augments human analysts with consistent, repeatable and evolutive techniques, tools and models to support the huage amount of data and information that have to be processed.
In fact, we can see Threat Intel Analyst as Data Analyst specialized in adversary and threat event investigation. Working in tandem, the Analyst, the Engineer and the Scientist can provide an unvaluable contribution to the full chain of defense, to compliance and that in definitive augment the chances for organization to achieve its strategic objectives.