Cloud security isn't just about setting up defenses; it's about keeping them running smoothly day in and day out. That's where CCSP Domain 5 comes in, focusing on Cloud Security Operations. This domain is all about the nuts and bolts of maintaining a secure cloud environment, from implementing robust infrastructure to managing incidents effectively.
While many operational security principles overlap with traditional IT security, the cloud introduces unique challenges and opportunities. We'll explore how to build and operate physical and logical infrastructure, implement crucial operational controls, and manage security operations in a cloud context. Whether you're dealing with virtualization, network security, or forensic investigations, Domain 5 equips you with the knowledge to keep your cloud environment secure and running like a well-oiled machine.
Let’s dig in!
5.1 Build and implement physical and logical infrastructure for cloud environment
The cloud provider is responsible for the physical infrastructure up to the hypervisor in all service models. In IaaS, the cloud customer has substantial control and responsibility over how its virtual infrastructure is configured as well as everything on top of it. Under PaaS, the customer has less control and responsibility, while SaaS gives them the least.
Hardware-specific security configuration requirements
We discuss hardware security modules (HSMs) and Trusted Platform Modules (TPMs) in Domain 4.6 under Key management.
BIOS and UEFI
In traditional computing, we need a layer of software that is responsible for initializing and configuring the hardware. It sits between the hardware and the OS. In legacy systems, this is taken care of by BIOS (Basic Input/Output System) firmware. In newer computers, it has been replaced by the Unified Extensible Firmware Interface (UEFI). Cloud providers are responsible for securing BIOS and UEFI on the physical hardware to prevent unauthorized access. Cloud providers can generally find details of the best practices in the manufacturer’s documentation.
Installation and configuration of management tools
The tools that we use to install and configure our virtual machines must be carefully secured. This is because these systems can control vast numbers of VMs and lots of sensitive data. Domain 3.1 delves into some of this in our discussion of the management plane.
Virtual hardware-specific security configuration requirements
In the cloud, not only do we need to securely configure our systems, but we also need to back up the configuration data. If there is a disaster and we have backed up the configuration data for our virtual machines, we can completely rebuild them from a data center in another availability zone. We discussed hypervisor security in Domain 3.1.
Installation of guest operating system (OS) virtualization toolsets
In the cloud we can set up dashboards that give us visualizations of different systems. This can give us a large amount of information about the health of our systems at a glance. One of the ways that we can collect this data is by installing monitoring agents like Azure Monitor Agent onto VMs. Having a centralized monitor enables us to check patch levels and configuration settings, as well as allowing us to respond and make changes. The CCSP exam outline refers to monitoring agents and the centralized monitors as guest operating system virtualization toolsets.
Another option for monitoring is virtual machine introspection (VMI). The hypervisor has full visibility into the VM, which means that if you monitor the hypervisor, you are also able to monitor what’s happening in the VM.
The Difference in Monitoring Location Between a Guest OS Virtualization Toolset and VMI
5.2 Operate and maintain physical and logical infrastructure for cloud environment
Access controls for local and remote access
In this section, we will be discussing techniques for local access, such as using the console or KVMs, as well as technologies for remote access, like RDP and SSH. VPNs can also play a role in remote access, but we will discuss them in Domain 5.2. Some of the major options are:
Local access | Remote access |
---|---|
|
|
KVM stands for keyboard, video monitor and mouse. They are pieces of hardware that enable us to control more than one computer from a single keyboard, video monitor and mouse. With a KVM and a single keyboard, video monitor and mouse, you can physically connect to all of the servers in a rack. KVMs also allow you to connect multiple keyboards, video monitors and mice.
Jump boxes, also known as jump hosts or jump servers, are hardened servers that users must log in to prior to accessing the other systems. Jump boxes allow access to machines on a remote network from a local network.
A virtual client, also known as a virtual desktop, is an image of a desktop environment with the operating system and apps built in. By virtualizing a desktop, we separate it from the device that we would normally use to access it. The benefit of virtualizing a desktop is that we can then remotely access the desktop environment from any of our devices.
Remote access security controls
Below are some of the most important security controls for remote access.
Access control | We want to have appropriate authentication and authorization measures to ensure that only authorized users are able to gain remote access. |
Encrypted connection | When we are remotely accessing servers, we want to make sure that our session is encrypted so that attackers can’t intercept the commands we are sending or any sensitive data. |
Real-time monitoring and logging | We need to have real-time monitoring and logging as both a deterrent and detective security control. |
Looking for some CCSP exam prep guidance and mentoring?
Learn about our personal CCSP mentoring
Secure network configuration
Virtual local area networks (VLAN)
We covered this in the Virtual local area networks (VLANs) section of Domain 3.1.
Transport Layer Security (TLS)
Transport Layer Security (TLS) TLS is one of the most common protocols for securing data in transit. The S in HTTPS (Hypertext Transfer Protocol Secure) is due to the security protections provided by TLS as an extension to the HTTP protocol. It can also add a protective layer to other application layer protocols, such as FTP, SMTP and XMPP. TLS evolved from SSL—Secure Sockets Layer—so many people still refer to TLS as SSL.
TLS is responsible for securing client and server communications. The first step of setting up a secure TLS connection is for the client and server to perform a TLS handshake where they agree on the algorithms that they will use, perform authentication, and establish a shared secret. This happens over the handshake protocol. Once the two parties have completed the handshake, they will have a secure channel through which they can safely communicate. This happens over the record protocol, which offers encryption, authentication, integrity and compression for data transmission once the session has been established.
The Basic TLS Handshake
Dynamic Host Configuration Protocol (DHCP)
The Dynamic Host Control Protocol (DHCP) is used to assign a valid IP address to a device when it first connects to a network. DHCP does this automatically.
Domain Name System Security Extensions (DNSSEC)
We use the Domain Name System (DNS) protocol to translate human-readable domain names to machine-readable IP addresses. However, DNS doesn’t have a lot of in-built security and it’s vulnerable to issues such as DNS-cache poisoning. Domain Name System Security Extensions (DNSSEC) is a suite of extensions that aims to plug some of the security gaps in DNS, while still providing backward compatibility. DNSSEC domains are digitally signed, which allows you to check the integrity and authenticity.
Tunneling
Before we can cover virtual private network (VPN) solutions in depth, we need to take a look at the more general concept of tunneling, which is depicted in the figure below. A VPN involves tunneling plus encryption—without encryption, it is only a tunnel. Tunneling is simply the process of taking a packet and placing it inside the payload of another packet. We often use tunneling for things like communicating between private IP addresses that are non-routable.
How Tunneling Works
Virtual private networks (VPNs)
VPNs are one of the most reliable and cost-effective ways to securely connect two networks together. VPNs are commonly used to connect to remote networks, such as when a remote employee is logging in to company resources from their home, or when an admin logs in to a secure trust zone. There are a range of different solutions that combine a variety of technologies, including PPTP, L2F, TLS, SSH, and IPsec.
IPsec
IPsec is a suite of security protocols that is natively supported in IPv6 and is therefore becoming a standard component of networking. It operates at the network layer. IPsec is made up of two major protocols, Authentication Header (AH) and Encapsulating Security Payload (ESP). AH provides integrity, data-origin authentication, and replay protection. However, AH isn’t used much these days. ESP provides all the functions AH does, in addition to ensuring confidentiality, because it provides payload encryption.
IPsec can be used in two different modes: transport mode and tunnel mode. Transport mode uses the header of the original packet, whereas in tunnel mode the original packet (the header and the data) is encapsulated and a new header is attached to it, as shown below.
Network security controls
Firewalls
Firewalls are preventive security controls that enforce security rules between two or more networks or network segments. They do this by performing traffic filtering and either blocking or allowing traffic based upon pre-defined rules. The basic types of firewalls are:
Packet filtering |
|
Stateful packet filtering |
|
Circuit-level proxy |
|
Application-level proxy |
|
CI/CD stands for continuous integration, continuous delivery, although sometimes sources switch out “delivery” for “deployment”. Continuous integration involves automating many of the steps for committing code to a repository, as well as automating much of the testing. This aaThe table below breaks down where firewalls are situated in the OSI model and the key characteristics of each firewall technology.llows code changes to be frequently integrated into the shared source code and ensures that a bunch of testing gets done easily.
Continuous delivery also involves automating the integration and testing of code changes, but it includes delivery as well, automating the release of these validated changes into the repository. Continuous deployment takes things a step further and automatically releases the code changes into production so that they can be used by customers. With continuous deployment, code changes can be automatically put into production without further human intervention, as long as they pass through all of the testing and there are no issues. If there is an error in any of these steps, the changes will get sent back to the developer. The diagram highlights how these three processes overlap:
Header | Simple packet filtering | Stateful packet filtering | Circuit proxy | Application proxy |
---|---|---|---|---|
OSI layer | Network (OSI layer 3). | Network and transport (OSI layers 3 and 4). | Session (OSI layer 5). | Application (OSI layer 7) |
Complexity | Simplest. | Complex. | More complex. | Very complex. |
Performance | Fastest. | Fast. | Higher latency. | Highest latency. |
How it works | Filters based on source and destination IP address, port and protocol of operation. | Maintains state table and makes decisions based on the state. | Filters sessions based on rules. | Filters based on data (the payload). |
Data Inspection
At a high level, data inspection involves monitoring and examining transmitted data and taking action according to a predefined set of security rules. Data inspection can involve the activities below:
Virus scanning | Files are scanned against known signatures for malware. |
Stateful inspection | Both the state (the status of an application or process) and the context (IP addresses, packets and other data) are monitored for potentially malicious activity. |
Content inspection | The content of packets is scanned and inspected for compliance with specific security rules. |
IDS and IPS
An intrusion detection system (IDS) examines traffic at the network level or the host level, specifically looking for malicious activity, policy violations, or other signs of suspicious activity. It can send alerts and log these events, but it cannot take direct action against malicious activity. An intrusion prevention system (IPS), can detect, prevent and take corrective action when necessary.
IDS and IPS network architecture
The sole purpose of an IDS is to detect suspicious activity. It therefore needs to be paired with other tools to provide the additional capabilities of prevention and correction. As shown in the top half of the diagram, an IDS is connected to a network via a mirror, SPAN or promiscuous port. This allows the IDS to get a copy of all network traffic. If potentially malicious traffic is identified, the IDS can communicate with the firewall, which can then take preventive and corrective actions.
IDS vs. IPS Architecture
In the lower half of the diagram, the IPS is placed in line with the network traffic, because it has the capability to detect, prevent, and correct. As traffic comes into the network, it passes through the IPS. If a rule is triggered for malicious activity, the IPS can act and prevent the suspicious packets from traversing the rest of the network.
In the context of the cloud, one of the best places to put an IDS/IPS is on the hypervisor because it has complete visibility into the VMs on top of it. If we monitor the SDN, we will be missing the traffic between VMs on the hypervisor. We can also install IDS/IPS on a single VM, or on our SDN. The yellow dots in the figure below indicate the different places we can put an IDS/IPS.
Placing an IDS/IPS in the Cloud
Port mirroring
When a network device, such as a switch, is port mirroring, it means that all of the traffic passing through a port (or multiple ports) is being replicated and sent via another port to a network monitoring connection. We can hook an IDS/IPS up to this network monitoring connection, so that the IDS can see the data that is traveling through the switch’s port (or multiple ports). Port mirroring is also sometimes called SPAN after Cisco’s Switched Port Analyzer feature. We often use the words mirroring, SPAN and promiscuous interchangeably in this context.
IDS/IPS detection methods
Two types of analysis engines are used in IDS and IPS. These are signature-based and anomaly-based:
Signature-based | Signature-based detection involves looking for known signatures of malicious activity, such as file hashes, suspicious email subject lines, malicious IP addresses or byte sequences. These signatures are added into the analysis engine. When an IDS/IPS sees one of these signatures in the network traffic, it can send an alert, which makes signature-based detection useful for finding known threats. |
Anomaly-based | Anomaly-based detection seeks to complement signature-based detection by first creating a baseline of normal behavior within the system, and then sending alerts when it detects anomalous or suspicious behavior. The major downsides are that it is computationally expensive and it can generate a lot of false positives. There are four major ways to detect anomalies:
|
Ingress and egress monitoring
Ingress monitoring involves monitoring incoming network traffic, while egress monitoring involves looking at outgoing network traffic. Both are shown below:
Alert statuses
Ingress monitoring involves monitoring incoming network traffic, while egress monitoring involves looking at outgoing network traffic. Both are shown below:
Header | True | False |
---|---|---|
Positive | True-positive | False-positive |
Negative | True-negative | False-negative |
In an ideal world, we only want true positives and true negatives. Basically, we just want to know when we are being attacked, and when we aren’t. But our security tools are far from perfect, and attackers are constantly coming up with new schemes to circumvent them. One thing that we can do is tune our security tools to be more sensitive. While this will make them more likely to pick up on true positives, it will also lead to a bunch more false positives, which will overwhelm and fatigue the security team. If we tune our tools down, we will get fewer false positives, but we also increase the risk of a false negative—we could be suffering an attack without realizing it.
Allow lists and deny lists
Allow lists (whitelists) and deny lists (blacklists) are important mechanisms for IDS/IPSs to detect and potentially block suspicious traffic. With an allow list, network traffic to listed IP addresses is allowed. Any other IP address is blocked by default. Deny lists are the exact opposite—any traffic to listed IP addresses is specifically blocked. Any other IP address is allowed by default.
Honeypots
Honeypots are individual computers (usually running a server OS posing as interesting targets for an attacker), but they contain no real data or value to the organization that deploys them. Honeynets are two or more honeypots networked together. A sophisticated honeynet employs the use of routers, switches, or gateways. Honeypots and honeynets contain vulnerabilities—usually unpatched systems, applications, open ports or running services—that aim to entice potential attackers into exploring further.
Network security groups
Network security groups (NSGs) can filter traffic over a virtual network. NSGs act as virtual firewalls that control traffic at the packet level. They have predefined security rules that are set to either allow or deny both inbound and outbound network traffic across the virtual network.
Bastion hosts
Organizations have applications and services that require access to the Internet, such as their webservers and email servers. These are necessary, but they often pose significant risks because attackers can use these as entry points.
The risks can be mitigated through the creation of a subnetwork, usually referred to as a demilitarized zone (DMZ), where services and applications that require public access can be segregated. Devices and applications within a DMZ are often referred to as bastion hosts and bastion applications. Given their vulnerable position, they are strengthened and hardened to defend against attacks. Between the DMZ and the Internet is a boundary router.
Operating system (OS) hardening through the application of baselines, monitoring and remediation
When we talk about hardening, it means that we are configuring a system to be in a secure state. The specifics of how we should harden something are dependent on what software we are using and how we are using it. This means that we need to create a baseline for each type of system. A baseline is a documented standard set of policies and configurations that we apply to each of our systems of a certain class. It’s situation dependent, but the table includes some steps that we often take to harden our systems:
Examples of hardening best practices | |
---|---|
Disabling unnecessary services | |
Installing security patches | |
Changing default credentials | |
Closing certain ports | |
Installing anti-malware | |
Installing a host-based firewall/IDS | |
Using encryption | |
Implementing strong authentication | |
Configuring backups |
Patch management
Patch management is a process for maintaining an environment that is secure against known vulnerabilities. Patches fix security flaws and vulnerabilities in systems. Patches can also improve performance and add functionality.
Whenever deploying patches, it’s important to do so in a manner that leaves the operating environment consistently configured. Patches should be deployed to the entire environment. We then need to verify that they were deployed properly with everything configured consistently. Below is the full patch management life cycle:
Deploying Patches
Once the need is identified, patches can be deployed via manual or automated means. With a manual approach, somebody actually logs into the target system and installs the software. With an automated approach, software is used to roll out the patches. Automated patching should be avoided for high-value, high priority production systems because patching sometimes breaks things.
Infrastructure as Code (IaC) strategy
We discussed Infrastructure as code (IaC) in Domain 4.2.
Availability of clustered hosts
If we want to ensure a high level of availability for our VM, we need systems in place that help to mitigate the many opportunities for failure. Clustering and redundancy are used by cloud providers to help provide this high availability (HA).
Clustering refers to a group of systems working together to handle a load. Redundancy also involves a group of systems, but unlike a cluster, where all the members work together, redundancy typically involves a primary system and a secondary system. The primary system does all the work, while the secondary system is in standby mode. If the primary system fails, activity can fail over to the secondary system.
Distributed resource scheduling
Distributed resource scheduling (DRS) helps us balance the available resources against our computing workloads. If there is high demand and a restricted pool of resources, it helps us decide how these will be distributed.
Reservations set a minimum amount of the resources that you will receive from the provider. They can be set in terms of compute, RAM, disk, or network. If you set a reservation of 2 GB of RAM for a VM, then your cloud provider will make sure that you always get a minimum of 2 GB. If you set a reservation for 1 GB, then your cloud provider will make sure that you get that 1 GB, etc.
Shares are another important concept you need to understand. They are important when there is resource contention and there isn’t enough to go around between all parties. With shares, cloud customers can pay to be designated a certain portion of the service’s resources. The figure shows four VMs and their respective shares of storage. When there isn’t enough storage to go around, the one on the right has an arrangement to suck up 50% of the resources. The other three VMs have to split the remaining 50% between them according to their own arrangements.
Limits set a maximum, and they are useful for keeping your cloud costs under control. With limits, you can set a limit for the amount of resources that you want to consume and know that you won’t go above that.
Dynamic optimization
Dynamic optimization is Microsoft’s name for a process of migrating workloads between hosts. As an example, if a host is running low on resources, it can live-migrate a workload away to another host that still has excess resources available.
Storage clusters
We discussed Storage clusters in Domain 3.1.
Maintenance mode
Maintenance mode takes a system like a compute node or a hypervisor offline, which disables customer access. It also stops automated logging alerts from being generated, but local logging is still enabled.
High availability (HA)
Many people use the terms uptime and availability interchangeably. If a distinction between the two is made, it is often that uptime means that a system is powered on, including when it is in maintenance mode. In contrast, availability not only means that a system is up, but that it can actually connect to a customer. We describe availability in terms of nines, which amount to the percentage of the year that a system is available. The table lists out the amount of availability per year for each nines value.
Availability | Nines | Downtime per year |
---|---|---|
90% | 1 nine | 36.5 days |
99% | 2 nines | 3.65 days |
99.9% | 3 nines | 8.76 hours |
99.99% | 4 nines | 52.56 minutes |
99.999% | 5 nines | 5.25 minutes |
99.9999% | 6 nines | 31.5 seconds |
99.99999% | 7 nines | 3.15 seconds |
Availability of guest operating system (OS)
The responsibility for guest OS availability is dependent on the service model. Under IaaS, the provider is responsible for the availability of the underlying infrastructure and the virtualization, but the cloud customer is responsible for everything above that.
Performance and capacity monitoring
It’s important to monitor the performance of our systems in the cloud, such as CPU, memory, disk and network. We use tools like the Simple Network Management Protocol (SNMP), as well as real user monitoring (RUM) and synthetic performance monitoring.
Network performance monitoring
The Simple Network Management Protocol (SNMP) is an important protocol for network performance management and monitoring. It allows us to remotely connect to systems and collect a range of performance data from them.
Real user monitoring (RUM) and synthetic performance monitoring
Real user monitoring is a passive monitoring technique that monitors user interactions and activity on a website or application. Synthetic performance monitoring essentially involves creating test scripts for each type of functionality so that we can run them at any time.
Along with these functional tests, synthetic performance monitoring can also test functionality and performance under load. We can run thousands of these tests at the same time, which gives us an idea of how well the system can handle loads, as well as its overall performance and response times.
Performance monitoring thresholds
Service-level agreements (SLAs) set out the specifics of the service levels that a cloud provider will provide to the customer. Among other things, SLAs can include thresholds for:
- CPU
- Memory
- Disk
- Network
Hardware monitoring
The cloud provider is responsible for monitoring their hardware to ensure that it is functioning appropriately. Some of the things providers can monitor include:
- Temperature
- Fan speed
- CPU
- Disk I/O
Configuration of host and guest operating system (OS) backup and restore functions
If we want to ensure business continuity, then not only do we need to back up our guest operating systems, but we also need to back up the underlying VM configurations. These are shown in yellow in the diagram. Two common options are agent-based backups and agentless backups, which are discussed in the table.
Agent-based backups |
|
Agentless backups |
|
Management plane
We discussed the management plane in Domain 3.1.
5.3 Implement operational controls and standards
The Information Technology Infrastructure Library (ITIL) is a framework for IT service management that helps us effectively deliver IT services to stakeholders. ISO/IEC 20000-1 is an important standard that covers service management system requirements. It covers similar territory to the ITIL framework, and it’s another critical document for aligning IT service management alongside business processes.
Change management
Our technologies need to change on a regular basis in order to move forward and keep up to date. When we want to change things that are in production, we need to implement a rigorous quality assurance process. Change management requires the right processes to ensure that our production systems remain stable.
Continuity management
Continuity management is about ensuring that services are maintained at appropriate levels, even if disaster strikes. It’s about planning and setting up resilient systems so that you can continue to provide services in adverse circumstances.
Information security management
Information security management is all about taking a risk-based approach to ensure we have security policies and controls that help to keep an organization safe, without substantially dulling the organization’s effectiveness. Security needs to be driven from the top of the organization down, with training and awareness campaigns to assure that all employees understand their responsibilities.
Continual service improvement management
An important part of service management involves constantly improving our IT services. Continual service improvement management starts by focusing on business objectives and then coming up with metrics that clearly link to these objectives. Examples of metrics we can use to communicate our objectives to leadership include:
- Uptime and availability
- Support ticket responsiveness
Incident management
Incidents are issues that can impact our services. Incident management involves developing the capability to detect incidents, prioritize them, escalate them if necessary, resolve them if possible, and then close them.
When we talk about managing incidents, we are generally talking about managing things that have already gone bad. A similar practice is known as event management. This involves monitoring our services and their components and then reporting on observable occurrences or significant changes in state. We term these state changes “events”, and we use automated monitoring tools to alert us when they occur. While some events may also qualify as incidents, not all events are bad, nor do they all require action. Monitoring events and taking corrective actions early can help us to prevent incidents or resolve them more rapidly.
Events vs. Incidents
The incident management process
Problem management
A problem is the cause of an incident (or multiple incidents). Organizations should use incident data and trends to identify future problems. They should analyze the underlying causes to try and figure out ways to prevent future incidents from occurring and reoccurring. Problems need to be:
- Recorded and classified
- Prioritized
- Escalated if needed
- Resolved if possible
- Closed
Release management
When we are releasing new services and features, it can involve many different components. All of these need to work together carefully for a smooth release, so the release management process can involve things like coordination, documentation and training, as well as updating our processes and components.
Deployment management
When we move new components to live environments, we need to have the appropriate deployment management practices in place. Deployment management is important for all new components, whether they are software, hardware, processes or documentation. Deployment management and release management often happen alongside one another, but they are separate practices.
Options for deployment include:
- Pull deployment
- Continuous delivery
- Phased deployment
- Big bang deployment
Configuration management
Configuration items (CIs) are the components that we have to manage and maintain in order to provide an IT service. CIs can range widely, including things as distinct as:
- Buildings
- The IT department
- Suppliers
- People
- Hardware
- Software
- Networks
- Documentations
Configuration management involves collecting information on each of these CIs and understanding how they work together to provide a service. Managing all of this configuration information can ultimately help an organization be more effective.
Service-level management
Organizations use service-level management to set targets for service levels and ensure that they deliver services according to these targets. Providers and customers sign service-level agreements (SLAs) that stipulate the levels of each service that must be delivered. Before signing a contract, cloud customers need to know what their service needs are so that they can find a suitable provider.
Privacy level agreements (PLAs) can be included as part of your contracts in addition to the SLA. These set out the privacy protections that the provider agrees to deliver as part of its service.
Availability management
We typically manage the availability of our services through clustering or redundancy (see 5.2 Availability of clustered hosts). We want our systems to be able to detect when something is wrong and then make the appropriate changes to ensure that our services stay up. Service-level agreements will generally include an availability level, often measured in nines, like we discussed in the high availability section of 5.2.
Capacity management
The performance of our services is dependent on their capacity, which is the maximum that a service can deliver. If we want to provide a service that meets our service-level agreement, then we need sufficient capacity and we need to manage this capacity effectively. We must do it in a cost-effective way if we want to stay in business.
5.4 Support digital forensics
In SP 800-86, NIST defines digital forensics as “…the application of science to the identification, collection, examination, and analysis of data while preserving the integrity of the information and maintaining a strict chain of custody for the data.”
A similar term is eDiscovery, which ISO/IEC defines as the “…process by which each party obtains information held by another party or non-party concerning a matter”, that “…includes the identification, preservation, collection, processing, review, analysis, or production of Electronically Stored Information (ESI).”
In practice, we often use the terms digital forensics and eDiscovery interchangeably. Both are rigorous processes for collecting, preserving, analyzing and reporting electronic information as evidence. The distinction between the two is that eDiscovery is generally seen as a process for collecting digital evidence from another party for a legal matter. Digital forensics is broader—a company could use the same tactics to analyze a cyber-incident, even when the authorities or the legal system aren’t involved.
An important part of the eDiscovery process is notification, where an organization is directed to preserve all potentially relevant information to a given case. This is also known as legal hold. When we are performing the eDiscovery process, we need to disclose all electronic information that we possess, have custody over, or control, especially if it is relevant to our claims or defense.
Forensic data collection methodologies
One of the most important frameworks that covers digital forensic methodologies is the ISO/IEC 27050 series of publications. It delves into eDiscovery and the related topics of identification, preservation, collection, processing, review, analysis, and production of electronic data for investigations. The Cloud Security Alliance published the Cloud Forensics Capability Maturity Model, which is another key resource.
Evidence management
We discuss evidence management as part of the following section.
Collect, acquire, and preserve digital evidence
Forensic Investigation Process
Digital forensic processes can vary, but there are also a number of practices and standards that we use fairly consistently, regardless of context. First, we must identify and secure the scene. This is important to prevent potential evidence from being touched, removed, or otherwise contaminated until it can be properly examined.
This step also marks the beginning of the chain of custody. As we mentioned in Domain 2.8, the chain of custody gives us a way to prove that our evidence is legitimate. It’s essentially a set of documentation that records the chronological order of how evidence has been collected, preserved, analyzed and provided to the courts so that the evidence maintains its integrity, as below.
After a scene is identified and secured properly, the formal collection of evidence can take place. Once collected, evidence and data can be examined and analyzed via automated and manual means. Finally, the results of the analysis should be compiled in a report.
Sources of information and evidence
Sources of information and evidence as part of a computer security investigation often include oral and written statements, documents, audio and visual records, as well as computer systems.
Motive, opportunity and means (MOM)
The three traits of motive, opportunity and means (MOM) form a structure that is useful for investigations. Motive involves questioning whether a suspect had a motive for a crime. Opportunity involves considering whether a suspect would have actually had the opportunity to commit the crime. Means involves considering whether a suspect had the means to commit the crime.
Locard’s exchange principle
Locard’s exchange principle can be understood as an exchange that occurs whenever two items make contact. As an example, when someone walks through mud, boot prints are left behind and some of the mud gets stuck to the boot. Similarly, when an insider threat logs in and steals company data (taking something), their access can be logged (leaving something behind). We can use this principle as a tool to aid our investigations: look for the traces that an attacker may have left behind. Once we narrow down the suspect, we can examine whether they have taken any evidence from the crime scene.
Digital forensics
One of the primary considerations of digital forensics is what’s referred to as live evidence, shown in the diagram below. Live evidence is data that is stored in a running system in places like random access memory (RAM), CPU, cache, buffers, etc. Cloud customers will generally not have access to a provider’s hardware, which means that it may not be feasible to gather certain types of live evidence.
Reporting and documentation
Documentation should be created at each step of the forensics process. Once the process is completed and all of the evidence has been analyzed, all of the findings and documentation should be collated into a report.
Artifacts
Forensic artifacts are remnants of a breach or an attempted breach of a system or network. Examples of artifacts include IP addresses, hashes, file names and types, registry keys (Windows), URLs, operating system information, etc. Another important type of artifact is logged information, like account updates, profile changes, file changes, etc. If there has been malicious activity, these logs can help you identify the culprit.
Five rules of evidence
If we want evidence to stand the best chance of surviving scrutiny, it should exhibit five characteristics, known as the five rules of evidence:
Authentic | We want to be able to show that evidence is not fabricated or planted. |
Accurate | We want to be able to prove that evidence has not been changed or modified—that it has integrity. |
Complete | Evidence must be complete. |
Convincing or reliable | Evidence must be conveyed in a manner that allows stakeholders to understand what is being presented. Evidence must display a high degree of veracity—it must demonstrate a high degree of truth. |
Admissible | Evidence needs to be accepted as part of a case and allowed into the court proceedings. |
Investigative techniques
There are several different investigative techniques that we use for analysis. One of them is media analysis, which involves analyzing things like hard drives, flash drives, tapes, CDs, USB drives, etc. However, in the cloud, you will generally only be able to access a snapshot. We discuss these further in the Cloud forensics section. Another technique is software analysis, which focuses on code, especially malware. Network analysis attempts to understand how a network might have been penetrated, how the network was traversed, and what systems may have been breached.
Types of investigation
Type | Overview | Who drives the investigation? |
---|---|---|
Criminal | Criminal investigations deal with crimes and can result in legal punishment. | Primarily law enforcement with support from the |
Civil | These deal with disputes between individuals or organizations, and whichever party is found at fault usually pays a monetary penalty as well | Organizations, individuals and their attorneys. |
Regulatory | These investigations deal with violations of regulated activities. | The associated regulatory body. |
Administrative | Administrative investigations focus on internal violations of organizational policies and other incidents identified by an organization. These could involve employee misconduct or violation of policies and procedures. Unless it’s determined that there was criminal activity, administrative investigations are opened and closed by the organization itself. | The organization. |
Cloud forensics
Public cloud environments involve multitenancy, with multiple customers sharing the same physical infrastructure, including hard drives. This means that physically accessing hardware that may contain relevant information is typically not possible because it could violate the privacy of other customers.
When a cloud provider receives an eDiscovery order for a legal case, it should notify the relevant customer immediately. Instead of accessing physical disks and systems, an investigator will most likely request copies—snapshots—of the virtual disk and VM images to obtain evidence and information pertinent to the investigation.
5.5 Manage communication with relevant parties
When we are using cloud environments as core parts of our organizations, we must effectively manage relationships and communications with key stakeholders, including:
- Vendors
- Customers
- Partners
- Regulators
- Other stakeholders
5.6 Manage security operations
Security operations center (SOC)
We want to monitor everything we do in the cloud, plus everything that’s happening in our on-premises environments. Security operations centers (SOCs) are critical for keeping track of it all. They operate around the clock and are staffed with security analysts who focus on monitoring, analyzing, responding to, reporting on, and preventing security incidents. The table below highlights some of the important elements of an SOC.
The right people | We need highly skilled staff who can monitor and respond to events appropriately. |
The right processes | We need policies and administrative controls to ensure that the SOC operates in a cohesive manner, without leaving any gaps that attackers can slip through |
The right technologies | We need the right technologies in place to detect and respond to events. Many of these tools automate the work, but there are still many manual processes that analysts must conduct. |
Intelligent monitoring of security controls
We discussed firewalls, IDS/IPSs, honeypots, and network security groups in section 5.2 Network security controls.
Log capture and analysis
Security information and event management (SIEM)
Security information and event management (SIEM) systems ingest logs from disparate systems throughout an organization. They aggregate and correlate these log entries and analyze them for interesting activity. They then report on these findings so that additional action can be taken if necessary. SIEM systems centralize logs, analyze trends, and even provide dashboards of relevant information. The diagram below shows how a SIEM system operates.
The following diagram highlights the steps involved in a SIEM system:
Threat intelligence
Threat intelligence is an umbrella term encompassing threat research, analysis and emerging threat trends. It’s an important part of any organization’s digital security strategy. It equips security professionals to proactively anticipate, recognize, and respond to threats.
User and entity behavior analytics (UEBA)
User and entity behavior analytics (UEBA), which is also known as user behavior analytics (UBA), is often included with SIEM solutions or it may be added via subscription. As the name implies, UEBA focuses on the analysis of user and entity behavior. At its core, UEBA monitors the behavior and patterns of users and entities. It then logs and correlates the underlying data, analyzes the data, and triggers alerts when necessary.
Continuous monitoring
The SIEM system must be updated and monitored continuously, because:
- The threat environment is constantly changing.
- New vulnerabilities are constantly emerging.
- Assets in the organization are changing.
- New monitoring rules need to be configured and programmed.
- The balance between false positives and false negatives must be closely monitored and responded to accordingly.
Having the right people following the right processes can help to stop breaches before they do significant damage to an organization. The full continuous monitoring life cycle is shown in the diagram below.
Log management
Reviewing and analyzing log files can help an organization know if systems deployed in production are working properly. It’s also critical for deterring and detecting cyber incidents. However, it’s easy to get overwhelmed by too many logs, or neglect reviewing important logs, so you need to keep the considerations from the table below in mind:
Log what is relevant | Most systems produce a wealth of information, but not all of it is relevant. You should use your risk assessment as a guide for which assets are most valuable and face the most significant threats. This is a good starting place for figuring out what is relevant to log. |
Review the logs | Logs must be reviewed by either automated or manual means. |
Identify errors and anomalies | As log review is undertaken, you should focus on identifying errors or anomalies that may indicate attacks or suspicious activities. |
Logging event time
Ensuring consistent time stamps of log entries is very important. If an organization has deployed multiple servers and other network devices—like switches and firewalls—and each device is generating events that are logged, it’s critical that the system time, and therefore the event log time, for each device is the same. Otherwise, it can be hard to correlate activities when there is a breach or other incident. We can use the Network Time Protocol (NTP) to ensure that all system and device clocks are set to the exact same time.
Limiting log sizes
We often use two methods to limit log sizes, circular overwrite and clipping levels. Circular overwrite works as the name suggests. For example, if the log file size is set to 100 MB or perhaps ten thousand logged events, enabling circular overwrite means that once the log file reaches the maximum size or length, log entries start being overwritten, from the earliest to the most recent. This means that the maximum file size or number of entries will never be exceeded.
The other method, clipping levels, involves not logging every single bit of activity. Instead, logs only start being collected after a specific threshold has been crossed. As an example, logging every failed login attempt due to a wrong password makes no sense, because people mistype passwords all the time. However, if the wrong password is typed ten times or twenty times, this could be an indication of a system-related problem or a password-cracking attempt. In these cases, we would definitely want to collect logs. This is where clipping levels can be effectively used. A threshold can be set so that logs are only stored after that threshold has been reached.
Incident management
We discussed incident management in 5.3. Incident management.
Vulnerability assessments
We discussed vulnerability assessments in 4.4 under the Vulnerability assessment and penetration testing section.
CCSP Domain 5 key takeaways
5.1 Build and implement physical and logical infrastructure for cloud environment
Hardware-specific security configuration requirements
Installation of guest operating system (OS) virtualization toolsets
5.2 Operate and maintain physical and logical infrastructure for cloud environment
Access controls for local and remote access
Secure network configuration
Network security controls
Operating system (OS) hardening through the application of baselines, monitoring and remediation
Patch management
Availability of clustered hosts