Week 9

Compute Operations (Nova)

OPS3 - Virtualization and Cloud Infrastructure

1. Introduction to OpenStack Compute (Nova)

Section 1 Checkpoint

Summary:

Nova is the compute controller, equivalent to Microsoft.Compute.
Cellular Architecture: Partitions the cloud for scalability and resilience.
Hypervisor Agnostic: Nova manages the hypervisor (KVM) but is not the hypervisor itself.

Reflection:

Why does Nova need a "Cellular Architecture" for large-scale clouds?
What is the difference between Nova and KVM?

2. Nova Component Anatomy

2.1 The Global Components (Control Plane)

The entry point for all requests is nova-api.
This service accepts REST requests from users and other services.
It first validates the user's authentication token via Keystone before passing the request into the system.
Crucially, nova-api is stateless, meaning scaling it is as simple as running multiple copies behind a Load Balancer.

The decision-making heart of the cloud is nova-scheduler.
Its sole responsibility is to decide where a new virtual machine should be placed.
It does not create the VM or touch the hypervisor; it simply selects the most appropriate host from the pool of available resources and passes the message along.
It achieves this through a sophisticated Filter-and-Weight algorithm.

Finally, the nova-conductor acts as a security guard for the database.
In a cloud environment, compute nodes are considered "untrusted" because they run user workloads that could potentially be malicious.
If a hacker were to escape a VM and gain control of the compute node, we must ensure they cannot corrupt the entire cloud database.
Therefore, compute nodes are not allowed to write to the database directly.
Instead, they send a message to the Conductor requesting an update, and the Conductor performs the write operation only after validating the request.

2.2 The Node Components (Data Plane)

On every hypervisor server, the nova-compute service acts as the worker.
It continually listens for instructions from the message queue (RabbitMQ).
When it receives a command, such as "Run Instance," it does not execute it blindly; it follows a rigorous process to ensure the VM is built correctly on the physical hardware.

The Driver Layer (Libvirt) nova-compute is designed to be hypervisor-agnostic.
It does not speak directly to the kernel; instead, it uses a driver.
In our Linux environment, it uses the Libvirt driver.
When you ask for a VM, nova-compute translates your request into a Libvirt XML file—a precise recipe describing the VM's CPU, RAM, and Devices—and passes it to the Libvirt daemon, which ultimately spawns the QEMU/KVM process.

The Resource Tracker Beyond creating VMs, nova-compute is responsible for auditing the physical server.
It runs a periodic task (typically every 60 seconds) called the Resource Tracker.
This task scans the available RAM, CPU cores, and Disk space on the host and compares it against the reserved resources.
It then reports this "Inventory" back to the central database.
This ensures that the Scheduler always possesses an accurate, up-to-date map of the cloud's capacity, preventing it from sending a VM to a host that is already full.

Section 2 Checkpoint

Summary:

Control Plane: nova-api (Entry), nova-scheduler (Decision), nova-conductor (DB Guard).
Data Plane: nova-compute (Hypervisor Worker).
Security: Compute nodes cannot talk directly to the DB; they go through Conductor.

Reflection:

Why is nova-api considered "stateless"?
Why do we need a "Conductor" to protect the database?

3. The Scheduling Algorithm (The Decision Process)

3.1 Pass 1: Filtering (Qualifying)

The first pass is designed to remove any hosts that are incapable of running the instance. It works like a sieve.

RamFilter: Checks if the host has enough free RAM to satisfy the requested flavor.
ComputeFilter: Ensures the host service is actually alive and reporting.
AvailabilityZoneFilter: Ensures the VM lands in the requested physical location.
ImagePropertiesFilter: Checks for specific hardware requirements like GPUs or Secure Boot support.

3.2 Pass 2: Weighting (Ranking)

Once the invalid hosts are removed, the second pass ranks the remaining candidates to find the "best" fit. The default RamWeigher checks the free RAM on each host.

Stacking Strategy: Fills up one server completely before moving to the next. This saves power but creates hotspots.
Spreading Strategy (Default): Places the VM on the emptiest possible server to maximize performance and minimize the "noisy neighbor" effect.

Section 3 Checkpoint

Summary:

Filtering: Removes invalid hosts (e.g., Not enough RAM).
Weighting: Ranks valid hosts (e.g., Emptiest first).
Goal: Select the single best host (Candidate) for the VM.

Reflection:

What is the difference between "Stacking" and "Spreading" strategies?
Which filter ensures a VM lands on a host with a GPU?

4. The Instance Lifecycle (State Machine)

Section 4 Checkpoint

Summary:

BUILD: Scheduling and Networking in progress.
ACTIVE: VM is running on the Hypervisor.
ERROR: Something went wrong (Check logs).
SHELVED: VM offloaded to disk.

Reflection:

What happens during the "Spawning" phase?
How does SHELVED differ from a simple Shutdown?

5. Operations Cookbook (CLI): Launching Nebula Inc.

In Week 8, we established the digital foundation for Nebula Inc.
We created the Project (nebula_prod), hired the User (nebula_admin), and wired the Office Network (nebula_net).
However, the data center currently sits empty.
To bring the company online, we must now define the virtual hardware standards (Flavors), issue security credentials (Keys & Groups), and finally press the "Power On" button for their first Web Server.

Below are the commands to execute this activation.

5.1 Defining Flavors (Capacity)

In a physical data center, you buy specific server models. In OpenStack, we abstract this capacity into what the platform calls Flavors (Instance Types in AWS/Azure). A Flavor is a virtual hardware template that defines the resource limits (vCPU, RAM, Disk).

The Provider vs. Consumer Role:

Public Cloud (AWS/Azure): You are a Consumer. You cannot create new sizes; you can only Select from the menu Amazon provides (t2.micro, m5.large).
Private Cloud (OpenStack): You are the Provider. It is your job to Create the menu that your users will select from.

1. Listing Existing Flavors (The Menu) Before creating new ones, check what is available.

2. Creating a Custom Flavor (The Chef) For "Nebula Inc.", we need a custom "Micro" size for cheap testing. We will name it m1.nebula_micro.

Naming Convention Decoding:

m1: Generation/Class. (e.g., "m" for General Purpose, "1" for 1st Generation). This mirrors AWS naming (e.g., t2.micro = Burstable, 2nd Gen).
nebula: Family. Identifies this as a custom flavor for our organization.
micro: Size. Indicates relative capacity (Micro < Small < Medium).

Result: We have added a new item to the menu. Users can now select m1.nebula_micro when launching instances.

5.2 Securing Access (Keys & Groups)

Security in the cloud is a two-layered approach. First, we must secure Identity (proving who you are) using Keypairs. Second, we must secure the Network (controlling traffic flow) using Security Groups. You cannot access a VM unless both of these layers are correctly configured.

5.2.1 Keypairs (Login Access)

Unlike traditional servers where you set a root password, Cloud images (AWS, Azure, OpenStack) verify identity using Asymmetric Cryptography.
This mechanism leverages a "Lock and Key" relationship to secure access.
The Public Key acts as the "Lock"; you upload this to the cloud, and Nova injects it into the VM's .ssh/authorized_keys file during boot.
It is safe to share and visible to anyone.
The Private Key acts as the unique "Key"; you keep this securely on your laptop and must never share it.

Generating a Keypair

Explanation: This command generates the pair. It stores the Public Key in the Nova Database and writes the Private Key to nebula_key.pem on your disk. The chmod is critical; SSH will refuse to use a key if the file permissions are too open.

5.2.2 Security Groups (The Virtual Firewall)

In traditional networking, firewalls are physical appliances sitting at the edge of the network. In Cloud Computing, we use Security Groups. A Security Group is a virtual firewall that is applied directly to the network interface (vNIC) of an instance, regardless of where it runs in the data center.

Figure 3: Security Group Architecture - How the Open vSwitch Agent filters packets on the Hypervisor before they reach the VM

Concept (General Cloud) Security groups operate on specific principles:

Stateful: If you allow a request out (e.g., download update), the return traffic is automatically allowed in.
Allow-List: The default policy is "Implicit Deny". All traffic is blocked until you explicitly allow it.
Dynamic: Rules are applied immediately to all running instances without rebooting.

OpenStack Implementation When you create a rule, Neutron communicates with the Open vSwitch (OVS) agent on the Compute Node.
It translates your high-level rule (e.g., "Allow Port 80") into low-level OVS Flow Tables or iptables chains on the physical hypervisor.
This ensures malicious traffic is dropped on the physical wire before it ever reaches your VM, providing a robust first line of defense.

CLI: Configuring the Firewall We must explicitly open ports for SSH and Web access.

Result: The OVS Agent on the compute node intercepts traffic to nebula_web_01 and filters it against these rules.

5.3 Launching Instances

The server create command brings together the Flavor, Image, Network, Key, and Security Group to instantiate a VM.

Boot Command

Explanation: --flavor: Defines the size. --image: Defines the software (OS). --network: Defines the wiring. Result: Triggers the entire scheduling and build process seen in Section 4.

5.4 Day 2 Operations (Debugging & Access)

Floating IPs (Public Access) To access the VM from the internet, map a public IP to it.

Console Logs (Troubleshooting) If a VM fails to become reachable (e.g., no network), check the boot logs.

Explanation: Retrieves the kernel ring buffer (dmesg) and cloud-init output directly from the hypervisor. Use this to find kernel panics or DHCP failures.

Section 5 Checkpoint

Summary:

Flavor: Virtual hardware template (CPU/RAM). Provider defines, Consumer selects.
Security Group: Stateful virtual firewall. "Implicit Deny" by default.
Keypairs: SSH Keys for identity. Private Key never leaves your laptop.
Floating IP: Assigns a public address to reach the VM from outside.

Reflection:

Why must we use chmod 600 on the private key?
How does an "Allow-List" firewall differ from a traditional "Block-List"?

6. Industry Comparison: The "Polyglot" Cloud Engineer

6.1 Concept Mapping

Concept	OpenStack Term	AWS Term	Azure Term
Compute Provider	Nova	EC2 (Service)	Azure Compute (`Microsoft.Compute`)
Size Template	Flavor (e.g., `m1.small`)	Instance Type (e.g., `t2.micro`)	VM Size (e.g., `Standard_B1s`)
Firewall	Security Group	Security Group	Network Security Group (NSG)
Login Key	Keypair	Key Pair	SSH Key
Default User	`cirros`, `ubuntu`	`ec2-user`, `ubuntu`	`azureuser`
### 6.2 CLI Rosetta Stone
Below is the exact same "Launch Instance" workflow translated into the three major languages of the cloud.

1. Create a "Flavor" (Size)

OpenStack: openstack flavor list (Selects m1.small)
AWS: aws ec2 describe-instance-types (Selects t2.micro)
Azure: az vm list-sizes (Selects Standard_B1s)

2. Create a Firewall

OpenStack: openstack security group create web-sg
AWS: aws ec2 create-security-group --group-name web-sg
Azure: az network nsg create --name web-nsg

3. Launch the Instance (The "Hello World" of Cloud) Notice how similar the flags are across all three platforms.

OpenStack (Nova)

AWS (EC2)

Azure (Compute)

Section 6 Checkpoint

Summary:

Concepts are universal; only terms change (Flavor -> Instance Type).
Nova = AWS EC2 = Azure Compute.
Security Group is the standard term across OpenStack and AWS.

Reflection:

Why is it valuable to learn the underlying concept rather than just the tool command?
How does "Infrastructure as Code" rely on these standardized CLI commands?

7. Summary and Next Steps

Preparing for Week 10

Next week, we tackle Storage and Persistence. A web server is useless if it loses all its data when it reboots. We will explore Cinder (Block Storage) to give our instances persistent hard drives.

Checklist:

Ensure you can launch an instance from the CLI without looking at the manual.
Verify you can SSH into your instance using your keypair.
Review the "Instance Lifecycle" states (Build -> Active).

8. Additional Resources

9. Lab Exercises

Summary

Review the key concepts covered in this week's material

Questions?