Chapter 4: Testing, Quality, and CI/CD
“Testing shows the presence, not the absence of bugs.” — Edsger W. Dijkstra
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the different levels of software testing and when to apply each.
- Write unit tests and integration tests in Python using pytest.
- Measure and interpret code coverage and understand its limitations.
- Configure a CI/CD pipeline using GitHub Actions.
- Apply static analysis and code review techniques to catch defects early.
- Critically evaluate AI-generated tests and understand why AI cannot replace a thoughtful testing strategy.
4.1 Why Testing Matters
Software testing is the process of executing software with the intent of finding defects. It is not an optional step at the end of development — it is a discipline that runs throughout the entire software development lifecycle.
Testing serves several purposes:
- Defect detection: Finding bugs before they reach users
- Regression prevention: Ensuring that new changes do not break existing functionality
- Design feedback: Tests that are hard to write often indicate design problems
- Documentation: A well-named test suite describes exactly what a system does
- Confidence: A passing test suite gives the team confidence to make changes
The question is not whether to test, but how to test effectively given limited time and resources.
4.2 The Testing Pyramid
The testing pyramid (Cohn, 2009) describes the ideal distribution of test types:
┌───────────┐
│ E2E / │ Few, slow, fragile — test critical paths only
│ UI Tests │
┌┴───────────┴┐
│ Integration │ Some — test component interactions
│ Tests │
┌┴──────────────┴┐
│ Unit Tests │ Many — fast, isolated, precise
└────────────────┘
Unit tests are the foundation: fast, isolated, numerous. They test individual functions or classes in isolation.
Integration tests verify that components work correctly together — services calling repositories, API handlers interacting with business logic.
End-to-end (E2E) tests exercise the system as a whole, simulating real user interactions. They are slow, brittle, and expensive to maintain — use them sparingly, for critical user journeys only.
This distribution is sometimes called the “1:10:100 rule” — for every E2E test, write ~10 integration tests and ~100 unit tests. The exact ratio varies by system, but the principle holds: favour fast, isolated tests over slow, coupled ones.
4.3 Black-Box and White-Box Testing
Testing approaches can be categorised by how much knowledge of the internal implementation the tester uses.
4.3.1 Black-Box Testing
In black-box testing, the tester has no knowledge of the internal implementation. Tests are derived entirely from the specification — inputs are provided and outputs are verified against expected behaviour.
Advantages: Tests are specification-driven; a new implementation can be tested without modifying the tests; tests reflect user-visible behaviour.
Techniques:
- Equivalence partitioning: Divide inputs into classes that the system should handle identically. Test one representative from each class.
- Boundary value analysis: Test at the boundaries of valid input ranges. Bugs cluster at boundaries (off-by-one errors, empty inputs, maximum values).
- Decision table testing: For systems with complex conditional logic, enumerate all combinations of conditions and expected outcomes.
Example — equivalence partitioning for task priority:
The system accepts priority values 1–4. Partitions:
- Valid: 1, 2, 3, 4
- Below range: 0, -1
- Above range: 5, 100
- Non-integer: “high”, 2.5, None
Test one value from each partition: priority=2 (valid), priority=0 (below), priority=5 (above), priority="high" (non-integer).
4.3.2 White-Box Testing
In white-box testing (also called structural or glass-box testing), the tester has full knowledge of the internal implementation. Tests are derived from the source code, with the goal of exercising specific paths, branches, and conditions.
Techniques:
- Statement coverage: Every statement is executed by at least one test
- Branch coverage: Every branch (if/else, loop) is executed in both directions
- Path coverage: Every possible path through the code is executed (often infeasible for complex code)
White-box testing is particularly valuable for finding dead code, unreachable branches, and logic errors that black-box tests might miss.
4.4 Unit Testing with pytest
Unit tests verify the behaviour of a single unit of code — typically a function or method — in isolation from its dependencies.
4.4.1 Writing Your First Tests
# src/task_service.py
from dataclasses import dataclass
from datetime import date
from uuid import UUID, uuid4
class TaskValidationError(ValueError):
pass
@dataclass
class Task:
id: UUID
title: str
priority: int # 1–4
due_date: date | None = None
status: str = "open"
def create_task(title: str, priority: int, due_date: date | None = None) -> Task:
"""Create a new task with validation."""
if not title or not title.strip():
raise TaskValidationError("Title cannot be empty")
if priority not in range(1, 5):
raise TaskValidationError(f"Priority must be 1–4, got {priority}")
if due_date and due_date < date.today():
raise TaskValidationError("Due date cannot be in the past")
return Task(id=uuid4(), title=title.strip(), priority=priority, due_date=due_date)
# tests/test_task_service.py
import pytest
from datetime import date, timedelta
from src.task_service import create_task, TaskValidationError
class TestCreateTask:
def test_creates_task_with_valid_inputs(self) -> None:
task = create_task("Write tests", priority=2)
assert task.title == "Write tests"
assert task.priority == 2
assert task.status == "open"
assert task.id is not None
def test_strips_whitespace_from_title(self) -> None:
task = create_task(" Write tests ", priority=1)
assert task.title == "Write tests"
def test_raises_for_empty_title(self) -> None:
with pytest.raises(TaskValidationError, match="Title cannot be empty"):
create_task("", priority=1)
def test_raises_for_whitespace_only_title(self) -> None:
with pytest.raises(TaskValidationError):
create_task(" ", priority=1)
@pytest.mark.parametrize("priority", [0, -1, 5, 100])
def test_raises_for_invalid_priority(self, priority: int) -> None:
with pytest.raises(TaskValidationError, match="Priority must be 1–4"):
create_task("Valid title", priority=priority)
@pytest.mark.parametrize("priority", [1, 2, 3, 4])
def test_accepts_valid_priorities(self, priority: int) -> None:
task = create_task("Valid title", priority=priority)
assert task.priority == priority
def test_raises_for_past_due_date(self) -> None:
yesterday = date.today() - timedelta(days=1)
with pytest.raises(TaskValidationError, match="Due date cannot be in the past"):
create_task("Valid title", priority=1, due_date=yesterday)
def test_accepts_future_due_date(self) -> None:
tomorrow = date.today() + timedelta(days=1)
task = create_task("Valid title", priority=1, due_date=tomorrow)
assert task.due_date == tomorrow
def test_accepts_no_due_date(self) -> None:
task = create_task("Valid title", priority=1)
assert task.due_date is None
Run the tests:
pytest tests/test_task_service.py -v
4.4.2 Fixtures
Fixtures are reusable setup functions that provide test dependencies. They replace repetitive setup code and enable dependency injection in tests.
# tests/conftest.py
import pytest
from uuid import uuid4
from datetime import date, timedelta
from src.task_service import Task
from src.repository import InMemoryTaskRepository
@pytest.fixture
def repository() -> InMemoryTaskRepository:
return InMemoryTaskRepository()
@pytest.fixture
def sample_task() -> Task:
return Task(
id=uuid4(),
title="Sample task",
priority=2,
due_date=date.today() + timedelta(days=7),
)
# tests/test_repository.py
from uuid import uuid4
from src.task_service import Task
from src.repository import InMemoryTaskRepository
def test_save_and_retrieve_task(
repository: InMemoryTaskRepository, sample_task: Task
) -> None:
repository.save(sample_task)
retrieved = repository.find_by_id(sample_task.id)
assert retrieved == sample_task
def test_returns_none_for_missing_task(repository: InMemoryTaskRepository) -> None:
result = repository.find_by_id(uuid4())
assert result is None
def test_delete_removes_task(
repository: InMemoryTaskRepository, sample_task: Task
) -> None:
repository.save(sample_task)
repository.delete(sample_task.id)
assert repository.find_by_id(sample_task.id) is None
4.4.3 Mocking
When a unit under test depends on external systems (databases, email services, APIs), mocking replaces those dependencies with controlled substitutes.
# tests/test_assignment_service.py
from unittest.mock import MagicMock, patch
from uuid import uuid4
from src.assignment_service import AssignmentService
from src.task_service import Task
def test_assign_task_sends_notification() -> None:
# Arrange
mock_repo = MagicMock()
mock_notifier = MagicMock()
service = AssignmentService(repo=mock_repo, notifier=mock_notifier)
task_id = uuid4()
mock_repo.find_by_id.return_value = Task(
id=task_id, title="Test task", priority=1
)
# Act
service.assign(task_id=task_id, assignee_email="alice@example.com")
# Assert
mock_repo.save.assert_called_once()
mock_notifier.notify.assert_called_once_with(
recipient="alice@example.com",
subject="You have been assigned a task",
)
4.5 Code Coverage
Code coverage measures how much of your source code is executed by your test suite. It is a useful indicator of untested areas, but it is not a measure of test quality.
pip install pytest-cov
pytest tests/ --cov=src --cov-report=term-missing
Sample output:
Name Stmts Miss Cover Missing
---------------------------------------------------------
src/task_service.py 18 2 89% 34-35
src/repository.py 22 0 100%
src/assignment_service.py 15 3 80% 28, 41-42
---------------------------------------------------------
TOTAL 55 5 91%
The Missing column shows which lines are not covered — useful for targeting additional tests.
Coverage targets: 80% is a common minimum threshold for production code. 100% coverage is neither necessary nor sufficient — you can have 100% coverage with tests that make no meaningful assertions.
What coverage cannot tell you:
- Whether the tests assert the right things
- Whether edge cases are tested (a line can be covered by a single happy-path test)
- Whether the system behaves correctly at the integration level
4.6 Code Quality and Static Analysis
Beyond testing, several automated tools catch quality issues before code review.
4.6.1 Linting with Ruff
Ruff (introduced in Chapter 1) enforces style rules and catches common programming errors:
ruff check src/
ruff format src/
Ruff subsumes the functionality of flake8, isort, and black, and is significantly faster than any of them individually.
4.6.2 Type Checking with mypy
Type annotations in Python (since PEP 484, van Rossum et al., 2015) enable static analysis. mypy verifies that type annotations are consistent throughout the codebase, catching a class of bugs that tests can miss.
mypy src/ --strict
Common errors mypy catches:
- Passing
Nonewhere a non-optional value is expected - Calling a method that does not exist on a type
- Returning the wrong type from a function
- Missing return statements
4.6.3 Security Scanning with Bandit
Bandit (PyCQA, 2014) scans Python code for common security vulnerabilities:
pip install bandit
bandit -r src/
Bandit flags issues like SQL injection risks, hardcoded passwords, use of weak cryptographic algorithms, and unsafe deserialization. Security scanning is covered in depth in Chapter 9.
4.7 Code Review
Code review is the practice of having another developer read and evaluate your code before it is merged into the main branch. It is one of the most effective defect-detection techniques in software engineering (Fagan, 1976).
4.7.1 What to Look for in a Code Review
An effective reviewer checks:
- Correctness: Does the code do what it is supposed to do? Are there edge cases the author missed?
- Tests: Are there sufficient tests? Do they cover the important cases?
- Design: Does the change fit the existing architecture? Does it introduce unnecessary complexity?
- Security: Does the change introduce any security vulnerabilities?
- Readability: Can you understand the code without asking the author?
- Performance: Are there obvious performance issues (e.g., N+1 queries)?
4.7.2 Code Review Etiquette
Effective code review requires clear, respectful communication:
- Review the code, not the person: “This function is hard to follow” not “You wrote this poorly”
- Be specific: “Line 42: extracting this into a helper function would make it easier to test” not “this is messy”
- Distinguish must-fix from suggestions: prefix non-blocking suggestions with “nit:” or “optional:”
- Respond to all review comments, even if to say “agreed, fixed” or “disagree because…”
4.7.3 Automated Code Review
AI-powered tools (GitHub Copilot, CodeRabbit, Sourcery) can perform a first-pass review, catching mechanical issues before human reviewers see the code. These tools are most effective at:
- Identifying obvious bugs and null pointer issues
- Suggesting more idiomatic patterns
- Flagging inconsistency with the surrounding codebase
They are least effective at:
- Understanding business context and domain logic
- Evaluating architectural decisions
- Catching subtle security vulnerabilities that require domain knowledge
4.8 Continuous Integration and Continuous Delivery (CI/CD)
Continuous integration (CI) is the practice of merging all developer branches into the main branch frequently — at least daily — with each merge triggering an automated build and test run (Fowler, 2006).
Continuous delivery (CD) extends CI to ensure that the software is always in a deployable state. Every passing build is a release candidate.
4.8.1 GitHub Actions
GitHub Actions is a CI/CD platform built into GitHub. Workflows are defined as YAML files in .github/workflows/.
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run linter
run: ruff check src/ tests/
- name: Run type checker
run: mypy src/ --strict
- name: Run tests with coverage
run: pytest tests/ --cov=src --cov-report=xml --cov-fail-under=80
- name: Run security scan
run: bandit -r src/ -ll
- name: Upload coverage report
uses: codecov/codecov-action@v4
with:
file: ./coverage.xml
This workflow runs on every push to main and on every pull request. It will fail if:
- The linter finds any issues
- The type checker finds any errors
- Any test fails
- Code coverage drops below 80%
- Bandit finds any medium or higher severity issues
A failing CI pipeline blocks the pull request from being merged, enforcing quality standards automatically.
4.8.2 Branch Protection
Configure your GitHub repository to require CI to pass before merging:
- Repository Settings → Branches → Branch protection rules
- Add a rule for
main - Enable: “Require status checks to pass before merging”
- Select the CI workflow checks
This ensures no code reaches the main branch without passing all automated checks.
4.9 AI-Generated Tests: Trust but Verify
AI tools can generate test cases quickly, but AI-generated tests require the same critical evaluation as AI-generated implementation code.
4.9.1 What AI Does Well
- Generating boilerplate test structure
- Suggesting parametrised test cases for boundary values
- Generating tests for simple, pure functions
- Identifying equivalence partitions given a function signature
4.9.2 What AI Does Poorly
- Asserting the right things: AI-generated tests often test that code runs without error rather than asserting specific output values.
- Edge cases in business logic: AI does not know that “a task cannot be assigned to a user who has left the project” unless you tell it.
- Integration behaviour: AI generates unit tests well but frequently misses the integration-level behaviours that cause production bugs.
- Security testing: AI rarely generates tests for injection, authentication bypass, or other security concerns.
4.9.3 Evaluating AI-Generated Tests
When reviewing AI-generated tests, ask:
- Does each test assert something meaningful? A test that calls a function and asserts
result is not Noneprovides almost no value. - Are the boundary cases covered? Check that the tests cover the boundaries of input ranges, not just the happy path.
- Is the test isolated? A test that depends on external state (time, filesystem, database) is fragile.
- Is the test readable? The test name should describe exactly what scenario it tests.
- Does the test failure message help diagnose the problem? A test named
test_task_1that fails withAssertionErroris useless;test_create_task_raises_for_empty_titlethat fails is immediately informative.
4.10 Tutorial: Full Testing and CI Setup for the Course Project
Project Structure
ai_native_project/
├── src/
│ ├── __init__.py
│ ├── task_service.py
│ ├── repository.py
│ └── assignment_service.py
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_task_service.py
│ ├── test_repository.py
│ └── test_assignment_service.py
├── .github/
│ └── workflows/
│ └── ci.yml
├── pyproject.toml
├── requirements.txt
└── .pre-commit-config.yaml
Running the Full Quality Suite Locally
# Run all checks in order
ruff check src/ tests/ # Linting
ruff format --check src/ tests/ # Formatting
mypy src/ --strict # Type checking
pytest tests/ -v --cov=src \
--cov-report=term-missing \
--cov-fail-under=80 # Tests + coverage
bandit -r src/ -ll # Security scan
Add a Makefile to run all checks with one command:
# Makefile
.PHONY: check test lint typecheck security
check: lint typecheck test security
lint:
ruff check src/ tests/
ruff format --check src/ tests/
typecheck:
mypy src/ --strict
test:
pytest tests/ -v --cov=src --cov-report=term-missing --cov-fail-under=80
security:
bandit -r src/ -ll
make check