Lecture 4 — Software Engineering for Data Science

Data & Code Management: From Collection to Application

Samuel Orso

2025-11-06

R packages — Structure, workflow, best practices

Why make an R package?

  • Distribute code to others — packaging nudges you to write documentation.
  • Enforces conventions (files, names, processes).
  • Increases stability via long-term maintenance + testing.
  • Improves usability as your function zoo grows.
  • Clear API, versioning, and easier onboarding for collaborators.

Setup

You will need (at least) the following packages:

install.packages(c("devtools", "knitr", "pkgdown", "roxygen2", "testthat", "usethis"))

Check your system toolchain:

devtools::has_devel()

If not ready: https://r-pkgs.org/setup.html

Demo

(We’ll live-code a tiny package: init → a function → docs → tests → pkgdown.)

Package anatomy

pkgtest/
├─ DESCRIPTION
├─ NAMESPACE          # auto-generated by roxygen2
├─ R/                 # your exported/internal functions
├─ man/               # *.Rd docs (generated)
├─ tests/testthat/    # unit tests
├─ vignettes/         # long-form docs (optional)
├─ data/              # .rda datasets (optional)
├─ inst/              # e.g., inst/examples/
└─ data-raw/          # raw data + scripts (ignored by build)

DESCRIPTION file

DESCRIPTION contains package metadata (authors, description, dependencies, contact, …). Example:

# Plain text (DCF) — shown here for reference
Package: pkgtest
Type: Package
Title: What the Package Does (Title Case)
Version: 0.1.0
Authors@R: person("John", "Doe", email = "john.doe@example.com",
  role = c("aut", "cre"))
Maintainer: John Doe <john.doe@example.com>
Description: More about what it does (maybe more than one line).
    Use four spaces when indenting paragraphs within the Description.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
URL: https://github.com/ptds2024/pkgtest
BugReports: https://github.com/ptds2024/pkgtest/issues
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
Suggests:
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3

Tip: usethis::use_description() can scaffold this for you.

Authors and license

Use person() in Authors@R. Common roles:

  • "cre" = maintainer (creator)
  • "aut" = author (substantial contributions)
  • "ctb" = contributor (smaller contributions)
  • "cph" = copyright holder (institution/corporate)

Choose a license: https://choosealicense.com/licenses/

usethis::use_mit_license()
# Alternatives: usethis::use_gpl3_license(), use_apl2_license(), …

Dependencies

DESCRIPTION lists what your package needs.

Depends: R (>= 4.0.0)    # note the space!
Imports:
    dplyr (>= 1.1.4),
    ggplot2 (>= 3.5.1)
Suggests:
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)

Guidelines:

  • Imports for fun() used in codes in R/.
  • Depends for base R version requirement.
  • Suggests for docs, tests, vignettes.
  • Note: for selective namespace import, use roxygen tags such as: @importFrom dplyr mutate select.

Documenting your package

  • Docs live in man/ as *.Rd (generated).
  • We generate them with roxygen2 from inline comments.
  • Run devtools::document() to update NAMESPACE + man/.

Roxygen basics

Place roxygen just above the function:

#' Say hello
#'
#' @return Prints a friendly message.
#' @examples
#' hello()
#' @export
hello <- function() {
  print("Hello, world!")
}

Useful tags: @title, @description, @details, @param, @return, @examples, @seealso, @author, @references, @import, @importFrom, @export

Document all user-facing functions; export some of them.

Example: hello() (docs vs. help page)

#' @title hello world function
#' @details A super fancy function to print Hello World!
#' @return Prints a message.
#' @examples
#' hello()
#' @export
hello <- function() {
  print("Hello, world!")
}

Import external functions explicitly

Rule: If you call functions from another package, you must import them with roxygen’s @importFrom so NAMESPACE lists them.

It is not sufficient to add pkg::fun() calls in your code.

Otherwise you’ll hit R CMD check NOTES like “no visible global function definition for ‘select’” and users may get runtime errors.

Note: Base R functions (e.g., mean(), lm()) do not need imports.

Do (bare names + explicit imports):

#' Transform helper
#'
#' @description Example using dplyr without pkg:: qualification.
#' @importFrom dplyr select mutate
#' @export
foo <- function(df) {
  df |>
    select(x, y) |>
    mutate(z = x + y)
}

Adding data (binary .rda)

  • Place datasets in data/ as .rda.
  • Use usethis::use_data() to serialize R objects.
snipes <- read.csv("data-raw/snipes.csv")
usethis::use_data(snipes, overwrite = TRUE)  # creates data/snipes.rda

Preserve the origin story (data-raw)

  • Keep raw inputs + wrangling scripts under data-raw/.
  • Make it reproducible with usethis::use_data_raw() (auto-adds to .Rbuildignore).
usethis::use_data_raw("snipes")
# edit data-raw/snipes.R to read/clean/write snipes object, then use_data()

Reference: r-pkgs, “Preserve the origin story of package data”.

Documenting datasets

Two useful tags: @format and @source.

#' Snipes price data
#'
#' @format A data frame with 48 rows and 3 columns:
#' \describe{
#'   \item{discount}{Discounted price of sneakers}
#'   \item{brand}{Brand of sneakers}
#'   \item{price}{Original price of sneakers}
#' }
#' @source <https://www.snipes.ch/>
"snipes"

.Rbuildignore

Like .gitignore, but for package builds.

usethis::use_build_ignore(c("^.*\\.Rproj$", "^\\.Rproj\\.user$", "^LICENSE\\.md$", "^\\.github$", "^data-raw$"))

Or maintain manually:

^.*\.Rproj$
^\.Rproj\.user$
^LICENSE\.md$
^\.github$
^data-raw$

Vignettes

Long-form docs (articles, tutorials) built with R Markdown/Quarto.

usethis::use_vignette("getting-started")
# Write vignettes/getting-started.Rmd (or .qmd)

Remember to list knitr and rmarkdown under Suggests.

Namespaces (how functions are found)

The namespace controls how R looks up variables: it searches your package namespace, then imports, then base, then the regular search path.

  • Generated automatically by roxygen2.
  • Use @export to expose a function; @importFrom pkg fun to bring symbols in.

A toy function to test

In R/reg_coef.R:

#' Compute regression coefficients
#'
#' @param x Design \code{matrix} or vector.
#' @param y Response \code{vector}.
#' @details Uses \link[stats]{lm} then \link[stats]{coef}.
#' @importFrom stats lm coef
#' @seealso \code{\link[stats]{lm}}, \code{\link[stats]{coef}}
#' @example inst/examples/eg_reg_coef.R
#' @export
`%r%` <- function(y, x) {
  fit <- lm(y ~ x)
  coef(fit)
}

“Testing” via examples

  • Examples (in roxygen) are shown to users and run under R CMD check.
  • For bigger snippets, place files under inst/examples/ and reference them:
#' @examples
#' @example inst/examples/eg_reg_coef.R

inst/examples/eg_reg_coef.R:

## linear regression
cars$speed %r% cars$dist

What happens on check?

Intentional mistake (to see failure)

If inst/examples/eg_reg_coef.R contains:

cars$speed %r% cars   # wrong!

You’ll get a failing check:

Testing with testthat

Examples are for users; tests are for you (broader, automated).

usethis::use_testthat()

“Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead.” — Martin Fowler

testthat structure

  1. Expectation: a single check (expect_*()).
  2. Test: one or more expectations (test_that()).
  3. Test file: one or more tests (e.g., tests/testthat/test_reg_coef.R).

Example:

test_that("regression coefficient input check", {
  expect_error(cars$speed %r% cars)
})

test_that("regression coefficient output", {
  expect_type(cars$speed %r% cars$dist, "double")
})

Continuous Integration (GitHub Actions)

Run checks on multiple R versions/OS on every push/PR:

usethis::use_github_action_check_standard()

More examples: https://github.com/r-lib/actions/tree/master/examples

Keep CI within budget (class org)

  • We have only 3000 free Actions minutes/month for the class org.
  • Prefer public repos (or self-hosted runners) → no Actions minutes used.
  • For private repos, test on your personal repo first; then fork to class org.
  • Trigger CI on PRs/manual runs and use paths filters:
on:
  pull_request:
    paths:
      - 'R/**'
      - 'DESCRIPTION'
      - 'NAMESPACE'
      - '.github/workflows/**'
  workflow_dispatch:

Keep CI within budget (class org)

  • Skip manually CI with [skip ci]:
jobs:
  check:
    if: "!contains(github.event.head_commit.message, '[skip ci]')"
git add README.md
git commit -m "docs: fix typos [skip ci]"
git push

Code coverage (nice enhancement)

Measure what your tests execute:

install.packages("covr")
covr::report()           # local HTML report
usethis::use_coverage()  # set up Codecov (if desired)

Aim for high, but don’t chase 100% blindly — test the behavior that matters.

pkgdown (package website)

Quickly create a website from your docs/vignettes:

usethis::use_pkgdown()    # run once
pkgdown::build_site()     # build locally
usethis::use_github_action("pkgdown")  # CI deploy

Docs: https://pkgdown.r-lib.org/

Example site preview

Release checklist (SemVer)

  • Update NEWS.md (usethis::use_news_md()).
  • Bump version in DESCRIPTION (e.g., 0.1.00.2.0 for features).
  • devtools::check() clean locally (no ERROR/WARNING/NOTE).
  • All CI green; coverage reasonable.
  • Rebuild site; tag the release; write a short changelog.
  • Consider CRAN policies if submitting (e.g., R CMD check --as-cran).

Install from GitHub (user)

For users (install the released tag):

install.packages("remotes")
remotes::install_github("your-org/yourpkg", ref = "v0.1.0", dependencies = TRUE)

# Repo with a subfolder package:
remotes::install_github("your-org/your-repo", subdir = "yourpkg", ref = "v0.1.0")

Note: Building from source may require Rtools (Windows) or Xcode CLT (macOS).

Typical workflow (big picture)

  1. usethis::create_package("pkgtest")
  2. Write a function in R/
  3. Add roxygen docs → devtools::document()
  4. Add tests → usethis::use_testthat() → write tests/testthat/
  5. Check → devtools::check()
  6. CI → usethis::use_github_action_check_standard()
  7. Site → usethis::use_pkgdown()pkgdown::build_site()
  8. Release checklist and version bump (SemVer)

Tips & helpers (usethis)

usethis::use_readme_md()
usethis::use_lifecycle_badge("maturing")
usethis::use_github()              # init repo + remote
usethis::use_pipe()                # if you want |> helper or magrittr
usethis::use_tidy_description()    # sort/format DESCRIPTION

Resources

Creating a Python Library

Why a Python library (R parallels)

  • Distribute code, enforce conventions, improve stability.
  • R: DESCRIPTION, NAMESPACE, roxygen, testthat, pkgdown
  • Py: setup.py/wheel, __init__.py, docstrings, pytest, MkDocs

Set up a virtual environment (VS Code)

python3 -m venv venv
source venv/bin/activate     # Windows: venv\Scripts\activate
pip install -U pip setuptools wheel pytest
code .
  • VS Code → “Python: Select Interpreter” → pick ./venv

Minimal structure (parallel to R package anatomy)


pypkg/                    # ← package root (like R pkg root)
├─ setup.py               # ← like DESCRIPTION (+ build recipe)
├─ MANIFEST.in            # ← include data files (like .Rbuildignore inverse)
├─ README.md
├─ pypkg/                 # ← source (like R/ folder)
│  ├─ __init__.py         # ← public API (like NAMESPACE role)
│  ├─ hello.py            # ← hello()
│  ├─ regression.py       # ← reg_coef()
│  ├─ data.py             # ← load_snipes()
│  └─ data/
│     └─ snipes.csv       # ← packaged dataset
├─ examples/
│  └─ eg_reg_coef.py      # ← like inst/examples/...
├─ tests/                 # ← pytest (like tests/testthat)
│  ├─ test_hello.py
│  ├─ test_regression.py
│  └─ test_data.py

We need to create these files step-by-step.

Create your project directory (Terminal basics)

Open your terminal (VS Code: View → Terminal) and create a folder for your Python library.

# See where you are (present working directory)
pwd

# List folders/files
ls

# Go to the parent location where you want the project
cd path/to/parent

# Create and enter your project folder
mkdir mypythonlibrary
cd mypythonlibrary

# (Optional) Open the folder in VS Code
code .

hello() (R: the “Hello, world!” slide)

pypkg/hello.py

def hello() -> str:
    """Return a friendly greeting."""
    return "Hello, world!"

pypkg/__init__.py — expose the public API (like exporting in NAMESPACE)

from .hello import hello

__all__ = ["hello"]

Build a wheel (R: R CMD build)

setup.py (simple & minimal)

from setuptools import setup, find_packages

setup(
    name="pypkg",
    version="0.1.0",
    description="Tiny demo library",
    author="Your Name",
    packages=find_packages(include=["pypkg", "pypkg.*"]),
    install_requires=["numpy>=1.26"],
    python_requires=">=3.9",
)

Build:

python setup.py bdist_wheel
ls dist/
# pypkg-0.1.0-py3-none-any.whl

Install & try:

pip install dist/pypkg-0.1.0-py3-none-any.whl
python -c "import pypkg; print(pypkg.hello())"

Dataset: snipes (R: data/snipes.rda + data-raw)

Ship the CSV and provide a loader (no pandas required).

pypkg/data.py

from importlib import resources
import csv, io
from typing import List, Dict

def load_snipes() -> List[Dict[str, str]]:
    """Load the bundled 'snipes' dataset (list of dict rows)."""
    with resources.files("pypkg.data").joinpath("snipes.csv").open("rb") as fh:
        text = io.TextIOWrapper(fh, encoding="utf-8")
        return list(csv.DictReader(text))

MANIFEST.in — include data in the wheel

include pypkg/data/snipes.csv

R parallel: use_data() writes data/*.rda; here we ship a CSV + loader.

Preserve the origin story (R: data-raw/)

Keep raw files & scripts outside the wheel (tracked in Git):

data_raw/
  snipes_raw.csv
  make_snipes.py   # cleans raw → writes pypkg/data/snipes.csv

R parallel: usethis::use_data_raw() + .Rbuildignore; Py: keep data_raw/ out of MANIFEST.in.

Documenting dataset (R: @format, @source)

Use the loader’s docstring:

def load_snipes() -> List[Dict[str, str]]:
    """
    Load the bundled 'snipes' dataset.

    Fields
    ------
    discount : str
    brand    : str
    price    : str

    Source
    ------
    https://www.snipes.ch/
    """
    ...

MkDocs (later) will render this automatically.

Regression coefficients (R: %r% with stats::lm)

Implement a tiny OLS with NumPy (no heavy deps).

pypkg/regression.py

from typing import Iterable, Tuple
import numpy as np

def reg_coef(y: Iterable[float], x: Iterable[float]) -> Tuple[float, float]:
    """
    Compute OLS coefficients for y ~ 1 + x.

    Returns
    -------
    (intercept, slope)
    """
    y = np.asarray(y, dtype=float).reshape(-1, 1)
    x = np.asarray(x, dtype=float).reshape(-1, 1)
    if y.shape[0] != x.shape[0]:
        raise ValueError("x and y must have the same length")
    X = np.c_[np.ones_like(x), x]
    beta, *_ = np.linalg.lstsq(X, y, rcond=None)
    b0, b1 = beta.ravel().tolist()
    return b0, b1

R parallel: returns coef(lm(y ~ x)).

Example script (R: inst/examples/eg_reg_coef.R)

examples/eg_reg_coef.py

from pypkg import reg_coef

y = [2, 4, 6, 8]
x = [1, 2, 3, 4]
print(reg_coef(y, x))   # → (intercept, slope)

Run:

python examples/eg_reg_coef.py

Update public API (R: NAMESPACE)

pypkg/__init__.py — expose the public API (like exporting in NAMESPACE)

from .hello import hello
from .regression import reg_coef
from .data import load_snipes

__all__ = ["hello", "reg_coef", "load_snipes"]

Re-Build a wheel (R: R CMD build)

setup.py (simple & minimal)

from setuptools import setup, find_packages

setup(
    name="pypkg",
    version="0.1.0",
    description="Tiny demo library",
    author="Your Name",
    packages=find_packages(include=["pypkg", "pypkg.*"]),
    install_requires=["numpy>=1.26"],
    include_package_data=True,   # needed with MANIFEST.in
    python_requires=">=3.9",
)

Build:

python setup.py bdist_wheel
ls dist/
# pypkg-0.1.0-py3-none-any.whl

Install & try:

pip install dist/pypkg-0.1.0-py3-none-any.whl
python -c "import pypkg; print(pypkg.hello()); print(pypkg.reg_coef([2,4,6,8],[1,2,3,4]))"

Tests (R: testthat)

Install pytest (already done) and add three tests.

tests/test_hello.py

from pypkg import hello
def test_hello():
    assert hello() == "Hello, world!"

tests/test_regression.py

import pytest
from pypkg import reg_coef

def test_reg_coef_values():
    y, x = [2, 4, 6, 8], [1, 2, 3, 4]
    b0, b1 = reg_coef(y, x)
    assert round(b0, 6) == 0.0
    assert round(b1, 6) == 2.0

def test_reg_coef_input_shape():
    with pytest.raises(ValueError):
        reg_coef([1, 2, 3], [1, 2])

Run:

pytest -q

“Explicit imports” (R: @importFrom rule)

  • In R, bare names require @importFrom pkg fun.
  • In Python, prefer explicit imports at the top:
# Good: explicit & narrow
from numpy.linalg import lstsq
import numpy as np

# Avoid: wildcard imports (hard to track & test)
from numpy.linalg import *

Clear imports → fewer undefined symbols, better tests & CI.

CI (GitHub Actions) — lite & budget-friendly

.github/workflows/python-ci.yml

name: Python CI (lite)
on:
  pull_request:
    paths:
      - "pypkg/**"
      - "tests/**"
      - "setup.py"
      - "MANIFEST.in"
      - ".github/workflows/**"
  workflow_dispatch:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test:
    if: "!contains(github.event.head_commit.message, '[skip ci]')"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: "pip" }
      - run: python -m pip install -U pip
      - run: pip install -e . pytest
      - run: pytest -q

Docs (R: pkgdown) — optional but nice

1) Install: Add MkDocs later to auto-render docstrings:

pip install mkdocs mkdocstrings[python] mkdocs-material
mkdocs new .

2) Configure

# mkdocs.yml
site_name: pypkg
theme: material
plugins:
  - mkdocstrings:
      handlers:
        python:
          options: { docstring_style: numpy, show_source: false }
nav:
  - Home: index.md
  - API: api.md

Docs

3) Write docs

Create docs/index.md (overview/quickstart) and docs/api.md for the API.

docs/index.md (example)

# pypkg

A tiny library with `hello()`, `reg_coef()`, and the `snipes` dataset loader.

## Quickstart
```bash
pip install -e .
python -c "import pypkg; print(pypkg.hello())"
```

docs/api.md (auto API from docstrings)

# API Reference

::: pypkg

4) Preview & build

mkdocs serve     # local preview
mkdocs build     # outputs site/
mkdocs gh-deploy # (optional) deploy to GitHub Pages

Release & SemVer (same rule as R slides)

  • 0.1.0 → new features: MINOR; bugfix: PATCH; breaking: MAJOR.
  • Tag: git tag v0.1.0 && git push --tags
  • Build wheel, test in a fresh venv, then (optionally) TestPyPI → PyPI with twine.

Install from GitHub

For users (install from GitHub):

# Option 1: Using pip and remotes (recommended)
pip install git+https://github.com/your-org/yourpkg@v0.1.0

# Option 2: Using `pip install -e .` for editable installs (locally cloned repo)
git clone https://github.com/your-org/yourpkg
cd yourpkg
pip install -e .

Quick recap (R ↔︎ Py)

  • hello() → printed message ⟷ returns/prints in Python.
  • snipes datasetload_snipes() with packaged CSV.
  • regression coefficientsreg_coef() via NumPy OLS.
  • examplesexamples/eg_reg_coef.py (like inst/examples).
  • testspytest (like testthat).
  • CI → Actions (lite).
  • docs/release → optional MkDocs, SemVer tags + wheel upload.

When should you create a library?

  • Rule of Three: you’ve copied the same functions across ≥ 3 projects.
  • Audience > you: teammates/students need to reuse your code.
  • Stable core idea: function names/arguments won’t change every week.
  • Testing matters: you’re willing to add automated tests (testthat/pytest).
  • Versioning matters: you want SemVer to communicate changes.
  • Docs exist: you can write a README and minimal usage examples (vignette/doc page).
  • Multiple environments: others will run it on different machines/OS.

When not (yet)

  • One-off notebook/report: no planned reuse.
  • Spike/prototype: API and data shapes still change daily.
  • Entangled secrets/paths: credentials or local file paths mixed with logic.
  • Huge assets: gigabytes of data/binaries—ship loaders, not raw assets.
  • No maintainer bandwidth: can’t commit to fixes/docs/release bumps.

Heuristic: if you’d hesitate to fix a bug reported by someone else, don’t package yet.

Object-Oriented Programming

Why OOP?

  • Tame complexity. Large codebases drift into “spaghetti code” when everything touches everything. OOP groups data + behavior into cohesive modules (classes) with clear interfaces.
  • Reuse & extension. Add new features by creating new classes or methods—without rewriting callers (polymorphism).
  • Safer changes. Internals can evolve behind the interface (encapsulation), reducing ripple effects and regressions.

Note

Context: The “spaghetti code” idea is often cited in post-mortems of complex systems (e.g., discussions around large automotive software stacks). Clear module boundaries and interfaces are a first line of defense, (see Toyota 2013 case study)

Part I — R Functions & S3 OOP

Function

“Everything is a function call”

  • A function has three components: arguments, a body, and an environment.
  • Signalling conditions: errors (severe), warnings (mild), messages (informative).
  • Lexical scoping: dynamic lookup & name masking.
  • Environments: current, global, empty, execution, package.
  • Composition via nesting or piping (|>).
# A function is an object
f <- function(x) x + 1
typeof(f); formals(f); environment(f)

Signalling conditions

message("This is a friendly message")
warning("This is a mild warning")
try(stop("This is an error"))
Error in try(stop("This is an error")) : This is an error

Lexical scoping

x <- 10
outer <- function() {
  x <- 5
  inner <- function() x
  inner()
}
outer()  # 5: name masking
x        # 10: global unchanged

S3 OOP system

  • Object-oriented programming (OOP) is a popular programming paradigm.
  • The type of an object is a class and a function implemented for a specific class is a method.
  • Polymorphism: function interface is separated from its implementation; behavior depends on class.
  • Encapsulation: object interface is separated from its internal structure; users don’t need to worry about details. Encapsulation helps avoid spaghetti code.
  • R has several OOP systems: S3, S4, R6, …
  • S3 is the first R OOP system; it is informal (easy to modify) and widespread.

Minimal S3 example

# Minimal S3 example: generic + methods
area <- function(x, ...) UseMethod("area")        # generic

# constructor for a 'circle'
new_circle <- function(radius) structure(list(radius = radius), class = "circle")
area.circle <- function(x, ...) pi * x$radius^2

# constructor for a 'rectangle'
new_rectangle <- function(w, h) structure(list(w = w, h = h), class = "rectangle")
area.rectangle <- function(x, ...) x$w * x$h

c1 <- new_circle(2)
r1 <- new_rectangle(3, 4)
area(c1); area(r1)

Why OOP?

  • Uniform interfaces: one verb (e.g., summary(), plot()) works across many data types.
  • Separation of concerns: analyses call verbs; classes handle how.
  • Extensibility: you can add behavior for new data types without touching old code.
  • Safer refactoring: callers don’t change; class-specific methods evolve independently.
  • Discoverability: “What happens if I call summary() on this object?” → predictable, documented.

In R, this is powered by S3: a lightweight dispatch system that maps a generic (like summary) to a method (like summary.lm) based on the object’s class.

S3 OOP — Motivation

  • Polymorphism: same function name, different behavior by object class.
  • Lightweight encapsulation: use interfaces, not internals.
  • Informal & flexible (no strict class declarations).

Polymorphism example

summary(cars$speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    15.0    15.4    19.0    25.0 
summary(lm(cars$speed ~ cars$dist))

Call:
lm(formula = cars$speed ~ cars$dist)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5293 -2.1550  0.3615  2.4377  6.4179 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
cars$dist    0.16557    0.01749   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

What is happening? (dispatch)

sloop::s3_dispatch(summary(cars$speed))
   summary.double
   summary.numeric
=> summary.default
sloop::s3_dispatch(summary(lm(cars$speed ~ cars$dist)))
=> summary.lm
 * summary.default
class(cars$speed)
[1] "numeric"
class(lm(cars$speed ~ cars$dist))
[1] "lm"
  • * method exists; => method selected.

Peeking at generics (R)

summary
function (object, ...) 
UseMethod("summary")
<bytecode: 0x5568d595b0c8>
<environment: namespace:base>
head(summary.default)
                                                    
1 function (object, ..., digits, quantile.type = 7) 
2 {                                                 
3     if (is.factor(object))                        
4         return(summary.factor(object, ...))       
5     else if (is.matrix(object)) {                 
6         if (missing(digits))                      
head(summary.lm)
                                                               
1 function (object, correlation = FALSE, symbolic.cor = FALSE, 
2     ...)                                                     
3 {                                                            
4     z <- object                                              
5     p <- z$rank                                              
6     rdf <- z$df.residual                                     

... — forwarding extra arguments

f <- function(...){ list(...) }
f(a = 1, b = 2)
$a
[1] 1

$b
[1] 2
  • Useful for generics and passing options forward.

Write a generic + methods (R, S3)

my_new_generic <- function(x, ...) {
  UseMethod("my_new_generic")
}

my_new_generic.default <- function(x, ...){
  "this is default method"
}

my_new_generic.lm <- function(x, ...){
  "this is method for class `lm`"
}

my_new_generic(cars$speed)
my_new_generic(lm(cars$speed ~ cars$dist))

Check the dispatch & implicit class

sloop::s3_dispatch(my_new_generic(cars$speed))
sloop::s3_dispatch(my_new_generic(lm(cars$speed ~ cars$dist)))
sloop::s3_class(cars$speed)  # implicit "double"
typeof(cars$speed)

Inheritance (multiple classes)

class(glm(cars$speed ~ cars$speed))
sloop::s3_dispatch(summary(glm(cars$speed ~ cars$dist)))
sloop::s3_dispatch(plot(glm(cars$speed ~ cars$dist)))

If a method isn’t found for the 1st class, R tries the next, and so on.

Create your own S3 class (quick way)

set.seed(123)
image <- matrix(rgamma(100, shape = 2), 10, 10)
class(image) <- "pixel"
attributes(image)

Create your own S3 class (neater)

set.seed(123)
image <- structure(
  matrix(rgamma(100, shape = 2), 10, 10),
  class = "pixel"
)
plot(image)  # No plot() method yet for class 'pixel'

Create your own S3 class with a constructor

# define a constructor for the 'pixel' class
new_pixel <- function(mat) structure(mat, class = "pixel")
set.seed(123)
mat <- matrix(rgamma(100, shape = 2), 10, 10)
image <- new_pixel(mat)

Caution

How can we make this constructor more robust?

Validators

  • More complicated class require more complicated checks for validity.
  • Rather than making constructor complicated, we can define a validator function for the checks.
# validator: enforce invariants
validate_pixel <- function(x) {
  mat <- unclass(x)
  if(!is.matrix(mat) ||  !is.numeric(mat)) {
    stop("A 'pixel' object must be a numeric matrix", call. = FALSE)
  }
  x
}

Extending an existing generic (with care)

plot.pixel <- function(x, ...) {
  heatmap(x, ...)
}
sloop::s3_dispatch(plot(image))
=> plot.pixel
 * plot.default

New plot for class pixel

plot(image)

Forwarding extra options with ...

plot(image, col = cm.colors(256), xlab = "x axis", ylab = "y axis")

To go further (R)

  • An Introduction to Statistical Programming Methods with R — Functions.
  • Advanced R (Hadley Wickham), Ch. 6–8 (functions), 12–16 (OOP: S3, S4, R6).

Part II — Python OOP

Why OOP in Python?

  • Class-based OOP: define classes (blueprints) and create instances (houses).
  • Encapsulation: group data (attributes) with behavior (methods).
  • Inheritance: share and extend behavior (code reuse).
  • Polymorphism: same method name, different behaviors across types.
  • Generic functions with @singledispatch provide an R-S3-like feel when helpful.

Classes & instances

class Dog:
    def __init__(self, name):
        self.name = name      # attribute
    def speak(self):          # method
        return f"{self.name} says woof!"

d = Dog("Milo")
d.speak()
type(d), isinstance(d, Dog)
  • class defines a new type.
  • __init__ is the constructor (like R’s new_*()).
  • self is the instance being operated on. It’s just a naming convention (not a keyword), but always the first parameter of instance methods.

Encapsulation (group data + behavior)

class Counter:
    def __init__(self, start=0):
        self._value = start   # underscore = “internal by convention”
    def inc(self, step=1):
        self._value += step
    def value(self):
        return self._value

c = Counter(); c.inc(); c.value()

Inheritance & overriding

class Animal:
    def speak(self): return "..."

class Dog(Animal):
    def speak(self): return "woof"

class Cat(Animal):
    def speak(self): return "meow"

pets = [Dog(), Cat(), Animal()]
[s.speak() for s in pets]     # polymorphism
Dog.__mro__                   # method resolution order

Polymorphism (same message, different response)

def make_it_speak(animal):
    print(animal.speak())

for a in [Dog(), Cat()]:
    make_it_speak(a)

Generic functions with @singledispatch

from functools import singledispatch
import statistics as stats

@singledispatch
def describe(x):
    return f"Generic object of type {type(x).__name__}"

@describe.register(list)
def _(x: list):
    return {"len": len(x), "mean": stats.mean(x)}

class LinReg:
    def __init__(self, coef, intercept): self.coef, self.intercept = coef, intercept

@describe.register(LinReg)
def _(m: LinReg):
    return {"coef": m.coef, "intercept": m.intercept}

describe([1,2,3]), describe(LinReg([0.5], 2.0))

Note

Rule of thumb in Python:

  • Use class methods when you control the class design.

  • Use @singledispatch when you want an external, pluggable generic (like S3) that different modules can extend without modifying the original class.

Composition & duck typing

  • Interfaces by behavior: “if it quacks like a duck…”
  • Prefer composition when inheritance isn’t needed.
class JsonWriter:
    def write(self, obj): 
        import json; return json.dumps(obj)

class XmlWriter:
    def write(self, obj):
        return "<root>...</root>"

def save(writer, obj):     # any object with .write() works
    return writer.write(obj)

save(JsonWriter(), {"a":1})
save(XmlWriter(), {"a":1})
  • save doesn’t care what the writer is—only that it has .write().
  • You can plug in a new writer (CSV, YAML, DB, HTTP…) without touching save.

Pass in behavior (composition), don’t inherit it, when all you need is a capability.

A tiny “generic plot” (analogy to R’s plot())

import numpy as np, matplotlib.pyplot as plt
from functools import singledispatch

class Pixel:
    def __init__(self, data): self.data = np.array(data)

@singledispatch
def plot(obj): 
    raise TypeError(f"Don't know how to plot {type(obj).__name__}")

@plot.register
def _(obj: Pixel):
    plt.imshow(obj.data); plt.title("Pixel"); plt.show()

rng = np.random.default_rng(123)
image = rng.gamma(shape=2.0, scale=1.0, size=(10,10))
px = Pixel(image)
plot(px)

To go further (Python)

  • Python tutorial on classes (docs.python.org)
  • functools.singledispatch for generic functions
  • inspect for introspection, abc for abstract base classes

Bridges — R S3 ↔︎ Python OOP

  • Generic function

    • R: summary(x)summary.<class> via S3 dispatch.
    • Python: @singledispatch picks implementation by type.
  • Class identity

    • R S3: class(x) is an attribute; can be a vector (ordered inheritance).
    • Python: type(x) is the class; inheritance via MRO chain.
  • Extensibility

    • R: add generic.class <- function(x, ...) {} for known generics (mind ...).
    • Python: @generic.register(Type) or subclass and override methods.
  • Scoping / environments vs namespaces

    • R: function environments, lexical scoping.
    • Python: LEGB (Local–Enclosing–Global–Builtins).
  • Varargs

    • R: ... ↔︎ Python: *args, **kwargs.

Recap

  • OOP gives you uniform verbs, decoupled implementations, and extensibility.
  • In R, S3 makes this lightweight and idiomatic (summary, plot, coef, …).
  • In Python, classes + @singledispatch cover both class methods and generic functions.

Case study (R, S3): a tiny smoother model

Goal: Create a simple “model” object that computes a moving-average fit and immediately works with summary() and plot() via S3.

# Tiny moving average helper (base R)
movavg <- function(z, k = 5) {
  stopifnot(k %% 2 == 1, k >= 3)
  n <- length(z); out <- rep(NA_real_, n); h <- k %/% 2
  for (i in (1 + h):(n - h)) out[i] <- mean(z[(i - h):(i + h)])
  out
}

# Constructor returning an S3 object of class "my_smooth"
my_smooth <- function(x, y, k = 5) {
  stopifnot(length(x) == length(y))
  o <- order(x); x <- x[o]; y <- y[o]
  structure(
    list(x = x, y = y, k = k, fitted = movavg(y, k), call = match.call()),
    class = "my_smooth"
  )
}

Case study (R, S3): methods + usage

# Summary method
summary.my_smooth <- function(object, ...) {
  res <- object$y - object$fitted
  c(n = length(object$y),
    k = object$k,
    mse = mean(res^2, na.rm = TRUE))
}

# Plot method
plot.my_smooth <- function(x, ..., col_points = "grey40", pch = 19) {
  plot(x$x, x$y, col = col_points, pch = pch,
       xlab = "x", ylab = "y", main = "my_smooth fit", ...)
  lines(x$x, x$fitted, lwd = 2)
  legend("topleft", bty = "n", legend = paste0("k = ", x$k))
}

# Fit on 'cars' and use generic verbs
m <- my_smooth(cars$dist, cars$speed, k = 5)
summary(m)
plot(m)

Case study (R, S3): what dispatch did

sloop::s3_dispatch(summary(m))
sloop::s3_dispatch(plot(m))

Takeaways

  • We created a new class ("my_smooth") with a tiny constructor.
  • By adding summary.my_smooth and plot.my_smooth, existing verbs just work.
  • Analyses call generic verbs; class authors decide the behavior.

Case study (Python): the same idea with classes + @singledispatch

Goal: Mirror the R idea: a small smoother class, plus generic summarize() and plot() that dispatch on type.

import numpy as np

class Smooth:
    def __init__(self, x, y, k=5):
        assert k % 2 == 1 and k >= 3
        idx = np.argsort(x)
        self.x = np.array(x)[idx]
        self.y = np.array(y)[idx]
        self.k = k
        self.fitted = self._movavg(self.y, k)

    def _movavg(self, z, k):
        n = len(z); out = np.full(n, np.nan); h = k // 2
        for i in range(h, n - h):
            out[i] = np.mean(z[i - h:i + h + 1])
        return out

Case study (Python): generic functions + usage

from functools import singledispatch
import numpy as np, matplotlib.pyplot as plt

@singledispatch
def summarize(obj):
    raise TypeError(f"No summarize() for {type(obj).__name__}")

@summarize.register
def _(obj: Smooth):
    res = obj.y - obj.fitted
    return {"n": len(obj.y), "k": obj.k, "mse": float(np.nanmean(res**2))}

@singledispatch
def plot(obj):
    raise TypeError(f"No plot() for {type(obj).__name__}")

@plot.register
def _(obj: Smooth):
    plt.scatter(obj.x, obj.y, s=15)
    plt.plot(obj.x, obj.fitted, linewidth=2)
    plt.title(f"Smooth (k={obj.k})")
    plt.xlabel("x"); plt.ylabel("y"); plt.show()

# Example with 'cars'-like data (replace with real arrays in class)
dist = [d for d in range(2, 122, 2)]
rng = np.random.default_rng(123)
speed = (0.3*np.array(dist) + rng.normal(0, 3, len(dist))).tolist()

m = Smooth(dist, speed, k=5)
summarize(m), plot(m)

Case study

  • Same verb, different types

    • R: summary(m) → finds summary.my_smooth.
    • Python: summarize(m)@singledispatch picks the Smooth implementation.
  • Extensibility

    • Add another class (e.g., my_tree) and define just two methods:

      • R: summary.my_tree(), plot.my_tree()
      • Python: @summarize.register(MyTree), @plot.register(MyTree)
  • Separation of concerns

    • Analysts keep calling the same verbs; class authors evolve internals safely.

OOP Exercise in R

Setup (given): you have my_smooth(x, y, k) (constructor) with
summary.my_smooth() and plot.my_smooth() from the case study.

A. Create and use an object 1) Build three models with different windows: k = 3, k = 7, k = 11.
2) Call summary() on each.
3) In 1–2 lines: which k fits best (lower MSE) on your data?

B. Add a tiny exporter Add an S3 method to convert your object to a simple list:

as.list.my_smooth <- function(x, ...) {
  list(x = x$x, y = x$y, fitted = x$fitted, k = x$k, call = x$call)
}

Then run: d <- as.list(m7); names(d) Why is exporting to a plain list handy (think: saving, APIs, tests)?

OOP Exercise in R (continued)

C. Make a simple prediction Add a predict() S3 method using linear interpolation along the fitted curve:

predict.my_smooth <- function(object, xnew, ...) {
  approx(object$x, object$fitted, xout = xnew, rule = 2)$y
}

Try: predict(m7, c(10, 20, 30)) and print the results.

D. Plot (optional) Plot your three models and briefly describe how increasing k changes smoothness:

par(mfrow = c(1,3))
plot(m3, main = "k=3");  plot(m7, main = "k=7");  plot(m11, main = "k=11")
par(mfrow = c(1,1))

OOP Exercise in Python

Setup (given): you have Smooth(x, y, k=5) from the case study, plus:

@singledispatch
def summarize(obj): ...
@summarize.register
def _(m: Smooth): ...

A. Create and use an object

  1. Make a Smooth with k = 3, k = 7, k = 11.
  2. Call summarize() on each.
  3. In 1–2 lines: which k fits best (lower MSE) on your data?

B. Add a tiny method Add this method inside Smooth:

def to_dict(self):
    return {"x": self.x.tolist(),
            "y": self.y.tolist(),
            "fitted": np.nan_to_num(self.fitted).tolist(),
            "k": int(self.k)}

Then do: d = m.to_dict() and print the keys. Why is this useful?

OOP Exercise in Python (continued)

C. Make a simple prediction Add inside Smooth:

def predict(self, x_new):
    # linear interpolation on the fitted curve
    return np.interp(x_new, self.x, np.nan_to_num(self.fitted))

Try predict([10, 20, 30]) and print the results.

D. Plot (optional) Make a function:

def quick_plot(m):
    plt.scatter(m.x, m.y, s=12)
    plt.plot(m.x, m.fitted, linewidth=2)
    plt.title(f"Smooth (k={m.k})"); plt.xlabel("x"); plt.ylabel("y"); plt.show()

Plot your three models; in 1–2 lines, say how increasing k changes smoothness.

Functional Programming

Functional programming

  • A paradigm emphasizing pure functions (no side effects), immutability, and declarative style.

  • Benefits: maintainable, predictable, and scalable (parallelizable) code.

  • Key concepts:

    • Pure function: same output for same input; no side effects.
    • First-class functions: can be passed, returned, stored.
    • Higher-order functions: take/return functions.

Pure function

  • A pure function always produces the same output for the same input.
  • Is rnorm a pure function?
set.seed(123)
rnorm(5)

Pure function (continued)

set.seed(123)
rnorm(5)
set.seed(124)
rnorm(5)
  • Same explicit call but different internal RNG state → rnorm is not pure.

First-class functions — pass as argument

f <- function(g) g(rnorm(10))
f(sum); f(max); f(mean)

First-class functions — returning a function

makeMultiplier <- function(factor) {
  function(x) x * factor
}

timesFive <- makeMultiplier(5)
timesFive(10)

Function operators / higher-order

applyTwice <- function(func) {
  function(x) func(func(x))
}
addTwo <- function(x) x + 2
applyTwice(addTwo)(3)

From for loops to functionals — squares (R)

# Using a for loop to calculate squares
n <- 5
result <- vector("list", n)
for (i in 1:n) result[[i]] <- i^2
result
# Functional approach with purrr
library(purrr)
map(1:n, ~ .^2)

From for loops to functionals — sum

# for loop
a <- 0
for (i in 1:10) a <- a + i
a
# Reduce
Reduce(`+`, 1:10)

purrr::map() (R)

  • map(v, f) applies f to each element of v and returns a list.
library(purrr)
map(1:3, exp)
# base R equivalent
lapply(1:3, exp)

Returning atomic vectors

  • map / lapply return a list; sometimes you want an atomic vector.
  • Use map_lgl(), map_int(), map_dbl(), map_chr() or base sapply / vapply.
# purrr typed maps
map_dbl(1:4, exp)
# base R
sapply(1:4, exp)
# vapply needs a template
vapply(1:4, FUN = exp, FUN.VALUE = double(1))

Inline anonymous functions

There are situations where the function you want to pass does not exist yet. Use an inline anonymous function (aka lambda).

# purrr formula shortcut
map_int(1:4, ~ if (.x %% 2 == 0) .x^2 else .x^3)
# base R anonymous function
sapply(1:4, function(x) if (x %% 2 == 0) x^2 else x^3)

Variants to purrr::map()

map2 takes two vectors v1, v2 and a function f and returns f(v1[i], v2[i]).

wt <- c(5,5,4,1)/15
wtL <- list(wt, wt, wt)
x <- list(c(6,4.5,5,4), c(5.5,5,4.5,6), c(6,6,4,4))
map2_dbl(x, wtL, weighted.mean)

pmap generalization (R)

l1 <- as.list(1:3)
l2 <- as.list(4:6)
l3 <- as.list(7:9)
calculate_sum <- function(e1,e2,e3) e1+e2+e3
pmap(list(l1,l2,l3), calculate_sum)

Variants to sapply (R)

  • mapply generalizes sapply to many inputs.
  • Map vectorizes over all arguments (no extra non-vectorized input allowed).
# mapply weighted means
wt <- c(5,5,4,1)/15
wtL <- list(wt, wt, wt)
x <- list(c(6,4.5,5,4), c(5.5,5,4.5,6), c(6,6,4,4))
mapply(FUN = weighted.mean, x, wtL)
# Map over all arguments
Map(f = weighted.mean, x, wtL)

outer product (R)

outer(X = c("a","b","c"), Y = c("1","2","3","4"), FUN = paste0)

Common higher-order functions in FP (R)

  • Higher-order = take/return functions.
  • Reduce applies a binary function iteratively to elements of a vector.
Reduce(`+`, 1:10)

R’s vectorization

# Using map/lapply
library(purrr)
map(1:2, exp)
lapply(1:2, exp)
# But exp is already vectorized
exp(1:2)

Vectorizing a function (R): pitfall then fix

# return square if even, cube otherwise
f <- function(x) if (x %% 2 == 0) x^2 else x^3
f(1:4)  # error: non-scalar if

Vectorizing a function (R): Vectorize

vf <- Vectorize(FUN = f, vectorize.args = "x")
vf(1:4)

Vectorizing a function (R): ifelse

ifelse(1:4 %% 2 == 0, (1:4)^2, (1:4)^3)

Parallelism (R)

  • Benefits of FP include scalability and parallelism.
  • Many problems are embarrassingly parallel.
  • parallel ships with R and offers parallelized *apply variants.
library(parallel)
detectCores()
  • This is the total number of threads (Hyper-Threading), not physical cores.

Forking with mclapply (R)

measure_time <- function(x){
  t1 <- Sys.time(); Sys.sleep(x); t2 <- Sys.time()
  difftime(t2,t1,units="secs")
}

Forking with mclapply (R)

library(parallel)
t1 <- Sys.time()
mclapply(1:5, measure_time, mc.cores = 5)
t2 <- Sys.time()
sprintf("In total, it took %.1f seconds to run", as.numeric(difftime(t2,t1,units="secs")))

Socket cluster with parLapply (R)

library(parallel)
cl <- makeCluster(5)
t1 <- Sys.time()
parLapply(cl, 1:5, measure_time)
t2 <- Sys.time()

Socket cluster with parLapply (R)

stopCluster(cl)
sprintf("In total, it took %.1f seconds to run", as.numeric(difftime(t2,t1,units="secs")))

Python counterparts

  • Python is multi-paradigm. It supports functional programming (FP) alongside OOP and imperative styles.
  • Functional tools: map/filter, comprehensions, itertools, functools.reduce/partial/lru_cache.
  • Vectorization is via NumPy (ufuncs). Parallelism via concurrent.futures / multiprocessing.
  • Defaults are eager at definition time (vs. R’s lazy promises); use the None pattern for dynamic defaults.

To go further