Data & Code Management: From Collection to Application
2025-11-06
You will need (at least) the following packages:
Check your system toolchain:
If not ready: https://r-pkgs.org/setup.html
(We’ll live-code a tiny package: init → a function → docs → tests → pkgdown.)
pkgtest/
├─ DESCRIPTION
├─ NAMESPACE # auto-generated by roxygen2
├─ R/ # your exported/internal functions
├─ man/ # *.Rd docs (generated)
├─ tests/testthat/ # unit tests
├─ vignettes/ # long-form docs (optional)
├─ data/ # .rda datasets (optional)
├─ inst/ # e.g., inst/examples/
└─ data-raw/ # raw data + scripts (ignored by build)DESCRIPTION contains package metadata (authors, description, dependencies, contact, …). Example:
# Plain text (DCF) — shown here for reference
Package: pkgtest
Type: Package
Title: What the Package Does (Title Case)
Version: 0.1.0
Authors@R: person("John", "Doe", email = "john.doe@example.com",
role = c("aut", "cre"))
Maintainer: John Doe <john.doe@example.com>
Description: More about what it does (maybe more than one line).
Use four spaces when indenting paragraphs within the Description.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
URL: https://github.com/ptds2024/pkgtest
BugReports: https://github.com/ptds2024/pkgtest/issues
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
Suggests:
knitr,
rmarkdown,
testthat (>= 3.0.0)
Config/testthat/edition: 3Tip: usethis::use_description() can scaffold this for you.
Use person() in Authors@R. Common roles:
"cre" = maintainer (creator)"aut" = author (substantial contributions)"ctb" = contributor (smaller contributions)"cph" = copyright holder (institution/corporate)Choose a license: https://choosealicense.com/licenses/
DESCRIPTION lists what your package needs.
Guidelines:
Imports for fun() used in codes in R/.Depends for base R version requirement.Suggests for docs, tests, vignettes.@importFrom dplyr mutate select.man/ as *.Rd (generated).devtools::document() to update NAMESPACE + man/.Place roxygen just above the function:
Useful tags: @title, @description, @details, @param, @return, @examples, @seealso, @author, @references, @import, @importFrom, @export
Document all user-facing functions; export some of them.
Rule: If you call functions from another package, you must import them with roxygen’s @importFrom so NAMESPACE lists them.
It is not sufficient to add pkg::fun() calls in your code.
Otherwise you’ll hit R CMD check NOTES like “no visible global function definition for ‘select’” and users may get runtime errors.
Note: Base R functions (e.g., mean(), lm()) do not need imports.
Do (bare names + explicit imports):
data/ as .rda.usethis::use_data() to serialize R objects.data-raw/.usethis::use_data_raw() (auto-adds to .Rbuildignore).Reference: r-pkgs, “Preserve the origin story of package data”.
Two useful tags: @format and @source.
Like .gitignore, but for package builds.
Or maintain manually:
Long-form docs (articles, tutorials) built with R Markdown/Quarto.
Remember to list knitr and rmarkdown under Suggests.
The namespace controls how R looks up variables: it searches your package namespace, then imports, then base, then the regular search path.
@export to expose a function; @importFrom pkg fun to bring symbols in.In R/reg_coef.R:
#' Compute regression coefficients
#'
#' @param x Design \code{matrix} or vector.
#' @param y Response \code{vector}.
#' @details Uses \link[stats]{lm} then \link[stats]{coef}.
#' @importFrom stats lm coef
#' @seealso \code{\link[stats]{lm}}, \code{\link[stats]{coef}}
#' @example inst/examples/eg_reg_coef.R
#' @export
`%r%` <- function(y, x) {
fit <- lm(y ~ x)
coef(fit)
}R CMD check.inst/examples/ and reference them:inst/examples/eg_reg_coef.R:
If inst/examples/eg_reg_coef.R contains:
You’ll get a failing check:
Examples are for users; tests are for you (broader, automated).
“Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead.” — Martin Fowler
expect_*()).test_that()).tests/testthat/test_reg_coef.R).Example:
Run checks on multiple R versions/OS on every push/PR:
More examples: https://github.com/r-lib/actions/tree/master/examples
paths filters:[skip ci]:Measure what your tests execute:
Aim for high, but don’t chase 100% blindly — test the behavior that matters.
Quickly create a website from your docs/vignettes:
NEWS.md (usethis::use_news_md()).DESCRIPTION (e.g., 0.1.0 → 0.2.0 for features).devtools::check() clean locally (no ERROR/WARNING/NOTE).R CMD check --as-cran).For users (install the released tag):
Note: Building from source may require Rtools (Windows) or Xcode CLT (macOS).
usethis::create_package("pkgtest")R/devtools::document()usethis::use_testthat() → write tests/testthat/devtools::check()usethis::use_github_action_check_standard()usethis::use_pkgdown() → pkgdown::build_site()DESCRIPTION, NAMESPACE, roxygen, testthat, pkgdownsetup.py/wheel, __init__.py, docstrings, pytest, MkDocs./venv
pypkg/ # ← package root (like R pkg root)
├─ setup.py # ← like DESCRIPTION (+ build recipe)
├─ MANIFEST.in # ← include data files (like .Rbuildignore inverse)
├─ README.md
├─ pypkg/ # ← source (like R/ folder)
│ ├─ __init__.py # ← public API (like NAMESPACE role)
│ ├─ hello.py # ← hello()
│ ├─ regression.py # ← reg_coef()
│ ├─ data.py # ← load_snipes()
│ └─ data/
│ └─ snipes.csv # ← packaged dataset
├─ examples/
│ └─ eg_reg_coef.py # ← like inst/examples/...
├─ tests/ # ← pytest (like tests/testthat)
│ ├─ test_hello.py
│ ├─ test_regression.py
│ └─ test_data.py
We need to create these files step-by-step.
Open your terminal (VS Code: View → Terminal) and create a folder for your Python library.
pypkg/hello.py
pypkg/__init__.py — expose the public API (like exporting in NAMESPACE)
R CMD build)setup.py (simple & minimal)
Build:
Install & try:
data/snipes.rda + data-raw)Ship the CSV and provide a loader (no pandas required).
pypkg/data.py
from importlib import resources
import csv, io
from typing import List, Dict
def load_snipes() -> List[Dict[str, str]]:
"""Load the bundled 'snipes' dataset (list of dict rows)."""
with resources.files("pypkg.data").joinpath("snipes.csv").open("rb") as fh:
text = io.TextIOWrapper(fh, encoding="utf-8")
return list(csv.DictReader(text))MANIFEST.in — include data in the wheel
R parallel:
use_data()writesdata/*.rda; here we ship a CSV + loader.
data-raw/)Keep raw files & scripts outside the wheel (tracked in Git):
R parallel:
usethis::use_data_raw()+.Rbuildignore; Py: keepdata_raw/out ofMANIFEST.in.
Use the loader’s docstring:
MkDocs (later) will render this automatically.
%r% with stats::lm)Implement a tiny OLS with NumPy (no heavy deps).
pypkg/regression.py
from typing import Iterable, Tuple
import numpy as np
def reg_coef(y: Iterable[float], x: Iterable[float]) -> Tuple[float, float]:
"""
Compute OLS coefficients for y ~ 1 + x.
Returns
-------
(intercept, slope)
"""
y = np.asarray(y, dtype=float).reshape(-1, 1)
x = np.asarray(x, dtype=float).reshape(-1, 1)
if y.shape[0] != x.shape[0]:
raise ValueError("x and y must have the same length")
X = np.c_[np.ones_like(x), x]
beta, *_ = np.linalg.lstsq(X, y, rcond=None)
b0, b1 = beta.ravel().tolist()
return b0, b1R parallel: returns
coef(lm(y ~ x)).
inst/examples/eg_reg_coef.R)examples/eg_reg_coef.py
Run:
pypkg/__init__.py — expose the public API (like exporting in NAMESPACE)
R CMD build)setup.py (simple & minimal)
from setuptools import setup, find_packages
setup(
name="pypkg",
version="0.1.0",
description="Tiny demo library",
author="Your Name",
packages=find_packages(include=["pypkg", "pypkg.*"]),
install_requires=["numpy>=1.26"],
include_package_data=True, # needed with MANIFEST.in
python_requires=">=3.9",
)Build:
Install & try:
Install pytest (already done) and add three tests.
tests/test_hello.py
tests/test_regression.py
Run:
@importFrom rule)@importFrom pkg fun.Clear imports → fewer undefined symbols, better tests & CI.
.github/workflows/python-ci.yml
name: Python CI (lite)
on:
pull_request:
paths:
- "pypkg/**"
- "tests/**"
- "setup.py"
- "MANIFEST.in"
- ".github/workflows/**"
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
test:
if: "!contains(github.event.head_commit.message, '[skip ci]')"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: "pip" }
- run: python -m pip install -U pip
- run: pip install -e . pytest
- run: pytest -q1) Install: Add MkDocs later to auto-render docstrings:
2) Configure
3) Write docs
Create docs/index.md (overview/quickstart) and docs/api.md for the API.
docs/index.md (example)
docs/api.md (auto API from docstrings)
4) Preview & build
0.1.0 → new features: MINOR; bugfix: PATCH; breaking: MAJOR.git tag v0.1.0 && git push --tagstwine.For users (install from GitHub):
load_snipes() with packaged CSV.reg_coef() via NumPy OLS.examples/eg_reg_coef.py (like inst/examples).pytest (like testthat).Heuristic: if you’d hesitate to fix a bug reported by someone else, don’t package yet.
Note
Context: The “spaghetti code” idea is often cited in post-mortems of complex systems (e.g., discussions around large automotive software stacks). Clear module boundaries and interfaces are a first line of defense, (see Toyota 2013 case study)

|>).R has several OOP systems: S3, S4, R6, …# Minimal S3 example: generic + methods
area <- function(x, ...) UseMethod("area") # generic
# constructor for a 'circle'
new_circle <- function(radius) structure(list(radius = radius), class = "circle")
area.circle <- function(x, ...) pi * x$radius^2
# constructor for a 'rectangle'
new_rectangle <- function(w, h) structure(list(w = w, h = h), class = "rectangle")
area.rectangle <- function(x, ...) x$w * x$h
c1 <- new_circle(2)
r1 <- new_rectangle(3, 4)
area(c1); area(r1)summary(), plot()) works across many data types.summary() on this object?” → predictable, documented.In R, this is powered by S3: a lightweight dispatch system that maps a generic (like
summary) to a method (likesummary.lm) based on the object’s class.
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 12.0 15.0 15.4 19.0 25.0
Call:
lm(formula = cars$speed ~ cars$dist)
Residuals:
Min 1Q Median 3Q Max
-7.5293 -2.1550 0.3615 2.4377 6.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
summary.double
summary.numeric
=> summary.default
=> summary.lm
* summary.default
[1] "numeric"
[1] "lm"
* method exists; => method selected.function (object, ...)
UseMethod("summary")
<bytecode: 0x5568d595b0c8>
<environment: namespace:base>
1 function (object, ..., digits, quantile.type = 7)
2 {
3 if (is.factor(object))
4 return(summary.factor(object, ...))
5 else if (is.matrix(object)) {
6 if (missing(digits))
1 function (object, correlation = FALSE, symbolic.cor = FALSE,
2 ...)
3 {
4 z <- object
5 p <- z$rank
6 rdf <- z$df.residual
... — forwarding extra argumentsIf a method isn’t found for the 1st class, R tries the next, and so on.
Caution
How can we make this constructor more robust?
pixel...@singledispatch provide an R-S3-like feel when helpful.class defines a new type.__init__ is the constructor (like R’s new_*()).self is the instance being operated on. It’s just a naming convention (not a keyword), but always the first parameter of instance methods.@singledispatchfrom functools import singledispatch
import statistics as stats
@singledispatch
def describe(x):
return f"Generic object of type {type(x).__name__}"
@describe.register(list)
def _(x: list):
return {"len": len(x), "mean": stats.mean(x)}
class LinReg:
def __init__(self, coef, intercept): self.coef, self.intercept = coef, intercept
@describe.register(LinReg)
def _(m: LinReg):
return {"coef": m.coef, "intercept": m.intercept}
describe([1,2,3]), describe(LinReg([0.5], 2.0))Note
Rule of thumb in Python:
Use class methods when you control the class design.
Use @singledispatch when you want an external, pluggable generic (like S3) that different modules can extend without modifying the original class.
save doesn’t care what the writer is—only that it has .write().Pass in behavior (composition), don’t inherit it, when all you need is a capability.
plot())import numpy as np, matplotlib.pyplot as plt
from functools import singledispatch
class Pixel:
def __init__(self, data): self.data = np.array(data)
@singledispatch
def plot(obj):
raise TypeError(f"Don't know how to plot {type(obj).__name__}")
@plot.register
def _(obj: Pixel):
plt.imshow(obj.data); plt.title("Pixel"); plt.show()
rng = np.random.default_rng(123)
image = rng.gamma(shape=2.0, scale=1.0, size=(10,10))
px = Pixel(image)
plot(px)functools.singledispatch for generic functionsinspect for introspection, abc for abstract base classesGeneric function
summary(x) → summary.<class> via S3 dispatch.@singledispatch picks implementation by type.Class identity
class(x) is an attribute; can be a vector (ordered inheritance).type(x) is the class; inheritance via MRO chain.Extensibility
generic.class <- function(x, ...) {} for known generics (mind ...).@generic.register(Type) or subclass and override methods.Scoping / environments vs namespaces
Varargs
... ↔︎ Python: *args, **kwargs.summary, plot, coef, …).@singledispatch cover both class methods and generic functions.Goal: Create a simple “model” object that computes a moving-average fit and immediately works with summary() and plot() via S3.
# Tiny moving average helper (base R)
movavg <- function(z, k = 5) {
stopifnot(k %% 2 == 1, k >= 3)
n <- length(z); out <- rep(NA_real_, n); h <- k %/% 2
for (i in (1 + h):(n - h)) out[i] <- mean(z[(i - h):(i + h)])
out
}
# Constructor returning an S3 object of class "my_smooth"
my_smooth <- function(x, y, k = 5) {
stopifnot(length(x) == length(y))
o <- order(x); x <- x[o]; y <- y[o]
structure(
list(x = x, y = y, k = k, fitted = movavg(y, k), call = match.call()),
class = "my_smooth"
)
}# Summary method
summary.my_smooth <- function(object, ...) {
res <- object$y - object$fitted
c(n = length(object$y),
k = object$k,
mse = mean(res^2, na.rm = TRUE))
}
# Plot method
plot.my_smooth <- function(x, ..., col_points = "grey40", pch = 19) {
plot(x$x, x$y, col = col_points, pch = pch,
xlab = "x", ylab = "y", main = "my_smooth fit", ...)
lines(x$x, x$fitted, lwd = 2)
legend("topleft", bty = "n", legend = paste0("k = ", x$k))
}
# Fit on 'cars' and use generic verbs
m <- my_smooth(cars$dist, cars$speed, k = 5)
summary(m)Takeaways
"my_smooth") with a tiny constructor.summary.my_smooth and plot.my_smooth, existing verbs just work.Goal: Mirror the R idea: a small smoother class, plus generic summarize() and plot() that dispatch on type.
import numpy as np
class Smooth:
def __init__(self, x, y, k=5):
assert k % 2 == 1 and k >= 3
idx = np.argsort(x)
self.x = np.array(x)[idx]
self.y = np.array(y)[idx]
self.k = k
self.fitted = self._movavg(self.y, k)
def _movavg(self, z, k):
n = len(z); out = np.full(n, np.nan); h = k // 2
for i in range(h, n - h):
out[i] = np.mean(z[i - h:i + h + 1])
return outfrom functools import singledispatch
import numpy as np, matplotlib.pyplot as plt
@singledispatch
def summarize(obj):
raise TypeError(f"No summarize() for {type(obj).__name__}")
@summarize.register
def _(obj: Smooth):
res = obj.y - obj.fitted
return {"n": len(obj.y), "k": obj.k, "mse": float(np.nanmean(res**2))}
@singledispatch
def plot(obj):
raise TypeError(f"No plot() for {type(obj).__name__}")
@plot.register
def _(obj: Smooth):
plt.scatter(obj.x, obj.y, s=15)
plt.plot(obj.x, obj.fitted, linewidth=2)
plt.title(f"Smooth (k={obj.k})")
plt.xlabel("x"); plt.ylabel("y"); plt.show()
# Example with 'cars'-like data (replace with real arrays in class)
dist = [d for d in range(2, 122, 2)]
rng = np.random.default_rng(123)
speed = (0.3*np.array(dist) + rng.normal(0, 3, len(dist))).tolist()
m = Smooth(dist, speed, k=5)
summarize(m), plot(m)Same verb, different types
summary(m) → finds summary.my_smooth.summarize(m) → @singledispatch picks the Smooth implementation.Extensibility
Add another class (e.g., my_tree) and define just two methods:
summary.my_tree(), plot.my_tree()@summarize.register(MyTree), @plot.register(MyTree)Separation of concerns
Setup (given): you have my_smooth(x, y, k) (constructor) with
summary.my_smooth() and plot.my_smooth() from the case study.
A. Create and use an object 1) Build three models with different windows: k = 3, k = 7, k = 11.
2) Call summary() on each.
3) In 1–2 lines: which k fits best (lower MSE) on your data?
B. Add a tiny exporter Add an S3 method to convert your object to a simple list:
Then run: d <- as.list(m7); names(d) Why is exporting to a plain list handy (think: saving, APIs, tests)?
C. Make a simple prediction Add a predict() S3 method using linear interpolation along the fitted curve:
Try: predict(m7, c(10, 20, 30)) and print the results.
D. Plot (optional) Plot your three models and briefly describe how increasing k changes smoothness:
Setup (given): you have Smooth(x, y, k=5) from the case study, plus:
A. Create and use an object
Smooth with k = 3, k = 7, k = 11.summarize() on each.k fits best (lower MSE) on your data?B. Add a tiny method Add this method inside Smooth:
Then do: d = m.to_dict() and print the keys. Why is this useful?
C. Make a simple prediction Add inside Smooth:
Try predict([10, 20, 30]) and print the results.
D. Plot (optional) Make a function:
Plot your three models; in 1–2 lines, say how increasing k changes smoothness.
A paradigm emphasizing pure functions (no side effects), immutability, and declarative style.
Benefits: maintainable, predictable, and scalable (parallelizable) code.
Key concepts:
rnorm a pure function?rnorm is not pure.for loops to functionals — squares (R)for loops to functionals — sumpurrr::map() (R)map / lapply return a list; sometimes you want an atomic vector.map_lgl(), map_int(), map_dbl(), map_chr() or base sapply / vapply.There are situations where the function you want to pass does not exist yet. Use an inline anonymous function (aka lambda).
purrr::map()pmap generalization (R)sapply (R)mapply generalizes sapply to many inputs.Map vectorizes over all arguments (no extra non-vectorized input allowed).outer product (R)Reduce applies a binary function iteratively to elements of a vector.R’s vectorizationVectorizeifelseparallel ships with R and offers parallelized *apply variants.mclapply (R)mclapply (R)parLapply (R)parLapply (R)map/filter, comprehensions, itertools, functools.reduce/partial/lru_cache.concurrent.futures / multiprocessing.None pattern for dynamic defaults.purrr cheatsheet.HEC Lausanne · Business Analytics · Thu 9:00–12:00