Lecture 2 — Programming Foundations

Data & Code Management: From Collection to Application

Samuel Orso

2025-10-02

Big picture

  • LLMs (ChatGPT, Gemini, Claude, LLaMA): great for ideation, scaffolding code, refactoring; not ground truth.
  • Quarto = literate programming hub: one doc, code+text+figures → HTML/PDF/slides.
  • Git/GitHub: version control + collaboration (branches, PRs, reviews).
  • Reproducibility: lock environments, script everything, document decisions.

LLMs — use wisely

  • Strengths: productivity, learning, boilerplate, code review.
  • Risks: hallucinations, missing context, ethics/licensing.
  • Good habits: give clear prompts + context, ask for tests, verify outputs.

Quarto essentials

  • Markdown + YAML header (title, format, options).
  • Code chunks (R/Python), chunk options: echo, eval, warning, fig-*.
  • Figures & tables: knitr::kable(), kableExtra; equations (Mathpix helpful).
  • Live preview; keep content modular.

Git/GitHub workflow

  1. Small commits with clear messages.
  2. Branch → PR → review → merge.
  3. Handle conflicts; use issues/templates.
  4. “New habits”: automate repeated steps; write READMEs.

Reproducibility kit

  • R: renv::init(); snapshot(); restore()
  • Python: venv/Poetry/Conda + lockfile
  • Determinism: pin versions; set seeds; record data sources.

From last time — tasks

  • Install: Git, GitHub account, Quarto, R (+renv), Python (venv/Conda).
  • Create a starter repo with a Quarto doc; push to GitHub.
  • Try one mini-exercise (R / Python / Quarto / GitHub).
  • Optional bonus: present a short solution next practical.

Data structures

Note

Today: Primary types, vectors/lists, matrices/arrays, data frames/tibbles, pandas DataFrame, dates, subsetting, coercion rules, and common ops — in both R and Python.

Primary (scalar) types

  • integer: 2L, 12L
  • double (real): -4, 12.4532, 6
  • logical: TRUE, FALSE (also T, F)
  • character: "a", "Bonjour"
  • (also: complex, raw)
typeof(12.5); typeof(2L); typeof(TRUE); typeof("hi")
  • int: -4, 6
  • float: 12.4532
  • bool: True, False
  • str: 'a', "Bonjour"
  • (also: complex, bytes)
type(12.5), type(2), type(True), type("hi")

Reserved / special values

  • Missing: NA, typed variants NA_real_, NA_integer_, NA_character_, …
  • Infinity / NaN: Inf, NaN
  • Null object: NULL
  • Control words: if, else, repeat, while, function, for, in, next, break
c(NA, Inf, NaN, NULL)
  • Missing: None (object), numerical missing often math.nan or numpy.nan / pandas.NA
  • Infinity / NaN: math.inf, float('inf'), math.nan
  • Keywords: if, else, for, while, def, in, break, continue, class, …
import math
[None, math.inf, math.nan]

Inspecting and coercing types

typeof(2L)              # "integer"
is.double(2L)           # FALSE
as.double(2L)           # 2
type(2) is int          # True
isinstance(2.0, float)  # True
int(2.9), float("3.1"), str(True)

Nondeterminism in LLM

Floating-point arithmetic in GPUs exhibits non-associativity, meaning \((a+b)+c\neq a+(b+c)\) due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.
Horace He in collaboration with others at Thinking Machines

Note

Read the full article here

Testing floating-point non-associativity

What do you obtain if you run this in Python/R?

(0.1 + 1e20) - 1e20
0.1 + (1e20 - 1e20)

If you replace 20 by another smaller integer, do you get the same result? At which integer do you start to get the same result? Why?

Testing floating-point non-associativity

options(digits = 22)
f <- function(n) (0.1 + 10^n) - 10^n
c(n20 = f(20), n16 = f(16), n15 = f(15), n14 = f(14))
0.1 + (10^20 - 10^20)
import math
def f(n): return (0.1 + 10.0**n) - 10.0**n
print({ 'n20': f(20), 'n16': f(16), 'n15': f(15), 'n14': f(14) })
print(0.1 + (10.0**20 - 10.0**20))

Key idea: near large magnitudes, the ULP is so big that small addends vanish or round to the nearest ULP, breaking associativity.

Homogeneous vs heterogeneous structures

Dimension R homogeneous R heterogeneous Python homogeneous Python heterogeneous
1 atomic vector list NumPy array list, tuple
2 matrix data.frame / tibble NumPy ndarray (2D) pandas DataFrame
n array NumPy ndarray (nD)

Rule of thumb: R vectors are homogeneous; Python lists are heterogeneous. For homogeneous numerics in Python, use NumPy.

Vectors (R) vs Lists / Arrays (Python)

x <- c(1, 2, 8, 10)
length(x); typeof(x)
x = [1, 2, 8, 10]
len(x), type(x)
import numpy as np
x = np.array([1, 2, 8, 10])
x.dtype, x.shape

Assignment

grand_slam_win <- c(16, 19, 20, 0, 0)
grand_slam_win = [16, 19, 20, 0, 0]

Subsetting — key differences

  • Index origin: R is 1-based, Python is 0-based.
  • Negative indices: R -i excludes element i; Python -i indexes from the end.
  • Logical indexing: R uses logical vectors; Python uses list comprehensions / boolean masks (NumPy/pandas).
x <- c(10, 20, 30, 40)
x[1]           # 10
x[-1]          # drop first
x[c(TRUE,FALSE,TRUE,FALSE)]  # logical mask
x = [10, 20, 30, 40]
x[0]           # 10
x[-1]          # last element (40)
# logical mask via comprehension
[xi for i, xi in enumerate(x) if i in (0,2)]
import numpy as np
x = np.array([10,20,30,40])
x[[0,2]]
mask = np.array([True, False, True, False])
x[mask]

Coercion in homogeneous containers

  • R vector coercion: logical < integer < double < character.
  • Python list: no automatic coercion.
  • NumPy array: coerces to common dtype.
c(TRUE, 12, 0.5)  # coerces to double
[True, 12, 0.5]   # stays mixed types
import numpy as np
np.array([True, 12, 0.5])    # dtype: float64

Attributes & names

grand_slam_win <- c(16, 19, 20, 0, 0)
names(grand_slam_win) <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
attr(grand_slam_win, "date") <- "2019-09-30"
attributes(grand_slam_win)
import pandas as pd
s = pd.Series([16,19,20,0,0], index=[
    "Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"])
s.attrs["date"] = "2019-09-30"
s.attrs

Sequences

1:3
seq_len(3)
seq(1, 2.8, by = 0.4)
seq(1, 2.8, length.out = 6)
rep(c(1,2), times = 3, each = 1)
list(range(1,4))
list(range(3,0,-1))

# numpy linspace/arange
import numpy as np
np.linspace(1, 2.8, 6)
np.arange(1, 2.9, 0.4)
# repetition
[1,2]*3

Useful vector ops

length(grand_slam_win)
sum(grand_slam_win)
mean(grand_slam_win)
order(grand_slam_win)
sort(grand_slam_win)
x = [16,19,20,0,0]
len(x), sum(x)
sorted(x)

# mean via statistics or numpy
import statistics as st
st.mean(x)

Matrices / 2D arrays

M <- matrix(1:12, nrow = 3, ncol = 4)
M; is.matrix(M); dim(M); nrow(M); ncol(M)
t(M)
# Elementwise vs matrix mult
M * M
M %*% t(M)
import numpy as np
M = np.arange(1,13).reshape(3,4)
M, M.shape
M.T
M * M          # elementwise
M @ M.T        # matrix mult

Lists (R) vs dict/list (Python)

num_vec <- c(188, 140)
char_vec <- c("Height", "Weight", "Length")
logic_vec <- rep(TRUE, 8)
my_mat <- matrix(seq_len(10), nrow = 2, ncol = 5)
my_list <- list(number = num_vec, character = char_vec, logic = logic_vec, matrix = my_mat)
typeof(my_list)
my_list[["matrix"]][,3]
my_dict = {
  "number": [188, 140],
  "character": ["Height","Weight","Length"],
  "logic": [True]*8,
  "matrix": [[1,2,3,4,5],[6,7,8,9,10]]
}
my_dict["matrix"][0][2]

Data frames / tibbles vs pandas DataFrame

players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
grand_slam_win <- c(16,19,20,0,0)
date_of_birth <- c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03")
(tennis <- data.frame(date_of_birth, grand_slam_win, row.names = players))
is.data.frame(tennis)
str(tennis)
import pandas as pd
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
    "date_of_birth": ["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"],
    "grand_slam_win": [16,19,20,0,0]
}, index=players)
df.info()
df.loc[:, ["grand_slam_win", "date_of_birth"]]

Subsetting data frames

# Like a list
tenis_cols <- tennis[c("grand_slam_win", "date_of_birth")]
# Like a matrix
same_cols <- tennis[, c("grand_slam_win", "date_of_birth")]
cols = df[["grand_slam_win", "date_of_birth"]]  # column subset
rowcol = df.loc[["Rafael Nadal", "Roger Federer"], ["grand_slam_win"]]

Dates

players_dob <- as.Date(c("22 May 1987","3 Jun 1986","8 Aug 1981","11 Feb 1996","3 Sep 1993"), format = "%d %b %Y")
players_dob
import pandas as pd
pd.to_datetime(["22 May 1987","3 Jun 1986","8 Aug 1981","11 Feb 1996","3 Sep 1993"], format="%d %b %Y")

Mini-exercises (quick wins)

  1. Indexing: Let x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical (or comprehension) indexing.
  2. Sorting: Sort x in descending order using (a) indices and (b) a built-in sorter.
  3. Coercion: In R, evaluate c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).
  4. Matrix ops: Create a 3×3 matrix / array and compute its transpose and A %*% A (R) / A @ A (Py).

Optional: Show your answers in a short .qmd or notebook.

Challenge exercise (R & Python)

Tennis leaderboard

  • Build a table with players, DoB, wins.
  • Add a computed column: age in years (use as.Date(Sys.Date()) / pd.Timestamp.today()), then sort by age descending.
  • Plot wins vs age.

Hints:

  • R: mutate, lubridate::time_length(interval, 'years'), plot() or ggplot2.
  • Python: pd.to_datetime, age via dt, df.sort_values, and matplotlib.

Appendix — Python execution in knitr

If Python chunks don’t execute:

  1. Ensure reticulate sees the right interpreter:

    reticulate::py_config()
  2. Point to an environment that has numpy / pandas if used:

    reticulate::use_condaenv("dacm", required = TRUE)
  3. Or disable execution of Python chunks during render with chunk option #| eval: false.

Appendix — Cheatsheet

  • Indexing: R 1-based; Python 0-based.
  • Negative indices: R excludes; Python counts from end.
  • Homogeneous numerics: R vectors; Python NumPy arrays.
  • Tabular: R data.frame/tibble; Python pandas DataFrame.
  • Missing: R NA; Python None/NaN/pd.NA (context-dependent).
  • Matrix mult: R %*%; Python @ (NumPy/pandas).

Mini-exercise 1: Indexing

Task: Let x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical/comprehension.

x <- 3 * seq_len(4)   # 3 6 9 12
x[2]                   # (a) positive
x[-c(1,3,4)]           # (b) negative (keep only 2)
x[c(FALSE, TRUE, FALSE, FALSE)]  # (c) logical
x = [3*i for i in range(1,5)]   # [3,6,9,12]
x[1]                             # (a) positive (0-based)
x[-3]                            # (b) negative index from end -> element 2
[xi for i, xi in enumerate(x) if i == 1]  # (c) comprehension

Mini-exercise 2: Sort descending

Task: Sort x in descending order using (a) indices and (b) a built-in sorter.

x <- 3 * seq_len(4)
x[order(x, decreasing = TRUE)]   # via indices
sort(x, decreasing = TRUE)       # via sorter
x = [3,6,9,12]
[xi for _, xi in sorted(enumerate(x), key=lambda t: t[1], reverse=True)]  # via indices
sorted(x, reverse=True)                                                 # via sorter

Mini-exercise 3: Coercion

Task: In R, evaluate c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).

val <- c(TRUE, 2, "3")
val
typeof(val)  # "character"
# Python list stays heterogeneous
val = [True, 2, "3"]
val, [type(v).__name__ for v in val]

# NumPy coerces to a common dtype
#| eval: false
import numpy as np
np.array([True, 2, "3"])  # dtype('<U...') (string)

Mini-exercise 4: Matrix ops

Task: Create a 3×3 matrix / array and compute its transpose and A %*% A (R) / A @ A (Py).

A <- matrix(1:9, 3, 3)
A; t(A); A %*% A
import numpy as np
A = np.arange(1,10).reshape(3,3)
A, A.T, A @ A

Challenge: Tennis leaderboard

Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.

players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
dob <- as.Date(c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"))
wins <- c(16,19,20,0,0)

df <- data.frame(player = players, dob = dob, wins = wins, stringsAsFactors = FALSE)
now <- Sys.Date()
df$age_years <- as.numeric(difftime(now, df$dob, units = "days")) / 365.2425

df2 <- df[order(-df$age_years), ]
print(df2)

plot(df2$age_years, df2$wins, xlab = "Age (years)", ylab = "Grand Slam wins")
import pandas as pd
from datetime import date
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
    "player": players,
    "dob": pd.to_datetime(["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"]),
    "wins": [16,19,20,0,0]
})
now = pd.Timestamp.today().normalize()
df["age_years"] = (now - df["dob"]).dt.days / 365.2425

df2 = df.sort_values("age_years", ascending=False)
print(df2)

ax = df2.plot.scatter(x="age_years", y="wins")
ax.set_xlabel("Age (years)"); ax.set_ylabel("Grand Slam wins")

Challenge: Tennis leaderboard - 2nd solution

Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.

library(dplyr)
library(tibble)
library(ggplot2)

ref_date <- as.Date("2025-10-08")

rankings <- tribble(
  ~rank, ~player,             ~country, ~points, ~dob,          ~gs_titles,
      1, "Carlos Alcaraz",    "ESP",     11540,  as.Date("2003-05-05"), 6,
      2, "Jannik Sinner",     "ITA",     10950,  as.Date("2001-08-16"), 4,
      3, "Alexander Zverev",  "GER",      5980,  as.Date("1997-04-20"), 0,
      4, "Taylor Fritz",      "USA",      4995,  as.Date("1997-10-28"), 0,
      5, "Novak Djokovic",    "SRB",      4830,  as.Date("1987-05-22"), 24
) |>
  mutate(age = floor(as.numeric(ref_date - dob) / 365.2425))

rankings

# Simple points bar chart
rankings |>
  ggplot(aes(x = reorder(player, points), y = points)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL, y = "Ranking points",
       title = "ATP Singles — Top 5 (week of 2025‑10‑06)") +
  theme_minimal(base_size = 12)

# Optional: Grand Slam titles lollipop
rankings |>
  ggplot(aes(x = reorder(player, gs_titles), y = gs_titles)) +
  geom_segment(aes(xend = player, y = 0, yend = gs_titles)) +
  geom_point(size = 3) +
  coord_flip() +
  labs(x = NULL, y = "Grand Slam singles titles",
       title = "Grand Slam titles (singles)") +
  theme_minimal(base_size = 12)
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

ref_date = datetime(2025, 10, 8)

data = {
    "rank":   [1, 2, 3, 4, 5],
    "player": ["Carlos Alcaraz","Jannik Sinner","Alexander Zverev","Taylor Fritz","Novak Djokovic"],
    "country":["ESP","ITA","GER","USA","SRB"],
    "points": [11540, 10950, 5980, 4995, 4830],
    "dob":    pd.to_datetime(["2003-05-05","2001-08-16","1997-04-20","1997-10-28","1987-05-22"]),
    "gs_titles": [6, 4, 0, 0, 24]
}
rankings = pd.DataFrame(data)
rankings["age"] = (ref_date - rankings["dob"]).dt.days // 365.2425
rankings

# Points bar chart (horizontal)
ax = rankings.sort_values("points").plot(kind="barh", x="player", y="points", legend=False)
ax.set_xlabel("Ranking points")
ax.set_title("ATP Singles — Top 5 (week of 2025‑10‑06)")
plt.tight_layout(); plt.show()

# Optional: Grand Slam titles lollipop-style
ordered = rankings.sort_values("gs_titles")
plt.figure()
for i, (p, v) in enumerate(zip(ordered["player"], ordered["gs_titles"])):
    plt.plot([0, v], [i, i])
    plt.plot(v, i, marker="o")
plt.yticks(range(len(ordered)), ordered["player"])
plt.xlabel("Grand Slam singles titles")
plt.title("Grand Slam titles (singles)")
plt.tight_layout(); plt.show()

Control structures

Note

Today: Booleans & logical ops, choices (if/else, switch/match), loops (for, while, repeat), short-circuiting, vectorised conditionals, comprehension/map/apply, performance tips — in R & Python.

Two families

  • Choices: decide a path based on conditions.
  • Loops: repeat a block of code.

We’ll contrast R and Python patterns side-by-side.

Logical operators — scalars

>, <, >=, <=, ==, !=, !, &&, ||

4 > 3
1 >= 1
!(2 > 1)
TRUE && TRUE
(1 > 1) || (2 < 3)

&& and || are short-circuit and only inspect the first element.

>, <, >=, <=, ==, !=, not, and, or

4 > 3
1 >= 1
not (2 > 1)
(True and True), ((1 > 1) or (2 < 3))

Logical operators — vectors/arrays

Use elementwise & and | for vectors/matrices; all(), any(), xor().

c(TRUE, FALSE) | c(TRUE, TRUE)
c(TRUE, TRUE) & c(TRUE, FALSE)
xor(TRUE, TRUE)
all(c(TRUE, FALSE, FALSE))
any(c(TRUE, FALSE, FALSE))
# Short-circuit vs elementwise
c(TRUE, FALSE) | c(TRUE, FALSE)
# NOTE: c(TRUE,FALSE) || c(TRUE,FALSE) uses only first element

Use & and | for elementwise logic with parentheses; np.all, np.any.

import numpy as np
A = np.array([True, False])
B = np.array([True, True])
(A | B), (A & B)

In pandas/NumPy: do not use and/or with arrays; use &/| and wrap comparisons in parentheses.

Truthiness pitfalls

  • R: if(c(TRUE,FALSE)) → warning: condition length > 1; only first used.
  • Python (lists): any non-empty list is truthy: if [False]: → True.
  • NumPy/pandas: if np.array([True, False]) → error (ambiguous). Use .any() / .all().
import numpy as np
x = np.array([True, False])
# if x:  # ValueError: ambiguous truth value
if x.any():
    print("has a True")

Selection operators

if — basic

x <- -4
if (x < 0) {
  x <- -x
}
if (x %% 2 == 0) {
  cat(x, "is even\n")
}
x = -4
if x < 0:
    x = -x
if x % 2 == 0:
    print(f"{x} is even")

if / else

x <- 3
if (x %% 2 == 0) {
  cat("even\n")
} else {
  cat("odd\n")
}
x = 3
if x % 2 == 0:
    print("even")
else:
    print("odd")

if / else if / else

x <- 0
if (x == 0) {
  cat("zero\n")
} else if (x %% 2 == 0) {
  cat("even\n")
} else {
  cat("odd\n")
}
x = 0
if x == 0:
    print("zero")
elif x % 2 == 0:
    print("even")
else:
    print("odd")

Vectorised conditionals

x <- 1:10
ifelse(x %% 2 == 0, "even", "odd")
import numpy as np, pandas as pd
x = np.arange(1,11)
np.where(x % 2 == 0, "even", "odd")
# pandas Series
s = pd.Series(range(1,11))
s.where(s % 2 == 0, other="odd").fillna("even")

switch (R) vs match/case (Python ≥3.10)

operator <- "+"
switch(operator,
  "+" = 20 + 5,
  "-" = 20 - 5,
  "*" = 20 * 5,
  "/" = 20 / 5,
  stop("Unknown operator")
)
operator = "+"
match operator:
    case "+":
        20 + 5
    case "-":
        20 - 5
    case "*":
        20 * 5
    case "/":
        20 / 5
    case _:
        raise ValueError("Unknown operator")
ops = {
    "+": lambda a,b: a+b,
    "-": lambda a,b: a-b,
    "*": lambda a,b: a*b,
    "/": lambda a,b: a/b
}
ops.get("+", lambda a,b: None)(20,5)

Loops — overview

  • R: for, while, repeat, with break, next.
  • Python: for (over iterables), while, with break, continue, and the lesser-known for … else / while … else.

for loops

for (number in 1:6) {
  print(number)
}
for (i in 1:10) {
  if (i %% 2 == 0) next
  print(i)
}
for number in range(1,7):
    print(number)
for i in range(1,11):
    if i % 2 == 0:
        continue
    print(i)

while loops

i <- 1
while (i <= 6) {
  print(i)
  i <- i + 1
}
i = 1
while i <= 6:
    print(i)
    i += 1

Infinite loops & safety

  • Easy to create by mistake.
  • Prefer clear termination; add guards and break.
counter <- 0
repeat {
  counter <- counter + 1
  if (counter >= 3) break
}
counter = 0
while True:
    counter += 1
    if counter >= 3:
        break

Python’s for … else

The else runs only if the loop did not break.

for n in range(2, 10):
    for x in range(2, n):
        if n % x == 0:
            break
    else:
        print(n, "is prime")

Vectorisation & apply-family

  • Prefer vectorised ops / builtins (rowMeans, colSums, …).
  • apply(X, MARGIN, FUN, …) for arrays; lapply/sapply for lists; or purrr::map*.
set.seed(321)
A <- matrix(rexp(30), ncol = 3, nrow = 10)
colMeans(A)
apply(A, 2, mean)
  • Prefer NumPy vectorisation; in pandas use column-wise ops; DataFrame.apply as last resort.
import numpy as np, pandas as pd
A = np.random.exponential(size=(10,3))
A.mean(axis=0)         # column means

Performance sketch

# install.packages("microbenchmark")  # if needed
# microbenchmark::microbenchmark(
#   for(i in 1:ncol(A)){},
#   apply(A, 2, mean),
#   colMeans(A)
# )

Rule: prefer builtins and vectorised code.

# import timeit
# timeit.timeit("sum(range(1000))", number=10000)

Rule: prefer vectorised NumPy/pandas ops over Python-level loops.

Mini‑exercises (quick)

  1. Short‑circuit: What do these return? In R: c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.
  2. FizzBuzz: Print numbers 1–20; for multiples of 3 print Fizz, of 5 Buzz, of both FizzBuzz.
  3. Masking: Given x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.
  4. Switch/match: Implement a tiny calculator using R switch and Python match.

Challenge — Find the first even square > 1000

Write two versions (R & Python):

  • A loop-based solution with a break when found.
  • A vectorised solution.

Stretch: benchmark both approaches.

Cheatsheet

  • R elementwise: &, |; short-circuit: &&, || (first element only).
  • Python booleans: and, or, not; elementwise with NumPy/pandas: &, | (+ parentheses!).
  • if/elif/else (Py) ~ if/else if/else (R).
  • Vectorised conditionals: ifelse (R), np.where/Series.where (Py).
  • Multi-way: switch (R), match/case or dict dispatch (Py).
  • Loops: for, while, repeat (R); for, while, for … else (Py).

Mini-exercise 1: Short‑circuit vs elementwise

Task: What do these return? In R: c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.

c(T,F) | c(T,F)        # elementwise OR -> TRUE FALSE
c(T,F) || c(T,F)       # an error with recent R; previously used only first element -> TRUE
# Python lists
[True, False] or [True, False]

# Elementwise with NumPy
#| eval: false
import numpy as np
A = np.array([True, False]); B = np.array([True, False])
A | B, A & B

Mini-exercise 2: FizzBuzz (1–20)

Task: Print numbers 1–20; for multiples of 3 print Fizz, of 5 Buzz, of both FizzBuzz.

for (i in 1:20) {
  if (i %% 15 == 0) cat("FizzBuzz\n")
  else if (i %% 3 == 0) cat("Fizz\n")
  else if (i %% 5 == 0) cat("Buzz\n")
  else cat(i, "\n")
}
for i in range(1, 21):
    if i % 15 == 0: print("FizzBuzz")
    elif i % 3 == 0: print("Fizz")
    elif i % 5 == 0: print("Buzz")
    else: print(i)

Mini-exercise 3: Masking odds to NA/NaN

Task: Given x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.

x <- 1:10
x[x %% 2 == 1] <- NA_integer_
x
import numpy as np
x = np.arange(1,11, dtype=float)
x[x % 2 == 1] = np.nan
x

Mini-exercise 4: Tiny calculator (switch/match)

Task: Implement a tiny calculator using R switch and Python match.

calc <- function(a, b, op) switch(op,
  "+" = a + b,
  "-" = a - b,
  "*" = a * b,
  "/" = a / b,
  stop("unknown op")
)
calc(20,5,"+")
# Python 3.10+
import sys
if sys.version_info >= (3,10):
    def calc(a,b,op):
        match op:
            case "+": return a + b
            case "-": return a - b
            case "*": return a * b
            case "/": return a / b
            case _: raise ValueError("unknown op")
    calc(20,5,"+")
else:
    ops = {"+": lambda a,b: a+b, "-": lambda a,b: a-b, "*": lambda a,b: a*b, "/": lambda a,b: a/b}
    ops["+"](20,5)

Challenge: First even square > 1000

Task: Write two versions (R & Python):

  • A loop-based solution with a break when found.
  • A vectorised solution.
# Loop solution
n <- 1
repeat {
  s <- n^2
  if (s > 1000 && s %% 2 == 0) { ans <- list(n=n, s=s); break }
  n <- n + 1
}
ans

# Vectorised solution
n <- seq_len(100)
sq <- n^2
cbind(n, sq)[which(sq > 1000 & sq %% 2 == 0)[1], ]
# Loop solution
n = 1
while True:
    s = n*n
    if s > 1000 and s % 2 == 0:
        ans = {"n": n, "s": s}
        break
    n += 1
ans

# Vectorised (NumPy)
#| eval: false
import numpy as np
n = np.arange(1, 101)
sq = n**2
idx = np.where((sq > 1000) & (sq % 2 == 0))[0][0]
int(n[idx]), int(sq[idx])

Functions

Note

Today: Function anatomy, arguments & matching, returns, errors/warnings/messages (vs Python exceptions), scope & environments (R) / LEGB (Python), higher‑order & closures, composition & piping, docs & tests — R & Python side‑by‑side.

“Everything is a function call” (and a first‑class object)

  • Evaluate a constant:
18.10
  • Names / quoted names:
sqrt
`+`
  • Function call:
1.1 + 2.1
`+`(1.1, 2.1)
  • Evaluate a constant, names, and call:
18.10
abs
(1.1).__add__(2.1)

Operators are methods/functions under the hood (see operator module), and functions are first‑class values.

Function components

  • Arguments, body, environment.
my_div <- function(numerator, denominator) {
  div <- numerator / denominator
  return(div)
}
formals(my_div)
body(my_div)
environment(my_div)
  • Parameters, body, globals/closure.
def my_div(numerator, denominator):
    div = numerator / denominator
    return div
my_div.__code__.co_varnames, my_div.__defaults__, my_div.__closure__

Returns

Last expression is returned if return() omitted.

f <- function(x) { x + 1 }
f(2)

Must return explicitly (else returns None).

def f(x):
    x + 1
print(f(2))  # None

Passing arguments

  • Positional, exact by name, partial (prefix) matching.
my_div(1, 2)
my_div(numerator = 1, denominator = 2)
my_div(n = 1, d = 2)  # partial match (works but discouraged)
  • Positional, keyword, defaults, *args, **kwargs.
  • Keyword‑only params (after *), positional‑only (before /).
def g(a, b=1, *args, scale=1, **kw):
    return (a + b) * scale

g(1, 2, scale=3)

def h(x, /, y, *, z=0):  # pos‑only x; keyword‑only z
    return x + y + z
h(1, 2, z=3)

Assignment vs argument binding in R

  • Both = and <- assign values, but inside calls = binds arguments.
my_div(numerator = 2, denominator = 1)
numerator             # error: object 'numerator' not found
my_div(numerator <- 2, denominator = 1)  # assigns globally first
numerator             # now exists in workspace

Type checks & friendly failures

my_div <- function(numerator, denominator) {
  if (any(!is.numeric(numerator), !is.numeric(denominator))) {
    stop("`numerator` and `denominator` must be numeric")
  }
  numerator / denominator
}
my_div("numerator", "denominator")
class NotNumericError(TypeError):
    pass

def my_div(numerator, denominator):
    from numbers import Number
    if not isinstance(numerator, Number) or not isinstance(denominator, Number):
        raise NotNumericError("numerator and denominator must be numeric")
    return numerator / denominator
# my_div("numerator", "denominator")  # raises

Warnings / messages vs warnings / logging

warning("Size mismatch; recycling may occur")
message("Starting the division…")
import warnings, logging
warnings.warn("Size mismatch; check inputs")
logging.basicConfig(level=logging.INFO)
logging.info("Starting the division…")

Dimensions & vectorisation

A <- matrix(1:9, ncol = 3)
B <- matrix(10:18, ncol = 3)
A / B  # vectorised elementwise
import numpy as np
A = np.arange(1,10).reshape(3,3)
B = np.arange(10,19).reshape(3,3)
A / B  # elementwise

Robust my_div (shape checks)

my_div <- function(numerator, denominator) {
  if (any(!is.numeric(numerator), !is.numeric(denominator))) {
    stop("`numerator` and `denominator` must be numeric")
  }
  if (!identical(dim(numerator), dim(denominator))) {
    # fall back to length check for vectors
    if (is.null(dim(numerator)) && is.null(dim(denominator))) {
      if (length(numerator) != length(denominator))
        stop("Lengths must match for vectors")
    } else {
      stop("Dimensions must match for arrays/matrices")
    }
  }
  numerator / denominator
}
import numpy as np

def my_div(numerator, denominator):
    a = np.asarray(numerator)
    b = np.asarray(denominator)
    if a.shape != b.shape:
        raise ValueError("Shapes must match")
    return a / b

Scope: R environments vs Python LEGB

  • Dynamic lookup: names resolved when function runs.
  • Name masking: inner names shadow outer names.
f <- function() x * x
f()            # error, x not found
x <- 10
f()            # now 100
x <- 10
f <- function(){ x <- 1; x * x }
f()            # 1
  • Inspect environments:
environment()
globalenv()
emptyenv()
x = 10

def f():
    x = 1   # local shadows global
    return x * x
f()

y = 0

def bump():
    global y
    y += 1
bump(); y


def make_adder(k):
    def add(x):
        return x + k  # closes over k (enclosing)
    return add
add5 = make_adder(5)
add5(3)

Closures & higher‑order functions

make_adder <- function(k) {
  function(x) x + k
}
add5 <- make_adder(5)
add5(3)
# Map over a vector
lapply(1:5, add5)
from functools import partial

def add(x, y):
    return x + y
add5 = partial(add, 5)
list(map(add5, range(1,6)))

Composition & piping

  • Base pipe |> and magrittr %>%.
`%big%` <- function(x, y) 10 * x * y  # custom infix
1 %big% 2

# composition via nesting / piping
sqrt(log1p(9))
9 |>
  log1p() |>
  sqrt()
  • Nest calls or write a tiny compose.
from math import sqrt, log1p

def compose(f, g):
    return lambda x: f(g(x))

h = compose(sqrt, log1p)
h(9)

Documentation

#' Divide two numbers
#'
#' @param numerator,denominator Numeric scalars or vectors.
#' @return Numeric result.
#' @examples
#' my_div(4, 2)
my_div <- function(numerator, denominator) numerator / denominator
from typing import Union
Number = Union[int, float]

def my_div(numerator: Number, denominator: Number) -> float:
    """Divide two numbers.

    Args:
        numerator: dividend
        denominator: divisor
    Returns:
        The quotient as float.
    """
    return numerator / denominator

Testing (unit tests)

# install.packages("testthat")
# test_that("my_div works", {
#   expect_equal(my_div(4,2), 2)
#   expect_error(my_div("a", 2))
# })
# def test_my_div():
#     assert my_div(4,2) == 2
#     import pytest
#     with pytest.raises(Exception):
#         my_div("a", 2)

Mini‑exercises (quick)

  1. Safe divide: Write safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.
  2. Prefix matching: In R, define f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.

Appendix — Python in this deck & pitfalls

Render with R/knitr + reticulate and point to your env:

install.packages("reticulate")
reticulate::use_condaenv("dacm", required = TRUE)
reticulate::py_config()

Python gotcha — mutable defaults:

# Bad
# def append_item(x, lst=[]):
#     lst.append(x); return lst
# Good
# def append_item(x, lst=None):
#     if lst is None: lst = []
#     lst.append(x); return lst

R note — lazy defaults: evaluated when first used, not at definition.

Cheatsheet

  • R returns last expr; Python needs return.
  • R: positional/exact/partial arg matching; Python: positional/keyword, *args/**kwargs, / & * markers.
  • R errors/warnings/messages vs Python exceptions/warnings/logging.
  • R lexical scoping & environments vs Python LEGB with global/nonlocal.
  • Vectorise: R builtins; Python use NumPy/pandas.
  • Composition: R |>/%>% & custom %op%; Python nesting/compose/method chaining.

Mini-exercise 1: Safe divide

Task: Write safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.

safe_div <- function(a, b) {
  ifelse(b == 0, {
    warning("denominator is 0; returning NA")
    NA_real_
  }, a / b)
}
safe_div(c(1,2,3), c(1,0,2))


# alternatively
safe_div <- function(a, b) {
  if(any(b == 0)) {
    warning("denominator is 0; returning NA")
  }
  
  x <- rep(NA_real_, length(a))
  x[b != 0] <- a[b != 0] / b[b != 0]
  
  return(x)
}
import math, warnings

def safe_div(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        warnings.warn("denominator is 0; returning NaN")
        return math.nan

print(safe_div(1,1))
print(safe_div(2,0))

Mini-exercise 2: Prefix matching (R) & safer calls

Task: In R, define f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.

f <- function(numerator, denominator) numerator / denominator
f(n = 1, d = 2)              # works via partial matching (risky)

# Make it safer:
options(warnPartialMatchArgs = TRUE)  # warn on partial matching
f(numerator = 1, denominator = 2)     # explicit names
# Not applicable (Python has no partial name matching for keywords).
pass