Lecture 2 — Programming Foundations

Data & Code Management: From Collection to Application

Samuel Orso

2025-10-02

Big picture

LLMs (ChatGPT, Gemini, Claude, LLaMA): great for ideation, scaffolding code, refactoring; not ground truth.
Quarto = literate programming hub: one doc, code+text+figures → HTML/PDF/slides.
Git/GitHub: version control + collaboration (branches, PRs, reviews).
Reproducibility: lock environments, script everything, document decisions.

LLMs — use wisely

Strengths: productivity, learning, boilerplate, code review.
Risks: hallucinations, missing context, ethics/licensing.
Good habits: give clear prompts + context, ask for tests, verify outputs.

Quarto essentials

Markdown + YAML header (title, format, options).
Code chunks (R/Python), chunk options: echo, eval, warning, fig-*.
Figures & tables: knitr::kable(), kableExtra; equations (Mathpix helpful).
Live preview; keep content modular.

Git/GitHub workflow

Small commits with clear messages.
Branch → PR → review → merge.
Handle conflicts; use issues/templates.
“New habits”: automate repeated steps; write READMEs.

Reproducibility kit

R: renv::init(); snapshot(); restore()
Python: venv/Poetry/Conda + lockfile
Determinism: pin versions; set seeds; record data sources.

From last time — tasks

Install: Git, GitHub account, Quarto, R (+renv), Python (venv/Conda).
Create a starter repo with a Quarto doc; push to GitHub.
Try one mini-exercise (R / Python / Quarto / GitHub).
Optional bonus: present a short solution next practical.

Data structures

Note

Today: Primary types, vectors/lists, matrices/arrays, data frames/tibbles, pandas DataFrame, dates, subsetting, coercion rules, and common ops — in both R and Python.

integer: 2L, 12L
double (real): -4, 12.4532, 6
logical: TRUE, FALSE (also T, F)
character: "a", "Bonjour"
(also: complex, raw)

typeof(12.5); typeof(2L); typeof(TRUE); typeof("hi")

int: -4, 6
float: 12.4532
bool: True, False
str: 'a', "Bonjour"
(also: complex, bytes)

type(12.5), type(2), type(True), type("hi")

Reserved / special values

R
Python

Missing: NA, typed variants NA_real_, NA_integer_, NA_character_, …
Infinity / NaN: Inf, NaN
Null object: NULL
Control words: if, else, repeat, while, function, for, in, next, break

c(NA, Inf, NaN, NULL)

Missing: None (object), numerical missing often math.nan or numpy.nan / pandas.NA
Infinity / NaN: math.inf, float('inf'), math.nan
Keywords: if, else, for, while, def, in, break, continue, class, …

import math
[None, math.inf, math.nan]

Inspecting and coercing types

R: typeof, is.*, as.*
Python: type, isinstance, casting

typeof(2L)              # "integer"
is.double(2L)           # FALSE
as.double(2L)           # 2

type(2) is int          # True
isinstance(2.0, float)  # True
int(2.9), float("3.1"), str(True)

Nondeterminism in LLM

Floating-point arithmetic in GPUs exhibits non-associativity, meaning \((a+b)+c\neq a+(b+c)\) due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.
Horace He in collaboration with others at Thinking Machines

Note

Read the full article here

Testing floating-point non-associativity

What do you obtain if you run this in Python/R?

(0.1 + 1e20) - 1e20
0.1 + (1e20 - 1e20)

If you replace 20 by another smaller integer, do you get the same result? At which integer do you start to get the same result? Why?

Testing floating-point non-associativity

R
Python

options(digits = 22)
f <- function(n) (0.1 + 10^n) - 10^n
c(n20 = f(20), n16 = f(16), n15 = f(15), n14 = f(14))
0.1 + (10^20 - 10^20)

import math
def f(n): return (0.1 + 10.0**n) - 10.0**n
print({ 'n20': f(20), 'n16': f(16), 'n15': f(15), 'n14': f(14) })
print(0.1 + (10.0**20 - 10.0**20))

Key idea: near large magnitudes, the ULP is so big that small addends vanish or round to the nearest ULP, breaking associativity.

Homogeneous vs heterogeneous structures

Dimension	R homogeneous	R heterogeneous	Python homogeneous	Python heterogeneous
1	atomic vector	list	NumPy array	list, tuple
2	matrix	data.frame / tibble	NumPy ndarray (2D)	pandas DataFrame
n	array	—	NumPy ndarray (nD)	—

Rule of thumb: R vectors are homogeneous; Python lists are heterogeneous. For homogeneous numerics in Python, use NumPy.

Vectors (R) vs Lists / Arrays (Python)

R — atomic vector
Python — list (heterogeneous)
Python — NumPy array (homogeneous)

x <- c(1, 2, 8, 10)
length(x); typeof(x)

x = [1, 2, 8, 10]
len(x), type(x)

import numpy as np
x = np.array([1, 2, 8, 10])
x.dtype, x.shape

Assignment

R
Python

grand_slam_win <- c(16, 19, 20, 0, 0)

grand_slam_win = [16, 19, 20, 0, 0]

Subsetting — key differences

Index origin: R is 1-based, Python is 0-based.
Negative indices: R -i excludes element i; Python -i indexes from the end.
Logical indexing: R uses logical vectors; Python uses list comprehensions / boolean masks (NumPy/pandas).

R
Python (list)
Python (NumPy)

x <- c(10, 20, 30, 40)
x[1]           # 10
x[-1]          # drop first
x[c(TRUE,FALSE,TRUE,FALSE)]  # logical mask

x = [10, 20, 30, 40]
x[0]           # 10
x[-1]          # last element (40)
# logical mask via comprehension
[xi for i, xi in enumerate(x) if i in (0,2)]

import numpy as np
x = np.array([10,20,30,40])
x[[0,2]]
mask = np.array([True, False, True, False])
x[mask]

Coercion in homogeneous containers

R vector coercion: logical < integer < double < character.
Python list: no automatic coercion.
NumPy array: coerces to common dtype.

R
Python (list)
Python (NumPy)

c(TRUE, 12, 0.5)  # coerces to double

[True, 12, 0.5]   # stays mixed types

import numpy as np
np.array([True, 12, 0.5])    # dtype: float64

Attributes & names

R
Python (pandas)

grand_slam_win <- c(16, 19, 20, 0, 0)
names(grand_slam_win) <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
attr(grand_slam_win, "date") <- "2019-09-30"
attributes(grand_slam_win)

import pandas as pd
s = pd.Series([16,19,20,0,0], index=[
    "Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"])
s.attrs["date"] = "2019-09-30"
s.attrs

Sequences

R
Python

1:3
seq_len(3)
seq(1, 2.8, by = 0.4)
seq(1, 2.8, length.out = 6)
rep(c(1,2), times = 3, each = 1)

list(range(1,4))
list(range(3,0,-1))

# numpy linspace/arange
import numpy as np
np.linspace(1, 2.8, 6)
np.arange(1, 2.9, 0.4)
# repetition
[1,2]*3

Useful vector ops

R
Python

length(grand_slam_win)
sum(grand_slam_win)
mean(grand_slam_win)
order(grand_slam_win)
sort(grand_slam_win)

x = [16,19,20,0,0]
len(x), sum(x)
sorted(x)

# mean via statistics or numpy
import statistics as st
st.mean(x)

Matrices / 2D arrays

R — matrix
Python — numpy.ndarray

M <- matrix(1:12, nrow = 3, ncol = 4)
M; is.matrix(M); dim(M); nrow(M); ncol(M)
t(M)
# Elementwise vs matrix mult
M * M
M %*% t(M)

import numpy as np
M = np.arange(1,13).reshape(3,4)
M, M.shape
M.T
M * M          # elementwise
M @ M.T        # matrix mult

Lists (R) vs dict/list (Python)

R — list (heterogeneous)
Python — dict / list

num_vec <- c(188, 140)
char_vec <- c("Height", "Weight", "Length")
logic_vec <- rep(TRUE, 8)
my_mat <- matrix(seq_len(10), nrow = 2, ncol = 5)
my_list <- list(number = num_vec, character = char_vec, logic = logic_vec, matrix = my_mat)
typeof(my_list)
my_list[["matrix"]][,3]

my_dict = {
  "number": [188, 140],
  "character": ["Height","Weight","Length"],
  "logic": [True]*8,
  "matrix": [[1,2,3,4,5],[6,7,8,9,10]]
}
my_dict["matrix"][0][2]

Data frames / tibbles vs pandas DataFrame

R
Python

players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
grand_slam_win <- c(16,19,20,0,0)
date_of_birth <- c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03")
(tennis <- data.frame(date_of_birth, grand_slam_win, row.names = players))
is.data.frame(tennis)
str(tennis)

import pandas as pd
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
    "date_of_birth": ["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"],
    "grand_slam_win": [16,19,20,0,0]
}, index=players)
df.info()
df.loc[:, ["grand_slam_win", "date_of_birth"]]

Subsetting data frames

R
Python

# Like a list
tenis_cols <- tennis[c("grand_slam_win", "date_of_birth")]
# Like a matrix
same_cols <- tennis[, c("grand_slam_win", "date_of_birth")]

cols = df[["grand_slam_win", "date_of_birth"]]  # column subset
rowcol = df.loc[["Rafael Nadal", "Roger Federer"], ["grand_slam_win"]]

Dates

R
Python

players_dob <- as.Date(c("22 May 1987","3 Jun 1986","8 Aug 1981","11 Feb 1996","3 Sep 1993"), format = "%d %b %Y")
players_dob

import pandas as pd
pd.to_datetime(["22 May 1987","3 Jun 1986","8 Aug 1981","11 Feb 1996","3 Sep 1993"], format="%d %b %Y")

Mini-exercises (quick wins)

Indexing: Let x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical (or comprehension) indexing.
Sorting: Sort x in descending order using (a) indices and (b) a built-in sorter.
Coercion: In R, evaluate c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).
Matrix ops: Create a 3×3 matrix / array and compute its transpose and A %*% A (R) / A @ A (Py).

Optional: Show your answers in a short .qmd or notebook.

Challenge exercise (R & Python)

Tennis leaderboard

Build a table with players, DoB, wins.
Add a computed column: age in years (use as.Date(Sys.Date()) / pd.Timestamp.today()), then sort by age descending.
Plot wins vs age.

Hints:

R: mutate, lubridate::time_length(interval, 'years'), plot() or ggplot2.
Python: pd.to_datetime, age via dt, df.sort_values, and matplotlib.

Appendix — Python execution in knitr

If Python chunks don’t execute:

Ensure reticulate sees the right interpreter:
```
reticulate::py_config()
```
Point to an environment that has numpy / pandas if used:
```
reticulate::use_condaenv("dacm", required = TRUE)
```
Or disable execution of Python chunks during render with chunk option #| eval: false.

Appendix — Cheatsheet

Indexing: R 1-based; Python 0-based.
Negative indices: R excludes; Python counts from end.
Homogeneous numerics: R vectors; Python NumPy arrays.
Tabular: R data.frame/tibble; Python pandas DataFrame.
Missing: R NA; Python None/NaN/pd.NA (context-dependent).
Matrix mult: R %*%; Python @ (NumPy/pandas).

Mini-exercise 1: Indexing

Task: Let x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical/comprehension.

R
Python

x <- 3 * seq_len(4)   # 3 6 9 12
x[2]                   # (a) positive
x[-c(1,3,4)]           # (b) negative (keep only 2)
x[c(FALSE, TRUE, FALSE, FALSE)]  # (c) logical

x = [3*i for i in range(1,5)]   # [3,6,9,12]
x[1]                             # (a) positive (0-based)
x[-3]                            # (b) negative index from end -> element 2
[xi for i, xi in enumerate(x) if i == 1]  # (c) comprehension

Mini-exercise 2: Sort descending

Task: Sort x in descending order using (a) indices and (b) a built-in sorter.

R
Python

x <- 3 * seq_len(4)
x[order(x, decreasing = TRUE)]   # via indices
sort(x, decreasing = TRUE)       # via sorter

x = [3,6,9,12]
[xi for _, xi in sorted(enumerate(x), key=lambda t: t[1], reverse=True)]  # via indices
sorted(x, reverse=True)                                                 # via sorter

Mini-exercise 3: Coercion

Task: In R, evaluate c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).

R
Python

val <- c(TRUE, 2, "3")
val
typeof(val)  # "character"

# Python list stays heterogeneous
val = [True, 2, "3"]
val, [type(v).__name__ for v in val]

# NumPy coerces to a common dtype
#| eval: false
import numpy as np
np.array([True, 2, "3"])  # dtype('<U...') (string)

Mini-exercise 4: Matrix ops

Task: Create a 3×3 matrix / array and compute its transpose and A %*% A (R) / A @ A (Py).

R
Python

A <- matrix(1:9, 3, 3)
A; t(A); A %*% A

import numpy as np
A = np.arange(1,10).reshape(3,3)
A, A.T, A @ A

Challenge: Tennis leaderboard

Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.

R
Python

players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
dob <- as.Date(c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"))
wins <- c(16,19,20,0,0)

df <- data.frame(player = players, dob = dob, wins = wins, stringsAsFactors = FALSE)
now <- Sys.Date()
df$age_years <- as.numeric(difftime(now, df$dob, units = "days")) / 365.2425

df2 <- df[order(-df$age_years), ]
print(df2)

plot(df2$age_years, df2$wins, xlab = "Age (years)", ylab = "Grand Slam wins")

import pandas as pd
from datetime import date
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
    "player": players,
    "dob": pd.to_datetime(["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"]),
    "wins": [16,19,20,0,0]
})
now = pd.Timestamp.today().normalize()
df["age_years"] = (now - df["dob"]).dt.days / 365.2425

df2 = df.sort_values("age_years", ascending=False)
print(df2)

ax = df2.plot.scatter(x="age_years", y="wins")
ax.set_xlabel("Age (years)"); ax.set_ylabel("Grand Slam wins")

Challenge: Tennis leaderboard - 2nd solution

Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.

R (dplyr / ggplot2)
Python (pandas / matplotlib)

library(dplyr)
library(tibble)
library(ggplot2)

ref_date <- as.Date("2025-10-08")

rankings <- tribble(
  ~rank, ~player,             ~country, ~points, ~dob,          ~gs_titles,
      1, "Carlos Alcaraz",    "ESP",     11540,  as.Date("2003-05-05"), 6,
      2, "Jannik Sinner",     "ITA",     10950,  as.Date("2001-08-16"), 4,
      3, "Alexander Zverev",  "GER",      5980,  as.Date("1997-04-20"), 0,
      4, "Taylor Fritz",      "USA",      4995,  as.Date("1997-10-28"), 0,
      5, "Novak Djokovic",    "SRB",      4830,  as.Date("1987-05-22"), 24
) |>
  mutate(age = floor(as.numeric(ref_date - dob) / 365.2425))

rankings

# Simple points bar chart
rankings |>
  ggplot(aes(x = reorder(player, points), y = points)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL, y = "Ranking points",
       title = "ATP Singles — Top 5 (week of 2025‑10‑06)") +
  theme_minimal(base_size = 12)

# Optional: Grand Slam titles lollipop
rankings |>
  ggplot(aes(x = reorder(player, gs_titles), y = gs_titles)) +
  geom_segment(aes(xend = player, y = 0, yend = gs_titles)) +
  geom_point(size = 3) +
  coord_flip() +
  labs(x = NULL, y = "Grand Slam singles titles",
       title = "Grand Slam titles (singles)") +
  theme_minimal(base_size = 12)

import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

ref_date = datetime(2025, 10, 8)

data = {
    "rank":   [1, 2, 3, 4, 5],
    "player": ["Carlos Alcaraz","Jannik Sinner","Alexander Zverev","Taylor Fritz","Novak Djokovic"],
    "country":["ESP","ITA","GER","USA","SRB"],
    "points": [11540, 10950, 5980, 4995, 4830],
    "dob":    pd.to_datetime(["2003-05-05","2001-08-16","1997-04-20","1997-10-28","1987-05-22"]),
    "gs_titles": [6, 4, 0, 0, 24]
}
rankings = pd.DataFrame(data)
rankings["age"] = (ref_date - rankings["dob"]).dt.days // 365.2425
rankings

# Points bar chart (horizontal)
ax = rankings.sort_values("points").plot(kind="barh", x="player", y="points", legend=False)
ax.set_xlabel("Ranking points")
ax.set_title("ATP Singles — Top 5 (week of 2025‑10‑06)")
plt.tight_layout(); plt.show()

# Optional: Grand Slam titles lollipop-style
ordered = rankings.sort_values("gs_titles")
plt.figure()
for i, (p, v) in enumerate(zip(ordered["player"], ordered["gs_titles"])):
    plt.plot([0, v], [i, i])
    plt.plot(v, i, marker="o")
plt.yticks(range(len(ordered)), ordered["player"])
plt.xlabel("Grand Slam singles titles")
plt.title("Grand Slam titles (singles)")
plt.tight_layout(); plt.show()

Control structures

Note

Today: Booleans & logical ops, choices (if/else, switch/match), loops (for, while, repeat), short-circuiting, vectorised conditionals, comprehension/map/apply, performance tips — in R & Python.

Two families

Choices: decide a path based on conditions.
Loops: repeat a block of code.

We’ll contrast R and Python patterns side-by-side.

Logical operators — scalars

R
Python

>, <, >=, <=, ==, !=, !, &&, ||

4 > 3
1 >= 1
!(2 > 1)
TRUE && TRUE
(1 > 1) || (2 < 3)

&& and || are short-circuit and only inspect the first element.

>, <, >=, <=, ==, !=, not, and, or

4 > 3
1 >= 1
not (2 > 1)
(True and True), ((1 > 1) or (2 < 3))

Logical operators — vectors/arrays

R
Python (NumPy/pandas)

Use elementwise & and | for vectors/matrices; all(), any(), xor().

c(TRUE, FALSE) | c(TRUE, TRUE)
c(TRUE, TRUE) & c(TRUE, FALSE)
xor(TRUE, TRUE)
all(c(TRUE, FALSE, FALSE))
any(c(TRUE, FALSE, FALSE))
# Short-circuit vs elementwise
c(TRUE, FALSE) | c(TRUE, FALSE)
# NOTE: c(TRUE,FALSE) || c(TRUE,FALSE) uses only first element

Use & and | for elementwise logic with parentheses; np.all, np.any.

import numpy as np
A = np.array([True, False])
B = np.array([True, True])
(A | B), (A & B)

In pandas/NumPy: do not use and/or with arrays; use &/| and wrap comparisons in parentheses.

Truthiness pitfalls

R: if(c(TRUE,FALSE)) → warning: condition length > 1; only first used.
Python (lists): any non-empty list is truthy: if [False]: → True.
NumPy/pandas: if np.array([True, False]) → error (ambiguous). Use .any() / .all().

import numpy as np
x = np.array([True, False])
# if x:  # ValueError: ambiguous truth value
if x.any():
    print("has a True")

Selection operators

`if` — basic

R
Python

x <- -4
if (x < 0) {
  x <- -x
}
if (x %% 2 == 0) {
  cat(x, "is even\n")
}

x = -4
if x < 0:
    x = -x
if x % 2 == 0:
    print(f"{x} is even")

`if` / `else`

R
Python

x <- 3
if (x %% 2 == 0) {
  cat("even\n")
} else {
  cat("odd\n")
}

x = 3
if x % 2 == 0:
    print("even")
else:
    print("odd")

`if` / `else if` / `else`

R
Python (elif)

x <- 0
if (x == 0) {
  cat("zero\n")
} else if (x %% 2 == 0) {
  cat("even\n")
} else {
  cat("odd\n")
}

x = 0
if x == 0:
    print("zero")
elif x % 2 == 0:
    print("even")
else:
    print("odd")

Vectorised conditionals

R — ifelse
Python — numpy.where / pandas

x <- 1:10
ifelse(x %% 2 == 0, "even", "odd")

import numpy as np, pandas as pd
x = np.arange(1,11)
np.where(x % 2 == 0, "even", "odd")
# pandas Series
s = pd.Series(range(1,11))
s.where(s % 2 == 0, other="odd").fillna("even")

`switch` (R) vs `match/case` (Python ≥3.10)

R — switch
Python — match
Python (pre-3.10) — dict dispatch

operator <- "+"
switch(operator,
  "+" = 20 + 5,
  "-" = 20 - 5,
  "*" = 20 * 5,
  "/" = 20 / 5,
  stop("Unknown operator")
)

operator = "+"
match operator:
    case "+":
        20 + 5
    case "-":
        20 - 5
    case "*":
        20 * 5
    case "/":
        20 / 5
    case _:
        raise ValueError("Unknown operator")

ops = {
    "+": lambda a,b: a+b,
    "-": lambda a,b: a-b,
    "*": lambda a,b: a*b,
    "/": lambda a,b: a/b
}
ops.get("+", lambda a,b: None)(20,5)

Loops — overview

R: for, while, repeat, with break, next.
Python: for (over iterables), while, with break, continue, and the lesser-known for … else / while … else.

`for` loops

R
Python

for (number in 1:6) {
  print(number)
}
for (i in 1:10) {
  if (i %% 2 == 0) next
  print(i)
}

for number in range(1,7):
    print(number)
for i in range(1,11):
    if i % 2 == 0:
        continue
    print(i)

`while` loops

R
Python

i <- 1
while (i <= 6) {
  print(i)
  i <- i + 1
}

i = 1
while i <= 6:
    print(i)
    i += 1

Infinite loops & safety

Easy to create by mistake.
Prefer clear termination; add guards and break.

R — repeat
Python — guarded while

counter <- 0
repeat {
  counter <- counter + 1
  if (counter >= 3) break
}

counter = 0
while True:
    counter += 1
    if counter >= 3:
        break

Python’s `for … else`

The else runs only if the loop did not break.

for n in range(2, 10):
    for x in range(2, n):
        if n % x == 0:
            break
    else:
        print(n, "is prime")

Vectorisation & apply-family

R
Python

Prefer vectorised ops / builtins (rowMeans, colSums, …).
apply(X, MARGIN, FUN, …) for arrays; lapply/sapply for lists; or purrr::map*.

set.seed(321)
A <- matrix(rexp(30), ncol = 3, nrow = 10)
colMeans(A)
apply(A, 2, mean)

Prefer NumPy vectorisation; in pandas use column-wise ops; DataFrame.apply as last resort.

import numpy as np, pandas as pd
A = np.random.exponential(size=(10,3))
A.mean(axis=0)         # column means

Performance sketch

R
Python

# install.packages("microbenchmark")  # if needed
# microbenchmark::microbenchmark(
#   for(i in 1:ncol(A)){},
#   apply(A, 2, mean),
#   colMeans(A)
# )

Rule: prefer builtins and vectorised code.

# import timeit
# timeit.timeit("sum(range(1000))", number=10000)

Rule: prefer vectorised NumPy/pandas ops over Python-level loops.

Mini‑exercises (quick)

Short‑circuit: What do these return? In R: c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.
FizzBuzz: Print numbers 1–20; for multiples of 3 print Fizz, of 5 Buzz, of both FizzBuzz.
Masking: Given x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.
Switch/match: Implement a tiny calculator using R switch and Python match.

Challenge — Find the first even square > 1000

Write two versions (R & Python):

A loop-based solution with a break when found.
A vectorised solution.

Stretch: benchmark both approaches.

Cheatsheet

R elementwise: &, |; short-circuit: &&, || (first element only).
Python booleans: and, or, not; elementwise with NumPy/pandas: &, | (+ parentheses!).
if/elif/else (Py) ~ if/else if/else (R).
Vectorised conditionals: ifelse (R), np.where/Series.where (Py).
Multi-way: switch (R), match/case or dict dispatch (Py).
Loops: for, while, repeat (R); for, while, for … else (Py).

Mini-exercise 1: Short‑circuit vs elementwise

Task: What do these return? In R: c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.

R
Python / NumPy

c(T,F) | c(T,F)        # elementwise OR -> TRUE FALSE
c(T,F) || c(T,F)       # an error with recent R; previously used only first element -> TRUE

# Python lists
[True, False] or [True, False]

# Elementwise with NumPy
#| eval: false
import numpy as np
A = np.array([True, False]); B = np.array([True, False])
A | B, A & B

Mini-exercise 2: FizzBuzz (1–20)

Task: Print numbers 1–20; for multiples of 3 print Fizz, of 5 Buzz, of both FizzBuzz.

R
Python

for (i in 1:20) {
  if (i %% 15 == 0) cat("FizzBuzz\n")
  else if (i %% 3 == 0) cat("Fizz\n")
  else if (i %% 5 == 0) cat("Buzz\n")
  else cat(i, "\n")
}

for i in range(1, 21):
    if i % 15 == 0: print("FizzBuzz")
    elif i % 3 == 0: print("Fizz")
    elif i % 5 == 0: print("Buzz")
    else: print(i)

Mini-exercise 3: Masking odds to NA/NaN

Task: Given x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.

R
Python

x <- 1:10
x[x %% 2 == 1] <- NA_integer_
x

import numpy as np
x = np.arange(1,11, dtype=float)
x[x % 2 == 1] = np.nan
x

Mini-exercise 4: Tiny calculator (switch/match)

Task: Implement a tiny calculator using R switch and Python match.

R
Python

calc <- function(a, b, op) switch(op,
  "+" = a + b,
  "-" = a - b,
  "*" = a * b,
  "/" = a / b,
  stop("unknown op")
)
calc(20,5,"+")

# Python 3.10+
import sys
if sys.version_info >= (3,10):
    def calc(a,b,op):
        match op:
            case "+": return a + b
            case "-": return a - b
            case "*": return a * b
            case "/": return a / b
            case _: raise ValueError("unknown op")
    calc(20,5,"+")
else:
    ops = {"+": lambda a,b: a+b, "-": lambda a,b: a-b, "*": lambda a,b: a*b, "/": lambda a,b: a/b}
    ops["+"](20,5)

Challenge: First even square > 1000

Task: Write two versions (R & Python):

A loop-based solution with a break when found.
A vectorised solution.

R
Python

# Loop solution
n <- 1
repeat {
  s <- n^2
  if (s > 1000 && s %% 2 == 0) { ans <- list(n=n, s=s); break }
  n <- n + 1
}
ans

# Vectorised solution
n <- seq_len(100)
sq <- n^2
cbind(n, sq)[which(sq > 1000 & sq %% 2 == 0)[1], ]

# Loop solution
n = 1
while True:
    s = n*n
    if s > 1000 and s % 2 == 0:
        ans = {"n": n, "s": s}
        break
    n += 1
ans

# Vectorised (NumPy)
#| eval: false
import numpy as np
n = np.arange(1, 101)
sq = n**2
idx = np.where((sq > 1000) & (sq % 2 == 0))[0][0]
int(n[idx]), int(sq[idx])

Functions

Note

Today: Function anatomy, arguments & matching, returns, errors/warnings/messages (vs Python exceptions), scope & environments (R) / LEGB (Python), higher‑order & closures, composition & piping, docs & tests — R & Python side‑by‑side.

“Everything is a function call” (and a first‑class object)

R
Python

Evaluate a constant:

18.10

Names / quoted names:

sqrt
`+`

Function call:

1.1 + 2.1
`+`(1.1, 2.1)

Evaluate a constant, names, and call:

18.10
abs
(1.1).__add__(2.1)

Operators are methods/functions under the hood (see operator module), and functions are first‑class values.

Function components

R
Python

Arguments, body, environment.

my_div <- function(numerator, denominator) {
  div <- numerator / denominator
  return(div)
}
formals(my_div)
body(my_div)
environment(my_div)

Parameters, body, globals/closure.

def my_div(numerator, denominator):
    div = numerator / denominator
    return div
my_div.__code__.co_varnames, my_div.__defaults__, my_div.__closure__

Returns

R
Python

Last expression is returned if return() omitted.

f <- function(x) { x + 1 }
f(2)

Must return explicitly (else returns None).

def f(x):
    x + 1
print(f(2))  # None

Passing arguments

R — matching rules
Python — calling conventions

Positional, exact by name, partial (prefix) matching.

my_div(1, 2)
my_div(numerator = 1, denominator = 2)
my_div(n = 1, d = 2)  # partial match (works but discouraged)

Positional, keyword, defaults, *args, **kwargs.
Keyword‑only params (after *), positional‑only (before /).

def g(a, b=1, *args, scale=1, **kw):
    return (a + b) * scale

g(1, 2, scale=3)

def h(x, /, y, *, z=0):  # pos‑only x; keyword‑only z
    return x + y + z
h(1, 2, z=3)

Assignment vs argument binding in R

Both = and <- assign values, but inside calls = binds arguments.

my_div(numerator = 2, denominator = 1)
numerator             # error: object 'numerator' not found
my_div(numerator <- 2, denominator = 1)  # assigns globally first
numerator             # now exists in workspace

Type checks & friendly failures

R
Python

my_div <- function(numerator, denominator) {
  if (any(!is.numeric(numerator), !is.numeric(denominator))) {
    stop("`numerator` and `denominator` must be numeric")
  }
  numerator / denominator
}
my_div("numerator", "denominator")

class NotNumericError(TypeError):
    pass

def my_div(numerator, denominator):
    from numbers import Number
    if not isinstance(numerator, Number) or not isinstance(denominator, Number):
        raise NotNumericError("numerator and denominator must be numeric")
    return numerator / denominator
# my_div("numerator", "denominator")  # raises

Warnings / messages vs warnings / logging

R
Python

warning("Size mismatch; recycling may occur")
message("Starting the division…")

import warnings, logging
warnings.warn("Size mismatch; check inputs")
logging.basicConfig(level=logging.INFO)
logging.info("Starting the division…")

Dimensions & vectorisation

R
Python (NumPy)

A <- matrix(1:9, ncol = 3)
B <- matrix(10:18, ncol = 3)
A / B  # vectorised elementwise

import numpy as np
A = np.arange(1,10).reshape(3,3)
B = np.arange(10,19).reshape(3,3)
A / B  # elementwise

Robust `my_div` (shape checks)

R
Python (NumPy)

my_div <- function(numerator, denominator) {
  if (any(!is.numeric(numerator), !is.numeric(denominator))) {
    stop("`numerator` and `denominator` must be numeric")
  }
  if (!identical(dim(numerator), dim(denominator))) {
    # fall back to length check for vectors
    if (is.null(dim(numerator)) && is.null(dim(denominator))) {
      if (length(numerator) != length(denominator))
        stop("Lengths must match for vectors")
    } else {
      stop("Dimensions must match for arrays/matrices")
    }
  }
  numerator / denominator
}

import numpy as np

def my_div(numerator, denominator):
    a = np.asarray(numerator)
    b = np.asarray(denominator)
    if a.shape != b.shape:
        raise ValueError("Shapes must match")
    return a / b

Scope: R environments vs Python LEGB

R — lexical scoping
Python — LEGB & global/nonlocal

Dynamic lookup: names resolved when function runs.
Name masking: inner names shadow outer names.

f <- function() x * x
f()            # error, x not found
x <- 10
f()            # now 100

x <- 10
f <- function(){ x <- 1; x * x }
f()            # 1

Inspect environments:

environment()
globalenv()
emptyenv()

x = 10

def f():
    x = 1   # local shadows global
    return x * x
f()

y = 0

def bump():
    global y
    y += 1
bump(); y


def make_adder(k):
    def add(x):
        return x + k  # closes over k (enclosing)
    return add
add5 = make_adder(5)
add5(3)

Closures & higher‑order functions

R
Python

make_adder <- function(k) {
  function(x) x + k
}
add5 <- make_adder(5)
add5(3)
# Map over a vector
lapply(1:5, add5)

from functools import partial

def add(x, y):
    return x + y
add5 = partial(add, 5)
list(map(add5, range(1,6)))

Composition & piping

R
Python

Base pipe |> and magrittr %>%.

`%big%` <- function(x, y) 10 * x * y  # custom infix
1 %big% 2

# composition via nesting / piping
sqrt(log1p(9))
9 |>
  log1p() |>
  sqrt()

Nest calls or write a tiny compose.

from math import sqrt, log1p

def compose(f, g):
    return lambda x: f(g(x))

h = compose(sqrt, log1p)
h(9)

Documentation

R — roxygen2 style (snippet)
Python — docstring + type hints

#' Divide two numbers
#'
#' @param numerator,denominator Numeric scalars or vectors.
#' @return Numeric result.
#' @examples
#' my_div(4, 2)
my_div <- function(numerator, denominator) numerator / denominator

from typing import Union
Number = Union[int, float]

def my_div(numerator: Number, denominator: Number) -> float:
    """Divide two numbers.

    Args:
        numerator: dividend
        denominator: divisor
    Returns:
        The quotient as float.
    """
    return numerator / denominator

Testing (unit tests)

R — testthat (minimal)
Python — pytest (minimal)

# install.packages("testthat")
# test_that("my_div works", {
#   expect_equal(my_div(4,2), 2)
#   expect_error(my_div("a", 2))
# })

# def test_my_div():
#     assert my_div(4,2) == 2
#     import pytest
#     with pytest.raises(Exception):
#         my_div("a", 2)

Mini‑exercises (quick)

Safe divide: Write safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.
Prefix matching: In R, define f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.

Appendix — Python in this deck & pitfalls

Render with R/knitr + reticulate and point to your env:

install.packages("reticulate")
reticulate::use_condaenv("dacm", required = TRUE)
reticulate::py_config()

Python gotcha — mutable defaults:

# Bad
# def append_item(x, lst=[]):
#     lst.append(x); return lst
# Good
# def append_item(x, lst=None):
#     if lst is None: lst = []
#     lst.append(x); return lst

R note — lazy defaults: evaluated when first used, not at definition.

Cheatsheet

R returns last expr; Python needs return.
R: positional/exact/partial arg matching; Python: positional/keyword, *args/**kwargs, / & * markers.
R errors/warnings/messages vs Python exceptions/warnings/logging.
R lexical scoping & environments vs Python LEGB with global/nonlocal.
Vectorise: R builtins; Python use NumPy/pandas.
Composition: R |>/%>% & custom %op%; Python nesting/compose/method chaining.

Mini-exercise 1: Safe divide

Task: Write safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.

R
Python

safe_div <- function(a, b) {
  ifelse(b == 0, {
    warning("denominator is 0; returning NA")
    NA_real_
  }, a / b)
}
safe_div(c(1,2,3), c(1,0,2))


# alternatively
safe_div <- function(a, b) {
  if(any(b == 0)) {
    warning("denominator is 0; returning NA")
  }
  
  x <- rep(NA_real_, length(a))
  x[b != 0] <- a[b != 0] / b[b != 0]
  
  return(x)
}

import math, warnings

def safe_div(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        warnings.warn("denominator is 0; returning NaN")
        return math.nan

print(safe_div(1,1))
print(safe_div(2,0))

Mini-exercise 2: Prefix matching (R) & safer calls

Task: In R, define f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.

R
Python

f <- function(numerator, denominator) numerator / denominator
f(n = 1, d = 2)              # works via partial matching (risky)

# Make it safer:
options(warnPartialMatchArgs = TRUE)  # warn on partial matching
f(numerator = 1, denominator = 2)     # explicit names

# Not applicable (Python has no partial name matching for keywords).
pass

Lecture 2 — Programming Foundations

Big picture

LLMs — use wisely

Quarto essentials

Git/GitHub workflow

Reproducibility kit

From last time — tasks

Data structures

Primary (scalar) types

Reserved / special values

Inspecting and coercing types

Nondeterminism in LLM

Testing floating-point non-associativity

Testing floating-point non-associativity

Homogeneous vs heterogeneous structures

Vectors (R) vs Lists / Arrays (Python)

Assignment

Subsetting — key differences

Coercion in homogeneous containers

Attributes & names

Sequences

Useful vector ops

Matrices / 2D arrays

Lists (R) vs dict/list (Python)

Data frames / tibbles vs pandas DataFrame

Subsetting data frames

Dates

Mini-exercises (quick wins)

Challenge exercise (R & Python)

Appendix — Python execution in knitr

Appendix — Cheatsheet

Mini-exercise 1: Indexing

Mini-exercise 2: Sort descending

Mini-exercise 3: Coercion

Mini-exercise 4: Matrix ops

Challenge: Tennis leaderboard

Challenge: Tennis leaderboard - 2nd solution

Control structures

Two families

Logical operators — scalars

Logical operators — vectors/arrays

Truthiness pitfalls

Selection operators

if — basic

if / else

if / else if / else

Vectorised conditionals

switch (R) vs match/case (Python ≥3.10)

Loops — overview

for loops

while loops

Infinite loops & safety

Python’s for … else

Vectorisation & apply-family

Performance sketch

Mini‑exercises (quick)

Challenge — Find the first even square > 1000

Cheatsheet

Mini-exercise 1: Short‑circuit vs elementwise

Mini-exercise 2: FizzBuzz (1–20)

Mini-exercise 3: Masking odds to NA/NaN

Mini-exercise 4: Tiny calculator (switch/match)

Challenge: First even square > 1000

Functions

“Everything is a function call” (and a first‑class object)

Function components

Returns

Passing arguments

Assignment vs argument binding in R

Type checks & friendly failures

Warnings / messages vs warnings / logging

Dimensions & vectorisation

Robust my_div (shape checks)

Scope: R environments vs Python LEGB

Closures & higher‑order functions

Composition & piping

Documentation

Testing (unit tests)

Mini‑exercises (quick)

Appendix — Python in this deck & pitfalls

`if` — basic

`if` / `else`

`if` / `else if` / `else`

`switch` (R) vs `match/case` (Python ≥3.10)

`for` loops

`while` loops

Python’s `for … else`

Robust `my_div` (shape checks)