Primary (scalar) types
- integer:
2L,12L - double (real):
-4,12.4532,6 - logical:
TRUE,FALSE(alsoT,F) - character:
"a","Bonjour" - (also: complex, raw)
Data & Code Management: From Collection to Application
2025-10-02
echo, eval, warning, fig-*.knitr::kable(), kableExtra; equations (Mathpix helpful).renv::init(); snapshot(); restore()Note
Today: Primary types, vectors/lists, matrices/arrays, data frames/tibbles, pandas DataFrame, dates, subsetting, coercion rules, and common ops — in both R and Python.
2L, 12L-4, 12.4532, 6TRUE, FALSE (also T, F)"a", "Bonjour"NA, typed variants NA_real_, NA_integer_, NA_character_, …Inf, NaNNULLif, else, repeat, while, function, for, in, next, breakFloating-point arithmetic in GPUs exhibits non-associativity, meaning \((a+b)+c\neq a+(b+c)\) due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.
Horace He in collaboration with others at Thinking Machines
Note
Read the full article here
What do you obtain if you run this in Python/R?
If you replace 20 by another smaller integer, do you get the same result? At which integer do you start to get the same result? Why?
Key idea: near large magnitudes, the ULP is so big that small addends vanish or round to the nearest ULP, breaking associativity.
| Dimension | R homogeneous | R heterogeneous | Python homogeneous | Python heterogeneous |
|---|---|---|---|---|
| 1 | atomic vector | list | NumPy array | list, tuple |
| 2 | matrix | data.frame / tibble | NumPy ndarray (2D) | pandas DataFrame |
| n | array | — | NumPy ndarray (nD) | — |
Rule of thumb: R vectors are homogeneous; Python lists are heterogeneous. For homogeneous numerics in Python, use NumPy.
-i excludes element i; Python -i indexes from the end.logical < integer < double < character.dtype.players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
grand_slam_win <- c(16,19,20,0,0)
date_of_birth <- c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03")
(tennis <- data.frame(date_of_birth, grand_slam_win, row.names = players))
is.data.frame(tennis)
str(tennis)import pandas as pd
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
"date_of_birth": ["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"],
"grand_slam_win": [16,19,20,0,0]
}, index=players)
df.info()
df.loc[:, ["grand_slam_win", "date_of_birth"]]x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical (or comprehension) indexing.x in descending order using (a) indices and (b) a built-in sorter.c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).A %*% A (R) / A @ A (Py).Optional: Show your answers in a short
.qmdor notebook.
Tennis leaderboard
as.Date(Sys.Date()) / pd.Timestamp.today()), then sort by age descending.Hints:
mutate, lubridate::time_length(interval, 'years'), plot() or ggplot2.pd.to_datetime, age via dt, df.sort_values, and matplotlib.If Python chunks don’t execute:
Ensure reticulate sees the right interpreter:
Point to an environment that has numpy / pandas if used:
Or disable execution of Python chunks during render with chunk option #| eval: false.
NA; Python None/NaN/pd.NA (context-dependent).%*%; Python @ (NumPy/pandas).Task: Let x <- 3 * seq_len(4) (R) / x = [3*i for i in range(1,5)] (Py). Select the 2nd element via (a) positive, (b) negative, (c) logical/comprehension.
Task: Sort x in descending order using (a) indices and (b) a built-in sorter.
Task: In R, evaluate c(TRUE, 2, "3") — what happens? In Python, compare [True, 2, "3"] and np.array([True, 2, "3"]).
Task: Create a 3×3 matrix / array and compute its transpose and A %*% A (R) / A @ A (Py).
Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.
players <- c("Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem")
dob <- as.Date(c("1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"))
wins <- c(16,19,20,0,0)
df <- data.frame(player = players, dob = dob, wins = wins, stringsAsFactors = FALSE)
now <- Sys.Date()
df$age_years <- as.numeric(difftime(now, df$dob, units = "days")) / 365.2425
df2 <- df[order(-df$age_years), ]
print(df2)
plot(df2$age_years, df2$wins, xlab = "Age (years)", ylab = "Grand Slam wins")import pandas as pd
from datetime import date
players = ["Novak Djokovic","Rafael Nadal","Roger Federer","Daniil Medvedev","Dominic Thiem"]
df = pd.DataFrame({
"player": players,
"dob": pd.to_datetime(["1987-05-22","1986-06-03","1981-08-08","1996-02-11","1993-09-03"]),
"wins": [16,19,20,0,0]
})
now = pd.Timestamp.today().normalize()
df["age_years"] = (now - df["dob"]).dt.days / 365.2425
df2 = df.sort_values("age_years", ascending=False)
print(df2)
ax = df2.plot.scatter(x="age_years", y="wins")
ax.set_xlabel("Age (years)"); ax.set_ylabel("Grand Slam wins")Task: Build table (players, DoB, wins); add age (years); sort by age desc; plot wins vs age.
library(dplyr)
library(tibble)
library(ggplot2)
ref_date <- as.Date("2025-10-08")
rankings <- tribble(
~rank, ~player, ~country, ~points, ~dob, ~gs_titles,
1, "Carlos Alcaraz", "ESP", 11540, as.Date("2003-05-05"), 6,
2, "Jannik Sinner", "ITA", 10950, as.Date("2001-08-16"), 4,
3, "Alexander Zverev", "GER", 5980, as.Date("1997-04-20"), 0,
4, "Taylor Fritz", "USA", 4995, as.Date("1997-10-28"), 0,
5, "Novak Djokovic", "SRB", 4830, as.Date("1987-05-22"), 24
) |>
mutate(age = floor(as.numeric(ref_date - dob) / 365.2425))
rankings
# Simple points bar chart
rankings |>
ggplot(aes(x = reorder(player, points), y = points)) +
geom_col() +
coord_flip() +
labs(x = NULL, y = "Ranking points",
title = "ATP Singles — Top 5 (week of 2025‑10‑06)") +
theme_minimal(base_size = 12)
# Optional: Grand Slam titles lollipop
rankings |>
ggplot(aes(x = reorder(player, gs_titles), y = gs_titles)) +
geom_segment(aes(xend = player, y = 0, yend = gs_titles)) +
geom_point(size = 3) +
coord_flip() +
labs(x = NULL, y = "Grand Slam singles titles",
title = "Grand Slam titles (singles)") +
theme_minimal(base_size = 12)import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
ref_date = datetime(2025, 10, 8)
data = {
"rank": [1, 2, 3, 4, 5],
"player": ["Carlos Alcaraz","Jannik Sinner","Alexander Zverev","Taylor Fritz","Novak Djokovic"],
"country":["ESP","ITA","GER","USA","SRB"],
"points": [11540, 10950, 5980, 4995, 4830],
"dob": pd.to_datetime(["2003-05-05","2001-08-16","1997-04-20","1997-10-28","1987-05-22"]),
"gs_titles": [6, 4, 0, 0, 24]
}
rankings = pd.DataFrame(data)
rankings["age"] = (ref_date - rankings["dob"]).dt.days // 365.2425
rankings
# Points bar chart (horizontal)
ax = rankings.sort_values("points").plot(kind="barh", x="player", y="points", legend=False)
ax.set_xlabel("Ranking points")
ax.set_title("ATP Singles — Top 5 (week of 2025‑10‑06)")
plt.tight_layout(); plt.show()
# Optional: Grand Slam titles lollipop-style
ordered = rankings.sort_values("gs_titles")
plt.figure()
for i, (p, v) in enumerate(zip(ordered["player"], ordered["gs_titles"])):
plt.plot([0, v], [i, i])
plt.plot(v, i, marker="o")
plt.yticks(range(len(ordered)), ordered["player"])
plt.xlabel("Grand Slam singles titles")
plt.title("Grand Slam titles (singles)")
plt.tight_layout(); plt.show()Note
Today: Booleans & logical ops, choices (if/else, switch/match), loops (for, while, repeat), short-circuiting, vectorised conditionals, comprehension/map/apply, performance tips — in R & Python.
We’ll contrast R and Python patterns side-by-side.
Use elementwise & and | for vectors/matrices; all(), any(), xor().
if(c(TRUE,FALSE)) → warning: condition length > 1; only first used.if [False]: → True.if np.array([True, False]) → error (ambiguous). Use .any() / .all().if — basicif / elseif / else if / elseswitch (R) vs match/case (Python ≥3.10)for, while, repeat, with break, next.for (over iterables), while, with break, continue, and the lesser-known for … else / while … else.for loopswhile loopsbreak.for … elseThe else runs only if the loop did not break.
rowMeans, colSums, …).apply(X, MARGIN, FUN, …) for arrays; lapply/sapply for lists; or purrr::map*.Rule: prefer builtins and vectorised code.
c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.Fizz, of 5 Buzz, of both FizzBuzz.x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.switch and Python match.Write two versions (R & Python):
break when found.Stretch: benchmark both approaches.
&, |; short-circuit: &&, || (first element only).and, or, not; elementwise with NumPy/pandas: &, | (+ parentheses!).if/elif/else (Py) ~ if/else if/else (R).ifelse (R), np.where/Series.where (Py).switch (R), match/case or dict dispatch (Py).for, while, repeat (R); for, while, for … else (Py).Task: What do these return? In R: c(T,F) | c(T,F) vs c(T,F) || c(T,F). In Python/NumPy: (A | B) vs (A or B) with boolean arrays.
Task: Print numbers 1–20; for multiples of 3 print Fizz, of 5 Buzz, of both FizzBuzz.
Task: Given x <- 1:10 / x = np.arange(1,11), replace odds by NA/np.nan using vectorised tools.
Task: Implement a tiny calculator using R switch and Python match.
# Python 3.10+
import sys
if sys.version_info >= (3,10):
def calc(a,b,op):
match op:
case "+": return a + b
case "-": return a - b
case "*": return a * b
case "/": return a / b
case _: raise ValueError("unknown op")
calc(20,5,"+")
else:
ops = {"+": lambda a,b: a+b, "-": lambda a,b: a-b, "*": lambda a,b: a*b, "/": lambda a,b: a/b}
ops["+"](20,5)Task: Write two versions (R & Python):
break when found.
Note
Today: Function anatomy, arguments & matching, returns, errors/warnings/messages (vs Python exceptions), scope & environments (R) / LEGB (Python), higher‑order & closures, composition & piping, docs & tests — R & Python side‑by‑side.
= and <- assign values, but inside calls = binds arguments.class NotNumericError(TypeError):
pass
def my_div(numerator, denominator):
from numbers import Number
if not isinstance(numerator, Number) or not isinstance(denominator, Number):
raise NotNumericError("numerator and denominator must be numeric")
return numerator / denominator
# my_div("numerator", "denominator") # raisesmy_div (shape checks)my_div <- function(numerator, denominator) {
if (any(!is.numeric(numerator), !is.numeric(denominator))) {
stop("`numerator` and `denominator` must be numeric")
}
if (!identical(dim(numerator), dim(denominator))) {
# fall back to length check for vectors
if (is.null(dim(numerator)) && is.null(dim(denominator))) {
if (length(numerator) != length(denominator))
stop("Lengths must match for vectors")
} else {
stop("Dimensions must match for arrays/matrices")
}
}
numerator / denominator
}|> and magrittr %>%.safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.Render with R/knitr + reticulate and point to your env:
Python gotcha — mutable defaults:
R note — lazy defaults: evaluated when first used, not at definition.
return.*args/**kwargs, / & * markers.global/nonlocal.|>/%>% & custom %op%; Python nesting/compose/method chaining.Task: Write safe_div in R & Python that returns NA/math.nan when denominator is 0 and emits a warning.
safe_div <- function(a, b) {
ifelse(b == 0, {
warning("denominator is 0; returning NA")
NA_real_
}, a / b)
}
safe_div(c(1,2,3), c(1,0,2))
# alternatively
safe_div <- function(a, b) {
if(any(b == 0)) {
warning("denominator is 0; returning NA")
}
x <- rep(NA_real_, length(a))
x[b != 0] <- a[b != 0] / b[b != 0]
return(x)
}Task: In R, define f <- function(numerator, denominator) {...} and call it via partial names. Why is this risky? Rewrite to avoid it.
HEC Lausanne · Business Analytics · Thu 9:00–12:00