Data & Code Management: From Collection to Application
2025-09-25
“Reproducibility is job #1 for modern data science.”
— everyone who has ever lost a script 😉
Note
Today’s goals
1. Understand course scope & expectations
2. See why reproducibility matters
3. Meet the core toolchain (R, Python, SQL, GitHub, Markdown/Jupyter/Quarto)
4. Try small, focused exercises
Tip: Slides include short activities you can try during/after class.
Tip
Success checklist
- Commit early, commit often
- Make small, reversible changes
- Automate what repeats
- Document decisions, not only code
- Prefer scripts & notebooks over manual clicks
Tools allowed: R, Python, SQL, Quarto, GitHub, AI copilots (with provenance & verification).
Definition: Large Language Models (LLMs) are AI models trained on vast amounts of text data to understand and generate human language.
Examples:
Capabilities:
Important
AI policy (short)
Use AI to brainstorm, outline, or lint code. Own the result: verify outputs, write your tests, and document AI assistance (what, why, where).
project/
├─ data/ # raw/ and processed/ (never overwrite raw)
├─ R/ or src/ # functions, modules
├─ notebooks/ # exploratory analysis
├─ reports/ # Quarto/Markdown outputs
├─ tests/ # unit tests
├─ renv/ or .venv/ # R or Python environment
├─ .gitignore
└─ README.md
R snippet
Caution
Are these codes reproducible? Why/why not?
Render:
Important
House rule: every analysis step appears in a script/notebook—no manual spreadsheet edits.
markdown contrasts markup languages (e.g. HTML) which require heavy syntaxIn a nutshell, Quarto builds on Pandoc and execution engines (Knitr for R, Jupyter for Python/Julia/Observable). It allows embedding code into Markdown documents, which can be rendered into multiple formats (HTML, PDF, Word, …).
In RStudio or VSCode, click File → New File → Quarto Document. Or simply create a .qmd file.
markdownThree aspects:
---Example:
Core body, essential for explaining your analysis.
Markdown syntax:
*italics*, **bold**, code style#, ##, ###)*, -, +, 1.)[Quarto](https://quarto.org)> …)knitr::kable())$...$ inline or $$...$$ display.@fig-label or @sec-label..bib files ([@doe2023]).Quarto supports R, Python, Julia, and ObservableJS.
Code cell delimiters:
Quarto syntax uses YAML-like #| comments:
Options you’ll use most often:
eval: run code?echo: show code?warning: show warnings?cache: reuse computations?Options for plots:
A scatterplot
knitr::kable()| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
For enhanced tables: kableExtra.
kableExtraMathpix Snip digitizes handwritten/printed math and pastes directly into Markdown/LaTeX/Word.
quarto previewInstant feedback without full rebuilds.
Tip
💡 Think of GitHub as both a time machine and a teamwork hub.
We will use GitHub to:
- Work in groups on projects
- Submit homeworks
- Develop programs
Git is a distributed version control system:
Important
Instead of dozens of “final” files, Git records every change in one clean history.
A file can be:
- Untracked – new, not in Git yet
- Modified – changed, not saved in history
- Staged – marked for next snapshot
- Committed – safely stored in the repository
Use a .gitignore file to tell Git which files to skip:
- Temporary files (.Rhistory, .DS_Store)
- Large data files
- Secrets or keys
Tip
Always pull before you push to avoid conflicts!
git pull --rebasePULL_REQUEST_TEMPLATE.md, ISSUE_TEMPLATE.mdCHANGELOG.mdNote
Activity (2’): In pairs, outline a PR description for adding a new utils/plot.R with one function and one test.
PR Title
Add utils/plot.R with basic plotting function and test
Summary
- Add plot_scatter() in utils/plot.R
- Add test in tests/testthat/test-plot.R
Details
- Input validation (numeric vectors of equal length)
- Uses ggplot2
- Tests:
- Error if unequal lengths
- Output is "ggplot"
Checklist
- [x] Function implemented
- [x] Tests added
- [x] Documentation with roxygen2
- [ ] CI checks pass
Related Issues
Closes #12 (feature request: plotting utilities)
Git gets easier once you get the basic idea that branches are homeomorphic endofunctors mapping submanifolds of a Hilbert space.
—Isaac Wolkerstorfer (joking)
Note
Activity (think‑pair‑share):
What makes a good commit message? Write one for “fixed weird bug in script” that would help your future self.
Symptoms of non‑reproducible work (raise your hand if you’ve seen these):
- “It works on my machine.”
- “I changed nothing and it broke.”
- “Which file is the final_final_v3.R?”
Principles
- Deterministic environments
- Versioned code and data contracts
- Scripts, not clicks
- Single‑source of truth (parameters, config)
- Literate programming (Markdown/Quarto)
- Automated checks (CI later in course)
R — renv
Python — Conda (recommended)
Note
Exercise (1′): List one package you rely on in R and in Python. Why lock its version?
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.
Horace He in collaboration with others at Thinking Machines
Note
Read the full article here
.gitignore to avoid committing large/secret files.qmd to HTML,renv or .venv.Tip
If stuck: ask on Slack—show error, steps tried, and minimal example.
exercise1.R.Note
💡 Tip: Use c() for vectors, mean() for averages, and paste() for printing messages.
exercise2.py.Note
💡 Tip: Use sum() and len() to calculate an average.
exercise3.qmd."Hello R!"."Hello Python!".Note
💡 Tip: Use ```{r} and ```{python} for chunks.
first-exercises.exercise1.R, exercise2.py, exercise3.qmd).Note
💡 Tip: Use GitHub Desktop, git on the command line, or RStudio’s Git interface.
mtcars.mpg) by the number of cylinders (cyl).mpg vs hp (horsepower).exercise5.R.Note
💡 Tip: Look at aggregate() or dplyr::group_by() + summarise(). Use plot() for a quick scatterplot.
iris (you can import it from sklearn.datasets).exercise6.py.Note
💡 Tip: pandas.DataFrame.groupby() is useful for summaries.
exercise7.qmd.summary(mtcars)).Note
💡 Tip: To render PDF, you may need LaTeX installed. Use quarto render exercise7.qmd --to pdf.
exercise8.qmd.mtcars$mpg.iris and plots sepal length vs sepal width..qmd, the HTML output, and any supporting files to your GitHub repo first-exercises.Note
💡 Goal: Practice combining R + Python code in Quarto, and making results reproducible via GitHub.
Note
🌟 Tip: The focus is on sharing your approach, not on having a perfect solution.
Thanks!
Optional: After class, try converting one old analysis to Quarto and push it to GitHub with a short README.
Example README.md:
# exercise5.R
# A. summary by cylinders
aggregate(mpg ~ cyl, data = mtcars, FUN = mean)
# Using dplyr (optional)
# library(dplyr)
# mtcars |>
# group_by(cyl) |>
# summarise(mean_mpg = mean(mpg))
# B. scatterplot mpg vs hp
plot(mtcars$hp, mtcars$mpg,
xlab = "Horsepower (hp)", ylab = "Miles per gallon (mpg)",
main = "mtcars: mpg vs hp")# exercise6.py
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load iris
iris_bunch = load_iris(as_frame=True)
df = iris_bunch.frame # columns: sepal length (cm), sepal width (cm), ...
df["species"] = df["target"].map(dict(enumerate(iris_bunch.target_names)))
# Average petal length per species
avg_petal = df.groupby("species")["petal length (cm)"].mean()
print(avg_petal)
# Histogram of sepal lengths
plt.figure()
df["sepal length (cm)"].hist()
plt.title("Histogram: Sepal Length")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()---
title: "Exercise 7"
author: "Your Name"
format:
html: default
pdf: default
---
This document shows basic R summaries and a Python plot.
```r
summary(mtcars)
```
```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True).frame
plt.figure()
plt.scatter(iris["sepal length (cm)"], iris["sepal width (cm)"])
plt.title("Iris: Sepal length vs width")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.tight_layout()
plt.show()
```Render:
---
title: "Exercise 8 (Capstone)"
author: "Your Name"
format: html
---
This Quarto file combines R + Python and is tracked on GitHub.
```r
mean(mtcars$mpg)
```
```python
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
iris = load_iris(as_frame=True).frame
plt.figure()
plt.scatter(iris["sepal length (cm)"], iris["sepal width (cm)"])
plt.title("Iris: Sepal length vs width")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.tight_layout()
plt.show()
```Push results:
HEC Lausanne · Business Analytics · Thu 9:00–12:00