Data & Code Management: From Collection to Application
2025-11-20
Main responsibility: bring high-quality, analysis-ready data.
data_raw/, data/, and scriptsMain responsibility: make the code reusable and documented.
DESCRIPTION / pyproject.toml or similarMain responsibility: communicate insights interactively.
Regardless of your lead role, everyone is responsible for:
All 3 students collaborate on the idea, the story, and the final presentation.
| Dimension | Points |
|---|---|
| Data gathering & wrangling | 10 |
| Package development & documentation | 10 |
| Dashboard / web app | 10 |
| Reproducibility & project management | 5 |
| Presentation & individual contribution | 5 |
| Total | 40 |
README with:
Tip
Shiny apps are just R code: a UI that declares outputs and inputs, and a server that reacts to them.
Source: https://shiny.posit.co/r/gallery
Source: https://shiny.posit.co/r/gallery
Source: https://deanattali.com/
Source: https://shiny.posit.co/
<div class="container-fluid">Hello, world!</div>
shinyApp(ui, server) (or runApp()).We’ll build a histogram of waiting time between eruptions (faithful).
Once the back-end logic is clear, design the UI.
| Function | Description |
|---|---|
fluidPage() |
Create a fluid page layout |
titlePanel() |
Application title |
sidebarLayout() |
Sidebar + main area layout |
sidebarPanel() |
Sidebar with inputs |
mainPanel() |
Main content (plots, tables, …) |
fluidPage(): fixedPage() (fixed width) and fillPage() (full height).sidebarLayout() is built on fluidRow()/column() for finer control.someInput(inputId, …). Access in server as input$inputId.| Function | Description |
|---|---|
numericInput() |
Number entry |
radioButtons() |
Radio selection |
selectInput() |
Dropdown menu |
sliderInput() |
Range/slider |
submitButton() |
Submission button |
textInput() |
Text box |
checkboxInput() |
Single checkbox |
dateInput() |
Date selector |
fileInput() |
Upload a file |
helpText() |
Describe input field |
output$outputId.| Output | Render | Description |
|---|---|---|
plotOutput() |
renderPlot() |
Plots |
tableOutput() |
renderTable() |
Simple tables |
textOutput() |
renderText() |
Text |
uiOutput() / htmlOutput() |
renderUI() |
HTML / dynamic UI |
verbatimTextOutput() |
renderPrint() |
Console-like text |
imageOutput() |
renderImage() |
Images |
# Define UI for application that draws a histogram
ui <- fluidPage(
titlePanel("Old Faithful Geyser Data"),
sidebarLayout(
sidebarPanel(
sliderInput(
inputId = "cells",
label = "Number of cells:",
min = 1, max = 50, value = 30
),
textInput(inputId = "label_x", label = "X-axis label:"),
textInput(inputId = "title", label = "Plot title:")
),
mainPanel(
plotOutput(outputId = "distPlot")
)
)
)
server <- function(input, output, session) { … }.input/output are list-like and reactive.ui <- fluidPage(
sliderInput(inputId = "cells", label = "Cells", min = 1, max = 50, value = 30),
textInput (inputId = "label_x", label = "X-axis label"),
textInput (inputId = "title", label = "Plot title"),
plotOutput(outputId = "distPlot")
)
server <- function(input, output, session) {
output$distPlot <- renderPlot({
x <- faithful[, 2]
breaks <- seq(min(x), max(x), length.out = input$cells + 1)
hist(x, breaks = breaks, col = 'darkgray', border = 'white',
xlab = input$label_x, main = input$title)
})
}x and breaks.server <- function(input, output, session) {
x <- reactive(faithful[, 2]) # <- cached
breaks <- reactive(seq(min(x()), max(x()),
length.out = input$cells + 1))
output$distPlot <- renderPlot({
hist(x(), breaks = breaks(), col = 'darkgray', border = 'white',
xlab = input$label_x, main = input$title)
})
}Note
Order of reactive expressions in the file does not matter (dependencies determine execution), but readability does.
breaks() updates only when cells (or x()) changes.reactive() to recompute only when needed.actionButton.eventReactive() or bindEvent() to react only on button clicks.eventReactive()server <- function(input, output, session) {
x <- reactive(faithful[, 2])
breaks <- eventReactive(input$make_graph, {
seq(min(x()), max(x()), length.out = input$cells + 1)
})
xlab <- eventReactive(input$make_graph, input$label_x)
title <- eventReactive(input$make_graph, input$title)
output$distPlot <- renderPlot({
hist(x(), breaks = breaks(), col = 'darkgray', border = 'white',
xlab = xlab(), main = title())
})
}bindEvent() (Shiny ≥ 1.6.0)library(magrittr)
server <- function(input, output, session) {
x <- reactive(faithful[, 2])
breaks <- reactive(seq(min(x()), max(x()), length.out = input$cells + 1)) %>%
bindEvent(input$make_graph)
xlab <- reactive(input$label_x) %>% bindEvent(input$make_graph)
title <- reactive(input$title) %>% bindEvent(input$make_graph)
output$distPlot <- renderPlot({
hist(x(), breaks = breaks(), col = 'darkgray', border = 'white',
xlab = xlab(), main = title())
})
}eventReactiveeventReactiveeventReactiveobserveEvent() works like eventReactive() but returns no value.shinyApp(ui = ui, server = server) creates an app object and runs on print.runApp() it.app.R.Ctrl/Cmd + Shift + Enter).You can choose the view you prefer.
Modules are one of the most powerful tools for building shiny applications in a maintainable and sustainable manner. Engineering Production-Grade Shiny Apps
library(shiny)
ui <- fluidPage(
selectInput("var", "Variable", names(mtcars)),
numericInput("bins", "bins", 10, min = 1),
plotOutput("hist")
)
server <- function(input, output, session) {
data <- reactive(mtcars[[input$var]])
output$hist <- renderPlot({
hist(data(), breaks = input$bins, main = input$var)
}, res = 96)
}
shinyApp(ui = ui, server = server)id.ns <- NS(id) and ns("inputId") for all IDs.moduleServer(id, function(input, output, session) { ... }) defines the server logic.id in UI and server.ns <- NS(id); wrap IDs with ns()).input$bins (from the extra slider) does not affect the module’s input$bins.reactive() from one module and pass it to another.histogramServer <- function(id, data, title = reactive("Histogram")) {
stopifnot(is.reactive(data))
moduleServer(id, function(input, output, session) {
output$hist <- renderPlot({
hist(data(), breaks = input$bins, main = title())
}, res = 96)
})
}
ui <- fluidPage(histogramUI("hist1"))
server <- function(input, output, session) {
data <- reactive(mtcars$mpg)
histogramServer("hist1", data)
}
shinyApp(ui, server)reactiveValues() object passed around modules.R/, assets in inst/app/, etc.shinyuieditorshinyuieditor is an RStudio add-in to design Shiny UIs visually (drag & drop), then export clean UI code.ui <- fluidPage(...) code into your app and connect it to your server logic.year and region, and a main panel with a time-series plot and a summary table using my sales data.”str() / glimpse()).app.R) with ui, server, and shinyApp(ui, server).shinydashboard layout.”Tip
Keep ChatGPT for boilerplate & layout; you remain responsible for data logic, validation, and interpretation.
Tip
Think of Streamlit as: Python script → reactive UI (without callbacks unless you need them).
You can explore examples here:
Streamlit App Gallery
Run: streamlit run app.py
st.title, st.header, st.write, st.image, st.videost.slider, st.selectbox, st.text_input, st.date_input, st.file_uploader, …st.table, st.dataframe, st.code, st.map, st.plotly_chart, st.pyplot, …st.sidebar, st.columns, st.container, st.tabs, st.expanderNote
Each widget takes a key (optional) and writes to st.session_state[key].
import streamlit as st
import numpy as np
import matplotlib.pyplot as plt
st.title("Old Faithful (demo with random data)")
# Inputs
bins = st.slider("Number of bins", min_value=1, max_value=50, value=30, key="bins")
label_x = st.text_input("X-axis label", value="Value", key="label_x")
title = st.text_input("Plot title", value="Histogram", key="title")
# Data & plot
x = np.random.normal(size=272)
fig, ax = plt.subplots()
ax.hist(x, bins=bins, edgecolor="white")
ax.set_xlabel(label_x)
ax.set_title(title)
st.pyplot(fig)st.session_state.Tip
Use caching (@st.cache_data, @st.cache_resource) to avoid recomputing expensive work on each rerun.
cache_data = memoize data/compute results; ttl/show_spinner options.cache_resource = memoize resources (clients, models, connections).st.button, st.toggle also gate heavy work.session_stateon_change/on_click to run callbacks.st.session_state stores cross‑rerun state.import streamlit as st
st.sidebar.success("Filters go here")
left, right = st.columns([1, 2])
with left:
st.markdown("### Controls")
_ = st.selectbox("Metric", ["RMSE", "MAE", "R^2"])
with right:
st.markdown("### Chart")
st.line_chart({"y": [1, 3, 2, 4]})
with st.expander("Details"):
st.code("print('debug info')")my_app/
app.py # Home page
pages/
1_Explore.py
2_Model.py
3_Report.py
st.line_chart, st.bar_chart, st.scatter_chart.st.pyplot(fig), st.altair_chart(chart), st.plotly_chart(fig).streamlit-components (advanced).@st.cache_data and @st.cache_resource.st.experimental_fragment (if available) to isolate reruns on parts of the page.st.dataframe() payloads; paginate or sample for previews.st.write_stream (if relevant) or background services.pytest.mypy/pyright; lint with ruff.app/, components/, services/).streamlit_app/
app.py
pages/
app/__init__.py
app/data.py # data loading (cached)
app/viz.py # plotting utilities
app/layout.py # column/tab builders
requirements.txt
pyproject.toml # optional (poetry/pdm)
FROM python:3.12-slim → pip install -r requirements.txt → streamlit run app.py --server.port 8501 --server.address 0.0.0.0.st.secrets["key"] for API tokens.key → widgets clobber each other.# app.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
st.set_page_config(page_title="Template", layout="wide")
st.title("Project Title")
with st.sidebar:
st.header("Controls")
n = st.slider("N", 10, 5000, 500)
run = st.button("Run analysis")
@st.cache_data
def simulate(n):
import numpy as np
x = np.random.randn(n)
return pd.DataFrame({"x": x})
if run:
df = simulate(n)
st.dataframe(df.head())
fig, ax = plt.subplots()
ax.hist(df["x"], bins=30, edgecolor="white")
st.pyplot(fig)
else:
st.info("Adjust parameters and click **Run analysis**.")Tip
Want a side-by-side set of slides comparing Shiny vs Streamlit patterns for your course? I can generate a combined deck with parallel examples.
Tip
Throughout, copy/paste prompt patterns from the “Cheatsheet” slides. Keep your schema visible when prompting.
price) and inflation (cpi) for 3–4 countries, 2019–2025.date, country, metric, value (month‑end dates, units documented).Where would look for these data?
If you don’t have data: use the synthetic fallback in the next slides (identical schema).
price (electricity price index) and cpi (inflation index).data/golden.parquet with columns: date, country, metric, value + sanity checks.# python/synth_and_tests.py
# Reproducible, realistic synthetic monthly data (2019-01 to 2025-12)
import os, numpy as np, pandas as pd
rng = np.random.default_rng(123)
countries = ["CH", "FR", "DE", "IT"]
dates = pd.period_range("2019-01", "2025-12", freq="M").to_timestamp("M")
n = len(dates)
t = np.arange(n)
def price_series(idx: int) -> np.ndarray:
# Electricity price index: trend + seasonality + 2022 shock + partial 2023 unwind + mild AR-like noise
base = 100 + 0.8*idx
slope = 0.12 + 0.03*idx # ~0.12–0.21 index pts/month
season_amp = 2.0 + 0.6*idx # stronger seasonality for some countries
seasonal = season_amp * np.sin(2*np.pi*(t % 12)/12)
shock = np.zeros(n)
m2022 = (dates >= "2022-01-31") & (dates <= "2022-12-31")
ramp = np.linspace(0, 12 + 3*idx, m2022.sum()) # build-up through 2022
shock[m2022] = ramp
m2023 = (dates >= "2023-01-31") & (dates <= "2023-12-31")
shock_end_2022 = ramp[-1] if ramp.size else 0
unwind = np.linspace(shock_end_2022, 6 + 1.5*idx, m2023.sum()) # partial normalization through 2023
shock[m2023] = unwind
# 2024–2025: hold residual shock level
eps = rng.normal(0, 0.6, n) # small noise with persistence
noise = np.cumsum(eps) * 0.12
return base + slope*t + seasonal + shock + noise
def cpi_series(idx: int) -> np.ndarray:
# Inflation index: gradual drift, 2022 bump, 2023 cool-down, light seasonality/noise
base = 100 + 0.4*idx
drift = np.linspace(0, 10 + 2*idx, n) # ~+10–16 pts by 2025
bump = np.zeros(n)
m2022 = (dates >= "2022-01-31") & (dates <= "2022-12-31")
bump[m2022] = np.linspace(0, 2.8 + 0.5*idx, m2022.sum())
cool = np.zeros(n)
m2023 = (dates >= "2023-01-31") & (dates <= "2023-12-31")
cool[m2023] = -np.linspace(0, 1.2, m2023.sum())
seasonal = 0.5 * np.sin(2*np.pi*(t % 12)/12)
noise = np.cumsum(rng.normal(0, 0.15, n)) * 0.08
return base + drift + bump + cool + seasonal + noise
rows = []
for idx, c in enumerate(countries):
p = price_series(idx)
cp = cpi_series(idx)
rows += [(d, c, "price", float(v)) for d, v in zip(dates, p)]
rows += [(d, c, "cpi", float(v)) for d, v in zip(dates, cp)]
df = pd.DataFrame(rows, columns=["date", "country", "metric", "value"]).sort_values(["country","metric","date"])
os.makedirs("data", exist_ok=True)
df.to_parquet("data/golden.parquet", index=False)
# Sanity checks
assert set(df["country"]) == set(countries)
assert df["date"].nunique() == n
for c in countries:
for m in ["price", "cpi"]:
sub = df[(df.country == c) & (df.metric == m)]
assert len(sub) == n, f"Missing months for {c}-{m}"
assert np.isfinite(df["value"]).all()
print(f"OK: {len(df):,} rows · {len(countries)} countries × 2 metrics × {n} months")# r/synth.R
# Reproducible, realistic synthetic monthly data (2019-01 to 2025-12)
library(dplyr); library(lubridate); library(purrr); library(arrow)
set.seed(123)
countries <- c("CH","FR","DE","IT")
dates <- seq(ymd("2019-01-31"), ymd("2025-12-31"), by = "1 month")
n <- length(dates)
t <- seq_len(n) - 1
price_series <- function(idx){
base <- 100 + 0.8*idx
slope <- 0.12 + 0.03*idx # ~0.12–0.21 per month
season_amp <- 2.0 + 0.6*idx
seasonal <- season_amp * sin(2*pi*((t) %% 12)/12)
shock <- rep(0, n)
m2022 <- dates >= ymd("2022-01-31") & dates <= ymd("2022-12-31")
ramp <- seq(0, 12 + 3*idx, length.out = sum(m2022))
shock[m2022] <- ramp
m2023 <- dates >= ymd("2023-01-31") & dates <= ymd("2023-12-31")
shock_end_2022 <- if (length(ramp)) tail(ramp, 1) else 0
unwind <- seq(shock_end_2022, 6 + 1.5*idx, length.out = sum(m2023))
shock[m2023] <- unwind
eps <- cumsum(rnorm(n, 0, 0.6)) * 0.12 # mild persistence
base + slope*t + seasonal + shock + eps
}
cpi_series <- function(idx){
base <- 100 + 0.4*idx
drift <- seq(0, 10 + 2*idx, length.out = n) # ~+10–16 by 2025
bump <- rep(0, n)
m2022 <- dates >= ymd("2022-01-31") & dates <= ymd("2022-12-31")
bump[m2022] <- seq(0, 2.8 + 0.5*idx, length.out = sum(m2022))
cool <- rep(0, n)
m2023 <- dates >= ymd("2023-01-31") & dates <= ymd("2023-12-31")
cool[m2023] <- -seq(0, 1.2, length.out = sum(m2023))
seasonal <- 0.5 * sin(2*pi*((t) %% 12)/12)
noise <- cumsum(rnorm(n, 0, 0.15)) * 0.08
base + drift + bump + cool + seasonal + noise
}
df <- map2_dfr(countries, seq_along(countries) - 1, function(c, idx){
tibble(
date = rep(dates, 2),
country = c,
metric = rep(c("price","cpi"), each = n),
value = c(price_series(idx), cpi_series(idx))
)
}) %>% arrange(country, metric, date)
dir.create("data", showWarnings = FALSE)
arrow::write_parquet(df, "data/golden.parquet")
# Sanity checks
stopifnot(length(unique(df$date)) == n)
stopifnot(setequal(unique(df$country), countries))
stopifnot(all(df %>% count(country, metric) %>% pull(n) == n))
stopifnot(all(is.finite(df$value)))
message(sprintf("OK: %s rows · %s countries × 2 metrics × %s months", nrow(df), length(countries), n))A — Wireframes & KPIs
Act as a visualization TA. Given columns
date,country,metric,value, propose 3 dashboard wireframes (KPI strip + main trend + small multiples). Output: titles, annotation ideas, and accessibility notes.
B — Tidy & Validate (already done with the synthetic data)
Write code to: (1) coerce month‑end dates, (2) detect duplicates on
(date,country,metric), (3) exportdata/golden.parquet, and (4) print a concise validation summary.
C — Chart Critique
Given this code + chart description, propose 3 concrete improvements (titles/scales/labels) and show the revised code.
D — App Scaffold
Generate a one‑page Streamlit/Shiny app with country & metric selectors, a KPI card (last value + YoY%), main chart, and a download button. Keep it minimal and fast.
E — Publication Audit
Audit the chart/app against this checklist (titles, axes/units, color‑blind safety, footnotes, last updated). Return exact fixes (code + text) and a 2‑sentence executive summary.
# python/tidy.py
import pandas as pd, numpy as np, os
os.makedirs("data", exist_ok=True)
# Try to load; otherwise create synthetic fallback
try:
df = pd.read_parquet("data/golden.parquet")
except Exception:
dates = pd.period_range("2019-01", "2025-06", freq="M").to_timestamp("M")
countries = ["CH","FR","DE"]
rng = np.random.default_rng(0)
rows = []
for c in countries:
base = 100 + rng.normal(0,1)
price = base + np.cumsum(rng.normal(0,1,len(dates))) + 0.15*np.arange(len(dates))
cpi = 100 + np.cumsum(np.clip(rng.normal(0,0.3,len(dates)), -0.2, 0.8))
rows += [(d,c,"price",p) for d,p in zip(dates,price)]
rows += [(d,c,"cpi",v) for d,v in zip(dates,cpi)]
df = pd.DataFrame(rows, columns=["date","country","metric","value"])
# Coerce to month‑end, check duplicates
df["date"] = pd.to_datetime(df["date"]).dt.to_period("M").dt.to_timestamp("M")
dups = df.duplicated(["date","country","metric"]).sum()
print(f"Rows: {len(df):,} · Duplicates: {dups}")
# Save tidy table
df.to_parquet("data/golden.parquet", index=False)# r/tidy.R
library(dplyr); library(lubridate); library(arrow)
if (!dir.exists("data")) dir.create("data")
if (!file.exists("data/golden.parquet")) {
dates <- seq(ymd("2019-01-31"), ymd("2025-06-30"), by="1 month")
mk <- function(n) cumsum(rnorm(n, 0, 1))
make_country <- function(c) {
tibble(date=dates, country=c, metric="price", value=100 + mk(length(dates)) + 0.15*seq_along(dates)) |>
bind_rows(tibble(date=dates, country=c, metric="cpi", value=100 + cumsum(pmax(rnorm(length(dates),0,0.3),-0.2))))
}
df <- bind_rows(lapply(c("CH","FR","DE"), make_country))
write_parquet(df, "data/golden.parquet")
}
df <- read_parquet("data/golden.parquet") |>
mutate(date = ceiling_date(as.Date(date), "month") - days(1))
sum(duplicated(df[c("date","country","metric")]))
write_parquet(df, "data/golden.parquet")# python/chart.py
import altair as alt, pandas as pd
alt.data_transformers.disable_max_rows()
df = pd.read_parquet("data/golden.parquet")
price = df.query("metric=='price'")
chart = (alt.Chart(price)
.mark_line()
.encode(
x=alt.X("date:T", title=None),
y=alt.Y("value:Q", title="Electricity price index (2019-12=100)"),
color=alt.Color("country:N", legend=alt.Legend(orient="bottom"))
)
.properties(width=760, height=340, title="Electricity price index by country")
)
chart.save("figs/price_trend.svg")Note
Ask ChatGPT for title/subtitle copy, annotation ideas, and accessibility tweaks (line styles, label size, legend placement).
# r/chart.R
library(ggplot2); library(dplyr); library(arrow)
df <- read_parquet("data/golden.parquet") |> filter(metric=="price")
ggplot(df, aes(date, value, color=country)) +
geom_line(linewidth=.9) +
labs(title="Electricity price index by country",
subtitle="Monthly, 2019–2025; baseline 2019-12=100",
x=NULL, y="Index (2019-12=100)") +
theme_minimal(base_size = 13) +
theme(legend.position="bottom")
ggsave("figs/price_trend.svg", width=8.5, height=4.2, dpi=300)# app.py
import streamlit as st, pandas as pd, altair as alt
@st.cache_data
def load(): return pd.read_parquet("data/golden.parquet")
df = load()
st.title("Energy & Economy Brief")
country = st.selectbox("Country", sorted(df["country"].unique()))
metric = st.selectbox("Metric", sorted(df["metric"].unique()))
sub = df[(df.country==country) & (df.metric==metric)].sort_values("date")
last = sub.value.iloc[-1]
yoy = None
if len(sub) > 12:
yoy = (sub.value.iloc[-1]/sub.value.iloc[-13]-1)*100
st.metric("Last value", f"{last:.1f}", f"{yoy:+.1f}% YoY" if yoy is not None else None)
st.altair_chart(alt.Chart(sub).mark_line().encode(x="date:T", y="value:Q"), use_container_width=True)
st.download_button("Download CSV", sub.to_csv(index=False).encode(), file_name=f"{country}_{metric}.csv")
# run: streamlit run app.py# app.R
library(shiny); library(dplyr); library(ggplot2); library(arrow)
df <- read_parquet("data/golden.parquet")
ui <- fluidPage(
titlePanel("Energy & Economy Brief"),
sidebarLayout(
sidebarPanel(
selectInput("country","Country", choices = sort(unique(df$country))),
selectInput("metric","Metric", choices = sort(unique(df$metric)))
),
mainPanel(
uiOutput("kpi"), plotOutput("main"), downloadButton("dl","Download CSV")
)
)
)
server <- function(input, output, session){
dat <- reactive(df |> filter(country==input$country, metric==input$metric) |> arrange(date))
output$kpi <- renderUI({
d <- dat(); if(nrow(d)<1) return(NULL)
last <- tail(d$value,1); yoy <- if(nrow(d)>12) round((last/d$value[nrow(d)-12]-1)*100,1) else NA
tagList(h4(sprintf("Last: %.1f", last)), if(!is.na(yoy)) h5(sprintf("YoY: %.1f%%", yoy)))
})
output$main <- renderPlot({ ggplot(dat(), aes(date,value)) + geom_line(linewidth=.9) + theme_minimal(13) })
output$dl <- downloadHandler(
filename = function() paste0(input$country,"_",input$metric,".csv"),
content = function(file) write.csv(dat(), file, row.names = FALSE)
)
}
shinyApp(ui, server)requirements.txt / {renv} lock; decisions in README.Prompt: Audit the app & chart against this checklist and output exact fixes with revised code blocks.
workshop/
├─ data/ # golden.parquet lives here
├─ python/ # tidy.py, chart.py
├─ r/ # tidy.R, chart.R
├─ figs/
├─ app.py # Streamlit app (or app.R for Shiny)
├─ README.md
├─ requirements.txt # pandas, altair, streamlit, pyarrow
└─ renv.lock # if using R
Run (Python): python python/tidy.py → python python/chart.py → streamlit run app.py Run (R): Rscript r/tidy.R → Rscript r/chart.R → Rscript app.R
Same goal as Shiny and Streamlit: interactive dashboards and data apps.
Different positioning:
In many companies, Power BI is the default front-end for business dashboards.
Tip
In this course: Shiny & Streamlit for programmable apps, Power BI for enterprise dashboards and data wrangling.
Dashboards are visual displays of the most important information to reach one or more objectives, shown on a single screen for at-a-glance monitoring.
Strongly influenced by Stephen Few (Show Me the Numbers, Information Dashboard Design).
Core principles:
| Question | Power BI | Shiny | Streamlit |
|---|---|---|---|
| Target users | Business users, managers | Statisticians, R users | Data scientists, Python users |
| Main strength | Self-service BI, data models | Statistical workflows in R | ML & prototyping in Python |
| Governance & sharing | Very strong (Power BI Service) | Needs server / Shiny Server | Needs server / Streamlit Cloud |
| Level of coding | Low/medium (M, DAX, some R/Py) | Medium/high (R) | Medium/high (Python) |
| Best for | KPI dashboards, corporate views | Custom modelling tools | Quick interactive experiments |
In practice, you often combine them:
Power BI Desktop (what we use in class):
Power BI Service / Fabric (online):
We will focus on Desktop + Power Query and show where R / Python fit.
Power Query (M language)
DAX (Data Analysis Expressions)
Analogy:
group_by + summarise but evaluated on-the-fly in visuals.In many DA projects you will:
Import data (Excel, CSV, SQL, SAP extracts, folders of files, …).
Clean & transform:
Model the data (relations, star schemas).
Visualize with charts and tables.
Export / reuse the prepared data (Power BI visuals, CSVs, R/Python).
Power BI + Power Query cover steps 1–4, and we can integrate R/Python in 2, 4 & 5.
Use Home → Get data.
Common sources in this course:
,, ;, tab, …).After import you can:
The Power Query editor shows:
Key ideas:
Each column has a type (text, number, date, …).
Use the Column quality / distribution / profile options to:
Fix suspicious types early (e.g. IDs imported as numbers instead of text).
This is the Power BI counterpart of doing str() / summary() / skim() in R or df.info() / df.describe() in Python.
Use this when you have multiple files with the same structure:
Get data → Folder and select the folder with files.Watch out:
Home → Merge queries to join tables.
Typical use case: add master data (e.g. customers, products) to a transaction table.
Choose:
Then expand the resulting column and select the fields you need.
Equivalent to left_join, inner_join, etc. in dplyr or merge in pandas.
Two main options:
In Power Query:
Transform → Group by (switch to Advanced).In Power BI visuals:
Power Query summaries are static (pre-computed), DAX summaries are dynamic (reactive to filters).
Pivot columns (wide): turn values in a column into separate columns.
Unpivot columns (long): turn multiple columns into key–value pairs.
These operations mirror pivot_wider / pivot_longer in tidyr or melt/pivot in pandas.
Options:
Export from a visual (up to ~30k rows):
… → Export data.Use R or Python in Power Query to write larger tables out:
Transform → Run R script or Transform → Run Python script.We will use this to bridge from Power BI to external R/Python analyses.
In Power Query, on a referenced query, add Run R script and use:
# 'dataset' holds the input data for this script
local_packages <- installed.packages()[, 1]
if (!"writexl" %in% local_packages) {
install.packages("writexl")
}
if (nrow(dataset) <= 500000) {
writexl::write_xlsx(dataset, "C:/temp/data.xlsx")
} else {
write.table(dataset, "C:/temp/data.csv", row.names = FALSE, sep = ",")
}C:/temp/) exists and is writable.dataset.Similarly, using Run Python script in Power Query:
dataset.You can integrate R and Python in Power BI in three main ways:
This allows you to:
Before you start:
Install R (e.g. from CRAN) and / or Python (e.g. from Anaconda or python.org).
In File → Options and settings → Options:
(Optional) Set up a dedicated virtual environment for Power BI.
Limitations:
Home → Get data → More… → Other → R script.Example:
Home → Get data → More… → Other → Python script.In Power Query: Transform → Run R script or Transform → Run Python script.
Power BI passes the current table as:
dataset (R): a data frame.dataset (Python): a pandas DataFrame.Pattern:
output is then used as the next step.Equivalent to the VMD duplicate bank account example:
# 'dataset' contains vendor master data with BANKS, BANKL, BANKN
library(dplyr)
# count rows per bank account
counts <- dataset %>%
group_by(BANKS, BANKL, BANKN) %>%
summarise(n = n(), .groups = "drop") %>%
filter(n > 1)
# join back to keep only vendors with duplicated accounts
output <- dataset %>%
inner_join(counts, by = c("BANKS", "BANKL", "BANKN")) %>%
arrange(BANKN, BANKL, BANKS)This reproduces the Power Query steps using R code.
# 'dataset' contains vendor master data with BANKS, BANKL, BANKN
import pandas as pd
counts = (
dataset
.groupby(["BANKS", "BANKL", "BANKN"], as_index=False)
.size()
.rename(columns={"size": "n"})
)
counts = counts[counts["n"] > 1]
output = (
dataset.merge(counts, on=["BANKS", "BANKL", "BANKN"], how="inner")
.sort_values(["BANKN", "BANKL", "BANKS"])
)Again, output becomes the next Power Query step.
On the report canvas you can:
dataset (data frame / DataFrame).Use this to:
Note: visuals are rendered as static images; interactive HTML widgets are not supported.
This is very similar to doing a scatter plot in a Shiny app, but embedded directly in Power BI.
# 'dataset' contains columns: Date, Sales
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
# Ensure Date is datetime and sorted
data = dataset.copy()
data["Date"] = pd.to_datetime(data["Date"])
data = data.sort_values("Date")
ax.plot(data["Date"], data["Sales"])
ax.set_xlabel("Date")
ax.set_ylabel("Sales")
ax.set_title("Sales over time")
fig.autofmt_xdate()
plt.show()This mirrors a basic Streamlit time series plot.
A realistic project might look like:
Exploration & modelling:
Industrialization:
Dashboarding:
For a Power BI + R/Python project you can follow this template:
Create a .pbix file with:
Add one or two R/Python visuals for:
Optionally, add a Power Query R/Python step to export a cleaned dataset.
Document:
This chapter complements the Shiny and Streamlit chapters by showing how to operate within a corporate BI tool while still leveraging your R and Python skills.
Goal: Build one clear, single‑page dashboard from data/golden.parquet.
Schema (long): date (month end), country (CH/FR/DE/IT), metric (price/cpi), value (index).
Steps
Load data
data/golden.parquet.date (Date), country (Text), metric (Text), value (Decimal).Add slicers (left rail)
country → set to List → Don’t summarize.metric → set to Dropdown → Don’t summarize.KPI Cards (no DAX yet)
value (Don’t summarize) → Title: Current value.value again → Title: Total selected period.value → Title: (placeholder).These are placeholders before enabling time intelligence in Steps 2–4.
Main line chart
date (set Type = Continuous).value.country.date is Date.MonthEnd if needed:let
Source = Parquet.Document(File.Contents("data/golden.parquet")),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"date", type date}, {"country", type text}, {"metric", type text}, {"value", type number}}),
#"Added MonthEnd" = Table.AddColumn(#"Changed Type", "MonthEnd", each Date.EndOfMonth([date]), type date)
in
#"Added MonthEnd"Date =
VAR MinDate =
DATE ( YEAR ( MIN ( 'data'[date] ) ), MONTH ( MIN ( 'data'[date] ) ), 1 )
VAR MaxDate =
EOMONTH ( MAX ( 'data'[date] ), 0 )
RETURN
ADDCOLUMNS (
CALENDAR ( MinDate, MaxDate ),
"Year", YEAR ( [Date] ),
"MonthNo", MONTH ( [Date] ),
"Month", FORMAT ( [Date], "MMM" ),
"YearMonth", FORMAT ( [Date], "YYYY-MM" ),
"MonthEnd", EOMONTH ( [Date], 0 )
)Value := SUM('data'[value])
Last Visible Date := MAX('Date'[Date])
Current Value := CALCULATE([Value], KEEPFILTERS('Date'[Date] = [Last Visible Date]))
YoY % :=
VAR Cur = [Value]
VAR Prev = CALCULATE([Value], DATEADD('Date'[Date], -1, YEAR))
RETURN DIVIDE(Cur - Prev, Prev)
Rolling 3M Avg :=
AVERAGEX(
DATESINPERIOD('Date'[Date], [Last Visible Date], -3, MONTH),
[Value]
)Update KPI cards to:
Left column: Country slicer, Metric slicer. Top strip: Cards (Current Value, YoY %, 3‑month Avg). Main area: Line chart (Index, legend = country). Right column: Optional bar chart (YoY % latest by country).
data[date], data[country], data[value] (all Don’t summarize).library(dplyr)
library(ggplot2)
df <- dataset %>%
rename(date = data.date,
country = data.country,
value = data.value) %>%
mutate(date = as.Date(date))
ggplot(df, aes(date, value, color = country)) +
geom_line(linewidth = 0.9) +
labs(title = "Electricity index by country",
subtitle = "Monthly values",
x = NULL, y = "Index") +
theme_minimal(base_size = 12)data[date], data[country], data[value] (Don’t summarize).import pandas as pd
import matplotlib.pyplot as plt
df = dataset.copy()
df.columns = [c.replace('data.', '') for c in df.columns]
df['date'] = pd.to_datetime(df['date'])
plt.figure()
for key, grp in df.groupby('country'):
plt.plot(grp['date'], grp['value'], label=key, linewidth=1.2)
plt.title("Electricity index by country
Monthly values")
plt.xlabel("")
plt.ylabel("Index")
plt.legend(loc="lower center", ncol=4)
plt.tight_layout()When running the script outside Power BI, remove Power BI’s prolog (e.g.,
matplotlib.use('Agg')) and addplt.show().
HEC Lausanne · Business Analytics · Thu 9:00–12:00