
Data & Code Management: From Collection to Application
2025-10-16
SELECT, WHERE, ORDER BY, LIMIT, GROUP BY, HAVING, JOINrvest, Python BeautifulSoup)| Type | Description |
|---|---|
| Transactional | Your data |
| Behavioral | Your data |
| Public Data | Open datasets |
| Premium Data | Paid datasets |
💳 Transactional (records of exchanges) - What it looks like: order lines, invoices, payments
- Typical fields: order_id, customer_id, sku, qty, price, timestamp
- Example dataset: UCI Online Retail II (real e-commerce transactions)
https://archive.ics.uci.edu/datasets?search=Online+Retail
👣 Behavioral (events & interactions) - What it looks like: page views, clicks, scrolls, add-to-cart
- Typical fields: user_id, session_id, event_name, event_params, ts
- Example datasets/docs: GA BigQuery sample (web/app) and Snowplow events model
https://developers.google.com/analytics/bigquery/web-ecommerce-demo-dataset/ • https://docs.snowplow.io/docs/fundamentals/events/
🏛️ Public Data (open gov & orgs) - What it looks like: macro indicators, demographics, prices, geo
- Where to find: opendata.swiss, World Bank WDI, Eurostat
https://www.bfs.admin.ch/bfs/en/home/services/ogd/portal.html • https://data.worldbank.org/indicator • https://ec.europa.eu/eurostat/data
💼 Premium Data (licensed/commercial) - What it looks like: real-time market data, fundamentals, consumer panels
- Providers (examples): Bloomberg Terminal, S&P Global Market Intelligence
https://www.bloomberg.com/professional/products/bloomberg-terminal/ • https://www.spglobal.com/market-intelligence/en
In an organization, data is typically stored in:
There exist many tools for Big Data analysis. Some of them are:

Most common data management platforms:
On-Premises or Cloud-Based


Suppose that:
country = {India, …} is of length 1,000,pollutant = {NO2, …} is of length 100,particles = {100, 90, …} is of length 10,000.Which filter in the WHERE clause would you run first?
The query scheduler takes care of it, we don‘t have to worry about this.
| Country | Pollutant | Particles |
|---|---|---|
| USA | NO2 | 90 |
| USA | CO2 | 100 |
| … | … | … |
| India | NO2 | 200 |
| India | NO2 | 210 |
| India | CO2 | 110 |
| … | … | … |
| Switzerland | CO2 | 20 |
| Switzerland | CO2 | 20 |
| … | … | … |
Before to begin:
In the Google Cloud console, select or create a Google Cloud project.
BigQuery has lots of public data available. To access it, click on ADD towards top left:

Enter “public dataset” in the search bar:

bigquery-public-data.openaq.global_air_quality dataset:To start querying the dataset, you can click on “Home” button then click on “SQL QUERY”:

Now you write SQL commands in the query windows and click on “RUN” to execute.
Asked ChatGPT for comment on SQL query:
SELECT city: Specifies that we want to retrieve only the values in the city column.
FROM 'bigquery-public-data.openaq.global_air_quality': Indicates the source table where the data is stored. In this case, it’s the global_air_quality table, which is part of the openaq dataset within bigquery-public-data on Google BigQuery.
WHERE country = 'CH': This is a filter condition that limits the results to only those rows where the country column has the value ‘CH’ (the country code for Switzerland).

Query: Records in India with no2 value more than 100.
*Query: What is the average no2 pollutant value in US? In India?

What about Switzerland? What about the max?
Query: Number of bitcoin transactions per date, sorted by date.
Query: What is the average no2 pollutant for each city in the US?
GROUP BY city:
GROUP BY, the AVG() function would calculate the average across all rows in the dataset, rather than for each individual city.
Query: Which cities in India have an average no2 pollutant > 50?
Query: What is the average time rent per station of bike share program in San Franciso?
(Look for the San Francisco Bikeshare dataset)
Query: Which day of the week has the most fatal motor accidents in 2020?
(Look for the NHTSA Traffic Fatalities dataset)
Query: What is the average time rent per station of bike share program in San Franciso?
SELECT stations.name AS station_name,
AVG(trips.duration_sec / 60) AS avg_rent_time_minutes
FROM `bigquery-public-data.san_francisco.bikeshare_trips` AS trips
JOIN `bigquery-public-data.san_francisco.bikeshare_stations` AS stations
ON trips.start_station_id = stations.station_id
GROUP BY station_name
ORDER BY avg_rent_time_minutes DESCJOIN operation to merge the two tables (we will see JOIN in the next slides). We could simply return the station_id instead.bigquery-public-data.san_francisco instead of bigquery-public-data.san_francisco_bikeshare dataset.
Query: Which day of the week has the most fatal motor accidents in 2020?

FORMAT_TIMESTAMP, TIMESTAMP and EXTRACT. These might not always be available in all SQL databases. Joins allow merging results from different tables. It is required to specify the “key” linking the tables.

Query: Find the top 5 longest bike rides in Austin, including informations about stations.
Which field name should we use to join the two tables?
You can simplify the query by using the following syntax:
Query: Which bike start station has the longest duration rides?
Query: How many bikes were reported as stolen?
To connect R to BigQuery, use the bigrquery package:
# Install the bigrquery package
install.packages("bigrquery")
# Load the library and authenticate
library(bigrquery)
bq_auth()
# Run a query
query <- "SELECT country, pollutant, value
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE country = 'US'
LIMIT 5"
result <- bq_project_query("your-project-id", query)
data <- bq_table_download(result)
head(data) country pollutant value
1 US no 0.007021218
2 US bc 0.410000000
3 US bc 0.500000000
4 US bc 2.100000000
5 US bc 0.160000000
query <- "WITH time AS (
SELECT DATE(block_timestamp) AS trans_date
FROM `bigquery-public-data.crypto_bitcoin.transactions`
)
SELECT COUNT(trans_date) AS transactions, trans_date
FROM time
GROUP BY trans_date
ORDER BY trans_date DESC"
result <- bq_project_query("your-project-id", query)
data <- bq_table_download(result)
library(ggplot2)
library(hrbrthemes)
p <- ggplot(data, aes(x=trans_date, y=transactions)) +
geom_line( color="#69b3a2") +
xlab("Time") + ylab("Number of Transactions") +
theme_ipsum() +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
ggtitle("Number of Bitcoin Transactions Over Time")
p
To connect Python to BigQuery, use the google-cloud-bigquery client.
# TIP: Install once in your env: pip install google-cloud-bigquery pandas pyarrow
# Auth options:
# - gcloud: gcloud auth application-default login
# - or set GOOGLE_APPLICATION_CREDENTIALS to a service-account JSON
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client(project="your-project-id")
# Run a query
query = """
SELECT country, pollutant, value
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE country = 'US'
LIMIT 5
"""
df = client.query(query).to_dataframe()
df.head()BigQuery is a highly scalable data warehouse designed to store and analyze terabytes of data using SQL.
Benefits:
bigrquery package to connect R with BigQuery or google-cloud-bigquery for Python.We will create a small customers–orders database to practice.
library(DBI); library(RSQLite)
con <- dbConnect(SQLite(), ":memory:")
DBI::dbExecute(con, "
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
country TEXT NOT NULL
);
")
DBI::dbExecute(con, "
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
order_date TEXT NOT NULL, -- ISO8601
amount REAL NOT NULL,
FOREIGN KEY(customer_id) REFERENCES customers(customer_id)
);
")
DBI::dbWriteTable(con, "customers",
data.frame(customer_id=1:5,
name=c("Ada","Bruno","Chloé","Dinesh","Elena"),
country=c("CH","FR","CH","IN","US")),
append=TRUE)
DBI::dbWriteTable(con, "orders",
data.frame(order_id=1:10,
customer_id=c(1,1,2,3,3,3,4,4,5,5),
order_date=as.character(as.Date("2025-09-20") + c(0,1,0,2,5,6,1,7,3,9)),
amount=c(120,75,300,42,88,150,60,500,20,220)),
append=TRUE)
DBI::dbListTables(con)import sqlite3, pandas as pd, datetime as dt
con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("""
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
country TEXT NOT NULL
);
""")
cur.execute("""
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
order_date TEXT NOT NULL,
amount REAL NOT NULL,
FOREIGN KEY(customer_id) REFERENCES customers(customer_id)
);
""")
customers = pd.DataFrame({
"customer_id": [1,2,3,4,5],
"name": ["Ada","Bruno","Chloé","Dinesh","Elena"],
"country": ["CH","FR","CH","IN","US"]
})
orders = pd.DataFrame({
"order_id": range(1,11),
"customer_id": [1,1,2,3,3,3,4,4,5,5],
"order_date": [(dt.date(2025,9,20)+dt.timedelta(days=d)).isoformat() for d in [0,1,0,2,5,6,1,7,3,9]],
"amount": [120,75,300,42,88,150,60,500,20,220]
})
customers.to_sql("customers", con, if_exists="append", index=False)
orders.to_sql("orders", con, if_exists="append", index=False)
con.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()SELECT specific columns → WHERE to filter → ORDER BY to sort → LIMIT to preview.
Tip: Start with
SELECT * ... LIMIT 5to inspect shape without scanning the full table.
Rules: every selected non-aggregated column must appear in GROUP BY. Use HAVING to filter groups.
UNION)LIMIT 3.LEFT JOIN.SELECT col1, col2 FROM t WHERE cond ORDER BY col1 DESC LIMIT n;JOIN: A JOIN B ON A.key = B.key · left join keeps all of AGROUP BY cols + HAVING to filter groupsDBI::dbGetQuery(con, sql); Python: pd.read_sql_query(sql, con)Create a free key (account on openweathermap.org).
Try this URL in your browser (replace YOUR_KEY):
https://api.openweathermap.org/data/2.5/weather?q=Lausanne&units=metric&appid=YOUR_KEY
main.temp, weather.description, wind.speed in the JSON.# packages
library(httr2)
library(dplyr)
# store your key in an env var before class: Sys.setenv(OPENWEATHERMAP_API_KEY="...")
okey <- Sys.getenv("OPENWEATHERMAP_API_KEY")
resp <- request("https://api.openweathermap.org/data/2.5/weather") |>
req_url_query(q = "Lausanne", units = "metric", appid = okey) |>
req_perform()
wx <- resp_body_json(resp, simplifyVector = TRUE)
# extract a compact summary row
wx_row <- tibble::tibble(
city = wx$name,
country = wx$sys$country,
temperature = wx$main$temp,
feels_like = wx$main$feels_like,
humidity = wx$main$humidity,
wind_ms = wx$wind$speed,
condition = wx$weather$description,
timestamp = as.POSIXct(wx$dt, origin = "1970-01-01", tz = "UTC")
)
wx_row# A tibble: 1 × 8
city country temperature feels_like humidity wind_ms condition
<chr> <chr> <dbl> <dbl> <int> <dbl> <chr>
1 Lausanne CH 1.87 -1.55 89 3.37 scattered clouds
# ℹ 1 more variable: timestamp <dttm>
import os, requests, pandas as pd
okey = os.environ.get("OPENWEATHERMAP_API_KEY")
params = {"q": "Lausanne", "units": "metric", "appid": okey}
r = requests.get("https://api.openweathermap.org/data/2.5/weather", params=params)
wx = r.json()
row = {
"city": wx.get("name"),
"country": wx.get("sys", {}).get("country"),
"temperature": wx.get("main", {}).get("temp"),
"feels_like": wx.get("main", {}).get("feels_like"),
"humidity": wx.get("main", {}).get("humidity"),
"wind_ms": wx.get("wind", {}).get("speed"),
"condition": (wx.get("weather", {}).get("description"),
}
pd.DataFrame([row])https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
Time Series (Daily) with open, high, low, close, volume by date.library(httr2)
library(purrr)
library(dplyr)
akey <- Sys.getenv("ALPHAVANTAGE_API_KEY", unset = "demo")
resp <- request("https://www.alphavantage.co/query") |>
req_url_query(`function` = "TIME_SERIES_DAILY", symbol = "MSFT", apikey = akey) |>
req_perform()
js <- resp_body_json(resp, simplifyVector = TRUE)
daily <- js[["Time Series (Daily)"]]
prices <- imap_dfr(daily, ~ tibble::tibble(
date = as.Date(.y),
open = as.numeric(.x[["1. open"]]),
high = as.numeric(.x[["2. high"]]),
low = as.numeric(.x[["3. low"]]),
close = as.numeric(.x[["4. close"]]),
volume = as.numeric(.x[["5. volume"]])
)) |>
arrange(date)
prices# A tibble: 100 × 6
date open high low close volume
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2025-07-09 500. 507. 500. 504. 18659538
2 2025-07-10 503. 504. 498. 501. 16498740
3 2025-07-11 498. 505. 498. 503. 16459512
4 2025-07-14 502. 504. 501. 503. 12058848
5 2025-07-15 503. 508. 503. 506. 14927202
6 2025-07-16 505. 507. 502. 506. 15154374
7 2025-07-17 506. 513. 506. 512. 17503129
8 2025-07-18 514. 515. 507. 510. 21209666
9 2025-07-21 507. 512. 506. 510. 14066805
10 2025-07-22 511. 511. 505. 505. 13868644
# ℹ 90 more rows
import os, requests, pandas as pd
akey = os.environ.get("ALPHAVANTAGE_API_KEY", "demo")
params = {"function": "TIME_SERIES_DAILY", "symbol": "MSFT", "apikey": akey}
r = requests.get("https://www.alphavantage.co/query", params=params)
js = r.json()
daily = js.get("Time Series (Daily)", {})
df = (pd.DataFrame(daily).T
.rename(columns={
"1. open":"open","2. high":"high","3. low":"low",
"4. close":"close","5. volume":"volume"
})
.reset_index(names="date"))
df["date"] = pd.to_datetime(df["date"]).dt.date
df.sort_values("date").head()Search for a title + year (replace YOUR_KEY):
https://www.omdbapi.com/?apikey=YOUR_KEY&s=titanic&y=1997
Then open details by IMDb id (Titanic):
https://www.omdbapi.com/?apikey=YOUR_KEY&i=tt0120338
Look for fields like Title, Year, Genre, Director, imdbRating.
library(httr2)
mkey <- Sys.getenv("OMDB_API_KEY")
# Search first
s <- request("https://www.omdbapi.com/") |>
req_url_query(apikey = mkey, s = "titanic", y = 1997) |>
req_perform() |>
resp_body_json(simplifyVector = TRUE)
head(s$Search)
# Then fetch details by IMDb id
movie <- request("https://www.omdbapi.com/") |>
req_url_query(apikey = mkey, i = "tt0120338") |>
req_perform() |>
resp_body_json(simplifyVector = TRUE)
unlist(movie[c("Title","Year","Genre","Director","imdbRating")])import os, requests, pandas as pd
mkey = os.environ.get("OMDB_API_KEY")
# Search
s = requests.get("https://www.omdbapi.com/", params={"apikey": mkey, "s": "titanic", "y": 1997}).json()
pd.DataFrame(s.get("Search", []))
# Details
movie = requests.get("https://www.omdbapi.com/", params={"apikey": mkey, "i": "tt0120338"}).json()
{ k: movie.get(k) for k in ["Title","Year","Genre","Director","imdbRating"] }import requests, datetime
import pandas as pd
yesterday = (datetime.date.today() - datetime.timedelta(days=1))
url = (
f"https://wikimedia.org/api/rest_v1/metrics/pageviews/top/"
f"en.wikipedia/all-access/{yesterday.year}/{yesterday:%m}/{yesterday:%d}"
)
resp = requests.get(url)
items = resp.json()["items"][0]["articles"]
# Build DataFrame and keep top 10
(df := pd.DataFrame(items))[['article','views']].head(10)This is what we obtain on Google translate:
But what a good translation of the name of a pipe!
# Requires: pip install openai
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"What do you think about the Programming Tools in Data Science class in R?"}
]
)
print(resp.choices[0].message.content)MSFT with another symbol in Alpha Vantage and fetch the JSON.Bonus: Copy one of the R/Python snippets, set your env var locally, and print 3 fields.
Never hard‑code secrets in slides, repos, or notebooks.
Prefer environment variables:
~/.Renviron → OPENWEATHER_API_KEY=....env + python-dotenv locally.Rotate / revoke keys if you accidentally leak them.
Watch the rate limits (Alpha Vantage is strict); implement simple retries if needed.
macOS/Linux (bash/zsh)
Windows (PowerShell)
R session
Python session
R using library(httr) or API wrappers.https://api.example.com/resource).?q=Lausanne&units=metric).API = Application Programming Interface — structured endpoints to request data.
Pros
Cons
If no API or missing fields → parse the HTML.
<!DOCTYPE html>
<html>
<body>
<h1 id='first'>Webscraping with R</h1>
<p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>
</body>
</html><tag> and ends </tag>,id=attribute),style="background-color:DodgerBlue;" to h1 and try it.| tag | meaning |
|---|---|
| p | Paragraph |
| h1 | Top-level heading |
| h2, h3, … | Lower level headings |
| ol | Ordered list |
| ul | Unorder list |
| li | List item |
| img | Image |
| a | Anchor (Hyperlink) |
| div | Section wrapper (block-level) |
| span | Text wrapper (in-line) |
Create a minimal HTML page for experimenting
html_page <- minimal_html('
<body>
<h1>Webscraping with R</h1>
<p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>
</body>')from bs4 import BeautifulSoup
html_page = BeautifulSoup('''
<body>
<h1>Webscraping with R</h1>
<p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>
</body>''', 'html.parser') <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>| selector | meaning |
|---|---|
tag |
element by tag |
.class |
by class |
#id |
by id |
a[href] |
attribute present |
a[href*="wiki"] |
attr contains |
parent > child |
direct child |
A + B |
adjacent sibling |
A ~ B |
any following sibling |
:first-child, :nth-child(2) |
structural pseudo‑classes |
:not(X) |
negation |
,)div, p selects all <div> elements and all <em> elements.li and em <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>li and em <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p> #<<space)div p selects all <p> elements inside <div> elements.em that are descendants of li <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>em that are descendants of li <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>>)div > p selects all <p> elements that are children of a <div> element.em that are children of p <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>em that are children of p <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>+)div + p selects the first <p> element that is situated immediately after <div> elements.em immediately after p <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>em immediately after p <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>No em are immediately after p.
em immediately after em <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>em immediately after em <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>~)div ~ p selects all <p> elements that are preceded by a <div> element.em next sibling of a <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>em next sibling of a <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>(Here, we would have obtained the same result with a + em)
:first-child):first-child selects the specified element that is the first child of another element.p:first-child selects all <p> elements that are the first child of any other element.li that are first children <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>li that are first children <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>:nth-child(n)):last-child is completely symmetric to :first-child.:nth-child(n) selects the specified element that is the nth child of another element.p:nth-child(2) selects all <p> elements that are the second child of any other element.li that are second children <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>li that are second children <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>name="value".<a href="www.r-project.org">R</a>, href is an attribute of a that specifies an url.html_attr command.href attributes <p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>href attributes
<p> Basic experience with <a href="www.r-project.org">R</a> and
familiarity with the <em>Tidyverse</em> is recommended.</p>
<h2>Technologies</h2>
<ol>
<li>HTML: <em>Hypertext Markup Language</em></li>
<li>CSS: <em>Cascading Style Sheets</em></li>
</ol>
<h2>Packages</h2>
<ul>
<a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
</ul>
<p><strong>Note</strong>:
<em>rvest</em> is included in the <em>tidyverse</em></p>| tag | meaning |
|---|---|
| table | Table section |
| tr | Table row |
| td | Table cell |
| th | Table header |
html_table()from bs4 import BeautifulSoup
import pandas as pd
basic_table_html = '''
<body>
<table>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</table>
</body>
'''
soup = BeautifulSoup(basic_table_html, 'html.parser')
pd.read_html(str(soup))[0].).class="wikitable"..).class="wikitable sortable".table immediately after four p.robots.txt and Terms of Service (ToS) of websites.robots.txt (see https://cran.r-project.org/robots.txt for instance).import time, random, requests
from urllib import robotparser
from tenacity import retry, wait_exponential, stop_after_attempt
import requests_cache
rp = robotparser.RobotFileParser()
rp.set_url("https://en.wikipedia.org/robots.txt"); rp.read()
rp.can_fetch("DaCM-course", "https://en.wikipedia.org/wiki/Main_Page")
session = requests_cache.CachedSession("scrape_cache", expire_after=3600)
@retry(wait=wait_exponential(multiplier=1, min=1, max=30), stop=stop_after_attempt(5))
def get(url):
time.sleep(random.uniform(0.5,1.5))
return session.get(url, headers={"User-Agent":"DaCM-course/1.0"}, timeout=20)
resp = get("https://httpbin.org/get")
resp.json()read_html(). In many cases, this is due to website that employs methods for dynamic data requests.RSelenium package.chromote package (developped by Posit) that focuses on Chrome DevTools Protocol.class="item".# Only header-like elements typically appear; dynamic content isn't in the initial HTML.
library(chromote)
b <- ChromoteSession$new() # open a chromote session
url <- "https://data.worldbank.org/indicator/SP.ADO.TFRT"
b$Page$navigate(url) # navigate to the url
b$Runtime$evaluate("document.querySelector('html').outerHTML")$result$value %>%
read_html() %>% html_nodes(".item") %>% html_text() %>% head()
b$close() # close the session# Example with Selenium (browser automation). Requires: pip install selenium webdriver-manager
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager
options = Options()
options.add_argument("-headless")
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=options)
url = "https://data.worldbank.org/indicator/SP.ADO.TFRT"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
items = [x.get_text(strip=True) for x in soup.select('.item')][:6]
driver.quit()
itemsSome comments on the chromote command:
b <- ChromoteSession$new() create a new ChromoteSession object assigned to b.b$Page$navigate(url) navigates to the provided URL.Runtime$evaluate command tells the browser to run JavaScript code.document.querySelector('html').outerHTML selects the
element from the current web page’s Document Object Model (DOM), and then retrieves its entire HTML content, including the element itself and everything inside it.b$view()chromote is for Chrome, Chromium and the likes. Selenium is more general.RSelenium is currently not working properly. But here is how a possible implementation would look like.When HTML is rendered by JavaScript, read_html()/requests won’t see final content.
Options:
chromote (Chrome DevTools Protocol)playwright (modern) or selenium (browser automation)# playwright install must have been run once
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://data.worldbank.org/indicator/SP.ADO.TFRT", wait_until="networkidle")
html = page.content()
# parse with BeautifulSoup here
browser.close()R has built-in functions to work with regular expressions:
grep(): Search for matches of a pattern in a character vector.grepl(): Returns a logical vector indicating if there is a match.sub(), gsub(): Replace the first or all occurrences of a pattern in a string.regexpr(), gregexpr(): Find the position and length of matches.str_extract() and str_replace() from stringr package for a tidy approach.Tip
Python equivalents:
re.search, re.findall, re.finditer for matching.re.sub for replacing.re.compile to pre-compile patterns.pandas.Series.str.contains / .extract / .replace for vectorized ops on Series.Common symbols used in regular expressions:
.: Any single character except newline.*: Zero or more repetitions of the preceding character.+: One or more repetitions of the preceding character.?: Zero or one repetition of the preceding character.[]: A set of characters. For example, [abc] matches ‘a’, ‘b’, or ‘c’.^: In [], it inverts the match. For example, [^abc] matches everything except ‘a’, ‘b’, or ‘c’.^: Matches the start of a string.$: Matches the end of a string.-: Defines a range of characters. For example, [a-z] matches any lowercase letter.\\: Escape character.\\d: Any digit.\\D: Any non-digit.\\w: Any word character (alphanumeric + underscore).\\W: Any non-word character.\\s: Any whitespace character.\\S: Any non-whitespace character.Quantifiers:
{n}: Exactly n repetitions.{n,}: At least n repetitions.{n,m}: Between n and m repetitions.library(stringr)
text <- "John's email is john.doe@example.com and Jane's email is jane_doe123@example.org"
# Extract all email addresses
# chatGPT solution:
# emails <- str_extract_all(text, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")
emails <- str_extract_all(text, "\\S+@\\S+")
print(emails)gsub() to clean text data by removing unwanted characters.data-* attributes, semantic classescontent-type and use encoding='utf-8'HEC Lausanne · Business Analytics · Thu 9:00–12:00