Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 11 additions & 18 deletions _doc/articles/2026/2026-03-15-route2026-ml.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Séance 1 (6/2)

*présentation de modules*

* données : :epkg:`pandas`
* données : :epkg:`pandas`, :epkg:`skrub`
* graphes : :epkg:`matplotlib`, :epkg:`seaborn`, :epkg:`bokeh`, :epkg:`altair`
* cartes : :epkg:`geopandas`, :epkg:`folium`
* machine learning : :epkg:`scikit-learn`, :epkg:`skrub`, :epkg:`skore`, :epkg:`imbalanced-learn`, :epkg:`hazardous`, :epkg:`fairlearn`,
Expand All @@ -40,7 +40,7 @@ Séance 1 (6/2)

*problème*

Peut-on prédire le nombre de condidatures en 2026 pour chaque établissement ?
Peut-on prédire le nombre de candidatures en 2026 pour chaque établissement ?

:ref:`Données parcours-sup 2021-2025 <nbl-practice-years-2026-parcoursup_2026>`

Expand Down Expand Up @@ -73,21 +73,14 @@ Evaluation
Quelques jeux de données
========================

* `Parcoursup 2025 - vœux de poursuite d'études et de réorientation dans l'enseignement supérieur et réponses des établissements
<https://www.data.gouv.fr/datasets/parcoursup-2025-voeux-de-poursuite-detudes-et-de-reorientation-dans-lenseignement-superieur-et-reponses-des-etablissements>`_
* `Patrimoine immobilier des opérateurs de l’Enseignement supérieur
<https://www.data.gouv.fr/datasets/patrimoine-immobilier-des-operateurs-de-lenseignement-superieur>`_
* `Prix des carburants en France - Flux quotidien
<https://www.data.gouv.fr/datasets/prix-des-carburants-en-france-flux-quotidien-1>`_
* `Prix des carburants en France - Flux instantané - v2
<https://www.data.gouv.fr/datasets/prix-des-carburants-en-france-flux-instantane-v2-amelioree>`_
* `Séries sur les surfaces, rendements, production céréales
<https://visionet.franceagrimer.fr/Pages/SeriesChronologiques.aspx?menuurl=SeriesChronologiques/productions%20vegetales/grandes%20cultures/surfaces,productions,rendements>`_
* `Effectifs d'étudiants inscrits dans les établissements et les formations de l'enseignement supérieur - détail par établissements
<https://www.data.gouv.fr/datasets/effectifs-detudiants-inscrits-dans-les-etablissements-et-les-formations-de-lenseignement-superieur-detail-par-etablissements>`_
* `Résultats du contrôle sanitaire de l'eau distribuée commune par commune
<https://www.data.gouv.fr/datasets/resultats-du-controle-sanitaire-de-leau-distribuee-commune-par-commune>`_
* `Parcoursup 2025 - vœux de poursuite d'études et de réorientation dans l'enseignement supérieur et réponses des établissements <https://www.data.gouv.fr/datasets/parcoursup-2025-voeux-de-poursuite-detudes-et-de-reorientation-dans-lenseignement-superieur-et-reponses-des-etablissements>`_
* `Patrimoine immobilier des opérateurs de l'Enseignement supérieur <https://www.data.gouv.fr/datasets/patrimoine-immobilier-des-operateurs-de-lenseignement-superieur>`_
* `Prix des carburants en France - Flux quotidien <https://www.data.gouv.fr/datasets/prix-des-carburants-en-france-flux-quotidien-1>`_
* `Prix des carburants en France - Flux instantané - v2 <https://www.data.gouv.fr/datasets/prix-des-carburants-en-france-flux-instantane-v2-amelioree>`_
* `Séries sur les surfaces, rendements, production céréales <https://visionet.franceagrimer.fr/Pages/SeriesChronologiques.aspx?menuurl=SeriesChronologiques/productions%20vegetales/grandes%20cultures/surfaces,productions,rendements>`_
* `Effectifs d'étudiants inscrits dans les établissements et les formations de l'enseignement supérieur - détail par établissements <https://www.data.gouv.fr/datasets/effectifs-detudiants-inscrits-dans-les-etablissements-et-les-formations-de-lenseignement-superieur-detail-par-etablissements>`_
* `Résultats du contrôle sanitaire de l'eau distribuée commune par commune <https://www.data.gouv.fr/datasets/resultats-du-controle-sanitaire-de-leau-distribuee-commune-par-commune>`_
* `Résultats du contrôle sanitaire de l'eau du robinet <https://www.data.gouv.fr/datasets/resultats-du-controle-sanitaire-de-leau-du-robinet>`_
* `Données climatologiques de base - horaires <https://www.data.gouv.fr/datasets/donnees-climatologiques-de-base-horaires>`_
* `Données climatologiques de base - mensuelles <https://www.data.gouv.fr/datasets/donnees-climatologiques-de-base-mensuelles>`_

* `Données climatologiques de base - mensuelles <https://www.data.gouv.fr/datasets/donnees-climatologiques-de-base-mensuelles>`_
* `Base de donnée de surveillance de pesticides dans l air par les AASQA à partir de 2002 <https://www.data.gouv.fr/datasets/base-de-donnee-de-surveillance-de-pesticides-dans-l-air-par-les-aasqa-a-partir-de-2002>`_
124 changes: 124 additions & 0 deletions _doc/examples/ml/plot_template_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the '# coding: utf-8' header that is present in all other example files in the _doc/examples directory. This header is particularly important for this file since it contains French text with special characters.

Copilot uses AI. Check for mistakes.
Données parcours-sup 2021-2025
==============================

"""

import pandas
from teachpyx.tools.pandas import read_csv_cached
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import HistGradientBoostingRegressor

# from skrub import TableReport


def get_data():
urls = {
"2021": "https://data.enseignementsup-recherche.gouv.fr/api/explore/v2.1/catalog/datasets/fr-esr-parcoursup_2021/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B",
"2022": "https://data.enseignementsup-recherche.gouv.fr/api/explore/v2.1/catalog/datasets/fr-esr-parcoursup_2022/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B",
"2023": "https://data.enseignementsup-recherche.gouv.fr/api/explore/v2.1/catalog/datasets/fr-esr-parcoursup_2023/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B",
"2024": "https://data.enseignementsup-recherche.gouv.fr/api/explore/v2.1/catalog/datasets/fr-esr-parcoursup_2024/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B",
"2025": "https://data.enseignementsup-recherche.gouv.fr/api/explore/v2.1/catalog/datasets/fr-esr-parcoursup/exports/csv?lang=fr&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B",
}

dfs = {}
for k, url in urls.items():
print(f"loading {k!r}")
dfs[k] = read_csv_cached(url, sep=";")

return pandas.concat(dfs.values(), axis=0)


def select_variables_and_clean(df):
keys = [
"Région de l’établissement",
"Session",
"Statut de l’établissement de la filière de formation (public, privé…)",
"Sélectivité",
"Code UAI de l'établissement",
"Établissement",
"Filière de formation détaillée bis",
"Filière de formation très agrégée",
"Filière de formation.1",
"Académie de l’établissement",
"Code départemental de l’établissement",
"Commune de l’établissement",
"Concours communs et banque d'épreuves",
]
cible = "Effectif total des candidats pour une formation"
columns = set(df.columns)
assert set(keys) & set(columns) == set(
keys
), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic is incorrect. The expression set(keys) - set(keys) & set(columns) will always result in an empty set due to operator precedence. The & operator has higher precedence than -, so it evaluates as set(keys) - (set(keys) & set(columns)), which removes all common keys from the set of keys. This should be set(keys) - (set(keys) & set(columns)) or more simply set(keys) - set(columns) to show the missing columns.

Suggested change
), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"
), f"Missing columns {set(keys) - set(columns)} in {sorted(df.columns)}"

Copilot uses AI. Check for mistakes.
groups = df[[*keys, cible]].groupby(keys).count()
filtered = groups[groups[cible] > 1].reset_index(drop=False)

mask = filtered.duplicated(subset=keys, keep=False)
return filtered[~mask][[*keys, cible]], cible


def compute_oracle(table, cible):
vars = [c for c in table.columns if c != cible]
f2025 = table["Session"] == 2025
f2024 = table["Session"] == 2024
ftwo = table[f2025 | f2024]
piv = (
pandas.pivot_table(
ftwo,
index=[c for c in vars if c != "Session"],
columns="Session",
values=cible,
)
# .dropna(axis=0) # fails
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot why it does not work?

.sort_index()
)
Comment on lines +75 to +77
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the question about why dropna fails: The pivot_table operation creates a DataFrame with multi-column index (2024 and 2025 as columns). When there are missing values in either year, dropna(axis=0) would remove those rows. The likely issue is that after removing rows with NaN values, the resulting DataFrame might be empty or nearly empty, causing the mean_absolute_error calculation on line 78 to fail. The pivot table requires matching keys between 2024 and 2025 data, and if many formations only appear in one year, many NaN values would be created. Consider using fillna(0) or only comparing formations that exist in both years by using an inner join approach in the pivot_table.

Suggested change
# .dropna(axis=0) # fails
.sort_index()
)
.sort_index()
)
# Keep only rows where both 2024 and 2025 have non-missing values
piv = piv.dropna(axis=0, how="any")
if piv.empty:
raise ValueError(
"Not enough overlapping data between 2024 and 2025 to compute oracle."
)

Copilot uses AI. Check for mistakes.
return mean_absolute_error(piv[2025], piv[2024])


def split_train_test(table, cuble):
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter name has a typo: 'cuble' should be 'cible' to match the variable name used throughout the codebase and in the function body.

Suggested change
def split_train_test(table, cuble):
def split_train_test(table, cible):

Copilot uses AI. Check for mistakes.
X, y = table.drop(cible, axis=1), table[cible]

train_test = X["Session"] < 2025

drop = ["Session", "Code UAI de l'établissement", "Établissement"]

train_X = X[train_test].drop(drop, axis=1)
train_y = y[train_test]
test_X = X[train_test].drop(drop, axis=1)
test_y = y[train_test]
Comment on lines +90 to +91
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both test_X and test_y are incorrectly using the training data filter. Lines 90-91 should use ~train_test instead of train_test to select the test set (Session == 2025), otherwise the test set will be identical to the training set.

Suggested change
test_X = X[train_test].drop(drop, axis=1)
test_y = y[train_test]
test_X = X[~train_test].drop(drop, axis=1)
test_y = y[~train_test]

Copilot uses AI. Check for mistakes.
return train_X, test_X, train_y, test_y


def make_pipeline(table, cible):
vars = [c for c in table.columns if c != "cible"]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition filters by the literal string "cible" instead of using the parameter variable. This should be c != cible to correctly filter out the target column.

Suggested change
vars = [c for c in table.columns if c != "cible"]
vars = [c for c in table.columns if c != cible]

Copilot uses AI. Check for mistakes.
num_cols = ["Capacité de l’établissement par formation"]
cat_cols = [c for c in vars if c not in num_cols]

model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(
[
("num", StandardScaler(), num_cols),
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
),
Comment on lines +97 to +109
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column "Capacité de l'établissement par formation" is not present in the 'keys' list defined in select_variables_and_clean, and therefore won't exist in the table. This will cause a KeyError when the pipeline attempts to apply StandardScaler to this non-existent column. Either add this column to the keys list in select_variables_and_clean, or use a column that is actually present in the filtered table.

Suggested change
num_cols = ["Capacité de l’établissement par formation"]
cat_cols = [c for c in vars if c not in num_cols]
model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(
[
("num", StandardScaler(), num_cols),
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
),
# Candidate numeric feature; include it only if it exists in the table to avoid KeyError.
numeric_feature = "Capacité de l’établissement par formation"
num_cols = [numeric_feature] if numeric_feature in table.columns else []
cat_cols = [c for c in vars if c not in num_cols]
transformers = []
if num_cols:
transformers.append(("num", StandardScaler(), num_cols))
if cat_cols:
transformers.append(
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols)
)
model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(transformers),

Copilot uses AI. Check for mistakes.
),
("regressor", HistGradientBoostingRegressor()),
]
)
return model


data = get_data()
table, cible = select_variables_and_clean(data)
oracle = compute_oracle(table, cible)
print(f"oracle : {oracle}")

train_X, test_X, train_y, test_y = split_train_test(table, cible)
model = make_pipeline(table, cible)
model.fit(train_X, train_y)
2 changes: 1 addition & 1 deletion _doc/practice/index_algo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ angles d'approches.
algo-compose/vigenere
algo-compose/exercice_morse

Les exercices suivants fonctionnent par pair énoncé et correction.
Les exercices suivants fonctionnent par paire énoncé et correction.

.. toctree::
:maxdepth: 1
Expand Down
2 changes: 1 addition & 1 deletion _doc/practice/index_python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Exercices sur le langage python
../auto_examples/prog/plot_gil_example
../auto_examples/prog/plot_lambda_function

Les exercices suivants fonctionnent par pair énoncé et correction.
Les exercices suivants fonctionnent par paire énoncé et correction.

.. toctree::
:maxdepth: 1
Expand Down
74 changes: 32 additions & 42 deletions _doc/practice/ml/pretraitement_image.ipynb

Large diffs are not rendered by default.

Loading
Loading