skferm.smoothing package#

Submodules#

skferm.smoothing.core module#

apply_method_to_groups(df: DataFrame, x: str, y: str, method_func: Callable, groupby_col: str | None = None, **kwargs) DataFrame[source]#

Apply a method function to groups or entire dataframe.

skferm.smoothing.methods module#

rolling_average(df: DataFrame, x: str, y: str, window: int = 5, center: bool = True, **kwargs) DataFrame[source]#

Rolling average smoothing.

exponential_moving_average(df: DataFrame, x: str, y: str, span: int = 10, **kwargs) DataFrame[source]#

Exponential moving average smoothing.

savitzky_golay_smooth(df: DataFrame, x: str, y: str, window_length: int = 5, polyorder: int = 2, **kwargs) DataFrame[source]#

Savitzky-Golay smoothing.

skferm.smoothing.metrics module#

Metrics for evaluating smoothing quality and curve smoothness.

This module provides functions to quantify: 1. How smooth a curve is (total variation metric) 2. How well the smoothed curve fits the original data (RMSE and R²)

mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]#

Mean squared error regression loss.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –

    Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

    ’raw_values’ :

    Returns a full set of errors in case of multioutput input.

    ’uniform_average’ :

    Errors of all outputs are averaged with uniform weight.

Returns:

loss – A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

Return type:

float or array of floats

Examples

>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
>>> y_pred = [[0, 2],[-1, 2],[8, -5]]
>>> mean_squared_error(y_true, y_pred)
0.708...
>>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
array([0.41666667, 1.        ])
>>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.825...
r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)[source]#

\(R^2\) (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a \(R^2\) score of 0.0.

In the particular case when y_true is constant, the \(R^2\) score is not finite: it is either NaN (perfect predictions) or -Inf (imperfect predictions). To prevent such non-finite numbers to pollute higher-level experiments such as a grid search cross-validation, by default these cases are replaced with 1.0 (perfect predictions) or 0.0 (imperfect predictions) respectively. You can set force_finite to False to prevent this fix from happening.

Note: when the prediction residuals have zero mean, the \(R^2\) score is identical to the Explained Variance score.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –

    Defines aggregating of multiple output scores. Array-like value defines weights used to average scores. Default is “uniform_average”.

    ’raw_values’ :

    Returns a full set of scores in case of multioutput input.

    ’uniform_average’ :

    Scores of all outputs are averaged with uniform weight.

    ’variance_weighted’ :

    Scores of all outputs are averaged, weighted by the variances of each individual output.

    Changed in version 0.19: Default value of multioutput is ‘uniform_average’.

  • force_finite (bool, default=True) –

    Flag indicating if NaN and -Inf scores resulting from constant data should be replaced with real numbers (1.0 if prediction is perfect, 0.0 otherwise). Default is True, a convenient setting for hyperparameters’ search procedures (e.g. grid search cross-validation).

    Added in version 1.1.

Returns:

z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is ‘raw_values’.

Return type:

float or ndarray of floats

Notes

This is not a symmetric function.

Unlike most other scores, \(R^2\) score may be negative (it need not actually be the square of a quantity R).

This metric is not well-defined for single samples and will return a NaN value if n_samples is less than two.

References

Examples

>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred,
...          multioutput='variance_weighted')
0.938...
>>> y_true = [1, 2, 3]
>>> y_pred = [1, 2, 3]
>>> r2_score(y_true, y_pred)
1.0
>>> y_true = [1, 2, 3]
>>> y_pred = [2, 2, 2]
>>> r2_score(y_true, y_pred)
0.0
>>> y_true = [1, 2, 3]
>>> y_pred = [3, 2, 1]
>>> r2_score(y_true, y_pred)
-3.0
>>> y_true = [-2, -2, -2]
>>> y_pred = [-2, -2, -2]
>>> r2_score(y_true, y_pred)
1.0
>>> r2_score(y_true, y_pred, force_finite=False)
nan
>>> y_true = [-2, -2, -2]
>>> y_pred = [-2, -2, -2 + 1e-8]
>>> r2_score(y_true, y_pred)
0.0
>>> r2_score(y_true, y_pred, force_finite=False)
-inf
total_variation(y_values: ndarray, normalize: bool = True) float[source]#

Calculate total variation for a sequence of values.

Parameters:#

y_valuesnp.ndarray

Array of y-values

normalizebool

Whether to normalize by the range of values

Returns:#

float

Total variation metric

fit_quality_metrics(original: ndarray, smoothed: ndarray) Dict[str, float][source]#

Calculate fit quality metrics between original and smoothed data.

Parameters:#

originalnp.ndarray

Original data values

smoothednp.ndarray

Smoothed data values

Returns:#

Dict[str, float]

Dictionary with ‘rmse’ and ‘r2’ keys

evaluate_smoothing_quality(df: DataFrame, x_col: str, original_col: str, smoothed_col: str, group_col: str | None = None) DataFrame | Series[source]#

Evaluation of smoothing quality.

Parameters:#

dfpd.DataFrame

DataFrame containing the data

x_colstr

Column name for x-axis (for sorting)

original_colstr

Column name for original data

smoothed_colstr

Column name for smoothed data

group_colOptional[str]

Column to group by (returns Series if provided)

Returns:#

pd.DataFrame or pd.Series

DataFrame with metrics if group_col is provided, else Series

Module contents#

smooth(df: DataFrame, x: str, y: str, method: Literal['rolling', 'ema', 'savgol'] = 'rolling', groupby_col: str | None = None, **kwargs) DataFrame[source]#

Apply smoothing to data with pandas pipe support.

Parameters: - df: Input DataFrame - x: Column name for x-axis values - y: Column name for y-axis values - method: Smoothing method - groupby_col: Optional column to group by - **kwargs: Method-specific parameters

Returns: - DataFrame with smoothed values in {y}_smooth column

smooth_sequential(df: DataFrame, x: str, y: str, stages: List[Tuple[str, Dict[str, Any]]], groupby_col: str | None = None, output_suffix: str = '_smooth') DataFrame[source]#

Apply multiple smoothing methods in sequence.

Parameters: - stages: List of (method_name, parameters) tuples - output_suffix: Suffix for the final smoothed column

Returns: - DataFrame with final smoothed column named {y}{output_suffix}

total_variation(y_values: ndarray, normalize: bool = True) float[source]#

Calculate total variation for a sequence of values.

Parameters:#

y_valuesnp.ndarray

Array of y-values

normalizebool

Whether to normalize by the range of values

Returns:#

float

Total variation metric

fit_quality_metrics(original: ndarray, smoothed: ndarray) Dict[str, float][source]#

Calculate fit quality metrics between original and smoothed data.

Parameters:#

originalnp.ndarray

Original data values

smoothednp.ndarray

Smoothed data values

Returns:#

Dict[str, float]

Dictionary with ‘rmse’ and ‘r2’ keys

evaluate_smoothing_quality(df: DataFrame, x_col: str, original_col: str, smoothed_col: str, group_col: str | None = None) DataFrame | Series[source]#

Evaluation of smoothing quality.

Parameters:#

dfpd.DataFrame

DataFrame containing the data

x_colstr

Column name for x-axis (for sorting)

original_colstr

Column name for original data

smoothed_colstr

Column name for smoothed data

group_colOptional[str]

Column to group by (returns Series if provided)

Returns:#

pd.DataFrame or pd.Series

DataFrame with metrics if group_col is provided, else Series