Curve Smoothing =============== Fermentation data often contains noise that can obscure underlying biological trends. The ``skferm.smoothing`` module provides tools to reduce noise while preserving important signal characteristics. This guide demonstrates how to smooth fermentation curves using the built-in datasets and various smoothing methods. Overview of Smoothing Methods ----------------------------- The ``skferm.smoothing`` module includes three main smoothing methods: - **Rolling Mean** (``rolling``): Simple moving average over a fixed window - **Exponential Moving Average** (``ema``): Weighted average giving more importance to recent values - **Savitzky-Golay Filter** (``savgol``): Polynomial fitting method that preserves peaks and valleys Basic Smoothing Example ----------------------- Let's start with the Rheolaser dataset, which contains elasticity index measurements from fermentation experiments: .. code-block:: python import matplotlib.pyplot as plt from skferm.datasets import load_rheolaser_data from skferm.plotting import plot_fermentation_curves from skferm.smoothing import smooth # Load the dataset rheolaser_data = load_rheolaser_data(clean=True) # Apply rolling mean smoothing (default method) smoothed_data = smooth( rheolaser_data, x="time", y="elasticity_index", groupby_col="sample_id" ) # Plot the results fig = plot_fermentation_curves( smoothed_data, title="Smoothed Rheolaser Elasticity Index", x="time", y="elasticity_index_smooth", xlabel="Time (minutes)", ylabel="Smoothed Elasticity Index" ) The ``smooth()`` function automatically applies the smoothing method to each group when ``groupby_col`` is specified. It adds a new column ``elasticity_index_smooth`` with the smoothed values. It always takes the original ``y`` name and appends ``_smooth`` to create the new column name. .. figure:: _static/smoothed_rheolaser_example.png :width: 600px :align: center :alt: Raw MTP pH Data showing pH over time for different samples Available Smoothing Methods --------------------------- You can specify different smoothing methods and their parameters: **Rolling Mean Smoothing** .. code-block:: python smoothed_data = smooth( rheolaser_data, x="time", y="elasticity_index", method="rolling", window_size=10, groupby_col="sample_id" ) **Exponential Moving Average** .. code-block:: python smoothed_data = smooth( rheolaser_data, x="time", y="elasticity_index", method="ema", span=15, groupby_col="sample_id" ) **Savitzky-Golay Filter** .. code-block:: python smoothed_data = smooth( rheolaser_data, x="time", y="elasticity_index", method="savgol", window_length=7, polyorder=2, groupby_col="sample_id" ) Sequential Smoothing -------------------- For heavily noisy data, you can apply multiple smoothing methods in sequence using ``smooth_sequential()``: .. code-block:: python from skferm.smoothing import smooth_sequential # Apply multiple smoothing stages smoothed_data = smooth_sequential( rheolaser_data, x="time", y="elasticity_index", groupby_col="sample_id", stages=[ ("rolling", {"window_size": 2}), ("rolling", {"window_size": 4}), ] ) Sequential smoothing applies each method in order, with each stage operating on the output of the previous stage. For the stages parameter, provide a list of tuples where each tuple contains the method name and a dictionary of its parameters. **Benefits of Sequential Smoothing:** - Improved noise reduction by targeting different frequency components - Flexibility to combine complementary smoothing approaches - Better control over edge preservation **Risks of Sequential Smoothing:** - Over-smoothing can remove important signal features - Multiple passes can introduce phase distortion - Results become sensitive to parameter order MTP pH Dataset Example ---------------------- The MTP pH dataset contains measurements from multiple wells across different microtiter plates: .. code-block:: python from skferm.datasets import load_mtp_ph_data # Load MTP pH data mtp_ph_data = load_mtp_ph_data() # Select a random design for demonstration design_id = mtp_ph_data.design_id.sample(1).values[0] subset_data = mtp_ph_data[mtp_ph_data["design_id"] == design_id] # Apply sequential smoothing smoothed_mtp = smooth_sequential( subset_data, x="time", y="ph", groupby_col="sample_id", stages=[ ("savgol", {"window_length": 7, "polyorder": 2}), ("rolling", {"window_size": 5}), ("rolling", {"window_size": 3}), ] ) When Smoothing Goes Wrong ------------------------- Smoothing can introduce artifacts or remove important features if not applied carefully. Here are common problems: **Problem 1: Curve Shifting** Using exponential moving average followed by large rolling windows can shift curves: .. code-block:: python # This combination causes unwanted curve shifting bad_smoothed = smooth_sequential( rheolaser_data, x="time", y="elasticity_index", groupby_col="sample_id", stages=[ ("ema", {"span": 20}), # EMA introduces lag ("rolling", {"window_size": 5}), # Additional lag ] ) **Problem 2: Over-smoothing** Too aggressive smoothing removes important biological features: .. code-block:: python # Over-smoothing example over_smoothed = smooth( rheolaser_data, x="time", y="elasticity_index", method="rolling", window_size=50, # Too large window groupby_col="sample_id" ) Evaluating Smoothing Quality ---------------------------- Always evaluate your smoothing results using both visual inspection and quantitative metrics: .. code-block:: python from skferm.smoothing import evaluate_smoothing_quality # Calculate smoothing quality metrics metrics = evaluate_smoothing_quality( smoothed_data, x_col="time", original_col="elasticity_index", smoothed_col="elasticity_index_smooth", group_col="sample_id" ) print(metrics) The ``evaluate_smoothing_quality()`` function returns several metrics: - **Smoothness metrics**: Total variation of original vs smoothed data - **Fit quality**: RMSE and R² between original and smoothed curves **Good smoothing should:** - Reduce total variation (smoother curve) - Maintain high R² (preserve signal shape) - Keep RMSE low (minimal distortion) Best Practices -------------- 1. **Always visualize results**: Plot original and smoothed data together to check for artifacts 2. **Start with gentle smoothing**: Use small window sizes first, then increase if needed 3. **Consider your data characteristics**: - High-frequency noise → rolling mean or Savitzky-Golay - Trending data → exponential moving average - Preserve peaks → Savitzky-Golay 4. **Use metrics to guide decisions**: Monitor total variation and fit quality 5. **Test different methods**: What works for one dataset may not work for another 6. **Be cautious with sequential smoothing**: Each additional stage increases the risk of over-smoothing Example: Complete Workflow -------------------------- Here's a complete example showing the recommended workflow: .. code-block:: python import matplotlib.pyplot as plt from skferm.datasets import load_rheolaser_data from skferm.plotting import plot_fermentation_curves from skferm.smoothing import smooth, evaluate_smoothing_quality # 1. Load and examine data data = load_rheolaser_data(clean=True) # 2. Try gentle smoothing first smoothed = smooth( data, x="time", y="elasticity_index", method="rolling", window_size=5, groupby_col="sample_id" ) # 3. Evaluate quality metrics = evaluate_smoothing_quality( smoothed, x_col="time", original_col="elasticity_index", smoothed_col="elasticity_index_smooth", group_col="sample_id" ) # 4. Visualize results fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6)) # Plot subset for clarity subset = data[data["sample_id"].isin(["A1", "E1"])] smoothed_subset = smoothed[smoothed["sample_id"].isin(["A1", "E1"])] # Original data plot_fermentation_curves( subset, x="time", y="elasticity_index", title="Original Data", ax=ax1 ) # Smoothed data plot_fermentation_curves( smoothed_subset, x="time", y="elasticity_index_smooth", title="Smoothed Data", ax=ax2 ) plt.tight_layout() plt.show() # 5. Check metrics and adjust if needed print("Smoothing Quality Metrics:") print(metrics.describe()) This workflow ensures you apply smoothing systematically while maintaining data quality.