DSCOVR Inverse Problem + Regularization Methods @ Caltech / UCR

Sole author

Basic Info

Period: Mar 2021 — Mar 2022 (~11 months active work, with the most concentrated 4-month original-contribution window)
Role: Undergraduate remote research
Organization: Caltech (Division of Geological and Planetary Sciences) + UC Riverside (Environmental Sciences)
Advisors: Yuk L. Yung (primary PI, Caltech) + King-Fai Li (co-supervisor, UCR)
Paper draft corresponding author: (LMD Paris, former Yuk Yung postdoc)
Location: Remote (Zhenyu at PKU; Caltech / UCR teamwork remote)
Project entry: via Li Chong (PKU older labmate, Yuk Yung co-author on the Li C 2020 RS paper)

Core Problem

DSCOVR single-point Earth lightcurve inverse problem — from Earth single-pixel reflectance time series observed by DSCOVR/EPIC at the Sun-Earth L1 Lagrange point, retrieve the land-ocean distribution map, serving as a “validation exoplanet” benchmark for future exoplanet characterization. Methodologically extends the traditional Rodgers retrieval with MCMC, and systematically studies regularization-parameter selection in ill-conditioned inverse problems ( $A x = b$ , condition number ~ $1 0^{6}$ , DSCOVR $W$ matrix Picard plot decays much faster than the Kawahara toy).

Paper draft (unpublished): “An automatic pipeline for retrieval from DSCOVR data”, co-authors co-authors (anonymized in public version), Zhenyu He¹, * (corresponding), King-Fai Li³, Yuk L. Yung⁴,⁵, Yongyun Hu¹ (multiple institutions). The 90-paragraph HBP mathematical section was authored primarily by Zhenyu.

Zhenyu’s 4 Original Contributions (per the proposal Example-6 4-original narrative)

Contribution (c): Hansen σ² vs σ⁴ textbook bug catch (2022-01-08, 80× deviation)

Core achievement: as an undergraduate, using independent SVD + filter-factor derivation, Zhenyu identified a formula-level bug in Hansen’s classic Discrete Inverse Problems (SIAM 2010), a foundational textbook — not a typo, but a substantive error that produces ~80× different numerical results.

Numerical comparison (same toy model + PC2 data):

Formula	Toy model λ	PC2 λ	Accuracy
Hansen-student correct version (σ⁴)	0.0251	0.0575	Consistent with naked-eye elbow + axis-image theoretical value
Hansen-textbook erroneous version (σ²)	1.995 (fail)	1.995 (fail)	80× overestimate, complete failure (λ pinned at upper bound, data-independent)

Innovative discovery — independently identified an L-curve axis-scale ambiguity that Hansen did not discuss (found 2022-01-08):

“Hansen didn’t talk about whether we should control the units of x-axis and y-axis of L-curve to be the same. In his book, he showed two L-curve figures, but both didn’t control the units to be the same.”

3-step self-correction cycle (11/29 → 12/06 → 01/08, the entire trajectory preserved):

11/29 discovered own formula was wrong (group-meeting slide 4)
12/06 hypothesized “accumulation of errors” — wrong attribution (slide 6)
01/08 refuted that hypothesis, correctly attributed to axis-scale ambiguity — slides 7-10 verbatim: “What if our naked eyes are wrong, but the calculated curvature is right?” (5-cell hand-computation demo; after MATLAB axis image forces equal units, the curvature elbow at $x = - 1.337$ truly aligns)

Physical-evidence discipline: the LCurve_HansenMistake_WeRectify_Test/ directory simultaneously preserves

Hansen自己文章_仍然mu二次方.pdf (Hansen’s own version, denominator σ²) (filename gloss: “Hansen own paper still mu squared.pdf”)
hansen学生mu四次方.pdf (Hansen-student correct version, denominator σ⁴) (filename gloss: “hansen student mu fourth power.pdf”)

Recorded side-by-side as reproducible numerical-science best practice — see Evidence Preservation Discipline (private companion).

Independent mathematical verification: Deductions of the curvature of log-log Lcurve.docx, a 36-paragraph symbolic derivation from SVD $A = U Σ V^{T}$ + filter factors $f_{i} = σ_{i}^{2} / (σ_{i}^{2} + λ^{2})$ to the second derivatives of $ln ∣ x_{λ} ∣$ and $ln ∣ A x_{λ} - b ∣$ with respect to $μ = ln λ$ ; key step: “The denominator should have σ_i^4 instead of σ_i^2”.

Community contribution (3 open questions): (1) Should L-curve axes be controlled to equal units? (2) Is Siteng’s elbow λ=1e-3 result a scale artifact? (3) When applying L-curve, under what conditions should one tune, and how?

See the L-Curve Axis-Scale Ambiguity (Hansen σ² vs σ⁴ bug catch) concept page.

Contribution (b): Pre-whitening L-curve innovation + refusing methodological-engineering pressure (2022-01-23)

Technical innovation: under an RBF kernel (non-diagonal prior covariance $S_{a}$ ), transform to whitened space:

$x_{1} = S_{a}^{- 1/2} x, y_{1} = S_{e}^{- 1/2} y, A_{1} = S_{e}^{- 1/2} A S_{a}^{1/2}$

In the whitened space, plot $ln ∣ x_{1} ∣$ vs $ln ∣ y_{1} - A_{1} x_{1} ∣$ → “modified L-curve”.

Problem diagnosis: discovered that under non-diagonal $S_{a}$ , the L-curve’s x/y values do not vary monotonically → structural limit: the L-curve corresponds to pure Tikhonov, and a non-diagonal $S_{a}$ cannot be reduced to Tikhonov form; the “modified L-curve” results are oversensitive to the $S_{a}, S_{e}$ priors — how well you know the priors directly influences the result.

Independent judgment — 2022-01-23 group-meeting verbatim (refusing methodological-engineering pressure):

Held to the methodology rather than tuning $S_{a}$ , $S_{e}$ priors to engineer a target λ.

Significance: a 21-year-old undergraduate refused on the spot the self-discipline call (declined to engineer a target λ) — no opportunism; honest admission that “the method I myself proposed only works under Tikhonov-equivalent covariance”. “I diagnose my own method’s limitation” is a rare discipline of methodology maturity.

Corresponds to Moment 3 of Independent Judgment narrative (held until 2026-06-10).

Contribution (a): 90-paragraph mathematical exposition of the Hardened Balancing Principle (HBP)

Theoretical framework: starting from Bauer 2007 Definition 5.1’s $ψ^{+} = 1$ (Tikhonov), derive:

minimum-difference regularization parameter: $n^{+} = min {n \in [0, N] : D (n, k) > D (n - 1, k) for some k}$
balancing functional $b (n, m) = ∥ x_{n} - x_{m} ∥/ [Φ (N, m) - Φ (N, n)]$
exponential concentration → HBP λ_{n+} automatic parameter selection

Systematic stress test (4 benchmarks):

Benchmark	Condition number / data	HBP performance
Hansen gravity problem (500-instance)	~2.88e28 vs Hansen-textbook reported ~1.54e5	Robust
Golub 1979 Laplace	2.88e28 (more ill-conditioned than Golub’s original 1.54e5)	Robust
3-point linear regression	Cross-validated against King-Fai Li’s Bayesian closed-form	α=9.57, σ=0.023, λ analytical vs MCMC differ by 18% (OK)
DSCOVR real PC2 time series	The W where Kawahara Bayesian cost fails	Converges to a range close to Siteng λ≈1e-3

Conclusion: HBP requires no prior knowledge of noise level δ or tuning constant κ (BP needs both); most robust across benchmarks.

Output: the 90-paragraph mathematical-section writeup in internal-only manuscript — full derivation from Tikhonov to HBP, equivalent in scope to an independent technical note / tutorial.

Downstream impact: HBP was extended by co-author a subsequent collaborator to PC1 (low clouds) / PC4 (high clouds) retrievals (different from Zhenyu’s PC2 = land/ocean) — methodological influence reaches the collaborator’s work.

See the Hardened Balancing Principle (HBP) concept page.

Contribution (d): Direction of Noise Vector discovery (2021-09-20)

Original finding (2021-09-20 verbatim):

“When I did numerical tests on L-Curve, GCV etc., I found out the direction of the noise vector plays a big role in the performance of statistical criteria. I did a literature search and it seems nobody has studied this topic.”

Geometric decomposition: in the column space of $A = [A_{1}, A_{2}]$ , with $d = A x_{so l}$ , decompose noise $dd$ into perpendicular + parallel; after decomposing the parallel component along eigenvectors, the EV2-direction component is amplified.

4-method failure-mode catalog:

Method	Failure condition
L-curve	$∥ x_{so l} ∥$ large, or $dd$ near EV1
Kawahara Bayesian	$∥ x_{so l} ∥$ large (5/5 fail for $x_{so l} = [10, 1]$ , negative component)
GCV	$d$ near EV1 and $dd$ near EV2 (long flat cost region)
Discrepancy principle	Always works, but introduces systematic error (underestimates $∥ x ∥$ )

Source: incidental discovery → systematic discovery: when reproducing the Golub 1979 GCV paper (Zhenyu’s condition 2.88e28, more ill-conditioned than the paper’s 1.54e5), the 4th of 5 instances yielded a λ_GCV completely different from the other 4 → the Hansen gravity 500-instance independently rediscovered the same → two independent pieces of evidence converged → systematized as Contribution (d). Literature search confirmed the literature gap (“nobody has studied”) — signature of independent problem-formulation ability.

See Multi-Method Comparison Spine (2019-2025).

Key Achievements

Hansen textbook bug catch (80× overestimate, formula-level error in a foundational classic) — undergraduate finds a formula bug in a SIAM 2010 classic + side-by-side inconvenient-evidence retention
Pre-whitening L-curve innovation + 2022-01-23 refuse-methodological-engineering pressure verbatim — research-integrity signature
HBP 90-paragraph mathematical exposition + 4-benchmark stress test + subsequent collaborator’s PC1/PC4 extension — methodology-influence chain
Independent discovery of the Direction-of-Noise-Vector + literature-gap awareness — problem-formulation originality
Cross-language fluency: within a single 11-month project, 5 programming languages:

Language	Role
IDL	OOP MCMC class (inheriting Yuk Yung lab convention; learned scripts from Jing Li’s PKU group)
MATLAB	Hansen gravity replication + csvd + 5 nested `LCurve_HansenMistake_WeRectify_Test` subdirectories
Python	Kawahara CUDA library bootstrap (remote workstation) + emcee + healpy + scipy
Mathematica	Symbolic-curvature derivation (`calculation_4_cal_deriva.nb` + `calculation_5_cal_deriva.nb`)
R	Metropolis-Hastings baseline for cross-validation

29 group_meeting pptx sustained trajectory (Mar 2021 → Mar 2022, ~1 PPT / 2 weeks) — undergraduate, ~12 months, single project, 29 formal group-meeting reports
Self-study rigor: full reproduction of three classics — Hansen Discrete Inverse Problems (SIAM 2010) + Bishop PRML + Golub 1979 original GCV paper, with numerical experiments replicated one by one

Technical Depth (resume-grade specifics)

Self-study, full reproduction of 3 textbooks: Hansen Discrete Inverse Problems + Bishop PRML + Golub 1979 GCV paper — not just reading, but replicating Fig 5.9 gravity 500-instance histogram + Laplace condition 2.88e28 stress test + Ch 5.6 shaw test problem, obtaining histograms that nearly overlap with the originals
~3,000+ lines of code + study notes on the repo (parallel to the 4 Caltech contributions; pre-finalization commit 2021-09-13; subsequent research-wiki files no longer in this repo but conceptually unified)
remote workstation bootstrap: Caltech Anaconda + CUDA 11.2 + Kawahara library (CUDA library not supported under Windows → Zhenyu manually diagnosed + driver setup); 2021-08-09 group meeting “Finished Setting up Environments for Running codes on (remote workstation)” — non-trivial devops skill
5 nested stress-test subdirectories (LCurve_HansenMistake_WeRectify_Test/): test 1 Hansen formula failed at upper bound on toy / test 2 student formula success on toy / test 3 PC2 / test 4 Sa, Se whitening with own estimate / test 5 Sa, Se whitening with Kawahara posterior — each directory contains Hansen original + student-correct L-curve + curvature-vs-λ figure for visual diff
Mathematica symbolic ↔ MCMC cross-validation: 3-point regression closed-form (α=9.57, σ=0.023, λ=5.73e-5) vs MCMC (λ=6.76e-5) differ by 18% — double-check discipline
5×4 failure-mode matrix (5 $x_{so l}$ geometries × 4 regularization methods): [1,1] / [3,1] / [5,1] / [7,1] / [10,1] — for each case, give the best λ + retrieved x for L-curve / Kawahara / GCV / Discrepancy, structurally identifying the geometric condition under which each method fails — not “4 methods in parallel” but “4 methods’ capability diagnosis under 5 geometries”

Academic / Career Significance

Moment 4 (Caltech → Berkeley cross-paradigm switch, Aug 2022): gave up 11 months of IDL / MATLAB / Python / Mathematica / R tooling + 4 original-contribution momentum, switched paradigm to David Romps DAM LES Fortran (single-point remote sensing → 3-D LES moist convection); External factor is a key external factor, but the core is Zhenyu’s proactive choice to switch application direction (exoplanet habitability → nuclear-winter policy) — see Independent Judgment narrative (held until 2026-06-10) Moment 4
Moment 3 (methodological restraint, 2022-01-23): when DSCOVR pre-whitening verification failed, the undergraduate refused the self-discipline call on the spot — see Independent Judgment narrative (held until 2026-06-10) Moment 3
3-track parallel Sep-Dec 2021 senior fall: Caltech DSCOVR HBP maturation + Walker paper v3→v6 drafting (see Walker Circulation Dynamics @ PKU) + PKU Aerosol thesis drafting (see Aerosol Joint Retrieval Senior Thesis @ PKU) + nwp_hw 4 assignments ≈ 3-4 deliverable/week sustained intensity

Skills Used

MCMC & Bayesian Inversion (Round 3 update adding the Caltech 4 contributions)
Inverse Problems & Regularization (Hansen bug catch + pre-whitening; concept scaffold)
Gaussian Process Bayesian Inversion (Kawahara SOT method)
Python Scientific Computing
High Performance Computing ((remote workstation) CUDA bootstrap)

Undergraduate Research @ Peking University & Caltech/UCR (umbrella page)
Yuk L. Yung (Caltech primary PI)
King-Fai Li (UCR co-supervisor)
(LMD Paris, paper-draft corresponding author)
Li Chong (Caltech bridge via Li C 2020 RS paper)
Hardened Balancing Principle (HBP) (contribution a)
L-Curve Axis-Scale Ambiguity (Hansen σ² vs σ⁴ bug catch) (contribution c)
Evidence Preservation Discipline (private companion) (Hansen PDFs preserved side by side)
Multi-Method Comparison Spine (2019-2025) (4-method, 5-geometry failure-mode matrix)
Independent Judgment narrative (held until 2026-06-10) (Moments 3 + 4)
Walker Circulation Dynamics @ PKU (3-track parallel Sep-Dec 2021)
Aerosol Joint Retrieval Senior Thesis @ PKU
Inverse Problems & Regularization
Gaussian Process Bayesian Inversion

Sources

research_wiki/projects/mcmc_retrieval/overview.md (~35k words, full 1329 lines)
research_wiki/resume_angles/problem_solving.md §Project 3 DSCOVR
research_wiki/resume_angles/independent_judgment.md §Moments 3 + 4
raw/Caltech_task/ 1.5 GB folder (1,364 files / 126 dirs, 29 group_meeting pptx + internal-only manuscript + 10 technical docx + LCurve_HansenMistake_WeRectify_Test/ subdirectories)