DSCOVR Inverse Problem + Regularization Methods @ Caltech / UCR
Sole author
Basic Info
- Period: Mar 2021 — Mar 2022 (~11 months active work, with the most concentrated 4-month original-contribution window)
- Role: Undergraduate remote research
- Organization: Caltech (Division of Geological and Planetary Sciences) + UC Riverside (Environmental Sciences)
- Advisors: Yuk L. Yung (primary PI, Caltech) + King-Fai Li (co-supervisor, UCR)
- Paper draft corresponding author: (LMD Paris, former Yuk Yung postdoc)
- Location: Remote (Zhenyu at PKU; Caltech / UCR teamwork remote)
- Project entry: via Li Chong (PKU older labmate, Yuk Yung co-author on the Li C 2020 RS paper)
Core Problem
DSCOVR single-point Earth lightcurve inverse problem — from Earth single-pixel reflectance time series observed by DSCOVR/EPIC at the Sun-Earth L1 Lagrange point, retrieve the land-ocean distribution map, serving as a “validation exoplanet” benchmark for future exoplanet characterization. Methodologically extends the traditional Rodgers retrieval with MCMC, and systematically studies regularization-parameter selection in ill-conditioned inverse problems (, condition number ~, DSCOVR matrix Picard plot decays much faster than the Kawahara toy).
Paper draft (unpublished): “An automatic pipeline for retrieval from DSCOVR data”, co-authors co-authors (anonymized in public version), Zhenyu He¹, * (corresponding), King-Fai Li³, Yuk L. Yung⁴,⁵, Yongyun Hu¹ (multiple institutions). The 90-paragraph HBP mathematical section was authored primarily by Zhenyu.
Zhenyu’s 4 Original Contributions (per the proposal Example-6 4-original narrative)
Contribution (c): Hansen σ² vs σ⁴ textbook bug catch (2022-01-08, 80× deviation)
Core achievement: as an undergraduate, using independent SVD + filter-factor derivation, Zhenyu identified a formula-level bug in Hansen’s classic Discrete Inverse Problems (SIAM 2010), a foundational textbook — not a typo, but a substantive error that produces ~80× different numerical results.
Numerical comparison (same toy model + PC2 data):
| Formula | Toy model λ | PC2 λ | Accuracy |
|---|---|---|---|
| Hansen-student correct version (σ⁴) | 0.0251 | 0.0575 | Consistent with naked-eye elbow + axis-image theoretical value |
| Hansen-textbook erroneous version (σ²) | 1.995 (fail) | 1.995 (fail) | 80× overestimate, complete failure (λ pinned at upper bound, data-independent) |
Innovative discovery — independently identified an L-curve axis-scale ambiguity that Hansen did not discuss (found 2022-01-08):
“Hansen didn’t talk about whether we should control the units of x-axis and y-axis of L-curve to be the same. In his book, he showed two L-curve figures, but both didn’t control the units to be the same.”
3-step self-correction cycle (11/29 → 12/06 → 01/08, the entire trajectory preserved):
- 11/29 discovered own formula was wrong (group-meeting slide 4)
- 12/06 hypothesized “accumulation of errors” — wrong attribution (slide 6)
- 01/08 refuted that hypothesis, correctly attributed to axis-scale ambiguity — slides 7-10 verbatim: “What if our naked eyes are wrong, but the calculated curvature is right?” (5-cell hand-computation demo; after MATLAB
axis imageforces equal units, the curvature elbow at truly aligns)
Physical-evidence discipline: the LCurve_HansenMistake_WeRectify_Test/ directory simultaneously preserves
Hansen自己文章_仍然mu二次方.pdf(Hansen’s own version, denominator σ²) (filename gloss: “Hansen own paper still mu squared.pdf”)hansen学生mu四次方.pdf(Hansen-student correct version, denominator σ⁴) (filename gloss: “hansen student mu fourth power.pdf”)
Recorded side-by-side as reproducible numerical-science best practice — see Evidence Preservation Discipline (private companion).
Independent mathematical verification: Deductions of the curvature of log-log Lcurve.docx, a 36-paragraph symbolic derivation from SVD + filter factors to the second derivatives of and with respect to ; key step: “The denominator should have σ_i^4 instead of σ_i^2”.
Community contribution (3 open questions): (1) Should L-curve axes be controlled to equal units? (2) Is Siteng’s elbow λ=1e-3 result a scale artifact? (3) When applying L-curve, under what conditions should one tune, and how?
See the L-Curve Axis-Scale Ambiguity (Hansen σ² vs σ⁴ bug catch) concept page.
Contribution (b): Pre-whitening L-curve innovation + refusing methodological-engineering pressure (2022-01-23)
Technical innovation: under an RBF kernel (non-diagonal prior covariance ), transform to whitened space:
In the whitened space, plot vs → “modified L-curve”.
Problem diagnosis: discovered that under non-diagonal , the L-curve’s x/y values do not vary monotonically → structural limit: the L-curve corresponds to pure Tikhonov, and a non-diagonal cannot be reduced to Tikhonov form; the “modified L-curve” results are oversensitive to the priors — how well you know the priors directly influences the result.
Independent judgment — 2022-01-23 group-meeting verbatim (refusing methodological-engineering pressure):
Held to the methodology rather than tuning , priors to engineer a target λ.
Significance: a 21-year-old undergraduate refused on the spot the self-discipline call (declined to engineer a target λ) — no opportunism; honest admission that “the method I myself proposed only works under Tikhonov-equivalent covariance”. “I diagnose my own method’s limitation” is a rare discipline of methodology maturity.
Corresponds to Moment 3 of Independent Judgment narrative (held until 2026-06-10).
Contribution (a): 90-paragraph mathematical exposition of the Hardened Balancing Principle (HBP)
Theoretical framework: starting from Bauer 2007 Definition 5.1’s (Tikhonov), derive:
- minimum-difference regularization parameter:
- balancing functional
- exponential concentration → HBP
λ_{n+}automatic parameter selection
Systematic stress test (4 benchmarks):
| Benchmark | Condition number / data | HBP performance |
|---|---|---|
| Hansen gravity problem (500-instance) | ~2.88e28 vs Hansen-textbook reported ~1.54e5 | Robust |
| Golub 1979 Laplace | 2.88e28 (more ill-conditioned than Golub’s original 1.54e5) | Robust |
| 3-point linear regression | Cross-validated against King-Fai Li’s Bayesian closed-form | α=9.57, σ=0.023, λ analytical vs MCMC differ by 18% (OK) |
| DSCOVR real PC2 time series | The W where Kawahara Bayesian cost fails | Converges to a range close to Siteng λ≈1e-3 |
Conclusion: HBP requires no prior knowledge of noise level δ or tuning constant κ (BP needs both); most robust across benchmarks.
Output: the 90-paragraph mathematical-section writeup in internal-only manuscript — full derivation from Tikhonov to HBP, equivalent in scope to an independent technical note / tutorial.
Downstream impact: HBP was extended by co-author a subsequent collaborator to PC1 (low clouds) / PC4 (high clouds) retrievals (different from Zhenyu’s PC2 = land/ocean) — methodological influence reaches the collaborator’s work.
See the Hardened Balancing Principle (HBP) concept page.
Contribution (d): Direction of Noise Vector discovery (2021-09-20)
Original finding (2021-09-20 verbatim):
“When I did numerical tests on L-Curve, GCV etc., I found out the direction of the noise vector plays a big role in the performance of statistical criteria. I did a literature search and it seems nobody has studied this topic.”
Geometric decomposition: in the column space of , with , decompose noise into perpendicular + parallel; after decomposing the parallel component along eigenvectors, the EV2-direction component is amplified.
4-method failure-mode catalog:
| Method | Failure condition |
|---|---|
| L-curve | large, or near EV1 |
| Kawahara Bayesian | large (5/5 fail for , negative component) |
| GCV | near EV1 and near EV2 (long flat cost region) |
| Discrepancy principle | Always works, but introduces systematic error (underestimates ) |
Source: incidental discovery → systematic discovery: when reproducing the Golub 1979 GCV paper (Zhenyu’s condition 2.88e28, more ill-conditioned than the paper’s 1.54e5), the 4th of 5 instances yielded a λ_GCV completely different from the other 4 → the Hansen gravity 500-instance independently rediscovered the same → two independent pieces of evidence converged → systematized as Contribution (d). Literature search confirmed the literature gap (“nobody has studied”) — signature of independent problem-formulation ability.
See Multi-Method Comparison Spine (2019-2025).
Key Achievements
- Hansen textbook bug catch (80× overestimate, formula-level error in a foundational classic) — undergraduate finds a formula bug in a SIAM 2010 classic + side-by-side inconvenient-evidence retention
- Pre-whitening L-curve innovation + 2022-01-23 refuse-methodological-engineering pressure verbatim — research-integrity signature
- HBP 90-paragraph mathematical exposition + 4-benchmark stress test + subsequent collaborator’s PC1/PC4 extension — methodology-influence chain
- Independent discovery of the Direction-of-Noise-Vector + literature-gap awareness — problem-formulation originality
- Cross-language fluency: within a single 11-month project, 5 programming languages:
| Language | Role |
|---|---|
| IDL | OOP MCMC class (inheriting Yuk Yung lab convention; learned scripts from Jing Li’s PKU group) |
| MATLAB | Hansen gravity replication + csvd + 5 nested LCurve_HansenMistake_WeRectify_Test subdirectories |
| Python | Kawahara CUDA library bootstrap (remote workstation) + emcee + healpy + scipy |
| Mathematica | Symbolic-curvature derivation (calculation_4_cal_deriva.nb + calculation_5_cal_deriva.nb) |
| R | Metropolis-Hastings baseline for cross-validation |
- 29 group_meeting pptx sustained trajectory (Mar 2021 → Mar 2022, ~1 PPT / 2 weeks) — undergraduate, ~12 months, single project, 29 formal group-meeting reports
- Self-study rigor: full reproduction of three classics — Hansen Discrete Inverse Problems (SIAM 2010) + Bishop PRML + Golub 1979 original GCV paper, with numerical experiments replicated one by one
Technical Depth (resume-grade specifics)
- Self-study, full reproduction of 3 textbooks: Hansen Discrete Inverse Problems + Bishop PRML + Golub 1979 GCV paper — not just reading, but replicating Fig 5.9 gravity 500-instance histogram + Laplace condition 2.88e28 stress test + Ch 5.6 shaw test problem, obtaining histograms that nearly overlap with the originals
- ~3,000+ lines of code + study notes on the repo (parallel to the 4 Caltech contributions; pre-finalization commit 2021-09-13; subsequent research-wiki files no longer in this repo but conceptually unified)
- remote workstation bootstrap: Caltech Anaconda + CUDA 11.2 + Kawahara library (CUDA library not supported under Windows → Zhenyu manually diagnosed + driver setup); 2021-08-09 group meeting “Finished Setting up Environments for Running codes on (remote workstation)” — non-trivial devops skill
- 5 nested stress-test subdirectories (
LCurve_HansenMistake_WeRectify_Test/): test 1 Hansen formula failed at upper bound on toy / test 2 student formula success on toy / test 3 PC2 / test 4 Sa, Se whitening with own estimate / test 5 Sa, Se whitening with Kawahara posterior — each directory contains Hansen original + student-correct L-curve + curvature-vs-λ figure for visual diff - Mathematica symbolic ↔ MCMC cross-validation: 3-point regression closed-form (α=9.57, σ=0.023, λ=5.73e-5) vs MCMC (λ=6.76e-5) differ by 18% — double-check discipline
- 5×4 failure-mode matrix (5 geometries × 4 regularization methods): [1,1] / [3,1] / [5,1] / [7,1] / [10,1] — for each case, give the best λ + retrieved x for L-curve / Kawahara / GCV / Discrepancy, structurally identifying the geometric condition under which each method fails — not “4 methods in parallel” but “4 methods’ capability diagnosis under 5 geometries”
Academic / Career Significance
- Moment 4 (Caltech → Berkeley cross-paradigm switch, Aug 2022): gave up 11 months of IDL / MATLAB / Python / Mathematica / R tooling + 4 original-contribution momentum, switched paradigm to David Romps DAM LES Fortran (single-point remote sensing → 3-D LES moist convection); External factor is a key external factor, but the core is Zhenyu’s proactive choice to switch application direction (exoplanet habitability → nuclear-winter policy) — see Independent Judgment narrative (held until 2026-06-10) Moment 4
- Moment 3 (methodological restraint, 2022-01-23): when DSCOVR pre-whitening verification failed, the undergraduate refused the self-discipline call on the spot — see Independent Judgment narrative (held until 2026-06-10) Moment 3
- 3-track parallel Sep-Dec 2021 senior fall: Caltech DSCOVR HBP maturation + Walker paper v3→v6 drafting (see Walker Circulation Dynamics @ PKU) + PKU Aerosol thesis drafting (see Aerosol Joint Retrieval Senior Thesis @ PKU) + nwp_hw 4 assignments ≈ 3-4 deliverable/week sustained intensity
Skills Used
- MCMC & Bayesian Inversion (Round 3 update adding the Caltech 4 contributions)
- Inverse Problems & Regularization (Hansen bug catch + pre-whitening; concept scaffold)
- Gaussian Process Bayesian Inversion (Kawahara SOT method)
- Python Scientific Computing
- High Performance Computing ((remote workstation) CUDA bootstrap)
Related Pages
- Undergraduate Research @ Peking University & Caltech/UCR (umbrella page)
- Yuk L. Yung (Caltech primary PI)
- King-Fai Li (UCR co-supervisor)
- (LMD Paris, paper-draft corresponding author)
- Li Chong (Caltech bridge via Li C 2020 RS paper)
- Hardened Balancing Principle (HBP) (contribution a)
- L-Curve Axis-Scale Ambiguity (Hansen σ² vs σ⁴ bug catch) (contribution c)
- Evidence Preservation Discipline (private companion) (Hansen PDFs preserved side by side)
- Multi-Method Comparison Spine (2019-2025) (4-method, 5-geometry failure-mode matrix)
- Independent Judgment narrative (held until 2026-06-10) (Moments 3 + 4)
- Walker Circulation Dynamics @ PKU (3-track parallel Sep-Dec 2021)
- Aerosol Joint Retrieval Senior Thesis @ PKU
- Inverse Problems & Regularization
- Gaussian Process Bayesian Inversion
Sources
research_wiki/projects/mcmc_retrieval/overview.md(~35k words, full 1329 lines)research_wiki/resume_angles/problem_solving.md§Project 3 DSCOVRresearch_wiki/resume_angles/independent_judgment.md§Moments 3 + 4raw/Caltech_task/1.5 GB folder (1,364 files / 126 dirs, 29 group_meeting pptx +internal-only manuscript+ 10 technical docx +LCurve_HansenMistake_WeRectify_Test/subdirectories)