Skip to main content

DSCOVR Inverse Problem + Regularization Methods @ Caltech / UCR

Sole author

Basic Info

  • Period: Mar 2021 — Mar 2022 (~11 months active work, with the most concentrated 4-month original-contribution window)
  • Role: Undergraduate remote research
  • Organization: Caltech (Division of Geological and Planetary Sciences) + UC Riverside (Environmental Sciences)
  • Advisors: Yuk L. Yung (primary PI, Caltech) + King-Fai Li (co-supervisor, UCR)
  • Paper draft corresponding author: (LMD Paris, former Yuk Yung postdoc)
  • Location: Remote (Zhenyu at PKU; Caltech / UCR teamwork remote)
  • Project entry: via Li Chong (PKU older labmate, Yuk Yung co-author on the Li C 2020 RS paper)

Core Problem

DSCOVR single-point Earth lightcurve inverse problem — from Earth single-pixel reflectance time series observed by DSCOVR/EPIC at the Sun-Earth L1 Lagrange point, retrieve the land-ocean distribution map, serving as a “validation exoplanet” benchmark for future exoplanet characterization. Methodologically extends the traditional Rodgers retrieval with MCMC, and systematically studies regularization-parameter selection in ill-conditioned inverse problems (, condition number ~, DSCOVR matrix Picard plot decays much faster than the Kawahara toy).

Paper draft (unpublished): “An automatic pipeline for retrieval from DSCOVR data”, co-authors co-authors (anonymized in public version), Zhenyu He¹, * (corresponding), King-Fai Li³, Yuk L. Yung⁴,⁵, Yongyun Hu¹ (multiple institutions). The 90-paragraph HBP mathematical section was authored primarily by Zhenyu.


Zhenyu’s 4 Original Contributions (per the proposal Example-6 4-original narrative)

Contribution (c): Hansen σ² vs σ⁴ textbook bug catch (2022-01-08, 80× deviation)

Core achievement: as an undergraduate, using independent SVD + filter-factor derivation, Zhenyu identified a formula-level bug in Hansen’s classic Discrete Inverse Problems (SIAM 2010), a foundational textbook — not a typo, but a substantive error that produces ~80× different numerical results.

Numerical comparison (same toy model + PC2 data):

FormulaToy model λPC2 λAccuracy
Hansen-student correct version (σ⁴)0.02510.0575Consistent with naked-eye elbow + axis-image theoretical value
Hansen-textbook erroneous version (σ²)1.995 (fail)1.995 (fail)80× overestimate, complete failure (λ pinned at upper bound, data-independent)

Innovative discovery — independently identified an L-curve axis-scale ambiguity that Hansen did not discuss (found 2022-01-08):

“Hansen didn’t talk about whether we should control the units of x-axis and y-axis of L-curve to be the same. In his book, he showed two L-curve figures, but both didn’t control the units to be the same.”

3-step self-correction cycle (11/29 → 12/06 → 01/08, the entire trajectory preserved):

  1. 11/29 discovered own formula was wrong (group-meeting slide 4)
  2. 12/06 hypothesized “accumulation of errors” — wrong attribution (slide 6)
  3. 01/08 refuted that hypothesis, correctly attributed to axis-scale ambiguity — slides 7-10 verbatim: “What if our naked eyes are wrong, but the calculated curvature is right?” (5-cell hand-computation demo; after MATLAB axis image forces equal units, the curvature elbow at truly aligns)

Physical-evidence discipline: the LCurve_HansenMistake_WeRectify_Test/ directory simultaneously preserves

  • Hansen自己文章_仍然mu二次方.pdf (Hansen’s own version, denominator σ²) (filename gloss: “Hansen own paper still mu squared.pdf”)
  • hansen学生mu四次方.pdf (Hansen-student correct version, denominator σ⁴) (filename gloss: “hansen student mu fourth power.pdf”)

Recorded side-by-side as reproducible numerical-science best practice — see Evidence Preservation Discipline (private companion).

Independent mathematical verification: Deductions of the curvature of log-log Lcurve.docx, a 36-paragraph symbolic derivation from SVD + filter factors to the second derivatives of and with respect to ; key step: “The denominator should have σ_i^4 instead of σ_i^2”.

Community contribution (3 open questions): (1) Should L-curve axes be controlled to equal units? (2) Is Siteng’s elbow λ=1e-3 result a scale artifact? (3) When applying L-curve, under what conditions should one tune, and how?

See the L-Curve Axis-Scale Ambiguity (Hansen σ² vs σ⁴ bug catch) concept page.


Contribution (b): Pre-whitening L-curve innovation + refusing methodological-engineering pressure (2022-01-23)

Technical innovation: under an RBF kernel (non-diagonal prior covariance ), transform to whitened space:

In the whitened space, plot vs → “modified L-curve”.

Problem diagnosis: discovered that under non-diagonal , the L-curve’s x/y values do not vary monotonically → structural limit: the L-curve corresponds to pure Tikhonov, and a non-diagonal cannot be reduced to Tikhonov form; the “modified L-curve” results are oversensitive to the priors — how well you know the priors directly influences the result.

Independent judgment — 2022-01-23 group-meeting verbatim (refusing methodological-engineering pressure):

Held to the methodology rather than tuning , priors to engineer a target λ.

Significance: a 21-year-old undergraduate refused on the spot the self-discipline call (declined to engineer a target λ) — no opportunism; honest admission that “the method I myself proposed only works under Tikhonov-equivalent covariance”. “I diagnose my own method’s limitation” is a rare discipline of methodology maturity.

Corresponds to Moment 3 of Independent Judgment narrative (held until 2026-06-10).


Contribution (a): 90-paragraph mathematical exposition of the Hardened Balancing Principle (HBP)

Theoretical framework: starting from Bauer 2007 Definition 5.1’s (Tikhonov), derive:

  • minimum-difference regularization parameter:
  • balancing functional
  • exponential concentration → HBP λ_{n+} automatic parameter selection

Systematic stress test (4 benchmarks):

BenchmarkCondition number / dataHBP performance
Hansen gravity problem (500-instance)~2.88e28 vs Hansen-textbook reported ~1.54e5Robust
Golub 1979 Laplace2.88e28 (more ill-conditioned than Golub’s original 1.54e5)Robust
3-point linear regressionCross-validated against King-Fai Li’s Bayesian closed-formα=9.57, σ=0.023, λ analytical vs MCMC differ by 18% (OK)
DSCOVR real PC2 time seriesThe W where Kawahara Bayesian cost failsConverges to a range close to Siteng λ≈1e-3

Conclusion: HBP requires no prior knowledge of noise level δ or tuning constant κ (BP needs both); most robust across benchmarks.

Output: the 90-paragraph mathematical-section writeup in internal-only manuscript — full derivation from Tikhonov to HBP, equivalent in scope to an independent technical note / tutorial.

Downstream impact: HBP was extended by co-author a subsequent collaborator to PC1 (low clouds) / PC4 (high clouds) retrievals (different from Zhenyu’s PC2 = land/ocean) — methodological influence reaches the collaborator’s work.

See the Hardened Balancing Principle (HBP) concept page.


Contribution (d): Direction of Noise Vector discovery (2021-09-20)

Original finding (2021-09-20 verbatim):

“When I did numerical tests on L-Curve, GCV etc., I found out the direction of the noise vector plays a big role in the performance of statistical criteria. I did a literature search and it seems nobody has studied this topic.

Geometric decomposition: in the column space of , with , decompose noise into perpendicular + parallel; after decomposing the parallel component along eigenvectors, the EV2-direction component is amplified.

4-method failure-mode catalog:

MethodFailure condition
L-curve large, or near EV1
Kawahara Bayesian large (5/5 fail for , negative component)
GCV near EV1 and near EV2 (long flat cost region)
Discrepancy principleAlways works, but introduces systematic error (underestimates )

Source: incidental discovery → systematic discovery: when reproducing the Golub 1979 GCV paper (Zhenyu’s condition 2.88e28, more ill-conditioned than the paper’s 1.54e5), the 4th of 5 instances yielded a λ_GCV completely different from the other 4 → the Hansen gravity 500-instance independently rediscovered the same → two independent pieces of evidence converged → systematized as Contribution (d). Literature search confirmed the literature gap (“nobody has studied”) — signature of independent problem-formulation ability.

See Multi-Method Comparison Spine (2019-2025).


Key Achievements

  1. Hansen textbook bug catch (80× overestimate, formula-level error in a foundational classic) — undergraduate finds a formula bug in a SIAM 2010 classic + side-by-side inconvenient-evidence retention
  2. Pre-whitening L-curve innovation + 2022-01-23 refuse-methodological-engineering pressure verbatim — research-integrity signature
  3. HBP 90-paragraph mathematical exposition + 4-benchmark stress test + subsequent collaborator’s PC1/PC4 extension — methodology-influence chain
  4. Independent discovery of the Direction-of-Noise-Vector + literature-gap awareness — problem-formulation originality
  5. Cross-language fluency: within a single 11-month project, 5 programming languages:
LanguageRole
IDLOOP MCMC class (inheriting Yuk Yung lab convention; learned scripts from Jing Li’s PKU group)
MATLABHansen gravity replication + csvd + 5 nested LCurve_HansenMistake_WeRectify_Test subdirectories
PythonKawahara CUDA library bootstrap (remote workstation) + emcee + healpy + scipy
MathematicaSymbolic-curvature derivation (calculation_4_cal_deriva.nb + calculation_5_cal_deriva.nb)
RMetropolis-Hastings baseline for cross-validation
  1. 29 group_meeting pptx sustained trajectory (Mar 2021 → Mar 2022, ~1 PPT / 2 weeks) — undergraduate, ~12 months, single project, 29 formal group-meeting reports
  2. Self-study rigor: full reproduction of three classics — Hansen Discrete Inverse Problems (SIAM 2010) + Bishop PRML + Golub 1979 original GCV paper, with numerical experiments replicated one by one

Technical Depth (resume-grade specifics)

  • Self-study, full reproduction of 3 textbooks: Hansen Discrete Inverse Problems + Bishop PRML + Golub 1979 GCV paper — not just reading, but replicating Fig 5.9 gravity 500-instance histogram + Laplace condition 2.88e28 stress test + Ch 5.6 shaw test problem, obtaining histograms that nearly overlap with the originals
  • ~3,000+ lines of code + study notes on the repo (parallel to the 4 Caltech contributions; pre-finalization commit 2021-09-13; subsequent research-wiki files no longer in this repo but conceptually unified)
  • remote workstation bootstrap: Caltech Anaconda + CUDA 11.2 + Kawahara library (CUDA library not supported under Windows → Zhenyu manually diagnosed + driver setup); 2021-08-09 group meeting “Finished Setting up Environments for Running codes on (remote workstation)” — non-trivial devops skill
  • 5 nested stress-test subdirectories (LCurve_HansenMistake_WeRectify_Test/): test 1 Hansen formula failed at upper bound on toy / test 2 student formula success on toy / test 3 PC2 / test 4 Sa, Se whitening with own estimate / test 5 Sa, Se whitening with Kawahara posterior — each directory contains Hansen original + student-correct L-curve + curvature-vs-λ figure for visual diff
  • Mathematica symbolic ↔ MCMC cross-validation: 3-point regression closed-form (α=9.57, σ=0.023, λ=5.73e-5) vs MCMC (λ=6.76e-5) differ by 18% — double-check discipline
  • 5×4 failure-mode matrix (5 geometries × 4 regularization methods): [1,1] / [3,1] / [5,1] / [7,1] / [10,1] — for each case, give the best λ + retrieved x for L-curve / Kawahara / GCV / Discrepancy, structurally identifying the geometric condition under which each method fails — not “4 methods in parallel” but “4 methods’ capability diagnosis under 5 geometries”

Academic / Career Significance

  • Moment 4 (Caltech → Berkeley cross-paradigm switch, Aug 2022): gave up 11 months of IDL / MATLAB / Python / Mathematica / R tooling + 4 original-contribution momentum, switched paradigm to David Romps DAM LES Fortran (single-point remote sensing → 3-D LES moist convection); External factor is a key external factor, but the core is Zhenyu’s proactive choice to switch application direction (exoplanet habitability → nuclear-winter policy) — see Independent Judgment narrative (held until 2026-06-10) Moment 4
  • Moment 3 (methodological restraint, 2022-01-23): when DSCOVR pre-whitening verification failed, the undergraduate refused the self-discipline call on the spot — see Independent Judgment narrative (held until 2026-06-10) Moment 3
  • 3-track parallel Sep-Dec 2021 senior fall: Caltech DSCOVR HBP maturation + Walker paper v3→v6 drafting (see Walker Circulation Dynamics @ PKU) + PKU Aerosol thesis drafting (see Aerosol Joint Retrieval Senior Thesis @ PKU) + nwp_hw 4 assignments ≈ 3-4 deliverable/week sustained intensity

Skills Used


Sources

  • research_wiki/projects/mcmc_retrieval/overview.md (~35k words, full 1329 lines)
  • research_wiki/resume_angles/problem_solving.md §Project 3 DSCOVR
  • research_wiki/resume_angles/independent_judgment.md §Moments 3 + 4
  • raw/Caltech_task/ 1.5 GB folder (1,364 files / 126 dirs, 29 group_meeting pptx + internal-only manuscript + 10 technical docx + LCurve_HansenMistake_WeRectify_Test/ subdirectories)