跳过导航至主内容

DSCOVR Inverse Problem + Regularization Methods @ Caltech / UCR

Sole author

DSCOVR Inverse Problem + Regularization Methods @ Caltech / UCR

基本信息

  • 时间: 2021.03 — 2022.03 (~11 months active work, 含最集中的 4 个月原创贡献窗口)
  • 角色: 本科远程研究 (Undergraduate remote research)
  • 组织: Caltech (Division of Geological and Planetary Sciences) + UC Riverside (Environmental Sciences)
  • 指导教师: Yuk-Yung (primary PI, Caltech) + King-Fai-Li (co-supervisor, UCR)
  • Paper draft corresponding author: (LMD Paris, former Yuk Yung postdoc)
  • 地点: 远程 (Zhenyu 在北大, Caltech / UCR teamwork remote)
  • 项目入口: via Li-Chong-李充 (北大师兄, Yuk Yung co-author on Li C 2020 RS paper)

核心问题

DSCOVR 单点地球光曲线 inverse problem — 从 DSCOVR/EPIC 卫星在 Sun-Earth L1 Lagrange point 观测到的 Earth 单点 reflectance time series, 反演 Earth 陆海分布 (land-ocean map retrieval), 作为未来 exoplanet characterization 的 “validation exoplanet” benchmark。方法上以 MCMC 扩展 Rodgers 传统反演, 并系统研究 ill-conditioned inverse problem (, condition number ~, DSCOVR matrix Picard plot 衰减远快于 Kawahara toy) 的 regularization-parameter selection。

Paper draft (未发表): “An automatic pipeline for retrieval from DSCOVR data”, co-authors co-authors (anonymized in public version), Zhenyu He¹, * (corresponding), King-Fai Li³, Yuk L. Yung⁴,⁵, Yongyun Hu¹ (multiple institutions)。90-paragraph HBP mathematical section 由 Zhenyu 主笔。


🌟 Zhenyu 4 原创贡献 (按 proposal Example 6 的 4 原创 narrative)

🌟 贡献 (c): Hansen σ² vs σ⁴ textbook bug catch (2022-01-08, 80× 偏差)

核心成就: 作为本科生, 用独立 SVD + filter factor 推导, 在 Hansen classic Discrete Inverse Problems (SIAM 2010) 这种 foundational textbook 中发现 公式级 bug — 不是 typo, 而是导致 数值结果相差 ~80× 的实质错误。

数值对比 (同一 toy model + PC2 数据):

FormulaToy model λPC2 λ准确性
Hansen 学生正确版 (σ⁴)0.02510.0575与 naked-eye elbow + axis-image 下理论值一致
Hansen 书本错误版 (σ²)1.9951.99580× overestimate, 完全失败 (λ 卡在上界, 不依赖数据)

创新发现 — 独立识破 Hansen 未讨论的 L-curve axis-scale ambiguity (2022-01-08 发现):

“Hansen didn’t talk about whether we should control the units of x-axis and y-axis of L-curve to be the same. In his book, he showed two L-curve figures, but both didn’t control the units to be the same.”

3-step self-correction cycle (11/29 → 12/06 → 01/08, 整个 trajectory 保留):

  1. 11/29 发现自己公式错 (group-meeting slide 4)
  2. 12/06 假设 “accumulation of errors” — 归因错误 (slide 6)
  3. 01/08 refute 该假设, 正确归因 axis-scale ambiguity — slide 7-10 原话: “What if our naked eyes are wrong, but the calculated curvature is right?” (5-cell 手算演示, 加 matlab axis image 强制等单位后, curvature elbow at 就真正对上了)

物理证据 discipline: LCurve_HansenMistake_WeRectify_Test/ 目录同时保留

  • Hansen自己文章_仍然mu二次方.pdf (Hansen 原版, 分母 σ²)
  • hansen学生mu四次方.pdf (Hansen 学生正确版, 分母 σ⁴)

并列纪录为 reproducible numerical-science best practice — 见 Evidence Preservation Discipline (private companion)。

独立数学验证: Deductions of the curvature of log-log Lcurve.docx 36-paragraph 符号推导, 从 SVD + filter factors 推导 关于 的二阶导, 关键一步: “The denominator should have σ_i^4 instead of σ_i^2”。

社区贡献 (3 open questions): (1) 应否 control L-curve axes 等单位? (2) a senior collaborator 的 elbow λ=1e-3 结论是否 scale-artifact? (3) 应用 L-curve 时什么情况下调节, 如何调?

详见 L-Curve-Axis-Scale-Ambiguity concept page。


🌟 贡献 (b): Pre-whitening L-curve 创新 + methodological restraint (2022-01-23)

技术创新: 在 RBF kernel (non-diagonal prior covariance ) 下, 变换到白化空间:

在 whitened space 绘 vs → “modified L-curve”。

问题诊断: 发现 L-curve 在非对角 x/y 值不单调变化 → structural limit: L-curve 对应 pure Tikhonov, non-diagonal 不能 derive 到 Tikhonov form; “modified L-curve” 结果对 priors 过度敏感 — how well you know the priors 直接影响结果。

独立判断 — 2022-01-23 group meeting verbatim (methodological restraint):

Held to the methodology rather than tuning , priors to engineer a target λ.

意义: 21 岁本科生当场拒绝 advisor 隐含 preference (tune 来凑 λ=1e-3 的 paper benchmark) — 不投机取巧, 诚实承认 “我自己提议的方法, 只在 Tikhonov-equivalent covariance 下能 work”。“我自己诊断自己方法的 limitation” 是 methodology maturity 的稀有 discipline。

与 Independent Judgment narrative (held until 2026-06-10) Moment 3 对应。


贡献 (a): Hardened Balancing Principle (HBP) 90-段数学展开

理论框架: 从 Bauer 2007 Definition 5.1 的 (Tikhonov) 起始, 推导:

  • minimum-difference regularization parameter:
  • balancing functional
  • exponential concentration → HBP λ_{n+} 自动 parameter selection

系统 stress test (4 benchmark):

Benchmark条件数 / 数据HBP 表现
Hansen gravity problem (500-instance)~2.88e28 vs Hansen 书 reported ~1.54e5稳健
Golub 1979 Laplace2.88e28 (比 Golub 原 paper 1.54e5 更 ill-conditioned)稳健
3-point linear regressionKing-Fai Li Bayesian 闭式解 cross-validateα=9.57, σ=0.023, λ 解析 vs MCMC 差 18% ✓
DSCOVR real PC2 time seriesKawahara Bayesian cost 失效的 W收敛到 Siteng λ≈1e-3 近似范围

结论: HBP 不需 prior knowledge of noise level δ 或 tuning constant κ (BP 都需要), 多 benchmark 最稳健。

输出: internal-only manuscript90-paragraph mathematical section writeup — 从 Tikhonov 到 HBP 全部推导, 相当于独立技术笔记 / 教程规模。

Downstream impact: HBP 被 co-author a subsequent collaborator 扩展到 PC1 (low clouds) / PC4 (high clouds) retrieval (不同于 Zhenyu 的 PC2 = land/ocean) — 方法学影响力延伸到合作者工作。

详见 Hardened-Balancing-Principle concept page。


贡献 (d): Direction of Noise Vector discovery (2021-09-20)

原始发现 (2021-09-20 verbatim):

“When I did numerical tests on L-Curve, GCV etc., I found out the direction of the noise vector plays a big role in the performance of statistical criteria. I did a literature search and it seems nobody has studied this topic.

Geometric decomposition: 的 column space 中, , 噪声 分解为 perpendicular + parallel; parallel component 沿 eigenvector 分解后 EV2 方向分量会被放大

4-method failure-mode catalog:

Method失效条件
L-curve 大, 或 近 EV1
Kawahara Bayesian 大 (5/5 fail for , negative component)
GCV 近 EV1 且 近 EV2 (long flat cost region)
Discrepancy principle总 work, 但引入 systematic error (underestimates )

源头偶然发现 → systematic discovery: Golub 1979 GCV paper 复现 (Zhenyu condition 2.88e28, 比 paper 1.54e5 更 ill-conditioned) 时, 5 instances 中第 4 个 λ_GCV 与其他 4 个完全不同 → Hansen gravity 500-instance 再次独立发现 → 两独立 evidence 汇聚 → 系统化为 contribution (d)。文献检索确认 literature gap (“nobody has studied”) 是 signature of independent problem-formulation ability。

详见 Multi-Method-Comparison-Spine


关键成就

  1. Hansen textbook bug catch (80× overestimate, 公式级错误 in foundational classic) — 本科生在 SIAM 2010 classic 发现公式 bug + inconvenient evidence 并列保留
  2. Pre-whitening L-curve 创新 + 2022-01-23 methodological restraint verbatim — research integrity signature
  3. HBP 90-段数学展开 + 4-benchmark stress test + subsequent collaborator’s PC1/PC4 extension — 方法学影响力链
  4. Direction of Noise Vector independent discovery + literature gap awareness — problem-formulation 原创性
  5. 跨语言 fluency: 单 11-month 项目内 5 种编程语言:
Language承担任务
IDLOOP MCMC class (继承 Yuk Yung lab convention, 向李婧 PKU group 学 scripts)
MATLABHansen gravity problem replication + csvd + 5 LCurve_HansenMistake_WeRectify_Test 嵌套子目录
PythonKawahara CUDA library bootstrap (remote workstation) + emcee + healpy + scipy
Mathematicasymbolic curvature 推导 (calculation_4_cal_deriva.nb + calculation_5_cal_deriva.nb)
RMetropolis-Hastings baseline for cross-validation
  1. 29 group_meeting pptx sustained trajectory (Mar 2021 → Mar 2022, ~1 PPT / 2 周) — 本科生 ~12 个月单项目 29 次正式 group-meeting 报告
  2. Self-study rigor: 三本 classic 全部复现 — Hansen Discrete Inverse Problems (SIAM 2010) + Bishop PRML + Golub 1979 original GCV paper, 数值实验逐一 replicate

技术深度 (resume-grade specifics)

  • Self-study 三本教材全复现: Hansen Discrete Inverse Problems + Bishop PRML + Golub 1979 GCV paper — 不是阅读, 是 replicate Fig 5.9 gravity 500-instance histogram + Laplace condition 2.88e28 stress test + Ch 5.6 shaw test problem, 得到与原图几乎重合的 histogram
  • ~3,000+ lines of code + study notes on repo (与 Caltech 4 贡献 parallel, pre-finalization commit 2021-09-13, 后续 research-wiki 文件已不在此 repo 但概念统一)
  • remote workstation bootstrap: Caltech Anaconda + CUDA 11.2 + Kawahara library (Windows 下 CUDA library 不 support → Zhenyu manual 诊断 + driver setup), 2021-08-09 group meeting “Finished Setting up Environments for Running codes on (remote workstation)” — 非 trivial 的 devops skill
  • 5 个 nested stress test 子目录 (LCurve_HansenMistake_WeRectify_Test/): 测试1 Hansen公式_失败于上界_toy / 测试2 学生公式_成功_toy / 测试3 PC2 / 测试4 SaSe白化_自己估计 / 测试5 SaSe白化_Kawahara后验 — 每目录含 Hansen原版 + 学生正确版 L-curve + curvature-vs-λ figure, 便于视觉 diff
  • Mathematica symbolic ↔ MCMC 交叉验证: 3-point regression 闭式解 (α=9.57, σ=0.023, λ=5.73e-5) vs MCMC (λ=6.76e-5) 差 18% — double-check discipline
  • 5×4 failure-mode matrix (5 geometry × 4 regularization method): [1,1] / [3,1] / [5,1] / [7,1] / [10,1] 每个 case 对 L-curve / Kawahara / GCV / Discrepancy 给出 best λ + retrieved x, 结构性识别每方法 fail 的 geometric condition — 不是 “4 方法并列” 而是 “4 方法在 5 geometry 下的 capability diagnosis”

学术 / 职业意义

  • Moment 4 (Caltech → Berkeley cross-paradigm switch, 2022-08): 放弃 11 个月 IDL / MATLAB / Python / Mathematica / R tooling + 4 原创贡献 momentum, 跨 paradigm 到 David Romps DAM LES Fortran (single-point remote sensing → 3-D LES moist convection); External factor 是 key external factor, 但核心是 Zhenyu 主动 选择 switch 应用方向 (exoplanet habitability → nuclear winter policy) — 详见 Independent Judgment narrative (held until 2026-06-10) Moment 4
  • Moment 3 (methodological restraint, 2022-01-23): DSCOVR pre-whitening 验证失败时本科生当场拒绝 self-discipline call (declined to engineer target λ) — 见 Independent Judgment narrative (held until 2026-06-10) Moment 3
  • 3-track parallel Sep-Dec 2021 本科大四上: Caltech DSCOVR HBP 成熟 + Walker paper v3→v6 drafting (见 PKU-Walker-Circulation-Research) + PKU Aerosol thesis drafting (见 PKU-Aerosol-Senior-Thesis) + nwp_hw 4 个作业 ≈ 3-4 deliverable/week sustained intensity

使用的技能


相关页面

来源

  • research_wiki/projects/mcmc_retrieval/overview.md (~35k words, 全 1329 行)
  • research_wiki/resume_angles/problem_solving.md §Project 3 DSCOVR
  • research_wiki/resume_angles/independent_judgment.md §Moment 3 + 4
  • raw/Caltech_task/ 1.5 GB folder (1,364 files / 126 dirs, 29 group_meeting pptx + internal-only manuscript + 10 technical docx + LCurve_HansenMistake_WeRectify_Test/ 子目录)