Convert National Lab Datasets to Evals
Research Log
Research Log
We are building XRD-Bench: a high-quality AI evaluation benchmark derived from U.S. national laboratory X-ray datasets (SIMPOD, opXRD, LLNL CT, Materials Project XAS). The goal is to test whether models and agents can perform real materials science reasoning on powder X-ray diffraction data — from basic pattern perception through full characterization. We model our eval after MaCBench, ScienceAgentBench, MatSciBench, ChemBench, and GPQA, borrowing their best structural and methodological patterns. All eval data must trace back to real national lab datasets — no synthetic or made-up data.
Sessions
- Session 1 (2026-03-20): Built 14-structure SIMPOD dataset (all 7 crystal systems), created Pillar 1 Perception question generator with 6 question types (83 questions total), scoring module with answer extraction — found and fixed a downsampled-diffractogram resolution bug; ready for baseline model eval. Full log
- Session 2 (2026-03-21): Ran first baseline eval (Claude Sonnet 4): v1 scored 90.4% (75/83); trace analysis revealed 3 token-truncation bugs, 3 question-quality issues, and 1 ground-truth error; applied 5 targeted fixes (maxtokens, P1.2 prompt, P1.3 trigonal/hexagonal acceptalso, P1.5 rewording, P1.6 margin check); v2 scored 98.8% (82/83) with 1 genuine model error remaining (boundary numerical comparison). Full log
- Session 3 (2026-03-22): Cross-model comparison: Claude 3 Haiku scored 74.7% vs Sonnet 4's 98.8% (-24.1pp gap), validating Pillar 1 discriminativeness; gap driven entirely by counting tasks (P1.2: -86pp, P1.6: -57pp) while 4 lookup/comparison types ceiling at 100% for both models; Haiku systematically undercounts table rows (mean ratio 0.45×) due to absent chain-of-thought. Full log
- Session 4 (2026-03-23): Added 4 harder question variants (P1.1b Nth-rank peak, P1.3b crystal-system-without-SG, P1.4b close-intensity comparison, P1.5b near-boundary presence) expanding v3 to 137 questions across 10 types; Haiku accuracy on new types: P1.1b 38%, P1.4b 62%, P1.5b 86%, P1.3b 93% — discriminative question coverage doubled from 33% to 69% while Sonnet holds 99.3% overall. Full log
- Session 5 (2026-03-24): Expanded structure set from 14 to 34 (20 real minerals via pymatgen, balanced 4-5 per crystal system), generating v4 with 335 questions across 10 types; Haiku eval on v4 scored 70.1% (235/335), stable vs v3's 71.5% — confirmed P1.2 peak-count cliff: 0% accuracy on tables with 4+ peaks; Sonnet v4 results pending. Full log
- Session 6 (2026-03-25): Completed v4 analysis: Sonnet 98.2% (329/335) vs Haiku 70.2% (235/335), gap 28.0pp matches v3's stability; discovered P1.2 failure is token-driven (Haiku 0% on 4+ peaks, Sonnet 88%), not question design; counting tasks (62-82pp gaps) are dominant discriminator; ✓ Pillar 1 approved for publication. Full log
- Session 7 (2026-03-26): Verified Sonnet v4 results (329/335 = 98.2%, 6 boundary errors only); created PILLAR1FINALSIGNOFF.md with full validation checklist and psychometric analysis; Pillar 1 ready for external review and publication; began Pillar 2 architecture design. Full log
- Session 8 (2026-03-27): Groundwork for Pillar 2: verified SIMPOD metadata completeness (all 34 structures, space groups, lattice parameters), researched opXRD dataset (92k diffractograms, 900+ with full structural labels, public + CC-BY license), confirmed Phase 1 data ready (P2.1 space group, P2.2 lattice parameter, P2.8 code task); created PILLAR2IMPLEMENTATIONREADINESS.md with detailed data assessment, timeline (Sessions 9–13), and mitigation plan for minor gaps; ✓ Ready for Phase 1 implementation pending user sign-off on Pillar 2 architecture. Full log
Project totals
2h 5m
Runtime
$18.33
Cost
195
Files
9
Sessions
Models
haiku-4-5opus-4-6
Skills
Recent sessions
🧪8d ago
Pause task
🧪10d ago
XRD-Bench Daily Experiment
🧪11d ago
XRD-Bench Daily Experiment
🧪12d ago
XRD-Bench Daily Experiment
🧪13d ago
XRD-Bench Daily Experiment