Operationalizing Expert Preferences for Model and Agent Evals