Langfuse Evaluation: Understanding the Scores Data Model
- Philip Moses
- 2 days ago
- 3 min read
Updated: 21 hours ago
How do you measure the performance of an AI model in a structured, reliable way? That’s exactly where Langfuse comes in. Its Scores Data Model is the backbone of LLM evaluation, making it possible to capture, compare, and standardize results across interactions, sessions, and datasets.

👉 In this blog, we’ll break down:
What Scores are in Langfuse and how they work
How Score Configs help standardize and scale evaluations
By the end, you’ll know exactly how to structure evaluations in Langfuse so they are consistent, comparable, and future-proof.
What are Scores in Langfuse?
In Langfuse, Scores are the building blocks of evaluation. They store the results of evaluation methods, making it possible to measure the performance of traces, sessions, or dataset runs.
Think of a Score as the output of an evaluation method — whether numeric, categorical, or boolean — tied to a specific object in Langfuse.
How Scores Work
A Score can reference one of four core objects:
Trace → Used for evaluating a single interaction (most common).
Observation → Used for evaluating a specific step within a trace (e.g., an LLM call).
Session → Used for evaluating outcomes across multiple interactions.
Dataset Run → Used for assessing the performance of a dataset run.
👉 Importantly, each score is linked to only one of these objects.
Anatomy of a Score Object
A Score object has multiple attributes to ensure clarity, traceability, and flexibility.
Attribute | Type | Description |
| string | Name of the score (e.g., user_feedback, hallucination_eval). |
| number | Numeric value (for numeric/boolean scores, optional for categorical). |
| string | String representation (commonly used for categorical/boolean data). |
| string | The trace this score relates to. |
| string | The observation this score relates to. |
| string | The session this score relates to. |
| string | The dataset run this score relates to. |
| string | Optional note — e.g., evaluator comments, user feedback. |
| string | Unique identifier (auto-generated, can also be used as an idempotency key). |
| string | Source of score: API, EVAL, or ANNOTATION. |
| string | One of NUMERIC, CATEGORICAL, or BOOLEAN. |
| string | Optional: Links to a ScoreConfig schema. |
Why Use Score Configs?
While Scores are flexible, ensuring consistency across teams and evaluations is key. This is where Score Configs come in.
A Score Config acts as a schema for your evaluation metrics. It helps:
Standardize evaluation methods across your team.
Ensure comparable results for long-term analysis.
Prevent errors in score assignment (e.g., enforcing ranges or categories).
You can define ScoreConfigs either in the Langfuse UI or via API. They are immutable but can be archived (and restored later).
Anatomy of a Score Config
A ScoreConfig defines the rules for how scores should behave.
Attribute | Type | Description |
| string | Unique identifier of the score config. |
| string | Name (e.g., hallucination_eval, user_feedback). |
| string | One of: NUMERIC, CATEGORICAL, or BOOLEAN. |
| boolean | Whether the config is archived. Defaults to false. |
| number | Minimum allowed value for numeric scores (defaults to -∞ if not set). |
| number | Maximum allowed value for numeric scores (defaults to +∞ if not set). |
| list | Defines valid categories for categorical scores. |
| string | Optional description for clarity. |
Why This Matters
By structuring evaluations with Scores and Score Configs, Langfuse provides a scalable, reliable, and transparent way to measure LLM performance.
Developers can track fine-grained evaluations at the trace or observation level.
Teams can ensure consistency by using configs across projects.
Organizations can build comparable benchmarks over time.
In short, Langfuse makes evaluations traceable, standardized, and analysis-ready.
Comments