Langfuse Evaluation: Understanding the Scores Data Model

Philip Moses
Aug 21, 2025
3 min read

Updated: Aug 22, 2025

How do you measure the performance of an AI model in a structured, reliable way? That’s exactly where Langfuse comes in. Its Scores Data Model is the backbone of LLM evaluation, making it possible to capture, compare, and standardize results across interactions, sessions, and datasets.

👉 In this blog, we’ll break down:

What Scores are in Langfuse and how they work
How Score Configs help standardize and scale evaluations

By the end, you’ll know exactly how to structure evaluations in Langfuse so they are consistent, comparable, and future-proof.

What are Scores in Langfuse?

In Langfuse, Scores are the building blocks of evaluation. They store the results of evaluation methods, making it possible to measure the performance of traces, sessions, or dataset runs.

Think of a Score as the output of an evaluation method — whether numeric, categorical, or boolean — tied to a specific object in Langfuse.

How Scores Work

A Score can reference one of four core objects:

Trace → Used for evaluating a single interaction (most common).
Observation → Used for evaluating a specific step within a trace (e.g., an LLM call).
Session → Used for evaluating outcomes across multiple interactions.
Dataset Run → Used for assessing the performance of a dataset run.

👉 Importantly, each score is linked to only one of these objects.

Anatomy of a Score Object

A Score object has multiple attributes to ensure clarity, traceability, and flexibility.

Attribute	Type	Description
name	string	Name of the score (e.g., user_feedback, hallucination_eval).
value	number	Numeric value (for numeric/boolean scores, optional for categorical).
stringValue	string	String representation (commonly used for categorical/boolean data).
traceId	string	The trace this score relates to.
observationId	string	The observation this score relates to.
sessionId	string	The session this score relates to.
datasetRunId	string	The dataset run this score relates to.
comment	string	Optional note — e.g., evaluator comments, user feedback.
id	string	Unique identifier (auto-generated, can also be used as an idempotency key).
source	string	Source of score: API, EVAL, or ANNOTATION.
dataType	string	One of NUMERIC, CATEGORICAL, or BOOLEAN.
configId	string	Optional: Links to a ScoreConfig schema.

Why Use Score Configs?

While Scores are flexible, ensuring consistency across teams and evaluations is key. This is where Score Configs come in.

A Score Config acts as a schema for your evaluation metrics. It helps:

Standardize evaluation methods across your team.
Ensure comparable results for long-term analysis.
Prevent errors in score assignment (e.g., enforcing ranges or categories).

You can define ScoreConfigs either in the Langfuse UI or via API. They are immutable but can be archived (and restored later).

Anatomy of a Score Config

A ScoreConfig defines the rules for how scores should behave.

Attribute	Type	Description
id	string	Unique identifier of the score config.
name	string	Name (e.g., hallucination_eval, user_feedback).
dataType	string	One of: NUMERIC, CATEGORICAL, or BOOLEAN.
isArchived	boolean	Whether the config is archived. Defaults to false.
minValue	number	Minimum allowed value for numeric scores (defaults to -∞ if not set).
maxValue	number	Maximum allowed value for numeric scores (defaults to +∞ if not set).
categories	list	Defines valid categories for categorical scores.
description	string	Optional description for clarity.

Why This Matters

By structuring evaluations with Scores and Score Configs, Langfuse provides a scalable, reliable, and transparent way to measure LLM performance.

Developers can track fine-grained evaluations at the trace or observation level.
Teams can ensure consistency by using configs across projects.
Organizations can build comparable benchmarks over time.

In short, Langfuse makes evaluations traceable, standardized, and analysis-ready.

Langfuse Evaluation: Understanding the Scores Data Model

Recent Posts

Comments