top of page

Langfuse Evaluation: Understanding the Scores Data Model

  • Philip Moses
  • 2 days ago
  • 3 min read

Updated: 21 hours ago

How do you measure the performance of an AI model in a structured, reliable way? That’s exactly where Langfuse comes in. Its Scores Data Model is the backbone of LLM evaluation, making it possible to capture, compare, and standardize results across interactions, sessions, and datasets.
ree

👉 In this blog, we’ll break down:

  1. What Scores are in Langfuse and how they work

  2. How Score Configs help standardize and scale evaluations

By the end, you’ll know exactly how to structure evaluations in Langfuse so they are consistent, comparable, and future-proof.

What are Scores in Langfuse?

In Langfuse, Scores are the building blocks of evaluation. They store the results of evaluation methods, making it possible to measure the performance of traces, sessions, or dataset runs.

Think of a Score as the output of an evaluation method — whether numeric, categorical, or boolean — tied to a specific object in Langfuse.

How Scores Work

A Score can reference one of four core objects:

  • Trace → Used for evaluating a single interaction (most common).

  • Observation → Used for evaluating a specific step within a trace (e.g., an LLM call).

  • Session → Used for evaluating outcomes across multiple interactions.

  • Dataset Run → Used for assessing the performance of a dataset run.


👉 Importantly, each score is linked to only one of these objects.

Anatomy of a Score Object

A Score object has multiple attributes to ensure clarity, traceability, and flexibility.

Attribute

Type

Description

  • name

string

Name of the score (e.g., user_feedback, hallucination_eval).

  • value

number

Numeric value (for numeric/boolean scores, optional for categorical).

  • stringValue

string

String representation (commonly used for categorical/boolean data).

  • traceId

string

The trace this score relates to.

  • observationId

string

The observation this score relates to.

  • sessionId

string

The session this score relates to.

  • datasetRunId

string

The dataset run this score relates to.

  • comment

string

Optional note — e.g., evaluator comments, user feedback.

  • id

string

Unique identifier (auto-generated, can also be used as an idempotency key).

  • source

string

Source of score: API, EVAL, or ANNOTATION.

  • dataType

string

One of NUMERIC, CATEGORICAL, or BOOLEAN.

  • configId

string

Optional: Links to a ScoreConfig schema.

Why Use Score Configs?

While Scores are flexible, ensuring consistency across teams and evaluations is key. This is where Score Configs come in.

A Score Config acts as a schema for your evaluation metrics. It helps:

  • Standardize evaluation methods across your team.

  • Ensure comparable results for long-term analysis.

  • Prevent errors in score assignment (e.g., enforcing ranges or categories).


You can define ScoreConfigs either in the Langfuse UI or via API. They are immutable but can be archived (and restored later).

Anatomy of a Score Config

A ScoreConfig defines the rules for how scores should behave.


Attribute

Type

Description

  • id

string

Unique identifier of the score config.

  • name

string

Name (e.g., hallucination_eval, user_feedback).

  • dataType

string

One of: NUMERIC, CATEGORICAL, or BOOLEAN.

  • isArchived

boolean

Whether the config is archived. Defaults to false.

  • minValue

number

Minimum allowed value for numeric scores (defaults to -∞ if not set).

  • maxValue

number

Maximum allowed value for numeric scores (defaults to +∞ if not set).

  • categories

list

Defines valid categories for categorical scores.

  • description

string

Optional description for clarity.

Why This Matters

By structuring evaluations with Scores and Score Configs, Langfuse provides a scalable, reliable, and transparent way to measure LLM performance.

  • Developers can track fine-grained evaluations at the trace or observation level.

  • Teams can ensure consistency by using configs across projects.

  • Organizations can build comparable benchmarks over time.


In short, Langfuse makes evaluations traceable, standardized, and analysis-ready.

 
 
 

Comments


bottom of page