top of page

How Langfuse Utilizes Inner Loop Offline and Outer Loop Online Evaluations for Enhanced Performance

  • Philip Moses
  • Jul 31
  • 4 min read

Updated: Aug 2

Creating a successful application using Large Language Models (LLMs) is an ongoing process of testing and improvement. In this blog post, we'll explore how to effectively build and maintain LLM applications using a two-part evaluation framework: offline evaluation for pre-launch testing and online evaluation for post-launch monitoring. We'll discuss the steps involved in each phase, including creating test datasets, running experiments, analyzing results, deploying the application, and continuously improving it based on real-world feedback.
By the end of this post, you'll understand how to use these processes to create robust and reliable LLM applications.
ree
Offline Evaluation: Testing Before Launch

Offline evaluation is like a practice round. It happens in a controlled setting where you can test and tweak your application without real-world consequences.

Creating a Test Dataset

  1. Make a Dataset: Start by creating a dataset, which is a collection of examples your application will be tested against. Think of it as a set of practice questions.

  2. Add Examples to the Dataset: Fill your dataset with examples. Each example should have an input (a question or task) and, optionally, an expected output (the correct answer).

  3. Edit or Remove Examples: Over time, you might need to update or remove examples from your dataset to keep it relevant and useful.


Running Tests

  1. Test Your Application: Run your application using the examples in your dataset to see how well it performs and where it might need improvement.

  2. Score the Results: Score the results to compare different versions of your application. This helps you understand which changes make it better.


Analyzing the Results

  1. Check the Scores: After testing, look at the scores to see how well your application did. This helps you decide what needs to be improved before you launch.


Moving to Production: Launching Your Application

Once your application has been tested and improved, it's time to launch it for real users to interact with.

  1. Launch: Deploy your application so that users can start using it. This is a big step because your application will now be tested in the real world.

Online Evaluation: Monitoring After Launch

After your application is launched, you need to keep an eye on how it's doing through online evaluation.

Keeping Track of Real-World Use

  1. Record Interactions: Keep a record of how users interact with your application to see what's working and what's not.

  2. Monitor Performance: Watch how your application performs in real-time to catch and fix any problems quickly.


Continuous Improvement

  1. Fix Problems: If you find any issues, fix them and test these fixes in your offline evaluation before updating the live application.

  2. Add Real-World Examples to Your Dataset: Use real-world examples to improve your test dataset, making your offline testing more realistic and helpful.


Getting Started with Langfuse

Langfuse is a tool that can help you with both offline and online evaluation. Here’s how you can get started:

Creating a Dataset

You can create a dataset using Langfuse. Here’s a simple example:

Copyfrom langfuse import Langfuselangfuse = Langfuse()dataset = langfuse.create_dataset(name="my_dataset")

Adding Examples to Your Dataset

Add examples to your dataset like this:

Copydataset_item = langfuse.create_dataset_item(    dataset_name="my_dataset",    input={"text": "hello world"},    expected_output={"text": "hello world"})

Running Tests

Run your application on the dataset and record the results:

Copydataset = langfuse.get_dataset("my_dataset")for item in dataset.items:    with item.run(run_name="my_run") as root_span:        output = my_llm_application.run(item.input)        root_span.score_trace(name="accuracy", value=0.95)
Analyzing Results

After running tests, check the results to see how your application performed.

Online Evaluation with Langfuse

Online evaluation helps you monitor your application in real-time. You can set up different types of evaluations, such as:

  • User Feedback: Direct feedback from users.

  • Implicit Feedback: Indirect signals like how long users spend on a task.

  • Run-time Checks: Automated checks to ensure everything is working correctly.

Conclusion

Building and maintaining a successful LLM application is a continuous journey of testing, learning, and improving. By leveraging both offline and online evaluation processes, you can ensure that your application not only meets but exceeds user expectations. Tools like Langfuse can significantly streamline these processes, making it easier to monitor performance, gather insights, and make data-driven improvements. Embrace this cycle of continuous evaluation, and you'll be well on your way to creating robust, reliable, and user-centric LLM applications.


🛠️ Want to Deploy Langfuse Without the Hassle?

That’s where House of FOSS steps in.

At House of FOSS, we make open-source tools like Langfuse plug-and-play for businesses of all sizes. Whether you're building an AI product, monitoring prompts, or evaluating LLM outputs — we help you deploy, scale, and manage Langfuse with zero friction.

✅ Why Choose House of FOSS?


🧩 Custom Setup – We tailor Langfuse to your exact observability and evaluation needs.

🕒 24/7 Support – We're here when you need us.

💰 Save up to 60% – Cut SaaS costs, not performance.

🛠️ Fully Managed – We handle security, scaling, and updates.


Bonus: With House of FOSS, deploying Langfuse is as easy as installing an app on your phone. No configs. No setup stress. Just click, install, and start monitoring.


 
 
 

Recent Posts

See All

Comments


bottom of page