How Langfuse Utilizes Inner Loop Offline and Outer Loop Online Evaluations for Enhanced Performance
- Philip Moses
- Jul 31
- 4 min read
Updated: Aug 2
Creating a successful application using Large Language Models (LLMs) is an ongoing process of testing and improvement. In this blog post, we'll explore how to effectively build and maintain LLM applications using a two-part evaluation framework: offline evaluation for pre-launch testing and online evaluation for post-launch monitoring. We'll discuss the steps involved in each phase, including creating test datasets, running experiments, analyzing results, deploying the application, and continuously improving it based on real-world feedback.
By the end of this post, you'll understand how to use these processes to create robust and reliable LLM applications.

Offline Evaluation: Testing Before Launch
Offline evaluation is like a practice round. It happens in a controlled setting where you can test and tweak your application without real-world consequences.
Creating a Test Dataset
Make a Dataset: Start by creating a dataset, which is a collection of examples your application will be tested against. Think of it as a set of practice questions.
Add Examples to the Dataset: Fill your dataset with examples. Each example should have an input (a question or task) and, optionally, an expected output (the correct answer).
Edit or Remove Examples: Over time, you might need to update or remove examples from your dataset to keep it relevant and useful.
Running Tests
Test Your Application: Run your application using the examples in your dataset to see how well it performs and where it might need improvement.
Score the Results: Score the results to compare different versions of your application. This helps you understand which changes make it better.
Analyzing the Results
Check the Scores: After testing, look at the scores to see how well your application did. This helps you decide what needs to be improved before you launch.
Moving to Production: Launching Your Application
Once your application has been tested and improved, it's time to launch it for real users to interact with.
Launch: Deploy your application so that users can start using it. This is a big step because your application will now be tested in the real world.
Online Evaluation: Monitoring After Launch
After your application is launched, you need to keep an eye on how it's doing through online evaluation.
Keeping Track of Real-World Use
Record Interactions: Keep a record of how users interact with your application to see what's working and what's not.
Monitor Performance: Watch how your application performs in real-time to catch and fix any problems quickly.
Continuous Improvement
Fix Problems: If you find any issues, fix them and test these fixes in your offline evaluation before updating the live application.
Add Real-World Examples to Your Dataset: Use real-world examples to improve your test dataset, making your offline testing more realistic and helpful.
Getting Started with Langfuse
Langfuse is a tool that can help you with both offline and online evaluation. Here’s how you can get started:
Creating a Dataset
You can create a dataset using Langfuse. Here’s a simple example:
Copyfrom langfuse import Langfuselangfuse = Langfuse()dataset = langfuse.create_dataset(name="my_dataset")Adding Examples to Your Dataset
Add examples to your dataset like this:
Copydataset_item = langfuse.create_dataset_item( dataset_name="my_dataset", input={"text": "hello world"}, expected_output={"text": "hello world"})Running Tests
Run your application on the dataset and record the results:
Copydataset = langfuse.get_dataset("my_dataset")for item in dataset.items: with item.run(run_name="my_run") as root_span: output = my_llm_application.run(item.input) root_span.score_trace(name="accuracy", value=0.95)Analyzing Results
After running tests, check the results to see how your application performed.
Online Evaluation with Langfuse
Online evaluation helps you monitor your application in real-time. You can set up different types of evaluations, such as:
User Feedback: Direct feedback from users.
Implicit Feedback: Indirect signals like how long users spend on a task.
Run-time Checks: Automated checks to ensure everything is working correctly.
Conclusion
Building and maintaining a successful LLM application is a continuous journey of testing, learning, and improving. By leveraging both offline and online evaluation processes, you can ensure that your application not only meets but exceeds user expectations. Tools like Langfuse can significantly streamline these processes, making it easier to monitor performance, gather insights, and make data-driven improvements. Embrace this cycle of continuous evaluation, and you'll be well on your way to creating robust, reliable, and user-centric LLM applications.
🛠️ Want to Deploy Langfuse Without the Hassle?
That’s where House of FOSS steps in.
At House of FOSS, we make open-source tools like Langfuse plug-and-play for businesses of all sizes. Whether you're building an AI product, monitoring prompts, or evaluating LLM outputs — we help you deploy, scale, and manage Langfuse with zero friction.
✅ Why Choose House of FOSS?
🧩 Custom Setup – We tailor Langfuse to your exact observability and evaluation needs.
🕒 24/7 Support – We're here when you need us.
💰 Save up to 60% – Cut SaaS costs, not performance.
🛠️ Fully Managed – We handle security, scaling, and updates.
⚡ Bonus: With House of FOSS, deploying Langfuse is as easy as installing an app on your phone. No configs. No setup stress. Just click, install, and start monitoring.


Comments