Explanation

Experiments allow you to evaluate the performance of your fine-tunes on specific datasets. By running experiments, you can gain insights into how well your models performed relative to other models.

Pre-requisites

You’ll need a dataset uploaded to Montelo to get started. Remember, an experiment is always associated to a dataset.

You’ll also need a runner and an evaluator function. Here’s a simple example

Node
import { Evaluator, Montelo, Runner } from "montelo";

const montelo = new Montelo();
const openai = new OpenAI();

type MetadataT = null;

type EvalT = {
  isCorrect: boolean;
};

const dataset = "dataset-id";

const gpt4Runner: Runner<MetadataT> = async (params) => {
  const completion = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    messages: params.input.messages as OpenAI.Chat.Completions.ChatCompletionMessageParam[],
  });
  const message = completion.choices[0].message;
  return { messages: [message] };
};

const evaluator: Evaluator<EvalT, MetadataT> = async (params) => {
  const { expectedOutput, actualOutput } = params;

  const expectedOut = expectedOutput.messages[0].content;
  const actualOut = actualOutput.messages[0].content;

  return { isCorrect: expectedOut === actualOut };
};

await montelo.experiments.createAndRun({
  name: "GPT-4",
  dataset,
  runner: gpt4Runner,
  evaluator,
  options,
});

Runner

The runner function runs the datapoint’s input and returns the output. Here’s the most basic example of this:

const runner = async (sentence: string): string => {
  // any llm call
  const message = await llmCall(sentence);
  return { messages: [message] };
};

The runner can be any function. You could call an external service, an internal API, or even a really simple function like above.

Evaluator

The evaluator function takes the datapoint’s expected output, and the output that the runner returned, and runs an evaluation on it. Again, you could run any evaluation you want!

Note that you must return an object, not a primitive, for evaluators. Here’s an example:

type MetadataT = null;

type EvalT = {
  isCorrect: boolean;
  error: boolean;
};

const evaluator: Evaluator<EvalT, MetadataT> = async (params) => {
  const { expectedOutput, actualOutput } = params;
  return { isCorrect: actualOutput === expectedOutput };
};

Your evaluator can be any function!

  • You could have an LLM act as your evaluator
  • You could use string comparisons (as above)
  • Levenshtein distance
  • Cosine similarity
  • Anything else!

We’ll take the object that you return, and run statistics and create charts for you.

Running an Experiment

Now that you have a runner and an evaluator, use the createAndRun to create an experiment and run it.

await montelo.experiments.createAndRun({
  name: "Experiment",
  dataset: "dataset-id",
  runner,
  evaluator,
});

And that’s all you need! Behind the scenes, Montelo will pull all the datapoints of the dataset and run the runner and evaluator on you each datapoint.

We’ll inject each datapoint’s input, metadata, and expected output into your runner and evaluator, and track the data that you return.

Options

You can pass in options to the createAndRun function:

parallelism
number

How many datapoints to run concurrently, to speed up the time it takes to finish the experiment.

onlyTestSplit
boolean

Only run the experiment on the test datapoints, skipping the train datapoints.