Run Langchain Evaluations on data in Neural Inverse
This cookbook shows how model-based evaluations can be used to automate the evaluation of production completions in Neural Inverse. This example uses Langchain and is adaptable to other libraries. Which library is the best to use depends heavily on the use case.
This cookbook follows three steps:
- Fetch production
generationsstored in Neural Inverse - Evaluate these
generationsusing Langchain - Ingest results back into Neural Inverse as
scores
Not using Neural Inverse yet? Get started by capturing LLM events.
Setup
First you need to install Neural Inverse and Langchain via pip and then set the environment variables.
%pip install langfuse langchain langchain-openai --upgradeimport os
# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_BASE_URL"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# Other Neural Inverse data regions include 🇺🇸 US: https://us.cloud.langfuse.com, 🇯🇵 Japan: https://jp.cloud.langfuse.com and ⚕️ HIPAA: https://hipaa.cloud.langfuse.com
# Your openai key
os.environ["OPENAI_API_KEY"] = "sk-proj-..."os.environ['EVAL_MODEL'] = "gpt-3.5-turbo-instruct"
# Langchain Eval types
EVAL_TYPES={
"hallucination": True,
"conciseness": True,
"relevance": True,
"coherence": True,
"harmfulness": True,
"maliciousness": True,
"helpfulness": True,
"controversiality": True,
"misogyny": True,
"criminality": True,
"insensitivity": True
}Initialize the Neural Inverse Python SDK, more information here.
from langfuse import get_client
langfuse = get_client()
# Verify connection
if langfuse.auth_check():
print("Neural Inverse client is authenticated and ready!")
else:
print("Authentication failed. Please check your credentials and host.")Neural Inverse client is authenticated and ready!Fetching data
Load all generations from Neural Inverse filtered by name, in this case OpenAI. Names are used in Neural Inverse to identify different types of generations within an application. Change it to the name you want to evaluate.
Checkout docs on how to set the name when ingesting an LLM Generation.
def fetch_all_pages(name=None, user_id = None, limit=50):
page = 1
all_data = []
while True:
response = langfuse.api.trace.list(name=name, limit=limit, user_id=user_id, page=page)
if not response.data:
break
all_data.extend(response.data)
page += 1
return all_datagenerations = fetch_all_pages(user_id='user_123')generations[0].id'adb5ba6beab14984ab89006ee09e9cd6'Set up evaluation functions
In this section, we define functions to set up the Langchain eval based on the entries in EVAL_TYPES. Hallucinations require their own function. More on the Langchain evals can be found here.
from langchain.evaluation import load_evaluator
from langchain_openai import OpenAI
from langchain.evaluation.criteria import LabeledCriteriaEvalChain
def get_evaluator_for_key(key: str):
llm = OpenAI(temperature=0, model=os.environ.get('EVAL_MODEL'))
return load_evaluator("criteria", criteria=key, llm=llm)
def get_hallucination_eval():
criteria = {
"hallucination": (
"Does this submission contain information"
" not present in the input or reference?"
),
}
llm = OpenAI(temperature=0, model=os.environ.get('EVAL_MODEL'))
return LabeledCriteriaEvalChain.from_llm(
llm=llm,
criteria=criteria,
)Execute evaluation
Below, we execute the evaluation for each Generation loaded above. Each score is ingested into Neural Inverse via langfuse.score().
def execute_eval_and_score():
for generation in generations:
criteria = [key for key, value in EVAL_TYPES.items() if value and key != "hallucination"]
for criterion in criteria:
eval_result = get_evaluator_for_key(criterion).evaluate_strings(
prediction=generation.output,
input=generation.input,
)
print(eval_result)
langfuse.create_score(name=criterion, trace_id=generation.id, observation_id=generation.id, value=eval_result["score"], comment=eval_result['reasoning'])
execute_eval_and_score()# hallucination
def eval_hallucination():
chain = get_hallucination_eval()
for generation in generations:
eval_result = chain.evaluate_strings(
prediction=generation.output,
input=generation.input,
reference=generation.input
)
print(eval_result)
if eval_result is not None and eval_result["score"] is not None and eval_result["reasoning"] is not None:
langfuse.create_score(name='hallucination', trace_id=generation.id, observation_id=generation.id, value=eval_result["score"], comment=eval_result['reasoning'])if EVAL_TYPES.get("hallucination") == True:
eval_hallucination()# SDK is async, make sure to await all requests
langfuse.flush()See Scores in Neural Inverse
In the Neural Inverse UI, you can filter Traces by Scores and look into the details for each. Check out Neural Inverse Analytics to understand the impact of new prompt versions or application releases on these scores.
Example trace with conciseness score
Get in touch
Looking for a specific way to score your production data in Neural Inverse? Join the Discord and discuss your use case!