Skip to main content

Conceptual guide

This section contains introductions to key parts of LangChain.

Architecture​

LangChain as a framework consists of several pieces. The below diagram shows how they relate.

Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.

@langchain/core​

This package contains base abstractions of different components and ways to compose them together. The interfaces for core components like LLMs, vectorstores, retrievers and more are defined here. No third party integrations are defined here. The dependencies are kept purposefully very lightweight.

@langchain/community​

This package contains third party integrations that are maintained by the LangChain community. Key partner packages are separated out (see below). This contains all integrations for various components (LLMs, vectorstores, retrievers). All dependencies in this package are optional to keep the package as lightweight as possible.

Partner packages​

While the long tail of integrations are in @langchain/community, we split popular integrations into their own packages (e.g. langchain-openai, langchain-anthropic, etc). This was done in order to improve support for these important integrations.

langchain​

The main langchain package contains chains, agents, and retrieval strategies that make up an application's cognitive architecture. These are NOT third party integrations. All chains, agents, and retrieval strategies here are NOT specific to any one integration, but rather generic across all integrations.

LangGraph.js​

LangGraph.js is an extension of langchain aimed at building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.

LangGraph exposes high level interfaces for creating common types of agents, as well as a low-level API for composing custom flows.

LangSmith​

A developer platform that lets you debug, test, evaluate, and monitor LLM applications.

Installation​

If you want to work with high level abstractions, you should install the langchain package.

npm i langchain

If you want to work with specific integrations, you will need to install them separately. See here for a list of integrations and how to install them.

For working with LangSmith, you will need to set up a LangSmith developer account here and get an API key. After that, you can enable it by setting environment variables:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...

LangChain Expression Language​

LangChain Expression Language, or LCEL, is a declarative way to easily compose chains together. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest β€œprompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:

First-class streaming support When you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.

Optimized parallel execution Whenever your LCEL chains have steps that can be executed in parallel (eg if you fetch documents from multiple retrievers) we automatically do it for the smallest possible latency.

Retries and fallbacks Configure retries and fallbacks for any part of your LCEL chain. This is a great way to make your chains more reliable at scale. We’re currently working on adding streaming support for retries/fallbacks, so you can get the added reliability without any latency cost.

Access intermediate results For more complex chains it’s often very useful to access the results of intermediate steps even before the final output is produced. This can be used to let end-users know something is happening, or even just to debug your chain. You can stream intermediate results, and it’s available on every LangServe server.

Input and output schemas Input and output schemas give every LCEL chain schemas inferred from the structure of your chain. This can be used for validation of inputs and outputs, and is an integral part of LangServe.

Seamless LangSmith tracing As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step. With LCEL, all steps are automatically logged to LangSmith for maximum observability and debuggability.

Interface​

To make it as easy as possible to create custom chains, we've implemented a "Runnable" protocol. Many LangChain components implement the Runnable protocol, including chat models, LLMs, output parsers, retrievers, prompt templates, and more. There are also several useful primitives for working with runnables, which you can read about below.

This is a standard interface, which makes it easy to define custom chains as well as invoke them in a standard way. The standard interface includes:

  • stream: stream back chunks of the response
  • invoke: call the chain on an input
  • batch: call the chain on an array of inputs

The input type and output type varies by component:

ComponentInput TypeOutput Type
PromptObjectPromptValue
ChatModelSingle string, list of chat messages or a PromptValueChatMessage
LLMSingle string, list of chat messages or a PromptValueString
OutputParserThe output of an LLM or ChatModelDepends on the parser
RetrieverSingle stringList of Documents
ToolSingle string or object, depending on the toolDepends on the tool

Components​

LangChain provides standard, extendable interfaces and external integrations for various components useful for building with LLMs. Some components LangChain implements, some components we rely on third-party integrations for, and others are a mix.

Chat models​

Language models that use a sequence of messages as inputs and return chat messages as outputs (as opposed to using plain text). These are traditionally newer models (older models are generally LLMs, see below). Chat models support the assignment of distinct roles to conversation messages, helping to distinguish messages from the AI, users, and instructions such as system messages.

Although the underlying models are messages in, message out, the LangChain wrappers also allow these models to take a string as input. This gives them the same interface as LLMs (and simpler to use). When a string is passed in as input, it will be converted to a HumanMessage under the hood before being passed to the underlying model.

LangChain does not host any Chat Models, rather we rely on third party integrations.

We have some standardized parameters when constructing ChatModels:

  • model: the name of the model

Chat Models also accept other parameters that are specific to that integration.

For specifics on how to use chat models, see the relevant how-to guides here.

LLMs​

caution

Pure text-in/text-out LLMs tend to be older or lower-level. Many popular models are best used as chat completion models, even for non-chat use cases.

You are probably looking for the section above instead.

Language models that takes a string as input and returns a string. These are traditionally older models (newer models generally are Chat Models, see above).

Although the underlying models are string in, string out, the LangChain wrappers also allow these models to take messages as input. This gives them the same interface as Chat Models. When messages are passed in as input, they will be formatted into a string under the hood before being passed to the underlying model.

LangChain does not host any LLMs, rather we rely on third party integrations.

For specifics on how to use LLMs, see the relevant how-to guides here.

Function/Tool Calling​

info

We use the term tool calling interchangeably with function calling. Although function calling is sometimes meant to refer to invocations of a single function, we treat all models as though they can return multiple tool or function calls in each message.

Tool calling allows a model to respond to a given prompt by generating output that matches a user-defined schema. While the name implies that the model is performing some action, this is actually not the case! The model is coming up with the arguments to a tool, and actually running the tool (or not) is up to the user - for example, if you want to extract output matching some schema from unstructured text, you could give the model an "extraction" tool that takes parameters matching the desired schema, then treat the generated output as your final result.

A tool call includes a name, arguments object, and an optional identifier. The arguments object is structured { argumentName: argumentValue }.

Many LLM providers, including Anthropic, Cohere, Google, Mistral, OpenAI, and others, support variants of a tool calling feature. These features typically allow requests to the LLM to include available tools and their schemas, and for responses to include calls to these tools. For instance, given a search engine tool, an LLM might handle a query by first issuing a call to the search engine. The system calling the LLM can receive the tool call, execute it, and return the output to the LLM to inform its response. LangChain includes a suite of built-in tools and supports several methods for defining your own custom tools.

There are two main use cases for function/tool calling:

Message types​

Some language models take an array of messages as input and return a message. There are a few different types of messages. All messages have a role, content, and response_metadata property.

The role describes WHO is saying the message. LangChain has different message classes for different roles.

The content property describes the content of the message. This can be a few different things:

  • A string (most models deal this type of content)
  • A List of objects (this is used for multi-modal input, where the object contains information about that input type and that input location)

HumanMessage​

This represents a message from the user.

AIMessage​

This represents a message from the model. In addition to the content property, these messages also have:

response_metadata

The response_metadata property contains additional metadata about the response. The data here is often specific to each model provider. This is where information like log-probs and token usage may be stored.

tool_calls

These represent a decision from an language model to call a tool. They are included as part of an AIMessage output. They can be accessed from there with the .tool_calls property.

This property returns an array of objects. Each object has the following keys:

  • name: The name of the tool that should be called.
  • args: The arguments to that tool.
  • id: The id of that tool call.

SystemMessage​

This represents a system message, which tells the model how to behave. Not every model provider supports this.

FunctionMessage​

This represents the result of a function call. In addition to role and content, this message has a name parameter which conveys the name of the function that was called to produce this result.

ToolMessage​

This represents the result of a tool call. This is distinct from a FunctionMessage in order to match OpenAI's function and tool message types. In addition to role and content, this message has a tool_call_id parameter which conveys the id of the call to the tool that was called to produce this result.

Prompt templates​

Prompt templates help to translate user input and parameters into instructions for a language model. This can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output.

Prompt Templates take as input an object, where each key represents a variable in the prompt template to fill in.

Prompt Templates output a PromptValue. This PromptValue can be passed to an LLM or a ChatModel, and can also be cast to a string or an array of messages. The reason this PromptValue exists is to make it easy to switch between strings and messages.

There are a few different types of prompt templates:

String PromptTemplates​

These prompt templates are used to format a single string, and generally are used for simpler inputs. For example, a common way to construct and use a PromptTemplate is as follows:

import { PromptTemplate } from "@langchain/core/prompts";

const promptTemplate = PromptTemplate.fromTemplate(
"Tell me a joke about {topic}"
);

await promptTemplate.invoke({ topic: "cats" });

ChatPromptTemplates​

These prompt templates are used to format an array of messages. These "templates" consist of an array of templates themselves. For example, a common way to construct and use a ChatPromptTemplate is as follows:

import { ChatPromptTemplate } from "@langchain/core/prompts";

const promptTemplate = ChatPromptTemplate.fromMessages([
["system", "You are a helpful assistant"],
["user", "Tell me a joke about {topic}"],
]);

await promptTemplate.invoke({ topic: "cats" });

In the above example, this ChatPromptTemplate will construct two messages when called. The first is a system message, that has no variables to format. The second is a HumanMessage, and will be formatted by the topic variable the user passes in.

MessagesPlaceholder​

This prompt template is responsible for adding an array of messages in a particular place. In the above ChatPromptTemplate, we saw how we could format two messages, each one a string. But what if we wanted the user to pass in an array of messages that we would slot into a particular spot? This is how you use MessagesPlaceholder.

import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";

const promptTemplate = ChatPromptTemplate.fromMessages([
["system", "You are a helpful assistant"],
new MessagesPlaceholder("msgs"),
]);

promptTemplate.invoke({ msgs: [new HumanMessage({ content: "hi!" })] });

This will produce an array of two messages, the first one being a system message, and the second one being the HumanMessage we passed in. If we had passed in 5 messages, then it would have produced 6 messages in total (the system message plus the 5 passed in). This is useful for letting an array of messages be slotted into a particular spot.

An alternative way to accomplish the same thing without using the MessagesPlaceholder class explicitly is:

const promptTemplate = ChatPromptTemplate.fromMessages([
["system", "You are a helpful assistant"],
["placeholder", "{msgs}"], // <-- This is the changed part
]);

For specifics on how to use prompt templates, see the relevant how-to guides here.

Example Selectors​

One common prompting technique for achieving better performance is to include examples as part of the prompt. This gives the language model concrete examples of how it should behave. Sometimes these examples are hardcoded into the prompt, but for more advanced situations it may be nice to dynamically select them. Example Selectors are classes responsible for selecting and then formatting examples into prompts.

For specifics on how to use example selectors, see the relevant how-to guides here.

Output parsers​

note

The information here refers to parsers that take a text output from a model try to parse it into a more structured representation. More and more models are supporting function (or tool) calling, which handles this automatically. It is recommended to use function/tool calling rather than output parsing. See documentation for that here.

Responsible for taking the output of a model and transforming it to a more suitable format for downstream tasks. Useful when you are using LLMs to generate structured data, or to normalize output from chat models and LLMs.

There are two main methods an output parser must implement:

  • "Get format instructions": A method which returns a string containing instructions for how the output of a language model should be formatted.
  • "Parse": A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.

And then one optional one:

  • "Parse with prompt": A method which takes in a string (assumed to be the response from a language model) and a prompt (assumed to be the prompt that generated such a response) and parses it into some structure. The prompt is largely provided in the event the OutputParser wants to retry or fix the output in some way, and needs information from the prompt to do so.

Output parsers accept a string or BaseMessage as input and can return an arbitrary type.

LangChain has many different types of output parsers. This is a list of output parsers LangChain supports. The table below has various pieces of information:

Name: The name of the output parser

Supports Streaming: Whether the output parser supports streaming.

Input Type: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific arguments.

Output Type: The output type of the object returned by the parser.

Description: Our commentary on this output parser and when to use it.

NameSupports StreamingInput TypeOutput TypeDescription
JSONβœ…string | BaseMessagePromise<T>Returns a JSON object as specified. You can specify a Zod schema and it will return JSON for that model.
XMLβœ…string | BaseMessagePromise<XMLResult>Returns a object of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's).
CSVβœ…string | BaseMessageArray[string]Returns an array of comma separated values.
Structuredstring | BaseMessagePromise<TypeOf<T>>Parse structured JSON from an LLM response.
HTTPβœ…stringPromise<Uint8Array>Parse an LLM response to then send over HTTP(s). Useful when invoking the LLM on the server/edge, and then sending the content/stream back to the client.
Bytesβœ…string | BaseMessagePromise<Uint8Array>Parse an LLM response to then send over HTTP(s). Useful for streaming LLM responses from the server/edge to the client.
DatetimestringPromise<Date>Parses response into a Date.
RegexstringPromise<Record<string, string>>Parses the given text using the regex pattern and returns a object with the parsed output.

For specifics on how to use output parsers, see the relevant how-to guides here.

Chat History​

Most LLM applications have a conversational interface. An essential component of a conversation is being able to refer to information introduced earlier in the conversation. At bare minimum, a conversational system should be able to access some window of past messages directly.

The concept of ChatHistory refers to a class in LangChain which can be used to wrap an arbitrary chain. This ChatHistory will keep track of inputs and outputs of the underlying chain, and append them as messages to a message database Future interactions will then load those messages and pass them into the chain as part of the input.

Document​

A Document object in LangChain contains information about some data. It has two attributes:

  • pageContent: string: The content of this document. Currently is only a string.
  • metadata: Record<string, any>: Arbitrary metadata associated with this document. Can track the document id, file name, etc.

Document loaders​

These classes load Document objects. LangChain has hundreds of integrations with various data sources to load data from: Slack, Notion, Google Drive, etc.

Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method. An example use case is as follows:

import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";

const loader = new CSVLoader();
// <-- Integration specific parameters here

const docs = await loader.load();

For specifics on how to use document loaders, see the relevant how-to guides here.

Text splitters​

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

  1. Split the text up into small, semantically meaningful chunks (often sentences).
  2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
  3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

  1. How the text is split
  2. How the chunk size is measured

For specifics on how to use text splitters, see the relevant how-to guides here.

Embedding models​

Embedding models create a vector representation of a piece of text. You can think of a vector as an array of numbers that captures the semantic meaning of the text. By representing the text in this way, you can perform mathematical operations that allow you to do things like search for other pieces of text that are most similar in meaning. These natural language search capabilities underpin many types of context retrieval, where we provide an LLM with the relevant data it needs to effectively respond to a query.

The Embeddings class is a class designed for interfacing with text embedding models. There are many different embedding model providers (OpenAI, Cohere, Hugging Face, etc) and local models, and this class is designed to provide a standard interface for all of them.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

For specifics on how to use embedding models, see the relevant how-to guides here.

Vectorstores​

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

Most vector stores can also store metadata about embedded vectors and support filtering on that metadata before similarity search, allowing you more control over returned documents.

Vectorstores can be converted to the retriever interface by doing:

const vectorstore = new MyVectorStore();
const retriever = vectorstore.asRetriever();

For specifics on how to use vector stores, see the relevant how-to guides here.

Retrievers​

A retriever is an interface that returns relevant documents given an unstructured query. They are more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Retrievers can be created from vector stores, but are also broad enough to include Exa search (web search) and Amazon Kendra.

Retrievers accept a string query as input and return an array of Documents as output.

For specifics on how to use retrievers, see the relevant how-to guides here.

Tools​

Tools are interfaces that an agent, chain, or LLM can use to interact with the world. They combine a few things:

  1. The name of the tool
  2. A description of what the tool is
  3. JSON schema of what the inputs to the tool are
  4. The function to call
  5. Whether the result of a tool should be returned directly to the user

It is useful to have all this information because this information can be used to build action-taking systems! The name, description, and JSON schema can be used to prompt the LLM so it knows how to specify what action to take, and then the function to call is equivalent to taking that action.

The simpler the input to a tool is, the easier it is for an LLM to be able to use it. Many agents will only work with tools that have a single string input.

Importantly, the name, description, and JSON schema (if used) are all used in the prompt. Therefore, it is really important that they are clear and describe exactly how the tool should be used. You may need to change the default name, description, or JSON schema if the LLM is not understanding how to use the tool.

For specifics on how to use tools, see the relevant how-to guides here.

Toolkits​

Toolkits are collections of tools that are designed to be used together for specific tasks. They have convenient loading methods.

All Toolkits expose a getTools method which returns an array of tools. You can therefore do:

// Initialize a toolkit
const toolkit = new ExampleTookit(...)

// Get list of tools
const tools = toolkit.getTools()

Agents​

By themselves, language models can't take actions - they just output text. A big use case for LangChain is creating agents. Agents are systems that use an LLM as a reasoning enginer to determine which actions to take and what the inputs to those actions should be. The results of those actions can then be fed back into the agent and it determine whether more actions are needed, or whether it is okay to finish.

LangGraph is an extension of LangChain specifically aimed at creating highly controllable and customizable agents. Please check out that documentation for a more in depth overview of agent concepts.

There is a legacy agent concept in LangChain that we are moving towards deprecating: AgentExecutor. AgentExecutor was essentially a runtime for agents. It was a great place to get started, however, it was not flexible enough as you started to have more customized agents. In order to solve that we built LangGraph to be this flexible, highly-controllable runtime.

If you are still using AgentExecutor, do not fear: we still have a guide on how to use AgentExecutor. It is recommended, however, that you start to transition to LangGraph. In order to assist in this we have put together a transition guide on how to do so.

Multimodal​

Some models are multimodal, accepting images, audio and even video as inputs. These are still less common, meaning model providers haven't standardized on the "best" way to define the API. Multimodal outputs are even less common. As such, we've kept our multimodal abstractions fairly light weight and plan to further solidify the multimodal APIs and interaction patterns as the field matures.

In LangChain, most chat models that support multimodal inputs also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.

For specifics on how to use multimodal models, see the relevant how-to guides here.

Callbacks​

LangChain provides a callbacks system that allows you to hook into the various stages of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.

You can subscribe to these events by using the callbacks argument available throughout the API. This argument is list of handler objects, which are expected to implement one or more of the methods described below in more detail.

Callback Events​

EventEvent TriggerAssociated Method
Chat model startWhen a chat model startshandleChatModelStart
LLM startWhen a llm startshandleLLMStart
LLM new tokenWhen an llm OR chat model emits a new tokenhandleLLMNewToken
LLM endsWhen an llm OR chat model endshandleLLMEnd
LLM errorsWhen an llm OR chat model errorshandleLLMError
Chain startWhen a chain starts runninghandleChainStart
Chain endWhen a chain endshandleChainEnd
Chain errorWhen a chain errorshandleChainError
Tool startWhen a tool starts runninghandleToolStart
Tool endWhen a tool endshandleToolEnd
Tool errorWhen a tool errorshandleToolError
Agent actionWhen an agent takes an actionhandleAgentAction
Agent finishWhen an agent endshandleAgentEnd
Retriever startWhen a retriever startshandleRetrieverStart
Retriever endWhen a retriever endshandleRetrieverEnd
Retriever errorWhen a retriever errorshandleRetrieverError
TextWhen arbitrary text is runhandleText

Callback handlers​

CallbackHandlers are objects that implement the CallbackHandler interface, which has a method for each event that can be subscribed to. The CallbackManager will call the appropriate method on each handler when the event is triggered.

Passing callbacks​

The callbacks property is available on most objects throughout the API (Models, Tools, Agents, etc.) in two different places:

  • Request callbacks: Passed at the time of the request in addition to the input data. Available on all standard Runnable objects. These callbacks are INHERITED by all children of the object they are defined on. For example, chain.invoke({foo: "bar"}, {callbacks: [handler]}).
  • Constructor callbacks: defined in the constructor, e.g. new ChatAnthropic({ callbacks: [handler], tags: ["a-tag"] }). In this case, the callbacks will be used for all calls made on that object, and will be scoped to that object only. For example, if you initialize a chat model with constructor callbacks, then use it within a chain, the callbacks will only be invoked for calls to that model.
danger

Constructor callbacks are scoped only to the object they are defined on. They are not inherited by children of the object.

If you're creating a custom chain or runnable, you need to remember to propagate request time callbacks to any child objects.

For specifics on how to use callbacks, see the relevant how-to guides here.

Techniques​

Streaming​

Individual LLM calls often run for much longer than traditional resource requests. This compounds when you build more complex chains or agents that require multiple reasoning steps.

Fortunately, LLMs generate output iteratively, which means it's possible to show sensible intermediate results before the final response is ready. Consuming output as soon as it becomes available has therefore become a vital part of the UX around building apps with LLMs to help alleviate latency issues, and LangChain aims to have first-class support for streaming.

Below, we'll discuss some concepts and considerations around streaming in LangChain.

.stream()​

Most modules in LangChain include the .stream() method as an ergonomic streaming interface. .stream() returns an iterator, which you can consume with a for await...of loop. Here's an example with a chat model:

import { ChatAnthropic } from "@langchain/anthropic";
import { concat } from "@langchain/core/utils/stream";
import type { AIMessageChunk } from "@langchain/core/messages";

const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229" });

const stream = await model.stream("what color is the sky?");

let gathered: AIMessageChunk | undefined = undefined;

for await (const chunk of stream) {
console.log(chunk);
if (gathered === undefined) {
gathered = chunk;
} else {
gathered = concat(gathered, chunk);
}
}

console.log(gathered);

For models (or other components) that don't support streaming natively, this iterator would just yield a single chunk, but you could still use the same general pattern when calling them. Using .stream() will also automatically call the model in streaming mode without the need to provide additional config.

The type of each outputted chunk depends on the type of component - for example, chat models yield AIMessageChunks. Because this method is part of LangChain Expression Language, you can handle formatting differences from different outputs using an output parser to transform each yielded chunk.

You can check out this guide for more detail on how to use .stream().

.streamEvents()​

While the .stream() method is intuitive, it can only return the final generated value of your chain. This is fine for single LLM calls, but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of the chain alongside the final output - for example, returning sources alongside the final generation when building a chat over documents app.

There are ways to do this using callbacks, or by constructing your chain in such a way that it passes intermediate values to the end with something like chained .assign() calls, but LangChain also includes an .streamEvents() method that combines the flexibility of callbacks with the ergonomics of .stream(). When called, it returns an iterator which yields various types of events that you can filter and process according to the needs of your project.

Here's one small example that prints just events containing streamed chat model output:

import { StringOutputParser } from "@langchain/core/output_parsers";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatAnthropic } from "@langchain/anthropic";

const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229" });

const prompt = ChatPromptTemplate.fromTemplate("tell me a joke about {topic}");
const parser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(parser);

const eventStream = await chain.streamEvents(
{ topic: "parrot" },
{ version: "v2" }
);

for await (const event of eventStream) {
const kind = event.event;
if (kind === "on_chat_model_stream") {
console.log(event);
}
}

You can roughly think of it as an iterator over callback events (though the format differs) - and you can use it on almost all LangChain components!

See this guide for more detailed information on how to use .streamEvents().

Tokens​

The unit that most model providers use to measure input and output is via a unit called a token. Tokens are the basic units that language models read and generate when processing or producing text. The exact definition of a token can vary depending on the specific way the model was trained - for instance, in English, a token could be a single word like "apple", or a part of a word like "app".

When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a tokenizer. The model then streams back generated output tokens, which the tokenizer decodes into human-readable text. The below example shows how OpenAI models tokenize LangChain is cool!:

You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.

The reason language models use tokens rather than something more immediately intuitive like "characters" has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on the initial input and their previous generations. Training the model using tokens language models to handle linguistic units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model to learn and understand the structure of the language, including grammar and context. Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.

Callbacks​

The lowest level way to stream outputs from LLMs in LangChain is via the callbacks system. You can pass a callback handler that handles the handleLLMNewToken event into LangChain components. When that component is invoked, any LLM or chat model contained in the component calls the callback with the generated token. Within the callback, you could pipe the tokens into some other destination, e.g. a HTTP response. You can also handle the handleLLMEnd event to perform any necessary cleanup.

You can see this how-to section for more specifics on using callbacks.

Callbacks were the first technique for streaming introduced in LangChain. While powerful and generalizable, they can be unwieldy for developers. For example:

  • You need to explicitly initialize and manage some aggregator or other stream to collect results.
  • The execution order isn't explicitly guaranteed, and you could theoretically have a callback run after the .invoke() method finishes.
  • Providers would often make you pass an additional parameter to stream outputs instead of returning them all at once.
  • You would often ignore the result of the actual model call in favor of callback results.

Structured output​

LLMs are capable of generating arbitrary text. This enables the model to respond appropriately to a wide range of inputs, but for some use-cases, it can be useful to constrain the LLM's output to a specific format or structure. This is referred to as structured output.

For example, if the output is to be stored in a relational database, it is much easier if the model generates output that adheres to a defined schema or format. Extracting specific information from unstructured text is another case where this is particularly useful. Most commonly, the output format will be JSON, though other formats such as XML can be useful too. Below, we'll discuss a few ways to get structured output from models in LangChain.

.withStructuredOutput()​

For convenience, some LangChain chat models support a .withStructuredOutput() method. This method only requires a schema as input, and returns an object matching the requested schema. Generally, this method is only present on models that support one of the more advanced methods described below, and will use one of them under the hood. It takes care of importing a suitable output parser and formatting the schema in the right format for the model.

For more information, check out this how-to guide.

Raw prompting​

The most intuitive way to get a model to structure output is to ask nicely. In addition to your query, you can give instructions describing what kind of output you'd like, then parse the output using an output parser to convert the raw model message or string output into something more easily manipulated.

The biggest benefit to raw prompting is its flexibility:

  • Raw prompting does not require any special model features, only sufficient reasoning capability to understand the passed schema.
  • You can prompt for any format you'd like, not just JSON. This can be useful if the model you are using is more heavily trained on a certain type of data, such as XML or YAML.

However, there are some drawbacks too:

  • LLMs are non-deterministic, and prompting a LLM to consistently output data in the exactly correct format for smooth parsing can be surprisingly difficult and model-specific.
  • Individual models have quirks depending on the data they were trained on, and optimizing prompts can be quite difficult. Some may be better at interpreting JSON schema, others may be best with TypeScript definitions, and still others may prefer XML.

While we'll next go over some ways that you can take advantage of features offered by model providers to increase reliability, prompting techniques remain important for tuning your results no matter what method you choose.

JSON mode​

Some models, such as Mistral, OpenAI, Together AI and Ollama, support a feature called JSON mode, usually enabled via config.

When enabled, JSON mode will constrain the model's output to always be some sort of valid JSON. Often they require some custom prompting, but it's usually much less burdensome and along the lines of, "you must always return JSON", and the output is easier to parse.

It's also generally simpler and more commonly available than tool calling.

Here's an example:

import { JsonOutputParser } from "@langchain/core/output_parsers";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
model: "gpt-4o",
modelKwargs: {
response_format: { type: "json_object" },
},
});

const TEMPLATE = `Answer the user's question to the best of your ability.
You must always output a JSON object with an "answer" key and a "followup_question" key.

{question}`;

const prompt = ChatPromptTemplate.fromTemplate(TEMPLATE);

const chain = prompt.pipe(model).pipe(new JsonOutputParser());

await chain.invoke({ question: "What is the powerhouse of the cell?" });
{
answer: "The powerhouse of the cell is the mitochondrion.",
followup_question: "Would you like to learn more about the functions of mitochondria?"
}

For a full list of model providers that support JSON mode, see this table.

Function/tool calling​

info

We use the term tool calling interchangeably with function calling. Although function calling is sometimes meant to refer to invocations of a single function, we treat all models as though they can return multiple tool or function calls in each message.

Tool calling allows a model to respond to a given prompt by generating output that matches a user-defined schema. While the name implies that the model is performing some action, this is actually not the case! The model is coming up with the arguments to a tool, and actually running the tool (or not) is up to the user - for example, if you want to extract output matching some schema from unstructured text, you could give the model an "extraction" tool that takes parameters matching the desired schema, then treat the generated output as your final result.

For models that support it, tool calling can be very convenient. It removes the guesswork around how best to prompt schemas in favor of a built-in model feature. It can also more naturally support agentic flows, since you can just pass multiple tool schemas instead of fiddling with enums or unions.

Many LLM providers, including Anthropic, Cohere, Google, Mistral, OpenAI, and others, support variants of a tool calling feature. These features typically allow requests to the LLM to include available tools and their schemas, and for responses to include calls to these tools. For instance, given a search engine tool, an LLM might handle a query by first issuing a call to the search engine. The system calling the LLM can receive the tool call, execute it, and return the output to the LLM to inform its response. LangChain includes a suite of built-in tools and supports several methods for defining your own custom tools.

LangChain provides a standardized interface for tool calling that is consistent across different models.

The standard interface consists of:

  • ChatModel.bindTools(): a method for specifying which tools are available for a model to call. This method accepts LangChain tools.
  • AIMessage.toolCalls: an attribute on the AIMessage returned from the model for accessing the tool calls requested by the model.

The following how-to guides are good practical resources for using function/tool calling:

For a full list of model providers that support tool calling, see this table.

Retrieval​

LLMs are trained on a large but fixed dataset, limiting their ability to reason over private or recent information. Fine-tuning an LLM with specific facts is one way to mitigate this, but is often poorly suited for factual recall and can be costly. Retrieval is the process of providing relevant information to an LLM to improve its response for a given input. Retrieval augmented generation (RAG) is the process of grounding the LLM generation (output) using the retrieved information.

tip
  • See our RAG from Scratch video series. The code examples are in Python but is useful for a general overview of RAG concepts for visual learners.
  • For a high-level guide on retrieval, see this tutorial on RAG.

RAG is only as good as the retrieved documents’ relevance and quality. Fortunately, an emerging set of techniques can be employed to design and improve RAG systems. We've focused on taxonomizing and summarizing many of these techniques (see below figure) and will share some high-level strategic guidance in the following sections. You can and should experiment with using different pieces together. You might also find this LangSmith guide useful for showing how to evaluate different iterations of your app.

Query Translation​

First, consider the user input(s) to your RAG system. Ideally, a RAG system can handle a wide range of inputs, from poorly worded questions to complex multi-part queries. Using an LLM to review and optionally modify the input is the central idea behind query translation. This serves as a general buffer, optimizing raw user inputs for your retrieval system. For example, this can be as simple as extracting keywords or as complex as generating multiple sub-questions for a complex query.

NameWhen to useDescription
Multi-queryWhen you need to cover multiple perspectives of a question.Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, return the unique documents for all queries.
Decomposition (Python cookbook)When a question can be broken down into smaller subproblems.Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer).
Step-back (Python cookbook)When a higher-level conceptual understanding is required.First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question.
HyDE (Python cookbook)If you have challenges retrieving relevant documents using the raw user inputs.Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches.
tip

See our Python RAG from Scratch videos for a few different specific approaches:

Routing​

Second, consider the data sources available to your RAG system. You want to query across more than one database or across structured and unstructured data sources. Using an LLM to review the input and route it to the appropriate data source is a simple and effective approach for querying across sources.

NameWhen to useDescription
Logical routingWhen you can prompt an LLM with rules to decide where to route the input.Logical routing can use an LLM to reason about the query and choose which datastore is most appropriate.
Semantic routingWhen semantic similarity is an effective way to determine where to route the input.Semantic routing embeds both query and, typically a set of prompts. It then chooses the appropriate prompt based upon similarity.
tip

See our Python RAG from Scratch video on routing.

Query Construction​

Third, consider whether any of your data sources require specific query formats. Many structured databases use SQL. Vector stores often have specific syntax for applying keyword filters to document metadata. Using an LLM to convert a natural language query into a query syntax is a popular and powerful approach. In particular, text-to-SQL, text-to-Cypher, and query analysis for metadata filters are useful ways to interact with structured, graph, and vector databases respectively.

NameWhen to UseDescription
Text to SQLIf users are asking questions that require information housed in a relational database, accessible via SQL.This uses an LLM to transform user input into a SQL query.
Text-to-CypherIf users are asking questions that require information housed in a graph database, accessible via Cypher.This uses an LLM to transform user input into a Cypher query.
Self QueryIf users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text.This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself).
tip

See our blog post overview and RAG from Scratch video on query construction, the process of text-to-DSL where DSL is a domain specific language required to interact with a given database. This converts user questions into structured queries.

Indexing​

Fouth, consider the design of your document index. A simple and powerful idea is to decouple the documents that you index for retrieval from the documents that you pass to the LLM for generation. Indexing frequently uses embedding models with vector stores, which compress the semantic information in documents to fixed-size vectors.

Many RAG approaches focus on splitting documents into chunks and retrieving some number based on similarity to an input question for the LLM. But chunk size and chunk number can be difficult to set and affect results if they do not provide full context for the LLM to answer a question. Furthermore, LLMs are increasingly capable of processing millions of tokens.

Two approaches can address this tension: (1) Multi Vector retriever using an LLM to translate documents into any form (e.g., often into a summary) that is well-suited for indexing, but returns full documents to the LLM for generation. (2) ParentDocument retriever embeds document chunks, but also returns full documents. The idea is to get the best of both worlds: use concise representations (summaries or chunks) for retrieval, but use the full documents for answer generation.

NameIndex TypeUses an LLMWhen to UseDescription
Vector storeVector storeNoIf you are just getting started and looking for something quick and easy.This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text.
ParentDocumentVector store + Document StoreNoIf your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together.This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks).
Multi VectorVector store + Document StoreSometimes during indexingIf you are able to extract information from documents that you think is more relevant to index than the text itself.This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions.
Time-Weighted Vector storeVector storeNoIf you have timestamps associated with your documents, and you want to retrieve the most recent onesThis fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents)
tip

Fifth, consider ways to improve the quality of your similarity search itself. Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is useful for search / retrieval, but puts a heavy burden on that single vector representation to capture the semantic nuance / detail of the document. In some cases, irrelevant or redundant content can dilute the semantic usefulness of the embedding.

There are some additional tricks to improve the quality of your retrieval. Embeddings excel at capturing semantic information, but may struggle with keyword-based queries. Many vector stores offer built-in hybrid-search to combine keyword and semantic similarity, which marries the benefits of both approaches. Furthermore, many vector stores have maximal marginal relevance, which attempts to diversify the results of a search to avoid returning similar and redundant documents.

NameWhen to useDescription
Hybrid searchWhen combining keyword-based and semantic similarity.Hybrid search combines keyword and semantic similarity, marrying the benefits of both approaches.
Maximal Marginal Relevance (MMR)When needing to diversify search results.MMR attempts to diversify the results of a search to avoid returning similar and redundant documents.

Post-processing​

Sixth, consider ways to filter or rank retrieved documents. This is very useful if you are combining documents returned from multiple sources, since it can can down-rank less relevant documents and / or compress similar documents.

NameIndex TypeUses an LLMWhen to UseDescription
Contextual CompressionAnySometimesIf you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM.This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM.
EnsembleAnyNoIf you have multiple retrieval methods and want to try combining them.This fetches documents from multiple retrievers and then combines them.
Re-rankingAnyYesIf you want to rank retrieved documents based upon relevance, especially if you want to combine results from multiple retrieval methods.Given a query and a list of documents, Rerank indexes the documents from most to least semantically relevant to the query.
tip

See our Python RAG from Scratch video on RAG-Fusion, on approach for post-processing across multiple queries: Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, and combine the ranks of multiple search result lists to produce a single, unified ranking with Reciprocal Rank Fusion (RRF).

Generation​

Finally, consider ways to build self-correction into your RAG system. RAG systems can suffer from low quality retrieval (e.g., if a user question is out of the domain for the index) and / or hallucinations in generation. A naive retrieve-generate pipeline has no ability to detect or self-correct from these kinds of errors. The concept of "flow engineering" has been introduced in the context of code generation: iteratively build an answer to a code question with unit tests to check and self-correct errors. Several works have applied this RAG, such as Self-RAG and Corrective-RAG. In both cases, checks for document relevance, hallucinations, and / or answer quality are performed in the RAG answer generation flow.

We've found that graphs are a great way to reliably express logical flows and have implemented ideas from several of these papers using LangGraph, as shown in the figure below (red - routing, blue - fallback, green - self-correction):

  • Routing: Adaptive RAG (paper). Route questions to different retrieval approaches, as discussed above
  • Fallback: Corrective RAG (paper). Fallback to web search if docs are not relevant to query
  • Self-correction: Self-RAG (paper). Fix answers w/ hallucinations or don’t address question

NameWhen to useDescription
Self-RAGWhen needing to fix answers with hallucinations or irrelevant content.Self-RAG performs checks for document relevance, hallucinations, and answer quality during the RAG answer generation flow, iteratively building an answer and self-correcting errors.
Corrective-RAGWhen needing a fallback mechanism for low relevance docs.Corrective-RAG includes a fallback (e.g., to web search) if the retrieved documents are not relevant to the query, ensuring higher quality and more relevant retrieval.

Text splitting​

LangChain offers many different types of text splitters. These are available in the main langchain package, but can be used separately in the @langchain/textsplitters package.

Table columns:

  • Name: Name of the text splitter
  • Classes: Classes that implement this text splitter
  • Splits On: How this text splitter splits text
  • Adds Metadata: Whether or not this text splitter adds metadata about where each chunk came from.
  • Description: Description of the splitter, including recommendation on when to use it.
NameClassesSplits OnAdds MetadataDescription
RecursiveRecursiveCharacterTextSplitterA list of user defined charactersRecursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.
Codemany languagesCode (Python, JS) specific charactersSplits text based on characters specific to coding languages. 15 different languages are available to choose from.
Tokenmany classesTokensSplits text on tokens. There exist a few different ways to measure tokens.
CharacterCharacterTextSplitterA user defined characterSplits text based on a user defined character. One of the simpler methods.

Evaluation​

Evaluation is the process of assessing the performance and effectiveness of your LLM-powered applications. It involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose. This process is vital for building reliable applications.

LangSmith helps with this process in a few ways:

  • It makes it easier to create and curate datasets via its tracing and annotation features
  • It provides an evaluation framework that helps you define metrics and run your app against your dataset
  • It allows you to track results over time and automatically run your evaluators on a schedule or as part of CI/Code

To learn more, check out this LangSmith guide.

Generative UI​

LangChain.js provides a few templates and examples showing off generative UI, and other ways of streaming data from the server to the client, specifically in React/Next.js.

You can find the template for generative UI in the official LangChain.js Next.js template.

For streaming agentic responses and intermediate steps, you can find the template and documentation here.

And finally, streaming tool calls and structured output can be found here.


Was this page helpful?


You can also leave detailed feedback on GitHub.