LlamaParse
LlamaParse is an API created by LlamaIndex to efficiently parse files, e.g. it's great at converting PDF tables into markdown.
To use it, first login and get an API key from https://cloud.llamaindex.ai. Make sure to store the key as apiKey
parameter or in the environment variable LLAMA_CLOUD_API_KEY
.
Official documentation for LlamaParse can be found here.
Usage
You can then use the LlamaParseReader
class to load local files and convert them into a parsed document that can be used by LlamaIndex.
See LlamaParseReader.ts for a list of supported file types:
import { LlamaParseReader, VectorStoreIndex } from "llamaindex";
async function main() {
// Load PDF using LlamaParse
const reader = new LlamaParseReader({ resultType: "markdown" });
const documents = await reader.loadData("../data/TOS.pdf");
// Split text and create embeddings. Store them in a VectorStoreIndex
const index = await VectorStoreIndex.fromDocuments(documents);
// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "What is the license grant in the TOS?",
});
// Output response
console.log(response.toString());
}
main().catch(console.error);
Params
All options can be set with the LlamaParseReader
constructor.
They can be divided into two groups.
General params:
apiKey
is required. Can be set as an environment variableLLAMA_CLOUD_API_KEY
checkInterval
is the interval in seconds to check if the parsing is done. Default is1
.maxTimeout
is the maximum timout to wait for parsing to finish. Default is2000
verbose
shows progress of the parsing. Default istrue
ignoreErrors
set to false to get errors while parsing. Default istrue
and returns an empty array on error.
Advanced params:
resultType
can be set tomarkdown
,text
orjson
. Defaults totext
. More information aboutjson
mode on the next pages.language
primarly helps with OCR recognition. Defaults toen
. Click here for a list of supported languages.parsingInstructions?
Optional. Can help with complicated document structures. See this LlamaIndex Blog Post for an example.skipDiagonalText?
Optional. Set to true to ignore diagonal text. (Text that is not rotated 0, 90, 180 or 270 degrees)invalidateCache?
Optional. Set to true to ignore the LlamaCloud cache. All document are kept in cache for 48hours after the job was completed to avoid processing the same document twice. Can be useful for testing when trying to re-parse the same document with, e.g. differentparsingInstructions
.doNotCache?
Optional. Set to true to not cache the document.fastMode?
Optional. Set to true to use the fast mode. This mode will skip OCR of images, and table/heading reconstruction. Note: Non-compatible withgpt4oMode
.doNotUnrollColumns?
Optional. Set to true to keep the text according to document layout. Reduce reconstruction accuracy, and LLM's/embedings performances in most cases.pageSeperator?
Optional. The page seperator to use. Defaults is\\n---\\n
.gpt4oMode
set to true to use GPT-4o to extract content. Default isfalse
.gpt4oApiKey?
Optional. Set the GPT-4o API key. Lowers the cost of parsing by using your own API key. Your OpenAI account will be charged. Can also be set in the environment variableLLAMA_CLOUD_GPT4O_API_KEY
.numWorkers
as in the python version, is set inSimpleDirectoryReader
. Default is 1.
LlamaParse with SimpleDirectoryReader
Below a full example of LlamaParse
integrated in SimpleDirectoryReader
with additional options.
import {
LlamaParseReader,
SimpleDirectoryReader,
VectorStoreIndex,
} from "llamaindex";
async function main() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData({
directoryPath: "../data/parallel", // brk-2022.pdf split into 6 parts
numWorkers: 2,
// set LlamaParse as the default reader for all file types. Set apiKey here or in environment variable LLAMA_CLOUD_API_KEY
overrideReader: new LlamaParseReader({
language: "en",
resultType: "markdown",
parsingInstruction:
"The provided files is Berkshire Hathaway's 2022 Annual Report. They contain figures, tables and raw data. Capture the data in a structured format. Mathematical equation should be put out as LATEX markdown (between $$).",
}),
});
const index = await VectorStoreIndex.fromDocuments(docs);
// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query:
"What is the general strategy for shareholder safety outlined in the report? Use a concrete example with numbers",
});
// Output response
console.log(response.toString());
}
main().catch(console.error);