Cohere Unstructured ChromaDB

In this article, we will walk through a step-by-step analysis of a MongoDB balance sheet using Python libraries such as Unstructured, Cohere, and Langchain.

When extracting text data from an earnings report, the LLM (Language Model) faces a limitation—it cannot readily process or comprehend the intricate table structures present in the document. However, Unstructured's library comes to the rescue by converting these structures into a more accessible HTML format.

This transformation is significant because LLMs excel at comprehending and analyzing the row and column layout of tables when presented in HTML format.

Introduction

The goal of this analysis is to extract structured financial information from a MongoDB balance sheet in PDF format. We'll then process this data and perform text-based queries on it to gain valuable insights. To achieve this, we'll use the following Python libraries:

Unstructured: This library helps us extract structured information from unstructured data sources like PDFs. We are going to extract tabular data from pdf into an HTML table.
Cohere: Cohere provides powerful natural language understanding capabilities, which we'll use for text analysis and question-answering tasks.
Langchain: Langchain is a library for working with text data and building complex text processing pipelines.

Setup

Before we dive into the analysis, we need to set up our environment. We'll install the necessary libraries using the following code:

!pip3 install unstructured unstructured-inference langchain numpy tesseract cohere chromadb

PDF Extraction and Processing

In this step, we'll extract table structures and other relevant elements from the MongoDB balance sheet PDF. Here's what we'll do:

filename = "mongodb_balancesheet.pdf"

# Extraction Parameters
strategy = "hi_res" 
model_name = "yolox" 

# Extract Elements from PDF
elements = partition_pdf(
    filename=filename, 
    strategy=strategy, 
    infer_table_structure=True, 
    model_name=model_name
)

# Store Results in JSON
elements_to_json(elements, filename=f"{filename}.json")

Install Libraries: We start by installing the required libraries using the code snippet above.
Specify the PDF File: We specify the path to the MongoDB balance sheet PDF file.
Extraction Parameters: We define parameters such as the extraction strategy and model to use. For instance, we select a model like "yolox" for table extraction.
Extract Elements: We use the Unstructured Library to extract elements from the PDF. This step includes inferring table structures.
Store Results: The extracted elements are stored in a JSON file for further processing.

Text Analysis

In this section, we process the JSON file containing extracted data. Here's what happens:

input_filename = f"{filename}.json"
with open(input_filename, 'r') as file:
    data = json.load(file)

# Extract Table Elements
extracted_elements = []
for entry in data:
    if entry["type"] == "Table":
        extracted_elements.append(entry["metadata"]["text_as_html"])

# Write to Text File
text_file = "mongodb_balancesheet.pdf.txt"
with open(text_file, 'w') as output_file:
    for element in extracted_elements:
        output_file.write(element + "\n\n")

Read JSON File: We read the JSON file generated in the previous step.
Extract Table Elements: We specifically extract the table elements from the JSON data.
Write-to-Text File: The extracted table data is written to a text file, making it ready for text analysis.

Questions and Answers

Now, let's get to the exciting part. We'll utilize the processed text data to answer specific questions related to the MongoDB balance sheet. Here's how:


text_file = "mongodb_balancesheet.pdf.txt"
loader = TextLoader(text_file)
documents = loader.load()

# Text Chunking
text_splitter = CharacterTextSplitter(chunk_size=4500, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# Text Embeddings with Cohere
COHERE_API_KEY = '<YOUR_COHERE_API_KEY>'
embeddings = CohereEmbeddings(
    model="embed-english-light-v2.0", cohere_api_key=COHERE_API_KEY
)

# Retrieval-Based QA Model
db = Chroma.from_documents(docs, embeddings)
llm = Cohere(cohere_api_key=COHERE_API_KEY, truncate="START", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())

# Example Questions
questions = [
    "What are the current assets of MongoDB",
    "What is the total cash flow from investing activities of MongoDB?",
    "How many total Atlas Customers MongoDB has in July 2023?",
    "What was the gross margin in July 2023?"
]

# Store responses in output_list
output_list = []

for query in questions:
    response = qa_chain({"query": query})
    output_list.append(response)

Load Text Data: We load the text data from the text file.
Text Chunking: To make the text data more manageable, we split it into smaller chunks.
Text Embeddings with Cohere: We leverage Cohere's text embeddings to represent the text data in a meaningful way.
Retrieval-Based QA Model: Using Langchain, we build a retrieval-based question-answering model. This model allows us to ask questions and receive answers based on the processed data.

Example Questions:

"What are the current assets of MongoDB?"
"What is the total cash flow from investing activities of MongoDB?"
"How many total Atlas Customers did MongoDB have in July 2023?"
"What was the gross margin in July 2023?"

The notebook generates answers to these questions and provides them as output.

[   {   'query': 'What are the current assets of MongoDB',
        'result': ' As of July 31, 2023, MongoDB had total current assets of '
                  '$2,648,234. This includes cash and cash equivalents of '
                  '$722,190, accounts receivable of $61,206, prepaid expenses '
                  'and other current assets of $124, and other current assets '
                  'of $2,548,114.'},
    {   'query': 'What is the total cash flow from investing activities of '
                 'MongoDB?',
        'result': ' The total cash flow from investing activities of MongoDB '
                  'is $110,195.'},
    {   'query': 'How many total Atlas Customers MongoDB has in July 2023?',
        'result': ' MongoDB has 29,000+ Atlas Customers as of July 2023.'},
    {   'query': 'What was the gross margin in July 2023?',
        'result': ' The gross margin in July 2023 was 74%.'}]

In conclusion, the combination of Unstructured, Cohere, and Langchain libraries empowers us to extract, process, and analyze financial data from unstructured sources like PDFs. This approach allows us to gain valuable insights from complex documents such as balance sheets, making it a valuable tool for financial analysis and reporting.

References

Please refer to the above documentation and resources for more details on these libraries and their capabilities.

Analyzing MongoDB Balance Sheet with Unstructured, Cohere, and Langchain

Table of contents