Introduction

This examples shows how to load data from a pdf, create emeddings, and then query the data.

Load PDF

We load the major projects report 2015 into python using (smart_pdf_loader)[https://llamahub.ai/l/smart_pdf_loader].

SmartPDFLoader is a super fast PDF reader that understands the layout structure of PDFs such as nested sections, nested lists, paragraphs and tables. It uses layout information to smartly chunk PDFs into optimal short contexts for LLMs.

Code

from llama_hub.smart_pdf_loader import SmartPDFLoader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://www.nao.org.uk/wp-content/uploads/2015/10/Major-Projects-Report-2015-and-the-Equipment-Plan-2015-2025.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf

pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data(pdf_url)
documents[0:2] # look at first 3 chunks

[Document(id_='7adbabd4-1df8-4ea6-a5cb-ff56f550e0c6', embedding=None, metadata={'chunk_type': 'para'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='7b2a550370ad85d6de2541a641003a5b2971d0115121e67f6ff2a3183be0b7c8', text='Report\nby the Comptroller and Auditor General', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='3762c995-6cbd-4f15-9e16-491bcc9f1dad', embedding=None, metadata={'chunk_type': 'list_item'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='bb8a396323dd4d9c61c79ed4882caa34151071ed1e48e554c8da9f00a2491810', text='Major Projects Report 2015 and the Equipment Plan 2015 to 2025\nHC 488-I', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

Create a Vector Store Index

The documentation on VectorStoreIndex feels a little light. My understanding (not least through reference to OpenAI billing) is that this step uses OpenAI ada model for embeddings and then stores the index.

Code

from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
index

<llama_index.indices.vector_store.base.VectorStoreIndex at 0x233c568edc0>

Query

We then use GPT 3.5 is used to query the data.

Code

query_engine = index.as_query_engine()
response = query_engine.query("Summarise this document.")
print(response)
response = query_engine.query("what are the main causes of schedule variation?")
print(response)

The document is titled "Major Projects Report 2015 and the Equipment Plan 2015 to 2025." It contains an executive project summary and an overview of cost, time, and performance. The report discusses the Department's ability to fund the Equipment Plan and suggests that the Affordability Statement should provide clearer information about uncertainties in costs and the range of possible cost outcomes. It also mentions the need to quantify risks not included in cost forecasts. The document is printed on Evolution Digital Satin paper, which is sourced from responsibly managed and sustainable forests certified by the FSC.
The main causes of schedule variation are the net 52-month deferment of the final stage of the Core Production Capability project and the net variation of 8 months in the remaining projects.

Next Steps

Being able to search over multiple documents
Being able to cite sources.