App template:

Dynamic Enterprise RAG with SharePoint

Saksham Goel·July 15, 2024·0 min read

Retrieval Augmented Generation (RAG) applications empower you to deliver context-specific answers based on private knowledge bases using LLMs/Gen AI.

SharePoint offered via Microsoft 365 is a common data source on which you might want to build your RAG applications. Microsoft SharePoint leverages workflow applications, "list" databases, and other web parts and security features to enable business teams to collaborate effectively and is widely used by Microsoft Office users for sharing files in a SharePoint document library.

Pathway, on the other hand, is crucial for building successful Enterprise RAG systems and managing dynamic data sources like Microsoft SharePoint while maintaining high accuracy and reliability.

What is Dynamic RAG?

In practical scenarios, files in data repositories are dynamic, i.e., frequently added, deleted, or modified. These ongoing changes require real-time synchronization and efficient incremental indexing to ensure the most current information is always available.

Dynamic Enterprise RAG Applications help you build RAG applications that are in permanent sync with your dynamic data sources.

This app template will help you build a Dynamic Enterprise RAG application that integrates with Microsoft SharePoint as a data source. Your application will always provide up-to-date knowledge, synchronized with any file insertions, deletions, or changes at any point in time, making your work easier. It avoids the need for constant ETL (Extract, Transform and Load) adjustments for such bound-to-implement considerations.

You can easily run this app template in minutes using Docker containers while ensuring the best practices needed in an enterprise setup.

Features of Dynamic Enterprise RAG with SharePoint

Real-Time Synchronization

Dynamic RAG Apps must stay in sync with your data repositories to provide relevant responses.

  • Pathway's SharePoint connector supports both static and streaming modes, enabling real-time synchronization of SharePoint data.
  • Ensures that your app continuously indexes documents from SharePoint, maintaining an up-to-date knowledge base.

Imagine senior executives making strategic decisions based on last month's financial reports or outdated project statuses. This lag in information leads to misinformed decisions, missed opportunities, or significant financial losses. Real-time synchronization ensures your app delivers the most current and accurate information, preventing such scenarios.

Detailed Metadata Handling

Enterprise RAG applications include comprehensive metadata such as file paths, modification times, and creation times in the output table. This additional context is crucial for effectively tracking and managing documents.

  • Pathway's streaming mode ensures that this metadata is always up-to-date.

High Security with Certificate-Based Authentication

Enterprise workflows must ensure high security and compliance with enterprise standards.

  • Pathway's certificate-based authentication future-proofs your system against the potential deprecation of simpler authentication methods by SharePoint.
  • For enhanced security, locally deployed LLMs can be set up within an isolated environment, like a Faraday cage, that protects against external interference. This setup ensures that sensitive data remains secure and private, adhering to the highest security standards.

While this template uses the OpenAI API as an example, you can easily swap it with private RAG setups using the additional resources provided at the end.

Scalable and Production-Ready Deployment

Enterprise applications handle vast and ever-growing data sources, often increasing as many users within a company work on them.

  • Pathway provides fast, built-in, and persistent vector indexing for up to millions of pages of documents, eliminating the need for complex ETL processes.
  • Pathway is built for scale, and it offers an integrated solution where the server and endpoints are part of the same application.
  • The easy Docker setup ensures consistency across different environments.

High Accuracy and Enhanced Query Capabilities

Pathway's SharePoint connector allows you to easily query and manage your datasets stored in SharePoint, providing flexible and powerful options for accessing your data.

  • You can configure the connector to read data from specific directories or entire subsites, with options for both recursive and non-recursive scans.
  • Starting with a basic RAG pipeline provides initial accuracy, but leveraging more advanced methods such as hybrid indexing and multimodal search can increase accuracy up to 98% and beyond.

By using this app template, you will leverage Pathway and Microsoft SharePoint to build a dynamic, secure, and efficient Enterprise RAG system tailored to your specific needs.

Prerequisites for the Enterprise RAG App Template

  1. Docker Desktop: You can download it from the Docker website.
  2. OpenAI API Key: Sign up on the OpenAI website and generate an API key from the API Key Management page. Keep this key secure as you will need to use it in your configuration.
  3. Pathway License Key: Get your free license key here.
  4. Certificate-Based Authentication Setup for SharePoint Integration

For better security, we use certificate-based authentication to access data from SharePoint. For this we use Azure AD, which is now renamed to Microsoft Entra ID.

You can follow the steps in the video below to create and upload your SSL certificate to obtain necessary parameters for Pathway's SharePoint connector.

How to Create and Register an Application on Microsoft Entra ID (previously Azure AD

Once done, you will use these parameters to update the config.yaml file to successfully build and deploy your Dynamic Enterprise RAG application using Microsoft SharePoint and Pathway.

Components of your RAG Pipeline

  • app.py, the application code using Pathway and written in Python.
  • config.yaml, the file containing configuration stubs for the data sources, the LLM model, and the web server.
  • requirements.txt, the dependencies for the pipeline.
  • Dockerfile, the Docker configuration for running the pipeline in the container.
  • .env, an environment variable configuration file where you will store the OpenAI API key.

Step-by-Step Process to Implement Production-Ready Enterprise RAG Pipeline

Step 1: Clone the Pathway LLM App Repository

Clone the llm-app repository from GitHub. This repository contains all the files you’ll need.

git clone https://github.com/pathwaycom/llm-app.git

If you have previously cloned an older version, update it using a pull command.

git pull

Step 2: Navigate to the Project Directory

Make sure you are in the right directory.

cd llm-app/examples/pipelines/demo-question-answering

Step 3: Create a .env File and put your Open AI API key

Configure your key in a .env file.

OPENAI_API_KEY=sk-*******

Save the file as .env in the demo-question-answering folder.

Step 4: Update your Pathway license key in app.py

Update your free license key in the application code. Rest of the code is already configured for you.

pw.set_license_key("demo-license-key-with-telemetry")

Step 5: Update the config.yaml File

To include the SharePoint configuration in the config.yaml file, follow the steps below. You can change the model specification from gpt-3.5-turbo to other OpenAI models like GPT-4 or GPT-4o as needed. Additionally, you can use 300+ LLMs via Pathway LiteLLM Class or build it with open-source models hosted locally.

llm_config:
  model: "gpt-3.5-turbo"
host_config:
  host: "0.0.0.0"
  port: 8000
cache_options:
  with_cache: True
  cache_folder: "./Cache"
sources:
     - sharepoint_folder:
     kind: sharepoint
     config:
       # The sharepoint is part of Pathway's commercial offering, please contact us for a demo
       # Please contact here: `contact@pathway.com`
       root_path: ROOT_PATH
       url: SHAREPOINT_URL
       tenant: SHAREPOINT_TENANT
       client_id: SHAREPOINT_CLIENT_ID
       cert_path: SHAREPOINT.pem
       thumbprint: SHAREPOINT_THUMBPRINT
       refresh_interval: 5

Mandatory Parameters:

  • url: The SharePoint site URL, including the site's path. For example: https://company.sharepoint.com/sites/MySite.
  • tenant: The ID of the SharePoint tenant, typically a GUID.
  • client_id: The Client ID of the SharePoint application with the required grants to access the data.
  • cert_path: The path to the certificate (typically a .pem file) added to the application for authentication.
  • thumbprint: The thumbprint for the specified certificate.
  • root_path: The path for a directory or file within the SharePoint space to be read.
  • refresh_interval: Time in seconds between scans if the mode is set to "streaming".

For more details on additional configurations, visit Pathway's SharePoint Connector page.

Example Configuration:

To illustrate the utility of this connector, consider a scenario where you need to access a dataset stored in the Shared Documents/Data directory of the SharePoint site Datasets. Below is a basic example demonstrating how to configure the connector for reading this dataset in streaming mode:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Shared Documents/Data",
)

In this setup, the connector targets the Shared Documents/Data directory and recursively scans all subdirectories. This method ensures that no file is overlooked, providing comprehensive access to all pertinent data within the specified path.

Step 6: Build the Docker Image for your Enterprise RAG App

This step might take a few minutes. Ensure you have enough space on your device (approximately 8 GB).

docker build -t ragshare .

Step 7: Run the Docker Container

Run the Docker container and expose via a port, i.e. 8000 in this template. You can pick another port as well that isn't used.

docker run -p 8000:8000 ragshare

Open up another terminal window to follow the next steps.

Step 8: Check the List of Files

Check if your files in SharePoint are indexed for information retrieval for LLMs. To test it, query to get the list of available inputs and associated metadata using curl:

curl -X 'POST'   'http://localhost:8000/v1/pw_list_documents'   -H 'accept: */*'   -H 'Content-Type: application/json'

This will return the list of files e.g. if you start with this file uploaded on your sharepoint the answer will be as follows:

[{"created_at": null, "modified_at": 1718810417, "owner": "root", "path":"data/IdeanomicsInc_20160330_10-K_EX-10.26_9512211_EX-10.26_Content License Agreement.pdf", "seen_at": 1718902304}]

Step 9: Last Step – Run the RAG Service

You can now run the RAG service. Start by asking a simple question. For example:

curl -X 'POST' \
  'http://0.0.0.0:8000/v1/pw_ai_answer' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "What is the start date of the contract?"
}'

This will return the following answer:

December 21, 2015

Conclusions

In this app template, you:

  • Learned about Dynamic RAG and key considerations for Enterprise RAG applications.
  • Successfully created and deployed a Dynamic Enterprise RAG application using Pathway with Microsoft SharePoint as a data source.

By leveraging the combined power of Pathway and Microsoft SharePoint, you built a secure, efficient, and scalable Enterprise RAG system tailored to your specific needs. Pathway enables you to rapidly deliver production-ready, enterprise-grade LLM projects at a fraction of the cost. This traditional RAG setup can be refined with rerankers, adaptive RAG, multimodal RAG, and other techniques.

Additional Resources on Enterprise RAG

  • Slides AI Search: Set up high accuracy multimodal RAG pipelines for presentations and PDFs on the Slides AI Search GitHub repo. This template helps you build a multi-modal search service using GPT-4o with Metadata Extraction and Vector Index. You can also try out the hosted demo here.
  • Private RAG with Connected Data Sources using Mistral, Ollama, and Pathway: Set up a private RAG pipeline with adaptive retrieval using Pathway, Mistral, and Ollama. This app template allows you to run the entire application locally while ensuring low costs without compromising on accuracy, making it ideal for production use-cases with sensitive data and explainable AI needs. Get started with the app template here.
  • Multimodal RAG for PDFs with Text, Images, and Charts: This showcase demonstrates how you can launch a MultiModal RAG pipeline that utilizes GPT-4o in the parsing stage. Pathway extracts information from unstructured financial documents in your folders, updating results as documents change or new ones arrive. Learn more here.

Need More Help?

Pathway is trusted by thousands of developers and enterprises, including Intel, Formula 1 Teams, CMA CGM, and more. Reach out for assistance with your enterprise applications. Contact us here to discuss your project needs or to request a demo.

Troubleshooting

To provide feedback or report a bug, please raise an issue on our issue tracker. You can also join the Pathway Discord server (#get-help) and let us know how the Pathway community can help you.

Discuss tricks & tips for RAG

Join our Discord community and dive into discussions on tricks and tips for mastering Retrieval Augmented Generation