showcasellmcase-studyengineering

Real-time Enterprise RAG with SharePoint

Saksham Goel

·Published July 15, 2024·Updated July 15, 2024·0 min read

Real-time Enterprise RAG with SharePoint and Pathway

Retrieval Augmented Generation (RAG) applications empower you to deliver context-specific answers based on private knowledge bases using LLMs/Gen AI.

SharePoint offered via Microsoft 365 is a common data source on which you might want to build your RAG applications. Microsoft SharePoint leverages workflow applications, "list" databases, and other web parts and security features to enable business teams to collaborate effectively and is widely used by Microsoft Office users for sharing files in a SharePoint document library.

Pathway, on the other hand, is crucial for building Enterprise RAG systems to work with live enterprise data and managing dynamic data sources like Microsoft SharePoint while maintaining high accuracy, real-time synchronization and reliability.

What is Real-time RAG?

In practical scenarios, files in data repositories are dynamic, i.e., frequently added, deleted, or modified. These ongoing changes require real-time synchronization and efficient incremental indexing to ensure the most current information is always available.

Real-time Enterprise RAG Applications stay in permanent sync with your dynamic data sources.

This app template will help you build a Real-time Enterprise RAG application that integrates with Microsoft SharePoint as a data source. Your application will always provide up-to-date knowledge, synchronized with any file insertions, deletions, or changes at any point in time, making your work easier. It avoids the need for constant ETL (Extract, Transform and Load) adjustments for such bound-to-implement considerations.

You can easily run this app template in minutes using Docker containers while ensuring the best practices needed in an enterprise setup.

Real-time RAG with SharePoint

Real-time RAG with SharePoint refers to an approach where RAG is integrated with Microsoft SharePoint as the data source, and enhanced by real-time AI capabilities. In this setup:

Continuously index documents as they're added, changed, or removed in SharePoint.
Securely authenticate and manage documents behind enterprise-grade permissions and certificate-based authentication.
Provide up-to-date answers with minimal latency, thanks to real-time synchronization.

Ready to Elevate Your RAG with SharePoint?

Discover how Pathway can streamline your SharePoint integration and drive intelligent document retrieval.

Features of Real-time Enterprise RAG with SharePoint

Real-Time Synchronization

Real-time RAG Apps must stay in sync with your data repositories to provide relevant responses.

Pathway's SharePoint connector supports both static and streaming modes.
Ensures that your app continuously indexes documents from SharePoint, maintaining an up-to-date knowledge base.

Imagine senior executives making strategic decisions based on last month's financial reports or outdated project statuses. This lag in information leads to misinformed decisions, missed opportunities, or significant financial losses. Real-time synchronization ensures your app delivers the most current and accurate information, preventing such scenarios.

Detailed Metadata Handling

Enterprise RAG applications include comprehensive metadata such as file paths, modification times, and creation times in the output table. This additional context is crucial for effectively tracking and managing documents.

Pathway's streaming mode ensures that this metadata is always up-to-date.

High Security with Certificate-Based Authentication

Enterprise workflows must ensure high security and compliance with enterprise standards.

Pathway's certificate-based authentication future-proofs your system against the potential deprecation of simpler authentication methods by SharePoint.
For enhanced security, locally deployed LLMs can be set up within an isolated environment, like a Faraday cage, that protects against external interference. This setup ensures that sensitive data remains secure and private, adhering to the highest security standards.

While this template uses the OpenAI API as an example, you can easily swap it with private RAG setups using the additional resources provided at the end.

Scalable and Production-Ready Deployment

Enterprise applications handle vast and ever-growing data sources, often increasing as many users within a company work on them.

Pathway provides fast, built-in, and persistent vector indexing for up to millions of pages of documents, eliminating the need for complex ETL processes.
Pathway is built for scale, and it offers an integrated solution where the server and endpoints are part of the same application.
The easy Docker setup ensures consistency across different environments.

High Accuracy and Enhanced Query Capabilities

Pathway's SharePoint connector allows you to easily query and manage your datasets stored in SharePoint, providing flexible and powerful options for accessing your data.

You can configure the connector to read data from specific directories or entire subsites, with options for both recursive and non-recursive scans.
Starting with a basic RAG pipeline provides initial accuracy, but leveraging more advanced methods such as hybrid indexing and multimodal search can significantly increase accuracy.

Step-by-Step Process to Implement a Production-Ready RAG with SharePoint Connector

This template guides you in connecting Pathway with SharePoint to build a real-time Enterprise RAG app.

Important: SharePoint connector requires a Pathway license key. If you haven’t already, request your free license key to unlock the SharePoint connector and other enterprise features. The application will be updated with this key in Step 5.

Prerequisites for the Enterprise RAG App Template

Docker Desktop: You can download it from the Docker website.
OpenAI API Key: Sign up on the OpenAI website and generate an API key from the API Key Management page. Keep this key secure as you will need to use it in your configuration.
Pathway License Key: Get your free license key here.
Certificate-Based Authentication Setup for SharePoint Integration

For better security, we use certificate-based authentication to access data from SharePoint. For this we use Azure AD, which is now renamed to Microsoft Entra ID.

You can follow the steps in the video below to create and upload your SSL certificate to obtain necessary parameters for Pathway's SharePoint connector.

Once done, you will use these parameters to update the app.yaml file to successfully build and deploy your Real-time Enterprise RAG with Microsoft SharePoint and Pathway.

Components of your Real-time RAG Pipeline

This folder contains several objects:

app.py, the application code using Pathway and written in Python;
app.yaml, the file containing configuration of the pipeline, like LLM models, data sources or server address;
requirements.txt, the dependencies for the pipeline. It can be passed to pip install -r ... to install everything that is needed to launch the pipeline locally;
Dockerfile, the Docker configuration for running the pipeline in the container;
.env, a short environment variables configuration file where the OpenAI key must be stored;
ui/, a simple ui written in Streamlit for asking questions.

Step 1: Clone the Pathway LLM App Repository

Clone the llm-app repository from GitHub. This repository contains all the files you’ll need.

git clone https://github.com/pathwaycom/llm-app.git

If you have previously cloned an older version, update it using a pull command.

git pull

Step 2: Navigate to the Question-Answering RAG Directory

Change to the directory where the example is located:

cd llm-app/templates/question_answering_rag

Step 3: Create a `.env` File and put your Open AI API key

Rename the .env.example file in the project directory to .env and update it with your OpenAPI key:

OPENAI_API_KEY=sk-*******

Save the file after making the changes.

Step 4: Modify the `app.yaml` File

By default, the YAML configuration reads documents from a local data folder. If files need to be pulled from external repositories—such as SharePoint, Google Drive, or Amazon S3—Pathway provides seamless integration through dedicated connectors. In this configuration, the !pw.xpacks.connectors.sharepoint.read block replaces the default local source, allowing documents to be directly ingested from SharePoint with metadata enrichment and periodic refresh intervals.

For the LLM service, the configuration uses gpt-3.5-turbo by default, but you can switch to OpenAI models like GPT-4 or GPT-4o as needed. Additionally, Pathway supports 300+ LLMs through the LiteLLM Class, offering flexibility in model selection. Users can also integrate open-source models hosted locally, providing full control over inference and deployment, ensuring privacy and cost efficiency.

app.yaml

$sources:
  - !pw.xpacks.connectors.sharepoint.read 
    url: $SHAREPOINT_URL
    tenant: $SHAREPOINT_TENANT
    client_id: $SHAREPOINT_CLIENT_ID
    cert_path: sharepointcert.pem
    thumbprint: $SHAREPOINT_THUMBPRINT
    root_path: $SHAREPOINT_ROOT
    with_metadata: true
    refresh_interval: 30

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache {}

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

$parser: !pw.xpacks.llm.parsers.UnstructuredParser
  cache_strategy: !pw.udfs.DefaultCache {}

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS
  dimensions: 1536
  
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  # You can set the number of documents to be included as the context of the query
  # search_topk: 6
  # You can use your own prompt for querying.
  # For that set prompt_template to string with `{query}` used as a placeholder for the question,
  # and `{context}` as a placeholder for context documents.
  # prompt_template: "Given these documents: {context}, please answer the question: {query}"

# Change host and port by uncommenting these lines
# host: "0.0.0.0"
# port: $PATHWAY_PORT

# Cache configuration
# with_cache: true

# If `terminate_on_error` is true then the program will terminate whenever any error is encountered.
# Defaults to false, uncomment the following line if you want to set it to true
# terminate_on_error: true

Mandatory Parameters:

url: The SharePoint site URL, including the site's path. For example: https://company.sharepoint.com/sites/MySite.
tenant: The ID of the SharePoint tenant, typically a GUID.
client_id: The Client ID of the SharePoint application with the required grants to access the data.
cert_path: The path to the certificate (typically a .pem file) added to the application for authentication.
thumbprint: The thumbprint for the specified certificate.
root_path: The path for a directory or file within the SharePoint space to be read.
refresh_interval: Time in seconds between scans if the mode is set to "streaming".

For more details on additional configurations, visit Pathway's SharePoint Connector page.

Example Configuration:

To illustrate the utility of this connector, consider a scenario where you need to access a dataset stored in the Shared Documents/Data directory of the SharePoint site Datasets. Below is a basic example demonstrating how to configure the connector for reading this dataset in streaming mode:

t = pw.xpacks.connectors.sharepoint.read(  
    url="https://company.sharepoint.com/sites/Datasets",
    tenant="c2efaf1f-8add-4334-b1ca-32776acb61ea",
    client_id="f521a53a-0b36-4f47-8ef7-60dc07587eb2",
    cert_path="certificate.pem",
    thumbprint="33C1B9D17115E848B1E956E54EECAF6E77AB1B35",
    root_path="Shared Documents/Data",
)

In this setup, the connector targets the Shared Documents/Data directory and recursively scans all subdirectories. This method ensures that no file is overlooked, providing comprehensive access to all pertinent data within the specified path.

Step 5: Obtain and Update the Pathway License Key in `app.py`

Pathway is an open-source framework that provides core functionalities for free. However, to use advanced features like SharePoint connector, you need a Pathway license key. This key unlocks additional enterprise-grade capabilities such as enhanced RAM limits, enterprise connectors (e.g., SharePoint, Delta Table, Iceberg), full persistence and monitoring.

To obtain your free license key, visit Pathway License Key Page and follow the instructions.

Once you have the key, update it in app.py by replacing the existing demo key:

# Set up license key for using Sharepoint feature
pw.set_license_key("demo-license-key-with-telemetry")

Step 6: Running the Project

Locally

If you are using Windows, refer to the Docker instructions in the next section. For a local run, first install the dependencies:

pip install -r requirements.txt

Then, start the app:

python app.py

With Docker

Build the Docker with:

docker compose build

And, run with:

docker compose up

This will start the pipeline and the ui for asking questions.

Step 7: Querying the Pipeline

Check the Indexed Files

Check if your files in SharePoint are indexed for information retrieval for LLMs. To test it, query to get the list of available inputs and associated metadata using curl:

curl -X 'POST'   'http://localhost:8000/v2/list_documents'   -H 'accept: */*'   -H 'Content-Type: application/json'

This will return the list of files e.g. if you start with this file uploaded on your sharepoint the answer will be as follows:

[{"created_at": null, "modified_at": 1718810417, "owner": "root", "path":"data/IdeanomicsInc_20160330_10-K_EX-10.26_9512211_EX-10.26_Content License Agreement.pdf", "seen_at": 1718902304}]

If you add or remove files from the connected folder, repeat the request to see the updated index. The service logs will display the progress of indexing new and modified files.

Ask a Question

You can now run the RAG service. Start by asking a simple question. For example:

curl -X 'POST' \
  'http://0.0.0.0:8000/v2/answer' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "What is the start date of the contract?"
}'

This will return the following answer:

{"response": "The start date of the contract is December 21, 2015."}

If the answer is in any of your indexed documents, the pipeline will return the most accurate, up-to-date response—powered by real-time AI.

Conclusions

In this app template, you:

Learned about Real-time RAG and key considerations for Enterprise RAG applications.
Successfully created and deployed a Enterprise RAG application using Pathway with Microsoft SharePoint as a data source.

By leveraging the combined power of Pathway and Microsoft SharePoint, you built a secure, efficient and scalable Enterprise RAG system tailored to your specific needs. This traditional RAG setup can be refined with rerankers, adaptive RAG, multimodal RAG, and other techniques.

Additional Resources on Enterprise RAG

Slides AI Search: Set up high accuracy multimodal RAG pipelines for presentations and PDFs on the Slides AI Search GitHub repo. This template helps you build a multi-modal search service using GPT-4o with Metadata Extraction and Vector Index. You can also try out the hosted demo here.
Private RAG with Connected Data Sources using Mistral, Ollama, and Pathway: Set up a private RAG pipeline with adaptive retrieval using Pathway, Mistral, and Ollama. This app template allows you to run the entire application locally while ensuring low costs without compromising on accuracy, making it ideal for production use-cases with sensitive data and explainable AI needs. Get started with the app template here.
Multimodal RAG for PDFs with Text, Images, and Charts: This showcase demonstrates how you can launch a MultiModal RAG pipeline that utilizes GPT-4o in the parsing stage. Pathway extracts information from unstructured financial documents in your folders, updating results as documents change or new ones arrive. Learn more here.

Are you looking to build an Enterprise RAG app?

Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. If you'd like to explore how Pathway can support your RAG and Generative AI initiatives, we invite you to schedule a discovery session with our team.

Schedule a 15-minute demo with one of our experts to see how Pathway can be the right solution for your enterprise needs.

Troubleshooting

To provide feedback or report a bug, please raise an issue on our issue tracker. You can also join the Pathway Discord server (#get-help) and let us know how the Pathway community can help you.

Discuss tricks & tips for RAG

Join our Discord community and dive into discussions on tricks and tips for mastering Retrieval Augmented Generation