How the Project Works
The project accomplishes its tasks through a series of steps shared below. Make sure you give it a quick read before proceeding to the next step where we explore the repository.
1 - Realtime Indexing:
- Sourcing data: A script called
discounts-data-generator.py
mimics real-time data from external sources. It creates or updates a file nameddiscounts.csv
with random data. Alongside this, a scheduled task (cron job) using Python Crontab runs every minute to fetch the latest data from Rainforest API. Crontab is a time-based job scheduler. - Giving the option to choose particular data source(s): With the Streamlit UI provided, either select Rainforest API as a data source or upload a CSV through the UI file-uploader. It then maps each row into a JSONline schema for better managing large data sets. This format helps in managing large datasets by representing each row as a separate JSON object
- Chunking: The documents are divided into shorter sections for them to be converted into vector embeddings.
- Embedding of data source: These shorter sections are processed through the OpenAI API to generate embeddings.
- Real-time Vector Indexing: An index is created based on these embeddings to facilitate quick searching later on.
2 - Query (Prompt) Embedding and Retrieval
- Query Embedding: For any question asked by the user, an embedding is generated using the OpenAI API for embeddings, i.e.,
text-embedding-ada-002
. - Retrieving: The system compares the vector embedding of the query/prompt and the vector embedding of the data source to find the most relevant information.
3 - Prompt Augmentation and Answer Generation
- The query/prompt and the most relevant sections of data are packaged into a message within the token limit.
- Get an Answer from GPT: This message is sent to
gpt-3.5-turbo
, which then provides an answer.