ask_youtube_playlists.data_processing package

Submodules

ask_youtube_playlists.data_processing.create_documents module

Contains functions to create documents from json files or their corresponding python objects.

ask_youtube_playlists.data_processing.create_documents.extract_documents_from_list_of_dicts(json_data: List[dict], text_key: str = 'text') → List[Document][source]: Extracts documents from a list of dictionaries.

ask_youtube_playlists.data_processing.create_documents.get_documents_from_directory(directory_path: str | PathLike, start_with: str = '', text_key: str = 'text') → List[List[Document]][source]

Extracts the documents from a directory with json files.

Deprecated. We should use the extract_documents_from_list_of_dicts.

Parameters:

directory_path (Union[str, os.PathLike]) – Path to the directory with the json files. Usually …/data/playlist_name/processed.
start_with (str) – The json files must start with this string. Defaults to “”.
text_key (str) – The key of the text field. Defaults to “text”.

ask_youtube_playlists.data_processing.create_embeddings module

Functions to create the Vector database.

class ask_youtube_playlists.data_processing.create_embeddings.EmbeddingModelSpec(model_name: str, model_type: str, max_seq_length: int)[source]

Bases: object

Class to store the specification of an embedding model.

model_name

The name of the embedding model.

Type:: str

model_type

The type of the embedding model. Can be sentence-transformers or openai.

Type:: str

max_seq_length

The maximum number of tokens the model can handle.

Type:: int

max_seq_length: int

model_name: str

model_type: str

ask_youtube_playlists.data_processing.create_embeddings.create_embeddings_pipeline(retriever_directory: str | PathLike, embedding_model_name: str, max_chunk_size: int, min_overlap_size: int, use_st_progress_bar: bool = True) → None[source]

Sets up the embeddings for the given embedding model in the directory.

Steps:

Creates the retriever_directory if it does not exist.
Creates the hyperparams.yaml file.
Chunks the data.
Creates the embeddings and saves them in the retriever_directory.

Parameters:

retriever_directory (PathLike) – The directory where the embeddings will be saved. It should be inside a data/playlist_name directory. This function assumes that the playlist directory contains a raw directory with the json files of each video.
embedding_model_name (str) – The name of the embedding model.
max_chunk_size (int) – The maximum number of characters in a chunk.
min_overlap_size (int) – The minimum number of characters in the overlap between two consecutive chunks.
use_st_progress_bar (bool) – Whether to use the Streamlit progress bar or not.

ask_youtube_playlists.data_processing.create_embeddings.create_vectorstore(embedding_model_name: str, documents: List[Document], vector_store_type: str = 'in-memory', **kwargs) → VectorStore[source]

Returns a vector store that contains the vectors of the documents.

Currently, it only supports “in-memory” mode. In the future, it may support “chroma-db” mode as well.

Note

In order to be able to make the vector store persistent, the vector_store_type should be chroma-db and the kwargs should contain the persist_directory argument with the path to the directory where the vector store will be saved or loaded from. The persist_directory is where Chroma will store its database files on disk, and load them on start.

Parameters:

embedding_model_name (str) – The name of the embedding model.
documents (List[Document]) – List of documents.
vector_store_type (str) – The vector store type. Can be chroma-db or in-memory.
**kwargs – Additional arguments passed to the from_documents method.

Raises:

ValueError – If the persist_directory argument is not provided when the vector store type is chroma-db.

ask_youtube_playlists.data_processing.create_embeddings.get_embedding_model(embedding_model_name: str) → Embeddings[source]

Returns the embedding model.

Parameters:: embedding_model_name (str) – The name of the embedding model.
Raises:: ValueError – If the model type is not supported.

ask_youtube_playlists.data_processing.create_embeddings.get_embedding_spec(model_name: str) → EmbeddingModelSpec[source]

Returns the embedding model specification.

Parameters:: model_name (str) – The name of the embedding model.
Raises:: ValueError – If the model name is not supported.

ask_youtube_playlists.data_processing.create_embeddings.load_embeddings(embedding_directory: str | PathLike) → List[ndarray][source]

Loads the embeddings from the retriever_directory.

Parameters:

embedding_directory (PathLike) – The directory where the embeddings are saved.

Returns:

The embeddings. The order of the embeddings in: the list is the same as the order of the json files in the processed directory.

Return type:

List[np.ndarray]

ask_youtube_playlists.data_processing.create_embeddings.load_hyperparams(directory: str | PathLike) → Dict[str, str | int][source]: Loads the hyperparams.yaml file in the directory.

ask_youtube_playlists.data_processing.create_embeddings.load_vectorstore(persist_directory: str | PathLike) → Chroma[source]

Loads a vectorstore from the local disk.

Parameters:: persist_directory (Union[str, os.PathLike]) – The directory where the vectorstore is saved.
Returns:: The Chroma vectorstore.
Return type:: VectorStore

ask_youtube_playlists.data_processing.create_embeddings.save_json(chunked_data: List[dict], path: Path, file_name: str) → None[source]

Saves the data in a json file.

Parameters:

chunked_data (List[dict]) – The data to be saved.
path (PathLike) – The path to the json file.
file_name (str) – The name of the json file.

ask_youtube_playlists.data_processing.create_embeddings.save_vectorstore(chroma_vectorstore: Chroma) → None[source]

Makes the vectorstore persistent in the local disk.

The vectorstore is saved in the persist directory indicated when the vectorstore was created.

Parameters:: chroma_vectorstore (VectorStore) – The vectorstore.

ask_youtube_playlists.data_processing.download_transcripts module

Code to download the transcripts from YouTube.

ask_youtube_playlists.data_processing.download_transcripts.create_chunked_data(file_path: Path, max_chunk_size: int, min_overlap_size: int) → List[Dict[str, str | List[str]]][source]

Creates chunked data from a JSON file.

Parameters:

file_path (str) – The path to the JSON file.
max_chunk_size (int) – The maximum size of a chunk.
min_overlap_size (int) – The minimum size of the overlap between two chunks.

Returns:

A dictionary with the chunked: data.

Return type:

List[Dict[str, Union[str, List[str]]]]

ask_youtube_playlists.data_processing.download_transcripts.download_playlist(url: str, data_path: Path, use_st_progress_bar: bool = False) → None[source]

Downloads the transcripts of a YouTube playlist.

Parameters:

url (str) – The URL of the YouTube playlist.
data_path (pathlib.Path) – The path to the data directory.
use_st_progress_bar (bool) – Whether to use a Streamlit progress bar.

ask_youtube_playlists.data_processing.download_transcripts.download_transcript(video_title: str, video_id: str, output_path: Path, verbose: bool = True) → None[source]

Downloads the transcript of a YouTube video.

Parameters:

video_title (str) – The title of the YouTube video.
video_id (str) – The ID of the YouTube video.
output_path (pathlib.Path) – The path to the output file.
verbose (bool) – Whether to print the progress of the download.

Raises:

Exception – If the transcript cannot be downloaded.

ask_youtube_playlists.data_processing.utils module

Utility functions for data processing.

ask_youtube_playlists.data_processing.utils.get_available_directories(data_directory: Path) → List[str][source]

Returns a list of the available playlists.

The playlists are the names of the directories in the data directory.

ask_youtube_playlists.data_processing.utils.get_device() → str[source]: Returns ‘cuda’ if a GPU is available, otherwise ‘cpu’.

ask_youtube_playlists.data_processing.utils.is_youtube_playlist(link) → bool[source]: Checks if a string is a YouTube playlist link.

Module contents

This module contains the functions related to the data.

These are the steps to follow: 1.- Download the transcripts of the videos in the playlists. 2.- Create chunks of data from the transcripts, where you can decide the size of the chunks and the overlap between them. 3.- Create a vectorstore from the chunks of data.