ask_youtube_playlists.data_processing package

Submodules

ask_youtube_playlists.data_processing.create_documents module

Contains functions to create documents from json files or their corresponding python objects.

ask_youtube_playlists.data_processing.create_documents.extract_documents_from_list_of_dicts(json_data: List[dict], text_key: str = 'text') List[Document][source]

Extracts documents from a list of dictionaries.

ask_youtube_playlists.data_processing.create_documents.get_documents_from_directory(directory_path: str | PathLike, start_with: str = '', text_key: str = 'text') List[List[Document]][source]

Extracts the documents from a directory with json files.

Deprecated. We should use the extract_documents_from_list_of_dicts.

Parameters:
  • directory_path (Union[str, os.PathLike]) – Path to the directory with the json files. Usually …/data/playlist_name/processed.

  • start_with (str) – The json files must start with this string. Defaults to “”.

  • text_key (str) – The key of the text field. Defaults to “text”.

ask_youtube_playlists.data_processing.create_embeddings module

Functions to create the Vector database.

class ask_youtube_playlists.data_processing.create_embeddings.EmbeddingModelSpec(model_name: str, model_type: str, max_seq_length: int)[source]

Bases: object

Class to store the specification of an embedding model.

model_name

The name of the embedding model.

Type:

str

model_type

The type of the embedding model. Can be sentence-transformers or openai.

Type:

str

max_seq_length

The maximum number of tokens the model can handle.

Type:

int

max_seq_length: int
model_name: str
model_type: str
ask_youtube_playlists.data_processing.create_embeddings.create_embeddings_pipeline(retriever_directory: str | PathLike, embedding_model_name: str, max_chunk_size: int, min_overlap_size: int, use_st_progress_bar: bool = True) None[source]

Sets up the embeddings for the given embedding model in the directory.

Steps:
  1. Creates the retriever_directory if it does not exist.

  2. Creates the hyperparams.yaml file.

  3. Chunks the data.

  4. Creates the embeddings and saves them in the retriever_directory.

Parameters:
  • retriever_directory (PathLike) – The directory where the embeddings will be saved. It should be inside a data/playlist_name directory. This function assumes that the playlist directory contains a raw directory with the json files of each video.

  • embedding_model_name (str) – The name of the embedding model.

  • max_chunk_size (int) – The maximum number of characters in a chunk.

  • min_overlap_size (int) – The minimum number of characters in the overlap between two consecutive chunks.

  • use_st_progress_bar (bool) – Whether to use the Streamlit progress bar or not.

ask_youtube_playlists.data_processing.create_embeddings.create_vectorstore(embedding_model_name: str, documents: List[Document], vector_store_type: str = 'in-memory', **kwargs) VectorStore[source]

Returns a vector store that contains the vectors of the documents.

Currently, it only supports “in-memory” mode. In the future, it may support “chroma-db” mode as well.

Note

In order to be able to make the vector store persistent, the vector_store_type should be chroma-db and the kwargs should contain the persist_directory argument with the path to the directory where the vector store will be saved or loaded from. The persist_directory is where Chroma will store its database files on disk, and load them on start.

Parameters:
  • embedding_model_name (str) – The name of the embedding model.

  • documents (List[Document]) – List of documents.

  • vector_store_type (str) – The vector store type. Can be chroma-db or in-memory.

  • **kwargs – Additional arguments passed to the from_documents method.

Raises:

ValueError – If the persist_directory argument is not provided when the vector store type is chroma-db.

ask_youtube_playlists.data_processing.create_embeddings.get_embedding_model(embedding_model_name: str) Embeddings[source]

Returns the embedding model.

Parameters:

embedding_model_name (str) – The name of the embedding model.

Raises:

ValueError – If the model type is not supported.

ask_youtube_playlists.data_processing.create_embeddings.get_embedding_spec(model_name: str) EmbeddingModelSpec[source]

Returns the embedding model specification.

Parameters:

model_name (str) – The name of the embedding model.

Raises:

ValueError – If the model name is not supported.

ask_youtube_playlists.data_processing.create_embeddings.load_embeddings(embedding_directory: str | PathLike) List[ndarray][source]

Loads the embeddings from the retriever_directory.

Parameters:

embedding_directory (PathLike) – The directory where the embeddings are saved.

Returns:

The embeddings. The order of the embeddings in

the list is the same as the order of the json files in the processed directory.

Return type:

List[np.ndarray]

ask_youtube_playlists.data_processing.create_embeddings.load_hyperparams(directory: str | PathLike) Dict[str, str | int][source]

Loads the hyperparams.yaml file in the directory.

ask_youtube_playlists.data_processing.create_embeddings.load_vectorstore(persist_directory: str | PathLike) Chroma[source]

Loads a vectorstore from the local disk.

Parameters:

persist_directory (Union[str, os.PathLike]) – The directory where the vectorstore is saved.

Returns:

The Chroma vectorstore.

Return type:

VectorStore

ask_youtube_playlists.data_processing.create_embeddings.save_json(chunked_data: List[dict], path: Path, file_name: str) None[source]

Saves the data in a json file.

Parameters:
  • chunked_data (List[dict]) – The data to be saved.

  • path (PathLike) – The path to the json file.

  • file_name (str) – The name of the json file.

ask_youtube_playlists.data_processing.create_embeddings.save_vectorstore(chroma_vectorstore: Chroma) None[source]

Makes the vectorstore persistent in the local disk.

The vectorstore is saved in the persist directory indicated when the vectorstore was created.

Parameters:

chroma_vectorstore (VectorStore) – The vectorstore.

ask_youtube_playlists.data_processing.download_transcripts module

Code to download the transcripts from YouTube.

ask_youtube_playlists.data_processing.download_transcripts.create_chunked_data(file_path: Path, max_chunk_size: int, min_overlap_size: int) List[Dict[str, str | List[str]]][source]

Creates chunked data from a JSON file.

Parameters:
  • file_path (str) – The path to the JSON file.

  • max_chunk_size (int) – The maximum size of a chunk.

  • min_overlap_size (int) – The minimum size of the overlap between two chunks.

Returns:

A dictionary with the chunked

data.

Return type:

List[Dict[str, Union[str, List[str]]]]

ask_youtube_playlists.data_processing.download_transcripts.download_playlist(url: str, data_path: Path, use_st_progress_bar: bool = False) None[source]

Downloads the transcripts of a YouTube playlist.

Parameters:
  • url (str) – The URL of the YouTube playlist.

  • data_path (pathlib.Path) – The path to the data directory.

  • use_st_progress_bar (bool) – Whether to use a Streamlit progress bar.

ask_youtube_playlists.data_processing.download_transcripts.download_transcript(video_title: str, video_id: str, output_path: Path, verbose: bool = True) None[source]

Downloads the transcript of a YouTube video.

Parameters:
  • video_title (str) – The title of the YouTube video.

  • video_id (str) – The ID of the YouTube video.

  • output_path (pathlib.Path) – The path to the output file.

  • verbose (bool) – Whether to print the progress of the download.

Raises:

Exception – If the transcript cannot be downloaded.

ask_youtube_playlists.data_processing.utils module

Utility functions for data processing.

ask_youtube_playlists.data_processing.utils.get_available_directories(data_directory: Path) List[str][source]

Returns a list of the available playlists.

The playlists are the names of the directories in the data directory.

ask_youtube_playlists.data_processing.utils.get_device() str[source]

Returns ‘cuda’ if a GPU is available, otherwise ‘cpu’.

ask_youtube_playlists.data_processing.utils.is_youtube_playlist(link) bool[source]

Checks if a string is a YouTube playlist link.

Module contents

This module contains the functions related to the data.

These are the steps to follow: 1.- Download the transcripts of the videos in the playlists. 2.- Create chunks of data from the transcripts, where you can decide the size of the chunks and the overlap between them. 3.- Create a vectorstore from the chunks of data.