ask_youtube_playlists.data_processing package
Submodules
ask_youtube_playlists.data_processing.create_documents module
Contains functions to create documents from json files or their corresponding python objects.
- ask_youtube_playlists.data_processing.create_documents.extract_documents_from_list_of_dicts(json_data: List[dict], text_key: str = 'text') List[Document] [source]
Extracts documents from a list of dictionaries.
- ask_youtube_playlists.data_processing.create_documents.get_documents_from_directory(directory_path: str | PathLike, start_with: str = '', text_key: str = 'text') List[List[Document]] [source]
Extracts the documents from a directory with json files.
Deprecated. We should use the extract_documents_from_list_of_dicts.
- Parameters:
directory_path (Union[str, os.PathLike]) – Path to the directory with the json files. Usually …/data/playlist_name/processed.
start_with (str) – The json files must start with this string. Defaults to “”.
text_key (str) – The key of the text field. Defaults to “text”.
ask_youtube_playlists.data_processing.create_embeddings module
Functions to create the Vector database.
- class ask_youtube_playlists.data_processing.create_embeddings.EmbeddingModelSpec(model_name: str, model_type: str, max_seq_length: int)[source]
Bases:
object
Class to store the specification of an embedding model.
- model_name
The name of the embedding model.
- Type:
str
- model_type
The type of the embedding model. Can be sentence-transformers or openai.
- Type:
str
- max_seq_length
The maximum number of tokens the model can handle.
- Type:
int
- max_seq_length: int
- model_name: str
- model_type: str
- ask_youtube_playlists.data_processing.create_embeddings.create_embeddings_pipeline(retriever_directory: str | PathLike, embedding_model_name: str, max_chunk_size: int, min_overlap_size: int, use_st_progress_bar: bool = True) None [source]
Sets up the embeddings for the given embedding model in the directory.
- Steps:
Creates the retriever_directory if it does not exist.
Creates the hyperparams.yaml file.
Chunks the data.
Creates the embeddings and saves them in the retriever_directory.
- Parameters:
retriever_directory (PathLike) – The directory where the embeddings will be saved. It should be inside a data/playlist_name directory. This function assumes that the playlist directory contains a raw directory with the json files of each video.
embedding_model_name (str) – The name of the embedding model.
max_chunk_size (int) – The maximum number of characters in a chunk.
min_overlap_size (int) – The minimum number of characters in the overlap between two consecutive chunks.
use_st_progress_bar (bool) – Whether to use the Streamlit progress bar or not.
- ask_youtube_playlists.data_processing.create_embeddings.create_vectorstore(embedding_model_name: str, documents: List[Document], vector_store_type: str = 'in-memory', **kwargs) VectorStore [source]
Returns a vector store that contains the vectors of the documents.
Currently, it only supports “in-memory” mode. In the future, it may support “chroma-db” mode as well.
Note
In order to be able to make the vector store persistent, the vector_store_type should be chroma-db and the kwargs should contain the persist_directory argument with the path to the directory where the vector store will be saved or loaded from. The persist_directory is where Chroma will store its database files on disk, and load them on start.
- Parameters:
embedding_model_name (str) – The name of the embedding model.
documents (List[Document]) – List of documents.
vector_store_type (str) – The vector store type. Can be chroma-db or in-memory.
**kwargs – Additional arguments passed to the from_documents method.
- Raises:
ValueError – If the persist_directory argument is not provided when the vector store type is chroma-db.
- ask_youtube_playlists.data_processing.create_embeddings.get_embedding_model(embedding_model_name: str) Embeddings [source]
Returns the embedding model.
- Parameters:
embedding_model_name (str) – The name of the embedding model.
- Raises:
ValueError – If the model type is not supported.
- ask_youtube_playlists.data_processing.create_embeddings.get_embedding_spec(model_name: str) EmbeddingModelSpec [source]
Returns the embedding model specification.
- Parameters:
model_name (str) – The name of the embedding model.
- Raises:
ValueError – If the model name is not supported.
- ask_youtube_playlists.data_processing.create_embeddings.load_embeddings(embedding_directory: str | PathLike) List[ndarray] [source]
Loads the embeddings from the retriever_directory.
- Parameters:
embedding_directory (PathLike) – The directory where the embeddings are saved.
- Returns:
- The embeddings. The order of the embeddings in
the list is the same as the order of the json files in the processed directory.
- Return type:
List[np.ndarray]
- ask_youtube_playlists.data_processing.create_embeddings.load_hyperparams(directory: str | PathLike) Dict[str, str | int] [source]
Loads the hyperparams.yaml file in the directory.
- ask_youtube_playlists.data_processing.create_embeddings.load_vectorstore(persist_directory: str | PathLike) Chroma [source]
Loads a vectorstore from the local disk.
- Parameters:
persist_directory (Union[str, os.PathLike]) – The directory where the vectorstore is saved.
- Returns:
The Chroma vectorstore.
- Return type:
VectorStore
- ask_youtube_playlists.data_processing.create_embeddings.save_json(chunked_data: List[dict], path: Path, file_name: str) None [source]
Saves the data in a json file.
- Parameters:
chunked_data (List[dict]) – The data to be saved.
path (PathLike) – The path to the json file.
file_name (str) – The name of the json file.
- ask_youtube_playlists.data_processing.create_embeddings.save_vectorstore(chroma_vectorstore: Chroma) None [source]
Makes the vectorstore persistent in the local disk.
The vectorstore is saved in the persist directory indicated when the vectorstore was created.
- Parameters:
chroma_vectorstore (VectorStore) – The vectorstore.
ask_youtube_playlists.data_processing.download_transcripts module
Code to download the transcripts from YouTube.
- ask_youtube_playlists.data_processing.download_transcripts.create_chunked_data(file_path: Path, max_chunk_size: int, min_overlap_size: int) List[Dict[str, str | List[str]]] [source]
Creates chunked data from a JSON file.
- Parameters:
file_path (str) – The path to the JSON file.
max_chunk_size (int) – The maximum size of a chunk.
min_overlap_size (int) – The minimum size of the overlap between two chunks.
- Returns:
- A dictionary with the chunked
data.
- Return type:
List[Dict[str, Union[str, List[str]]]]
- ask_youtube_playlists.data_processing.download_transcripts.download_playlist(url: str, data_path: Path, use_st_progress_bar: bool = False) None [source]
Downloads the transcripts of a YouTube playlist.
- Parameters:
url (str) – The URL of the YouTube playlist.
data_path (pathlib.Path) – The path to the data directory.
use_st_progress_bar (bool) – Whether to use a Streamlit progress bar.
- ask_youtube_playlists.data_processing.download_transcripts.download_transcript(video_title: str, video_id: str, output_path: Path, verbose: bool = True) None [source]
Downloads the transcript of a YouTube video.
- Parameters:
video_title (str) – The title of the YouTube video.
video_id (str) – The ID of the YouTube video.
output_path (pathlib.Path) – The path to the output file.
verbose (bool) – Whether to print the progress of the download.
- Raises:
Exception – If the transcript cannot be downloaded.
ask_youtube_playlists.data_processing.utils module
Utility functions for data processing.
- ask_youtube_playlists.data_processing.utils.get_available_directories(data_directory: Path) List[str] [source]
Returns a list of the available playlists.
The playlists are the names of the directories in the data directory.
Module contents
This module contains the functions related to the data.
These are the steps to follow: 1.- Download the transcripts of the videos in the playlists. 2.- Create chunks of data from the transcripts, where you can decide the size of the chunks and the overlap between them. 3.- Create a vectorstore from the chunks of data.