1. Version Control with Git:
git init
git add .
git commit -m "Initial commit"
git push origin main
Version control is essential in MLOps to track changes in code, models, and datasets. Git is the most popular version control system. You can initialize a Git repository, stage changes, commit them, and push to a remote repository.
2. Virtual Environments:
python -m venv myenv
source myenv/bin/activate
Using virtual environments helps maintain isolated Python environments for different projects. You can create a virtual environment using the venv
module and activate it to work within that environment.
3. Dependency Management with pip:
pip install numpy pandas scikit-learn
pip freeze > requirements.txt
Managing dependencies is crucial for reproducibility and ease of deployment. You can use pip
to install required packages and create a requirements.txt
file to record the exact versions of the installed packages.
4. Dockerizing ML Applications:
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
Dockerizing ML applications ensures consistent and portable deployment across different environments. You can create a Dockerfile that specifies the base image, installs dependencies, copies the code, and defines the entry point for the application.
5. Continuous Integration and Continuous Deployment (CI/CD):
# GitHub Actions workflow
name: CI/CDon:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build and Test
run: |
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python tests.py
- name: Deploy
run: |
docker build -t my-ml-app .
docker push my-ml-app
CI/CD pipelines automate the build, test, and deployment processes. You can use tools like GitHub Actions or Jenkins to define workflows that trigger on code changes, run tests, and deploy the application.
6. Model Versioning with MLflow:
import mlflowwith mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")
7. Model Serving with Flask:
from flask import Flask, request, jsonify
import joblibapp = Flask(__name__)
model = joblib.load("model.pkl")
@app.route("/predict", methods=["POST"])
def predict():
data = request.json["data"]
prediction = model.predict(data)
return jsonify({"prediction": prediction.tolist()})
if __name__ == "__main__":
app.run()
Model serving involves deploying trained models as web services. Flask is a lightweight web framework in Python. You can create a Flask application that loads a trained model and exposes an endpoint for making predictions.
8. Model Monitoring with Prometheus:
from prometheus_client import start_http_server, Gaugeaccuracy_gauge = Gauge("model_accuracy", "Accuracy of the model")
def monitor_model():
while True:
accuracy = evaluate_model()
accuracy_gauge.set(accuracy)
time.sleep(60)
if __name__ == "__main__":
start_http_server(8000)
monitor_model()
Model monitoring helps track the performance of deployed models in production. Prometheus is an open-source monitoring system. You can use Prometheus client libraries to expose metrics from your Python application and set up Prometheus to scrape and store these metrics.
9. Logging with Python Logging:
import logginglogging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Starting the application")
logger.warning("Warning message")
logger.error("Error message")
Logging is essential for debugging and monitoring ML applications. Python provides a built-in logging module. You can configure the logging level, create loggers, and log messages with different severity levels.
10. Testing ML Code with pytest:
def test_model_accuracy():
model = train_model()
accuracy = evaluate_model(model)
assert accuracy > 0.9def test_data_preprocessing():
data = load_data()
preprocessed_data = preprocess_data(data)
assert preprocessed_data.shape == (1000, 10)
Testing ML code ensures the correctness and reliability of ML pipelines. pytest is a popular testing framework in Python. You can write test functions that assert expected behavior and run them using pytest.
11. Experiment Tracking with Weights and Biases (wandb):
import wandbwandb.init(project="my-project")
wandb.config.update({"learning_rate": 0.01, "batch_size": 32})
for epoch in range(10):
loss = train_model()
wandb.log({"loss": loss})
wandb.log_artifact("model.pkl", "model")
Experiment tracking helps manage and compare different runs of ML experiments. Weights and Biases (wandb) is a cloud-based platform for experiment tracking. You can use the wandb library to log metrics, hyperparameters, and artifacts during the training process.
12. Model Serialization with joblib:
import joblibjoblib.dump(model, "model.pkl")
loaded_model = joblib.load("model.pkl")
Model serialization allows saving trained models to disk for later use. joblib is a library for efficiently serializing Python objects. You can use joblib to dump trained models to files and load them back when needed.
13. Data Versioning with DVC:
dvc init
dvc add data.csv
git add data.csv.dvc
git commit -m "Add data"
dvc push
Data versioning helps track and manage different versions of datasets. DVC (Data Version Control) is a tool for versioning and sharing datasets and ML models. You can use DVC to track data files, create a DVC repository, and push/pull data versions.
14. Model Deployment with Kubernetes:
# Kubernetes deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-deployment
spec:
replicas: 3
selector:
matchLabels:
app: ml-app
template:
metadata:
labels:
app: ml-app
spec:
containers:
- name: ml-app
image: my-ml-app:latest
ports:
- containerPort: 5000
Model deployment involves deploying ML models to production environments. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. You can create Kubernetes deployment YAML files to specify the desired state of your ML application.
15. Model Explainability with SHAP:
import shapexplainer = shap.Explainer(model)
shap_values = explainer(X)
shap.summary_plot(shap_values, X)
Model explainability helps understand how ML models make predictions. SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of ML models. You can use the SHAP library to compute SHAP values and visualize feature importance.
16. Hyperparameter Tuning with Optuna:
import optunadef objective(trial):
learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
batch_size = trial.suggest_categorical("batch_size", [32, 64, 128])
accuracy = train_and_evaluate(learning_rate, batch_size)
return accuracy
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
Hyperparameter tuning involves finding the best hyperparameters for an ML model. Optuna is an optimization framework for hyperparameter tuning. You can define an objective function that takes hyperparameters as input and returns a performance metric, and use Optuna to optimize the hyperparameters.
17.Model Compression with Quantization:
import tensorflow_model_optimization as tfmotquantize_model = tfmot.quantization.keras.quantize_model
quantized_model = quantize_model(model)
quantized_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
Model compression techniques help reduce the size of ML models for deployment. Quantization is a technique that reduces the precision of model weights, thereby reducing the model size. You can use libraries like TensorFlow Model Optimization to apply quantization to your models.
18. Distributed Training with Horovod:
import horovod.tensorflow as hvdhvd.init()
model = create_model()
optimizer = hvd.DistributedOptimizer(tf.optimizers.Adam())
@tf.function
def training_step(inputs, labels):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
loss = compute_loss(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
for epoch in range(num_epochs):
for batch in dataset:
loss = training_step(batch.inputs, batch.labels)
Distributed training allows training ML models across multiple machines or devices. Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. You can use Horovod to scale your training across multiple GPUs or machines.
19. Model Serving with TensorFlow Serving:
import tensorflow as tfmodel = create_model()
model.save("model", save_format="tf")
docker run -p 8501:8501 --mount type=bind,source=$(pwd)/model,target=/models/model -e MODEL_NAME=model tensorflow/serving
Model serving involves deploying trained models for inference. TensorFlow Serving is a flexible, high-performance serving system for ML models. You can save your TensorFlow model in the SavedModel format and serve it using TensorFlow Serving.
20. Experiment Tracking with TensorBoard:
import tensorflow as tflogdir = "logs"
writer = tf.summary.create_file_writer(logdir)
with writer.as_default():
for step in range(num_steps):
loss = train_step()
tf.summary.scalar("loss", loss, step=step)
writer.flush()
Experiment tracking helps monitor and visualize the training progress of ML models. TensorBoard is a visualization toolkit for TensorFlow. You can use TensorBoard to log scalar summaries, histograms, images, and more during the training process.
21. Model Interpretability with Lime:
import lime
import lime.lime_tabularexplainer = lime.lime_tabular.LimeTabularExplainer(X_train, feature_names=feature_names, class_names=class_names)
explanation = explainer.explain_instance(X_test[0], model.predict_proba, num_features=5)
explanation.show_in_notebook()
Model interpretability helps understand how ML models make predictions for individual instances. Lime (Local Interpretable Model-agnostic Explanations) is a technique for explaining the predictions of black-box models. You can use the Lime library to generate local explanations for individual instances.
22. Continuous Monitoring with Evidently:
import evidently
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab, RegressionPerformanceTabreference_data = load_reference_data()
production_data = load_production_data()
data_drift_dashboard = Dashboard(tabs=[DataDriftTab(reference_data=reference_data, production_data=production_data)])
data_drift_dashboard.save("data_drift_dashboard.html")
regression_performance_dashboard = Dashboard(tabs=[RegressionPerformanceTab(reference_data=reference_data, production_data=production_data)])
regression_performance_dashboard.save("regression_performance_dashboard.html")
Continuous monitoring helps detect and alert on issues in production ML systems. Evidently is an open-source Python library for data and model monitoring. You can use Evidently to monitor data drift, model performance, and other metrics in real-time.
23. Workflow Orchestration with Apache Airflow:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetimedefault_args = {
"start_date": datetime(2023, 1, 1),
}
dag = DAG("ml_pipeline", default_args=default_args, schedule_interval="@daily")
def preprocess_data():
# Preprocessing logic
def train_model():
# Training logic
def evaluate_model():
# Evaluation logic
preprocess_task = PythonOperator(task_id="preprocess_data", python_callable=preprocess_data, dag=dag)
train_task = PythonOperator(task_id="train_model", python_callable=train_model, dag=dag)
evaluate_task = PythonOperator(task_id="evaluate_model", python_callable=evaluate_model, dag=dag)
preprocess_task >> train_task >> evaluate_task
Workflow orchestration involves defining and managing complex ML pipelines. Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. You can define DAGs (Directed Acyclic Graphs) in Airflow to represent ML pipelines and orchestrate their execution.
24. SHAP Values:
import shapexplainer = shap.Explainer(model)
shap_values = explainer(X)
shap.summary_plot(shap_values, X)
SHAP (SHapley Additive exPlanations) is a framework for explaining the output of machine learning models. It assigns each feature an importance value for a particular prediction. The shap
library provides functions to compute SHAP values and visualize feature importance.
25. Pandera for Data Validation:
from pandera import Check, Column, DataFrameSchemaschema = DataFrameSchema({
"age": Column(int, Check(lambda x: x > 0)),
"income": Column(float, Check(lambda x: x >= 0))
})
try:
schema.validate(df)
print("Data validation passed.")
except Exception as e:
print(f"Data validation failed: {str(e)}")
26. Alibi for drift detection:
from alibi_detect.cd import ChiSquareDriftdrift_detector = ChiSquareDrift(X_ref, p_val=0.05)
drift_preds = drift_detector.predict(X_new)
print(drift_preds)
27. MLFlow for Version Control:
import mlflowmlflow.set_experiment("my-experiment")
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")
28. MLFlow for Model Packaging:
from mlflow.models.signature import infer_signaturesignature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, "model", signature=signature)
29. MLFlow for Model Serving:
import mlflow.pyfuncclass MyModel(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def predict(self, context, input_data):
return self.model.predict(input_data)
mlflow.pyfunc.save_model("model", python_model=MyModel(model))
30. Evidently for Model Monitoring:
from evidently.report import Report
from evidently.metric_preset import DataDriftPresetreference_data = load_reference_data()
current_data = load_current_data()
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
report.save_html("drift_report.html")
31. Hyperparameter Tuning:
from sklearn.model_selection import GridSearchCVparam_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [None, 5, 10],
"min_samples_split": [2, 5, 10]
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
32.Model Evaluation:
from sklearn.metrics import classification_report, confusion_matrixy_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
33. Bentoml for Model Deployment:
import bentoml@bentoml.env(infer_pip_packages=True)
@bentoml.artifacts([bentoml.sklearn.SklearnModelArtifact("model")])
class MyModel(bentoml.BentoService):
@bentoml.api(input=bentoml.sklearn.SKLearnModelInput(), output=bentoml.sklearn.SKLearnModelOutput())
def predict(self, input_data):
return self.artifacts.model.predict(input_data)
bento_model = MyModel()
bento_model.pack("model", model)
bento_model.save()
34.Model Inference:
import requestsurl = "http://localhost:5000/predict"
data = {"input_data": [[1, 2, 3, 4]]}
response = requests.post(url, json=data)
predictions = response.json()["predictions"]
35. MLFlow for Model Registry:
import mlflowmodel_name = "my_model"
mlflow.register_model(f"runs:/{run_id}/model", model_name)
36. MLFlow for Model Versioning:
import mlflowmodel_name = "my_model"
model_version = 1
mlflow.sklearn.log_model(model, "model", registered_model_name=model_name)
37. MLFlow for Model Staging:
import mlflowmodel_name = "my_model"
model_version = 1
stage = "Production"
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(model_name, model_version, stage)
38.Data Preprocessing:
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
39.Feature Selection:
from sklearn.feature_selection import SelectKBest, f_classifselector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
40.Model Monitoring Dashboard Using Streamlit:
import streamlit as st
import pandas as pd@st.cache
def load_data():
return pd.read_csv("data.csv")
@st.cache
def load_metrics():
return pd.read_csv("metrics.csv")
data = load_data()
metrics = load_metrics()
st.title("Model Monitoring Dashboard")
st.write(data)
st.write(metrics)
41.Model Serving with FastAPI:
from fastapi import FastAPI
from pydantic import BaseModelclass InputData(BaseModel):
features: List[float]
app = FastAPI()
@app.post("/predict")
async def predict(input_data: InputData):
prediction = model.predict([input_data.features])
return {"prediction": prediction.tolist()}
42.Use Hydra for configuration management:
import hydra
from omegaconf import DictConfig@hydra.main(config_path="config", config_name="config")
def main(cfg: DictConfig):
print(cfg.learning_rate)
print(cfg.model.architecture)
43.Implement logging for better debugging and monitoring:
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Training started")
44.Use Prometheus for monitoring ML models in production:
from prometheus_client import start_http_server, Gaugeaccuracy_gauge = Gauge("model_accuracy", "Accuracy of the model")
45.Use Seldon Core for model deployment and serving:
- Define a Seldon deployment YAML file
- Deploy the model using Seldon Core
- Access the model endpoint for predictions
46.Implement model serving with KFServing:
apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
name: sklearn-iris
spec:
default:
predictor:
sklearn:
storageUri: gs://my-bucket/sklearn-iris
47.Checking GPU Availability:
import tensorflow as tfprint(tf.test.is_gpu_available())
Deep learning models often benefit from running on GPUs for faster computation. You can check if a GPU is available using TensorFlow’s is_gpu_available()
function.
48. Setting Random Seed:
import tensorflow as tf
import numpy as npseed_value = 42
tf.random.set_seed(seed_value)
np.random.seed(seed_value)
Setting a fixed random seed ensures reproducibility of your deep learning experiments. You can set the random seed using tf.random.set_seed()
for TensorFlow and np.random.seed()
for NumPy.
49.Normalizing Input Data:
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
Normalizing the input data to a specific range (e.g., [0, 1] or [-1, 1]) can help improve the convergence and stability of deep learning models. You can use scikit-learn’s MinMaxScaler
to normalize the data.
50. One-Hot Encoding:
from tensorflow.keras.utils import to_categoricaly_one_hot = to_categorical(y)
One-hot encoding is commonly used to convert categorical variables into a binary vector representation. Keras provides the to_categorical()
function to perform one-hot encoding on target labels.
51. Splitting Data into Train and Test Sets:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Splitting the data into training and testing sets is crucial for evaluating the performance of deep learning models. You can use scikit-learn’s train_test_split()
function to split the data.
52. Building a Neural Network with Keras:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Densemodel = Sequential()
model.add(Dense(64, activation='relu', input_shape=(input_dim,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Keras provides a high-level API for building neural networks. You can create a sequential model using the Sequential
class and add layers using the add()
method. The compile()
method is used to configure the model’s optimizer, loss function, and metrics.
53. Training a Neural Network:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))
Once the model is built and compiled, you can train it using the fit()
method. Specify the training data, number of epochs, batch size, and validation data (if available).
54. Saving and Loading Models:
model.save('model.h5')
loaded_model = keras.models.load_model('model.h5')
You can save a trained model to disk using the save()
method and later load it using keras.models.load_model()
. This is useful for deploying models or reusing them without retraining.
55. Early Stopping:
from tensorflow.keras.callbacks import EarlyStoppingearly_stopping = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test), callbacks=[early_stopping])
Early stopping is a technique to prevent overfitting by monitoring a validation metric and stopping the training if there is no improvement after a certain number of epochs. Keras provides the EarlyStopping
callback to implement early stopping.
56. Dropout Regularization:
from tensorflow.keras.layers import Dropoutmodel.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
Dropout is a regularization technique that randomly drops out a fraction of neurons during training to prevent overfitting. You can add dropout layers using the Dropout
class in Keras.
57. Batch Normalization:
from tensorflow.keras.layers import BatchNormalizationmodel.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
Batch normalization is a technique that normalizes the activations of a layer to stabilize the training process and improve convergence. You can add batch normalization layers using the BatchNormalization
class in Keras.
58. Learning Rate Scheduling:
from tensorflow.keras.callbacks import LearningRateSchedulerdef lr_scheduler(epoch):
if epoch < 10:
return 0.001
else:
return 0.001 * tf.math.exp(0.1 * (10 - epoch))
lr_callback = LearningRateScheduler(lr_scheduler)
model.fit(X_train, y_train, epochs=50, callbacks=[lr_callback])
Learning rate scheduling adjusts the learning rate during training to optimize the convergence. Keras provides the LearningRateScheduler
callback to define custom learning rate schedules.
59.Data Augmentation:
from tensorflow.keras.preprocessing.image import ImageDataGeneratordatagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
datagen.fit(X_train)
model.fit(datagen.flow(X_train, y_train, batch_size=32), epochs=10)
Data augmentation is a technique to artificially increase the size of the training dataset by applying random transformations to the existing data. Keras provides the ImageDataGenerator
class for image data augmentation.
60.Transfer Learning:
from tensorflow.keras.applications import VGG16base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = base_model.output
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
model = tf.keras.Model(inputs=base_model.input, outputs=x)
Transfer learning involves using pre-trained models as a starting point for a new task. Keras provides several pre-trained models such as VGG16, ResNet, and Inception that can be used for transfer learning.
61. Fine-tuning Pre-trained Models:
base_model.trainable = False
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)base_model.trainable = True
model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
Fine-tuning involves unfreezing some layers of a pre-trained model and training them along with the new layers added for the specific task. This allows the model to adapt to the new dataset while leveraging the pre-learned features.
62. Gradient Clipping:
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
model.compile(optimizer=optimizer, loss='categorical_crossentropy')
Gradient clipping is a technique to prevent exploding gradients during training. It limits the magnitude of the gradients to a specific value. Keras optimizers provide the clipvalue
or clipnorm
arguments to clip the gradients.
63. Custom Loss Functions:
from tensorflow.keras.losses import Lossclass CustomLoss(Loss):
def call(self, y_true, y_pred):
# Implement your custom loss calculation here
loss = ...
return loss
model.compile(optimizer='adam', loss=CustomLoss())
Keras allows you to define custom loss functions by subclassing the Loss
class and implementing the call()
method. This is useful when you need a specific loss function that is not provided by Keras.
64.Custom Metrics:
from tensorflow.keras.metrics import Metricclass CustomMetric(Metric):
def __init__(self, name='custom_metric'):
super(CustomMetric, self).__init__(name=name)
# Initialize any variables or states here
def update_state(self, y_true, y_pred, sample_weight=None):
# Update the metric state based on the current batch
# Implement your custom metric calculation here
def result(self):
# Compute and return the final metric value
# Implement your custom metric aggregation here
def reset_states(self):
# Reset any variables or states between epochs
# Implement any necessary reset logic here
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[CustomMetric()])
Keras allows you to define custom metrics by subclassing the Metric
class and implementing the update_state()
, result()
, and reset_states()
methods. This is useful when you need to track a specific metric that is not provided by Keras.
65.Fine-tuning a pre-trained model using Hugging Face Transformers:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainermodel_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
training_args = TrainingArguments(
output_dir="output",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
66.Using LangChain for question answering with a pre-trained model:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChainllm = OpenAI(temperature=0.9)
prompt = PromptTemplate(
input_variables=["question"],
template="Q: {question}\nA:",
)
chain = LLMChain(llm=llm, prompt=prompt)
question = "What is the capital of France?"
response = chain.run(question)
print(response)
67. Creating an index using LlamaIndex:
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReaderdocuments = SimpleDirectoryReader("data").load_data()
index = GPTSimpleVectorIndex(documents)
query = "What is the main topic discussed in the documents?"
response = index.query(query)
print(response)
68.Prompt engineering with few-shot learning:
prompt = """
Given the following examples, answer the question below.Example 1:
Q: What is the capital of France?
A: The capital of France is Paris.
Example 2:
Q: What is the largest planet in our solar system?
A: The largest planet in our solar system is Jupiter.
Example 3:
Q: Who wrote the play "Romeo and Juliet"?
A: William Shakespeare wrote the play "Romeo and Juliet".
Question: What is the currency used in Japan?
Answer:
"""
response = generate_text(prompt)
print(response)
69. Zero-shot classification using Hugging Face’s pipeline:
from transformers import pipelineclassifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
sequence = "The movie was a thrilling adventure with amazing special effects."
candidate_labels = ["positive", "negative", "neutral"]
result = classifier(sequence, candidate_labels)
print(f"Sequence: {sequence}")
for label, score in zip(result['labels'], result['scores']):
print(f"{label}: {score:.2f}")
70. Text summarization using LangChain and OpenAI:
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chainllm = OpenAI(temperature=0)
text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, nulla sit amet aliquam lacinia, nisl nisl aliquam nisl, nec aliquam nisl nisl sit amet nisl. Sed euismod, nulla sit amet aliquam lacinia, nisl nisl aliquam nisl, nec aliquam nisl nisl sit amet nisl. Sed euismod, nulla sit amet aliquam lacinia, nisl nisl aliquam nisl, nec aliquam nisl nisl sit amet nisl.
"""
summary_chain = load_summarize_chain(llm, chain_type="map_reduce")
summary = summary_chain.run(text)
print(summary)
71. Storing and retrieving embeddings using ChromaDB:
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from chromadb.api.models.Collection import Collectionchroma_settings = Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="db"
)
client = chromadb.Client(chroma_settings)
collection = client.create_collection(name="my_collection")
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "example1"}, {"source": "example2"}],
ids=["id1", "id2"]
)
results = collection.query(
query_texts=["This is a query document"],
n_results=2
)
print(results)
72. Similarity search using FAISS:
import faiss
import numpy as npd = 64 # dimension
nb = 100000 # database size
nq = 10000 # number of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)
k = 4 # we want to see 4 nearest neighbors
D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(D[:5]) # distances of the 5 first queries
73. LoRA (Low-Rank Adaptation) fine-tuning using Hugging Face’s PEFT:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, get_peft_model_state_dictmodel_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
peft_config = LoraConfig(
task_type="SEQ_CLS",
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q", "v"],
)
model = get_peft_model(model, peft_config)
# Prepare the dataset and data collator
dataset = load_dataset("glue", "mrpc", split="train")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
output_dir="output_dir",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-4,
weight_decay=0.01,
warmup_ratio=0.06,
fp16=True,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_steps=200,
save_total_limit=2,
seed=42,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
)
trainer.train()
model.save_pretrained("output_dir")
74.QLoRA (Quantization-Aware Low-Rank Adaptation) fine-tuning using Hugging Face’s PEFT:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_trainingmodel_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
peft_config = LoraConfig(
task_type="SEQ_CLS",
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q", "v"],
)
model = get_peft_model(model, peft_config)
model = prepare_model_for_int8_training(model)
# Prepare the dataset and data collator
dataset = load_dataset("glue", "mrpc", split="train")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
output_dir="output_dir",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-4,
weight_decay=0.01,
warmup_ratio=0.06,
fp16=True,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_steps=200,
save_total_limit=2,
seed=42,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
)
trainer.train()
model.save_pretrained("output_dir")
75. Text-to-image generation using DALL-E:
import openai
import requests
from PIL import Image
from io import BytesIOopenai.api_key = "YOUR_API_KEY"
prompt = "A colorful bird sitting on a tree branch"
response = openai.Image.create(
prompt=prompt,
n=1,
size="512x512"
)
image_url = response['data'][0]['url']
response = requests.get(image_url)
Image.open(BytesIO(response.content)).show()
76. Entity recognition using spaCy:
import spacynlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
77. Sentence embedding using Hugging Face’s sentence transformers:
from sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The quick brown fox jumps over the lazy dog.",
"I love to eat pizza.",
"The sky is blue and the sun is shining.",
]
embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, embeddings):
print(f"Sentence: {sentence}")
print(f"Embedding shape: {embedding.shape}")
print("---")
78. Text classification using Hugging Face’s pipeline:
from transformers import pipelineclassifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
texts = [
"This movie was absolutely amazing!",
"I didn't enjoy the book at all.",
"The product worked well, but the customer service was terrible.",
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"Text: {text}")
print(f"Label: {result['label']}, Score: {round(result['score'], 2)}")
print("---")
79. Topic modeling using Gensim:
from gensim import corpora, modelsdocuments = [
"The quick brown fox jumps over the lazy dog.",
"The lazy dog sleeps all day.",
"The quick brown fox is very agile.",
"The dog is man's best friend.",
]
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=2)
for idx, topic in lda_model.print_topics(-1):
print(f"Topic {idx}: {topic}")
80. Word embeddings using Gensim:
from gensim.models import Word2Vecsentences = [
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
["the", "lazy", "dog", "sleeps", "all", "day"],
["the", "quick", "brown", "fox", "is", "very", "agile"],
]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word = "fox"
similar_words = model.wv.most_similar(word)
print(f"Words similar to '{word}':")
for similar_word, similarity in similar_words:
print(f"- {similar_word}: {similarity}")
81. Part-of-speech tagging using NLTK:
import nltk
from nltk.tokenize import word_tokenizenltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
for token, pos in pos_tags:
print(f"Word: {token}, POS: {pos}")
82. Model testing with pytest:
import pytest
from transformers import AutoModelForSequenceClassification, AutoTokenizer@pytest.fixture
def model_and_tokenizer():
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
return model, tokenizer
def test_sentiment_analysis(model_and_tokenizer):
model, tokenizer = model_and_tokenizer
positive_text = "This movie was amazing!"
negative_text = "The book was terrible and boring."
positive_input_ids = tokenizer(positive_text, return_tensors="pt").input_ids
negative_input_ids = tokenizer(negative_text, return_tensors="pt").input_ids
positive_output = model(positive_input_ids)[0].argmax().item()
negative_output = model(negative_input_ids)[0].argmax().item()
assert positive_output == 1 # Positive sentiment
assert negative_output == 0 # Negative sentiment
83. Sentiment Analysis with VADER:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzeranalyzer = SentimentIntensityAnalyzer()
text = "This movie was amazing! The acting was brilliant and the plot kept me engaged throughout."
scores = analyzer.polarity_scores(text)
print(f"Positive: {scores['pos']}")
print(f"Neutral: {scores['neu']}")
print(f"Negative: {scores['neg']}")
print(f"Compound: {scores['compound']}")
84. TensorFlow: Basic Operations:
import tensorflow as tfa = tf.constant(3.0)
b = tf.constant(4.0)
c = tf.add(a, b)
print(c.numpy()) # Output: 7.0
- This code snippet demonstrates basic operations in TensorFlow.
tf.constant
is used to create constant tensorsa
andb
.tf.add
is used to perform element-wise addition ofa
andb
, resulting in tensorc
.- The value of
c
is printed using thenumpy()
method.
85. PyTorch: Basic Operations:
import torcha = torch.tensor(3.0)
b = torch.tensor(4.0)
c = a + b
print(c.item()) # Output: 7.0
- This code snippet demonstrates basic operations in PyTorch.
torch.tensor
is used to create tensorsa
andb
.- The
+
operator is used to perform element-wise addition ofa
andb
, resulting in tensorc
. - The value of
c
is printed using theitem()
method.