The creation of the analytical dataset is a crucial stage in the AI/ML lifecycle, embodying the principle “garbage in, garbage out.” An analytical dataset comprises the inputs for your AI/ML model, including the target variable (Y) for supervised learning. This dataset is pivotal as it directly influences the model’s quality and effectiveness. In the upcoming section, we will delve into essential considerations for building, cleaning, and transforming data to construct your first analytical dataset. This foundational work is vital for hypothesis testing and feature selection, setting the stage for developing your initial model version.
Tie data requirements directly to defined user stories or MECE hypothesis trees to clarify what data is needed and why.
For example, A hospital network is looking to leverage AI/ML to enhance patient care and optimize hospital operations. The goal is to reduce patient readmission rates and improve the efficiency of resource allocation across its facilities.
One way you can lay out requirements is by creating a data requirements document with Patient medical history, diagnosis information, and previous admissions. Another way is by first developing user stories & hypotheses and tying data requirements back which makes the request easier to understand. Let’s explore that in further detail below:
User Story: As a hospital administrator, I want to predict patient readmission risks to implement targeted interventions and reduce readmission rates.
Hypothesis Tree:
- Hypothesis 1: Patients with certain chronic conditions are more likely to be readmitted.
- Data needed: Patient medical history, diagnosis information, previous admissions.
Hypothesis 2: Readmission rates are higher for patients discharged on weekends or holidays.
- Data needed: Discharge dates and times, readmission records.
Hypothesis 3: Post-discharge follow-up improves patient outcomes and reduces readmissions.
- Data needed: Follow-up appointment records, patient outcome measures
Ensure IT and stakeholders understand the ‘why’ behind data requests to avoid missing essential data and its interconnections.
The specific value behind this is that by providing IT & Data Stewards with the bigger picture they will pro-actively guide you in how to join all the information together, which will become valuable for building your overall dataset.
To build an AI/ML model you first need to identify which features have the most predictive power. To that end, you need to go through multiple rounds of hypothesis testing, feature engineering, and feature selection. It’s best to create an analytical dataset to do all this upfront, rather than do testing across disaggregated datasets
Say you have two hypotheses:
- Patients with certain chronic conditions are more likely to be readmitted.
- Readmission rates are higher for patients discharged on weekends or holidays.
Chances are the data to test the two hypotheses comes from different datasets. You could aggregate them or test them using separate datasets. There are two reasons why you should aggregate the dataset.
- Consistency in Data Handling: Using one dataset ensures that the data processing, cleaning, and transformation steps are consistent across all variables. This uniformity reduces the risk of discrepancies or biases that might arise from handling datasets separately.
- Cross-Hypothesis Insights: Investigating both hypotheses together may uncover insights that would be missed if the datasets were analyzed separately. For instance, the impact of discharge timing might be different for patients with chronic conditions compared to those without, which could lead to a more nuanced understanding and targeted interventions.
You need to understand the level your model is required to be at and the level the data is required to be at to generate business insights that can be adopted.
Using the patient re-admissions use case, by using patient-level data, the model can accurately identify factors that contribute to readmissions for individual patients. This allows the hospital to implement targeted interventions, like personalized follow-up care plans, to prevent readmissions, thus improving patient outcomes and reducing costs.
If the data were only at a more aggregated level, such as the department or ward level, it would be difficult to identify specific risk factors for individual patients, leading to less effective readmission prevention strategies.
Conversely, if the data were too granular, focusing on minute-by-minute monitoring data for all patients, it could overwhelm the model with irrelevant information, making it less efficient and harder to extract meaningful insights
As you understand data sources in further detail Update architecture and data flow diagrams as you identify core data sources to maintain clarity on data integration.
People are generally visual learners, and when you update the architecture & data flow diagrams it becomes easier for people to track requirements (what data to gather), ideate on tasks ( what pipelines need to be developed, how data is to be joined), and develop acceptance criteria (what does end state look like). Tracking all of this simply through a Kanban Board or Jira ticket is a very cumbersome process which leads to challenges if it’s not complemented by an effective visual.
When your data is pulled from analysis it passes through various storage units. Understanding what these are is essential to building a robust solution
Data Lake: This is a central repository that allows you to store all your structured and unstructured data at any scale. Data pulled from various sources is often stored here first. Purpose: The data lake acts as a staging area, where data can be kept in its raw form until it’s needed for analysis. This setup allows for greater flexibility in handling and processing data.
Integration with Cloud Platforms:
- In Google Cloud Platform (GCP), data can be moved from a data lake (like Google Cloud Storage) to analytics services like BigQuery for analysis, or to AI Platform for machine learning tasks.
- In Amazon SageMaker, you might store your data in Amazon S3 (Simple Storage Service) as your data lake, then import it into SageMaker for machine learning model training and analysis.
- For SAS, data can be imported from various sources, including data lakes, into the SAS environment for advanced analytics, data management, and business intelligence.
When you’re ready to analyze your data, it’s often brought into a temporary storage or processing environment that is part of the analytics or machine learning platform you’re using. This could be:
- An instance in Google Cloud’s BigQuery, a powerful data warehouse tool for analysis.
- A notebook instance or data processing job in Amazon SageMaker.
- A workspace in SAS where data is loaded for analysis and model building.
Understand the importance of primary keys in dataset construction and how to identify them.
In a dataset, a primary key is a specific column or set of columns that uniquely identifies each row of the table. Taking the re-admissions use case as an example
Each patient’s record in the dataset must be unique to avoid duplicating or mixing up information. A primary key (like a patient ID) ensures that each record is distinct, so the AI/ML model can accurately track and analyze each patient’s admission and readmission events.
The primary key allows for the linkage of different data sources or tables. For example, patient ID can link a patient’s demographic information, medical history, and readmission records. This linkage is crucial for a comprehensive analysis of factors influencing readmissions.
Recognize whether your data is structured, semi-structured, or unstructured, and understand the implications for processing and analysis.
Structured Data: This is like data in a neat Excel spreadsheet, where everything has its place in rows and columns. Examples include names, dates, or addresses. It’s easy to enter, store, search, and analyze because of its organized format. For example, in a customer database, each customer’s information (like name, age, and address) would be in separate columns.
Since structured data is so organized, it’s straightforward to use in databases, making it easier to input, query, and analyze. This makes tasks like data pipelining (moving data from one system to another), storage, and model development more efficient and less costly.
Semi-structured Data: This type of data isn’t as neat as structured data but still has some organization. Think of emails where the format isn’t as rigid: they have standard fields (like “To,” “From,” “Subject”), but the body can be free-form text. Other examples include JSON and XML files.
Semi-structured data requires more work to process than structured data because it’s not in a strict table format. However, it’s still easier to work with than unstructured data because it has some level of organization. Tools that can handle variability in data formats are needed for processing and analysis.
Unstructured Data: This is like a big, messy closet where everything is just thrown in. Examples include video files, images, and free-form text documents. This data doesn’t fit neatly into tables, making it harder to organize and analyze.
Unstructured data is the most challenging to work with because it requires more advanced techniques and technologies for processing and analysis. It takes up more storage space, and extracting useful information often involves complex processes like natural language processing (NLP) for text, image recognition for photos, and data mining techniques.
Handling Strategy for semi-structured data
- Parsing: Use software tools to read and extract the useful parts of the data, like pulling contact info from an XML file of customer data.
- Transformation: Convert the data into a more structured format, like a table, so it’s easier to analyze. For example, turning a JSON file of tweets into a spreadsheet where each row is a tweet and columns represent different tweet attributes like the date, text, and user.
Handling strategy for unstructured data:
Text Data (Natural Language Processing or NLP):
- Tokenization: Break down text into smaller pieces, like sentences or words, to understand and analyze it better.
- Sentiment Analysis: Determine the emotion or opinion expressed in the text, like figuring out if a product review is positive or negative.
- Keyword Extraction: Identify the most important words or phrases in a text to get a quick sense of what it’s about.
Image Data (Computer Vision):
- Image Recognition: Use AI to identify and classify objects in photos, like distinguishing between images of cats and dogs.
- Feature Extraction: Convert images into a form that a computer can process (like numbers or vectors) to analyze patterns or features in the visual data.
Understand the difference between time series, cross-sectional, and panel data and the implications it has for your analysis
Cross-Sectional Data: Think of taking a snapshot of a population at a single point in time, like a survey of voter preferences during an election year. This data type captures various characteristics of subjects at one moment, without considering time.
With cross-sectional data, the focus is on comparing different entities at the same point in time. The analysis might be simpler than with panel or time series data, but it doesn’t capture changes over time, limiting its use for trend analysis.
Time Series Data: This is like a timeline of events or measurements, such as daily stock market prices. Time series data is collected at regular intervals over time, focusing on the temporal sequence of data points.
Analyzing time series data involves looking for trends, patterns, and cyclical behaviors over time. This requires specific techniques like forecasting and trend analysis, which can be crucial for predicting future events based on past patterns.
Panel Data: Imagine tracking the same students’ grades across several years. Panel data combines features of time series data (data points collected over time) and cross-sectional data (data collected at a single point in time but across multiple subjects or entities). It’s like having a detailed record of multiple entities over time.
Panel data allows for more complex analysis, such as examining changes over time within and across groups. It requires more sophisticated approaches to specifically extract the results.
Based on requirements, understand whether you need to work with real-time data or batch data & what implications that has for your data engineering & model build.
Real-time Data: This is like having a continuous flow of information, where data is processed immediately as it comes in, similar to a live traffic update system. Real-time processing allows for instant analysis and decision-making.
- Impact on Data Engineering: Requires a robust infrastructure that can handle streaming data continuously without lag or downtime. Technologies like Apache Kafka or Spark Streaming are often used.
- Impact on Model Build: Models need to be lightweight and efficient to make quick predictions. They might be updated frequently as new data arrives, necessitating automated, ongoing training processes.
Batch Data: This involves collecting data over a period, then processing it all at once, like calculating the average daily temperature from hourly readings at the end of each day. Batch processing is suitable when immediate responses are not critical.
- Impact on Data Engineering: The infrastructure can be simpler and less costly compared to real-time systems. Data is processed in large chunks at scheduled times, using tools like Apache Hadoop or batch processing features in cloud services.
- Impact on Model Build: Models can be more complex and computationally intensive since they don’t need to produce instant outputs. Training can occur on a less frequent schedule, using larger, consolidated datasets.
Create a data dictionary if one doesn’t exist to aid in data understanding and future use. Ensure it’s validated by SME before moving forward
I once spent 2–3 weeks on an analysis because my turnover #s weren’t syncing up with what was in a master report. It turns out that was because the column I was using to identify when an employee turned over followed a different definition from the one the report drew from.
Having access to data dictionaries would have helped resolve this issue but none were available. In any case, we created one for both data sets as a means of building it forward so other teams would move past similar challenges at greater speed.
Perform standard data quality checks for outliers, nulls, and inconsistent data types. Collaborate with stakeholders for validation.
Outliers: These are data points that are significantly different from the majority of the data. They can indicate variability in your data, experimental errors, or even true anomalies.
Basic Checks: Use methods like z-scores, IQR (Interquartile Range), or box plots to identify data points that are far from the norm. Plotting data, using scatter plots or histograms, can help visually spot outliers.
Nulls: These are missing or undefined values in your dataset. Null values can skew your analysis and lead to incorrect conclusions if not handled properly.
Basic Checks: Determine the number of null values in each column of your dataset. Assess the impact of nulls on your analysis and decide whether to impute (fill in), remove, or ignore these null values based on the context.
Inconsistent Data Types: This refers to having data that doesn’t match the expected type, like having alphabetical characters in a numeric field. Inconsistent data types can cause errors in data processing and analysis.
Basic Checks: Type Verification: Ensure each column in your dataset matches the expected data type (e.g., numbers, strings, dates).Format Consistency: Check that the data within each column follows a consistent format, especially for dates and categorical data.
Establish a clear process for handling missing or null values, balancing statistical methods with business logic and stakeholder input:
Suppose you have a dataset containing daily sales figures for a retail store, and you’re using this data to forecast future sales. You notice that some days have missing sales data (null values). Standard Imputation Approaches might suggest filling these missing values with the mean, median, or mode of the sales data. However, this approach may not always be appropriate because it doesn’t take into account the context or reason for the missing data.
For example, if sales data are missing for specific days because the store was closed for renovation or due to a holiday, using the average of surrounding days would not accurately reflect the store’s sales pattern. Imputing with a standard method could lead to misleading analysis, affecting the accuracy of the sales forecast.
A business stakeholder, such as the store manager, might inform you that the store was closed for a week for renovation, and therefore, it would not be accurate to impute sales data for that period based on historical averages. Instead, the stakeholder might suggest setting the sales to zero for those days or using data from the same period in previous years to estimate the impact of the closure on sales.
Conduct basic analysis to validate data quality and understand its characteristics before deep diving into complex models.
To understand if your data is correct you should be able to run some quick analysis re-creating some basic reports to validate that the underlying data is correct. It’s advised to do this, otherwise, you may find yourself very deep into the model-building process only to realize you have issues with your underlying data quality that should have been caught earlier.
Standardization and Normalization: Understand the concepts and their importance in preparing data for analysis.
Standardization is about adjusting the values in your dataset so they have a common scale. It’s like converting temperatures from Fahrenheit to Celsius so everyone understands the temperature in the same way. In technical terms, it involves subtracting the mean of your data and dividing by the standard deviation, resulting in data centered around zero with a standard deviation of one.
When it’s important: You standardize data when you need to compare features that are on different scales or units. For example, if you’re looking at height and weight together, these two are measured in completely different units (inches and pounds). Standardization makes these different measurements comparable.
Normalization, on the other hand, adjusts your data so that the range is scaled down to fit within a specific range, often between 0 and 1. It’s like adjusting scores from different tests to a common scale to see which scores are better, regardless of the test’s total points.
When it’s important: Normalization is key when your data needs to have a bounded range, especially for methods that rely on the length of the vectors in your data, like k-nearest neighbors (KNN) or neural networks.
Understanding Data Distribution: Use basic analysis to comprehend data distribution, aiding in hypothesis testing and model building.
Before you dive into hypothesis testing or building predictive models, you should first understand the distribution of your data. This step can guide you in selecting the appropriate statistical tests and algorithms that align with your data’s characteristics. There are three approaches to use
- Visual Inspection: Use plots like histograms, box plots, or Q-Q plots to visually assess the distribution of your data.
- Descriptive Statistics: Look at measures like mean, median, mode, range, variance, and standard deviation to get a sense of your data’s central tendency and spread.
- Statistical Tests: Perform tests like the Shapiro-Wilk test to check if your data is normally distributed or use skewness and kurtosis measures to understand the degree of asymmetry and weakness in your data distribution.
If you want an example of what happens when you don’t do this check, let’s take a standard retail company.
The company assumes that customer spending is normally distributed (a bell curve), so they calculate the average spending and base their offers around this figure. What if the spending data is not normally distributed but heavily skewed with a few high spenders (outliers) and many low spenders?
Using the average in this skewed distribution could misrepresent what most customers typically spend. As a result, the special offers might be too high for the majority of customers, leading to unnecessary expenditure for the company and not effectively targeting the customer base.
Recognize and address class imbalance to improve model accuracy and prevent biased outcomes.
Imagine you have a basket of fruit with apples and oranges, but there are 95 apples and only 5 oranges. If you’re trying to build a machine that sorts the fruit and you teach it using this basket, it might get good at recognizing apples (because it sees them a lot) but not so good at recognizing oranges. In data science, this is a “class imbalance” — you have more of one class (apples) and fewer of another (oranges).
If you don’t address class imbalance, your model might become biased towards the majority class (apples in our example). This means it might often predict the majority class even when the minority class is the correct one, simply because it hasn’t seen enough of the minority class to learn about it properly.
Use machine learning algorithms that are inherently designed to handle class imbalance, or adjust the existing algorithms to give more weight to the minority class.
Ensure data consistency across related fields, like verifying chronological order in date fields.
Consider a situation where you’re filling out a form with start and end dates for an event. Cross-field validation would involve making sure the start date is before the end date. It’s not just about checking that each date is valid on its own; it’s about ensuring they make sense together.
Feature Engineering: Learn the basics of creating features that can significantly influence model performance.
Feature engineering is the process of using domain knowledge to select, modify, or create new features from raw data to increase the predictive power of machine learning models. Using the hospital’s re-admissions use case, let’s take a couple of examples of feature engineering.
Convert continuous data, such as length of stay, into categories (short, medium, long) if the model performs better with categorical data. Take several treatments per day & group them into low, medium, and high-intensity treatments to see if a higher treatment intensity correlates with readmissions.
What you are seeing here is the intensity of treatment or length of stay may have useful predictive power but in their original form of 1,2,3,4,5,6,7,8, 9,10 days or treatments, it’s not as conducive to prediction compared to short stay, medium stay, long stay.
There are some general rules of thumb to follow around the amount of data required for effective model training and validation.
- More complex models (like deep neural networks) usually need more data to learn effectively without overfitting (memorizing the data too closely).
- If your model is like a sophisticated machine with lots of parts (features and parameters), it needs a lot of data to train each part properly.
- The more varied or diverse your data, the more samples you will need to capture all the possible variations.
- If you’re studying something with lots of different behaviors or characteristics (like predicting customer behavior across different regions), you need enough data to cover all these variations.
- The level of accuracy or precision you need can dictate how much data is required. Higher accuracy typically requires more data to validate the model’s predictions.
- If your model needs to make very precise predictions (like in medical diagnoses), you’ll need a large amount of data to ensure the predictions are reliable and validated.
- You need enough data to not only train the model but also to validate and test it. This usually means dividing your data into separate sets for training, validation, and testing.
- Ensuring you have enough data to validate and test the model means setting aside some of your total data, so you need to collect enough to accommodate this split.
Overall, a good idea is, to begin with a smaller set of data to see how the model performs, then incrementally add more data until the performance plateaus or meets your objectives.
Know the frequency of data updates, data collection methods, and data management processes.
You would be surprised at how often, people build highly useful models only to find the data they need to refresh the model won’t be available at the cadence necessary thus making the model less useful than intended.
Follow protocols set by your compliance, legal, and data governance team to avoid any risks.
Unless you are working at a startup, your IT, Legal, and Compliance teams should be able to identify what is sensitive data, and how to handle that data. Follow the rules & don’t skip steps no matter how cumbersome it may seem.
Understand the process of creating data pipelines to automate the flow of data from source to destination.
Creating a data pipeline is like setting up a conveyor belt in a factory to move products from one place to another automatically. In the context of data, this “conveyor belt” is a series of steps that automatically move and process data from its source (where it comes from) to its destination (where it’s needed).
Here’s how it works in simple terms:
Collection: First, you gather data from various sources. This could be like picking up raw materials from different parts of a warehouse. These sources might be databases, websites, sensors, or other places where data is generated.
Transportation: Next, you move the data from where it was collected to a place where it can be processed. This is like the conveyor belt in our factory, carrying products to different stations. In data terms, this often involves transferring data over a network to a central location.
Processing: Once the data arrives at its processing location, it’s cleaned and transformed into a useful format. This is similar to factory workers refining raw materials or assembling parts into a product. For data, processing can mean organizing it, fixing errors, or converting it into a format that’s easy for computers to handle.
Storage: After the data is processed, it’s stored in a place where it can be easily accessed later, like storing finished products in a warehouse. In the data world, this storage can be a database or a data lake, depending on how the data will be used.
Here are two to three rules of thumb for deciding whether to start with flat files for your MVP and when to transition to more complex data pipelines:
Start with Flat Files If:
- Your data volume is low and the processing needs are simple.
- You need to quickly validate your idea or model with minimal setup.
Switch to Data Pipelines When:
- Your data volume and complexity grow, making flat files cumbersome to manage.
- You need real-time processing, automation, or integration with other systems.
Consider Scalability Early:
- Even if starting with flat files, plan for future scaling to ensure a smooth transition to data pipelines when needed.
APIs for Data Acquisition: Learn how to set up APIs to pull data from different sources, facilitating real-time or periodic data updates.
APIs (Application Programming Interfaces) are used for data acquisition by providing a structured way to request and receive data from different sources. They act as a bridge between your application and external data sources, allowing for real-time or periodic data updates. Here’s how it works and how you can set up APIs for data acquisition:
You send a request to the API of a data source (like a weather service or social media platform), and in return, the API sends back the data you asked for. APIs can provide real-time data, allowing your application to have up-to-date information whenever needed. APIs can be set up to automatically pull data at regular intervals, facilitating periodic updates without manual intervention.
Determine which external services or platforms have the data you need and whether they offer an API. You might need to register or request access to use the API. This often involves creating an account with the data provider and getting an API key, which is a unique identifier for authenticating your requests.
Use programming languages like Python, Java, or tools designed for API integration to write scripts or applications that make requests to the API. Libraries like requests
Python can simplify this process. For periodic data updates, you can schedule your API requests using task schedulers (like cron jobs in Unix/Linux) or workflow automation tools (like Apache Airflow).
Once you receive the data, process it as needed (which may include parsing, cleaning, and transforming) and store it in your database or data warehouse for further use.
Understand that building an analytical dataset is an iterative process that may change as building a version of model results.
Building an analytical dataset for a project like predicting patient readmissions in a hospital is indeed an iterative process. This means the dataset evolves as you refine your model and learn more about the data and what influences readmissions. Here’s how this iterative process can unfold in the patient readmissions use case:
Start with Basic Data: Initially, you might gather basic patient information, such as demographic details, medical history, and details of the current and past hospital stays.
After testing the initial model, you might find that certain factors, like the length of stay or specific diagnoses, are strong predictors of readmissions. Based on these insights, you decide to add more detailed data to your dataset, such as laboratory test results, medication records, or notes from healthcare providers.
With the additional data, you can create more complex features, such as trends in lab results or the number of readmissions in the past year. Second Model Version: You build and test a new version of the model using the enriched dataset, which may show improved accuracy in predicting readmissions.
The process continues iteratively, with each cycle of model building, testing, and evaluation providing insights that guide further data collection, feature engineering, and model refinement. Feedback from clinicians and hospital administrators might also inform the dataset’s evolution, highlighting other factors that could influence readmissions, such as patient satisfaction or post-discharge support