Friday, 2 February 2024

Exploration - Exploratory data analysis

 Exploratory Data Analysis Overview

  • Deep dive into data using graphical techniques.
  • Uses open mind and eyes for understanding data interactions.
  • Aims to discover anomalies not previously identified.
  • Requires step back and fixation to ensure accuracy.


 Visualization Techniques in Data Analysis

  • Uses range from simple line graphs or histograms to complex diagrams like Sankey and network graphs.
  • Composes composite graphs for deeper data insight.
  • Animates or makes interactive graphs for ease and enjoyment.

 Interactive Data Exploration Techniques

  • Combining plots for deeper insights.
  • Overlaying several plots for better understanding.
  • Using Pareto diagrams or 80-20 diagrams.
  • Brushing and linking for automatic transfer of changes from one graph to another.
  • High correlation between answers indicated by average score per country.
  • Selection of points on subplots corresponds to similar points on other graphs.
  • Histogram: Categorizes variables into discrete categories, summarizing occurrences in each category.
  • Boxplot: Provides distribution within categories, showing maximum, minimum, median, and other characterizing measures.
  • Techniques include visualization, tabulation, clustering, and other modeling techniques.
  • Building simple models can also be part of exploratory analysis.
  • After data exploration, move on to building models.

Key objectives of EDA:

  • Gaining familiarity with the data: This involves understanding the structure of the dataset, the data types of each variable, and any missing values present.
  • Identifying patterns and trends: EDA helps uncover relationships between variables, outliers, and potential errors within the data.
  • Formulating hypotheses: Based on the observations and insights gained, you can start forming hypotheses that you can later test through modeling or analysis.
  • Guiding further analysis: EDA lays the groundwork for choosing the appropriate techniques for modeling, feature engineering, and data cleaning.

Common steps involved in EDA:

  1. Data import and cleaning: This involves loading the data into your chosen environment and addressing any missing values, inconsistencies, or formatting issues.
  2. Univariate analysis: This step examines each variable individually, using summary statistics like mean, median, and standard deviation for numerical variables and frequency distributions for categorical variables. Visualizations like histograms, boxplots, and bar charts are helpful in understanding the distribution of each variable.
  3. Bivariate analysis: This step explores the relationships between two variables. Scatter plots, heatmaps, and correlation matrices are commonly used to visualize these relationships.
  4. Multivariate analysis: This step involves exploring the relationships between multiple variables simultaneously. Techniques like principal component analysis (PCA) and dimensionality reduction can be used for this purpose.

Benefits of EDA:

  • Improved data understanding: A thorough EDA provides a deep understanding of the data, its strengths, and weaknesses, allowing you to make informed decisions about further analysis.
  • Enhanced data quality: By identifying and addressing data quality issues early on, you can ensure the reliability and accuracy of your results.
  • More effective modeling: Understanding the data's characteristics helps you choose the most appropriate modeling techniques and avoid common pitfalls.
  • Clearer communication: EDA findings can be effectively communicated to stakeholders through data visualizations and reports, fostering better collaboration and project understanding.

Data Preparation - Cleansing, integrating, and transforming data

 Data Retrieval Phase and Modeling

  • Data from retrieval phase is often "diamond in the rough."
  • Sanitization and preparation are crucial for better performance and less time spent on output correction.
  • Data transformation is necessary for the model to fit specific data formats.
  • Early correction of data errors is recommended.
  • Corrective actions may be necessary in realistic settings.
  • Below figure shows common actions during data cleansing, integration, and transformation.

1. Data Cleaning 

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

  • Data collection and entry are error-prone processes requiring human intervention.
  • Human errors can include typos or loss of concentration.
  • Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
  • Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
  • Hand-checking every value is recommended for small data sets.
  • Data errors can be detected by tabulating data with counts.
  • Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis 

  • Outliers are observations that seem distant from others or follow a different logic or generative process.
  • Finding outliers is easy using plots or tables with minimum and maximum values.
  • An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
  • Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

  • Missing values aren't always wrong but need separate handling.
  • They may indicate data collection errors or ETL process errors.
  • Common techniques used by data scientists are listed in table 2.4.

2. Transforming Data for Data Modeling 

  • Data cleansing and integration are crucial for data modeling.
  • Data transformation involves transforming data into a suitable form.
  • Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
  • Combining two variables into a new variable can also be used.

Reducing Variables in Models

  • Overloading variables can hinder model handling.
  • Techniques like Euclidean distance perform best with 10 variables.
  • Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

  • Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
  • Dummy variables indicate the absence of a categorical effect explaining an observation.
  • Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
  • Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
  • This technique is popular in modeling and is not exclusive to economists.
  • The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

  • Data sources include databases, Excel files, text documents, etc.
  • Data science process is the focus, not presenting scenarios for every type of data.
  • Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

  1. Joining: enriches an observation from one table with information from another.
  2. Appending or stacking: adds observations from one table to another.
  3.  Combining data allows creation of new physical or virtual tables.
  4. Views consume less disk space

Retrieving Data

 Data Science Steps: Retrieving Required Data

  • Designing data collection process may be necessary.
  • Companies often collect and store data.
  • Unneeded data can be purchased from third parties.
  • Don't hesitate to seek data outside your organization.
  • More organizations are making high-quality data freely available for public and commercial use.

Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.

 1. Start with data stored within the company

 Assessing Data Relevance and Quality

  • Assess the quality and relevance of available data within the company.
  • Companies often have a data maintenance program, reducing cleaning work.
  • Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
  • Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
  • Data lakes contain raw data, while data warehouses and data marts are preprocessed.
  • Data may still exist in Excel files on a domain expert's desktop.

Data Management Challenges in Companies

  • Data scattered as companies grow.
  • Knowledge dispersion due to position changes and departures.
  • Documentation and metadata not always prioritized.
  • Need for Sherlock Holmes-like skills to find lost data.

Data Access Challenges

  • Organizations often have policies ensuring data access only for necessary information.
  • These policies create physical and digital barriers, known as "Chinese walls."
  • These "walls" are mandatory and well-regulated for customer data in most countries.
  • Accessing data can be time-consuming and influenced by company politics.


2. Don’t be afraid to shop around

Data Sharing and its Importance

  • Companies like Nielsen and GFK specialize in collecting valuable information.
  • Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
  • Governments and organizations share their data for free, covering a broad range of topics.
  • This data is useful for enriching proprietary data and training data science skills at home.
  • Table 2.1 shows a small selection from the growing number of open-data providers.





3. Do data quality checks now to prevent problems later


Data Science Project Overview
  • Data correction and cleansing are crucial, often up to 80% of project time.
  • Data retrieval is the first phase of data inspection in the data science process.
  • Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
  • Data investigation occurs during import, data preparation, and exploratory phases.
  • Data retrieval checks if the data is equal to the source document and if the data types match.
  • Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
  • The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
  • Iteration over these phases is common, as outliers can indicate data entry errors.

Advertisement

Follow US

Join 12,000+ People Following

Notifications

More

Results

More

Java Tutorial

More

Digital Logic design Tutorial

More

syllabus

More

ANU Materials

More

Advertisement

Top