Acharya Nagarjuna university information from Manabadi.co.in

Data Retrieval Phase and Modeling

Data from retrieval phase is often "diamond in the rough."
Sanitization and preparation are crucial for better performance and less time spent on output correction.
Data transformation is necessary for the model to fit specific data formats.
Early correction of data errors is recommended.
Corrective actions may be necessary in realistic settings.
Below figure shows common actions during data cleansing, integration, and transformation.

1. Data Cleaning

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

Data collection and entry are error-prone processes requiring human intervention.
Human errors can include typos or loss of concentration.
Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
Hand-checking every value is recommended for small data sets.
Data errors can be detected by tabulating data with counts.
Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis

Outliers are observations that seem distant from others or follow a different logic or generative process.
Finding outliers is easy using plots or tables with minimum and maximum values.
An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

Missing values aren't always wrong but need separate handling.
They may indicate data collection errors or ETL process errors.
Common techniques used by data scientists are listed in table 2.4.

2. Transforming Data for Data Modeling

Data cleansing and integration are crucial for data modeling.
Data transformation involves transforming data into a suitable form.
Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
Combining two variables into a new variable can also be used.

Reducing Variables in Models

Overloading variables can hinder model handling.
Techniques like Euclidean distance perform best with 10 variables.
Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
Dummy variables indicate the absence of a categorical effect explaining an observation.
Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
This technique is popular in modeling and is not exclusive to economists.
The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

Data sources include databases, Excel files, text documents, etc.
Data science process is the focus, not presenting scenarios for every type of data.
Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

Joining: enriches an observation from one table with information from another.
Appending or stacking: adds observations from one table to another.
Combining data allows creation of new physical or virtual tables.
Views consume less disk space

Friday, 2 February 2024

Data Preparation - Cleansing, integrating, and transforming data

Advertisemtnt

Advertisemtnt

0comments:

Post a Comment

Advertisement

Follow US

Notifications

Results

Timetables

Java Tutorial

Digital Logic design Tutorial

Data Science and R Tutorial

syllabus

ANU Materials

previous question papers

Advertisement