Data Retrieval Phase and Modeling
- Data from retrieval phase is often "diamond in the rough."
- Sanitization and preparation are crucial for better performance and less time spent on output correction.
- Data transformation is necessary for the model to fit specific data formats.
- Early correction of data errors is recommended.
- Corrective actions may be necessary in realistic settings.
- Below figure shows common actions during data cleansing, integration, and transformation.
1. Data Cleaning
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.
1.1. Data Entry Errors Overview
- Data collection and entry are error-prone processes requiring human intervention.
- Human errors can include typos or loss of concentration.
- Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
- Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
- Hand-checking every value is recommended for small data sets.
- Data errors can be detected by tabulating data with counts.
- Frequency tables can be created for variables with only two values.
1.2.Outliers in Data Analysis
- Outliers are observations that seem distant from others or follow a different logic or generative process.
- Finding outliers is easy using plots or tables with minimum and maximum values.
- An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
- Outliers can significantly influence data modeling, so it's crucial to investigate them first.
1.3. Dealing with Missing Values in Data Science
- Missing values aren't always wrong but need separate handling.
- They may indicate data collection errors or ETL process errors.
- Common techniques used by data scientists are listed in table 2.4.
2. Transforming Data for Data Modeling
- Data cleansing and integration are crucial for data modeling.
- Data transformation involves transforming data into a suitable form.
- Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
- Combining two variables into a new variable can also be used.
Reducing Variables in Models
- Overloading variables can hinder model handling.
- Techniques like Euclidean distance perform best with 10 variables.
- Reducing the number of variables can add new information to the model.
Turning Variables into Dummies in Data Science
- Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
- Dummy variables indicate the absence of a categorical effect explaining an observation.
- Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
- Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
- This technique is popular in modeling and is not exclusive to economists.
- The next step is to transform and integrate data into usable input for the modeling phase.
3. Data Combination from Different Sources
- Data sources include databases, Excel files, text documents, etc.
- Data science process is the focus, not presenting scenarios for every type of data.
- Other data sources like key-value stores and document stores will be discussed in later sections.
Different Ways of Combining Data
- Joining: enriches an observation from one table with information from another.
- Appending or stacking: adds observations from one table to another.
- Combining data allows creation of new physical or virtual tables.
- Views consume less disk space
0comments:
Post a Comment
Note: only a member of this blog may post a comment.