Data Science Steps: Retrieving Required Data
- Designing data collection process may be necessary.
- Companies often collect and store data.
- Unneeded data can be purchased from third parties.
- Don't hesitate to seek data outside your organization.
- More organizations are making high-quality data freely available for public and commercial use.
1. Start with data stored within the company
Assessing Data Relevance and Quality
- Assess the quality and relevance of available data within the company.
- Companies often have a data maintenance program, reducing cleaning work.
- Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
- Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
- Data lakes contain raw data, while data warehouses and data marts are preprocessed.
- Data may still exist in Excel files on a domain expert's desktop.
Data Management Challenges in Companies
- Data scattered as companies grow.
- Knowledge dispersion due to position changes and departures.
- Documentation and metadata not always prioritized.
- Need for Sherlock Holmes-like skills to find lost data.
Data Access Challenges
- Organizations often have policies ensuring data access only for necessary information.
- These policies create physical and digital barriers, known as "Chinese walls."
- These "walls" are mandatory and well-regulated for customer data in most countries.
- Accessing data can be time-consuming and influenced by company politics.
2. Don’t be afraid to shop around
Data Sharing and its Importance
- Companies like Nielsen and GFK specialize in collecting valuable information.
- Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
- Governments and organizations share their data for free, covering a broad range of topics.
- This data is useful for enriching proprietary data and training data science skills at home.
- Table 2.1 shows a small selection from the growing number of open-data providers.
3. Do data quality checks now to prevent problems later
Data Science Project Overview
- Data correction and cleansing are crucial, often up to 80% of project time.
- Data retrieval is the first phase of data inspection in the data science process.
- Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
- Data investigation occurs during import, data preparation, and exploratory phases.
- Data retrieval checks if the data is equal to the source document and if the data types match.
- Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
- The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
- Iteration over these phases is common, as outliers can indicate data entry errors.
0comments:
Post a Comment
Note: only a member of this blog may post a comment.