Monday 29 January 2024

History and Overview of R

What is R?

R is a free and open-source software environment for statistical computing and graphics. It's a programming language specifically designed for data analysis and visualization. Its strengths lie in its extensive statistical functionalities, easy-to-learn syntax, and powerful graphical capabilities.

What is S?

S is a similar statistical programming language and environment developed earlier at Bell Laboratories. R owes its origin to S, sharing many core concepts and functionalities. Although R isn't a direct extension of S, much code written for S works within R with some adjustments.

The S Philosophy

The S philosophy emphasizes:

  • Interactivity: Users can run commands and see results immediately, facilitating exploration and experimentation.
  • Conciseness: The language is designed to be compact and expressive, allowing for efficient coding.
  • Extensibility: Users can create and share packages to expand the functionality of R beyond its core features.
  • Data-oriented: Focus is placed on efficient data manipulation and analysis.

Back to R

R builds upon the S philosophy while improving in several areas, including:

  • Object-oriented programming: Provides better structure and organization for large projects.
  • Memory management: Offers more efficient memory handling for complex tasks.
  • Graphical capabilities: Produces publication-quality graphs with rich customization options.

Basic Features of R

  • Data structures: Arrays, matrices, lists, data frames, etc. for organizing and manipulating data.
  • Operators: Mathematical, logical, and data manipulation operators for performing various calculations.
  • Control flow: if, for, while statements for controlling program execution based on conditions.
  • Functions: Built-in and user-defined functions for performing specific tasks.
  • Graphics: Extensive plotting capabilities to visualize data in various ways.

Free Software

R is free and open-source software (FOSS), meaning anyone can download, use, modify, and redistribute it without restrictions. This fosters a vibrant community of developers and users who contribute to its continuous improvement.

Design of the R System

R consists of:

  • The R language: Defines the syntax and structure of the code.
  • The R interpreter: Executes the R code and interacts with the user.
  • Packages: Collections of functions and data that extend R's functionalities beyond its core.
  • CRAN: Central repository for downloading and installing packages.

Limitations of R

While powerful, R has some limitations:

  • Steep learning curve: The syntax and concepts can be challenging for beginners.
  • Memory limitations: Can handle large datasets, but complex analyses may require careful memory management.
  • Debugging difficulties: Tracing errors can be challenging due to the dynamic nature of the language.

R Resources

  • The R Project for Statistical Computing: https://www.r-project.org/
  • RStudio: Popular integrated development environment for R: https://posit.co/
  • DataCamp: Online platform for learning R and data science: https://www.datacamp.com/
  • Books: "The R Book" by Dalgaard, "R in Action" by Cotton, "ggplot2" by Wickham and Grolemund
  • Forums and communities: Stack Overflow, R-Help mailing list, online forums

Data Science Process.

Data science is mostly applied in the context of an organization. When the business asks you to perform a data science project, you’ll first prepare a project charter. This charter contains information such as what
you’re going to research, how the company benefits from that, what data and resources you need, a timetable, and deliverables.
 


 

1. Setting the research goal: This initial step involves defining the specific problem or question you want to answer using data. It's crucial to have a clear and well-defined goal to guide the rest of the process.

2. Retrieving data: Once you know what you're looking for, you need to gather the relevant data. This can involve accessing existing data sources, designing and conducting surveys or experiments, or scraping data from the web.

3. Data preparation: Raw data is rarely ready for analysis, so this step involves cleaning, organizing, and formatting the data to make it suitable for modeling. This might include tasks like:

  • Data cleaning: Fixing errors, inconsistencies, and missing values.
  • Data integration: Combining data from multiple sources.
  • Data transformation: Converting data into a format compatible with your chosen analysis tools.
  • Feature engineering: Creating new features from existing data to improve the performance of your models.

4. Data exploration: This is where you start to get a feel for the data by analyzing its properties and identifying patterns, trends, and relationships. Exploratory data analysis (EDA) can involve techniques like:

  • Descriptive statistics: Summarizing the data using measures like mean, median, and standard deviation.
  • Data visualization: Creating charts and graphs to represent the data visually.
  • Correlation analysis: Identifying relationships between different variables.

5. Data modeling: This step involves using the prepared data to build a model that can answer your research question or make predictions. There are many different types of data models, such as:

  • Regression models: Used to predict a continuous outcome variable based on one or more predictor variables.
  • Classification models: Used to predict a categorical outcome variable.
  • Clustering algorithms: Used to group similar data points together.

6. Presentation and automation: Finally, you need to communicate your findings to others and, if applicable, deploy your model into production. This might involve:

  • Creating reports and presentations: Summarizing your results and insights in a clear and concise way.
  • Developing dashboards and visualizations: Making your results more accessible and interactive.
  • Deploying the model: Integrating your model into a production environment to make predictions on new data.
 
 Reference:
 
 DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” ManningPublications, 2016

Facets of data

 In data science and big data you’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these:

  1. Structured data
  2. Unstructured data
  3. Natural language data
  4. Machine-generated data
  5. Graph-based data
  6. Audio, video, and images data
  7. Streaming data

Let’s explore all these interesting data types.

 1.  Structured data

  • Data that is stored in a defined field inside a record and is dependent on a data model is referred to as structured data.
  • Because of this, storing structured data in tables inside databases or Excel files is frequently simple.
  • Database management and querying are best done with SQL, or Structured Query Language.
  • Additionally, you can encounter complex data that is difficult to store in a conventional relational database.
  • One example is hierarchical data, like a family tree.
  • The world isn’t made up of structured data, though; it’s imposed upon it by humans and machines. More often, data comes unstructured


2. Unstructured data 

  • Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying.
  • One example of unstructured data is your regular email (figure 1.2).
  • Although email contains structured elements such as the sender, title, and body text, it’s a challenge to find the number of people who have written an email complaint about a specific employee because so many ways exist to refer to a person, for example.
  • The thousands of different languages and dialects out there further complicate this.



3. Natural language data

  • Natural language is a special type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics.
  • The natural language processing community has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis, but models trained in one domain don’t generalize well to other domains.
  • The concept of meaning itself is questionable here.


4. Machine-generated data

  • Machine-generated data is information that’s automatically created by a computer, process, application, or other machine without human intervention.
  • Machine-generated data is becoming a major data resource and will continue to do so.
  • The analysis of machine data relies on highly scalable tools, due to its high volume and speed. 
  • Examples of machine data are web server logs, call detail records, network event logs, and telemetry

5. Graph-based data

Graph-based data represents entities and their relationships as nodes and edges in a graph. This makes it a powerful tool for modeling complex relationships between entities, such as social networks, financial transactions, and knowledge graphs.

For example, in a social network, people are represented as nodes and their friendships are represented as edges. This allows us to analyze things like the spread of information, the formation of communities, and the influence of individuals.

6. Audio, video, and images data

Audio, video, and images are collectively known as multimedia data. This type of data is characterized by its rich and complex nature, and it can be challenging to store, process, and analyze. However, it also has the potential to provide valuable insights that other types of data cannot.

Here are some examples of how multimedia data is used:

  • Computer vision: Analyzing images and videos to understand the content, such as identifying objects, people, and actions.
  • Speech recognition: Converting spoken language into text.
  • Natural language processing: Understanding the meaning of text and speech.
  • Medical imaging: Analyzing medical images to diagnose diseases.
  • Entertainment: Creating movies, games, and other forms of entertainment.

7. Streaming data

Streaming data is data that is generated in real-time and continuously over time. This type of data is becoming increasingly common, due to the growth of the Internet of Things (IoT) and other sensors that generate data constantly.

Here are some examples of how streaming data is used:

  • Fraud detection: Analyzing financial transactions in real-time to identify fraudulent activity.
  • Traffic monitoring: Monitoring traffic flows in real-time to optimize traffic management.
  • Social media analysis: Analyzing social media posts in real-time to understand public opinion and trends.
  • Industrial automation: Monitoring and controlling industrial processes in real-time.
  • Scientific research: Collecting and analyzing data from scientific experiments in real-time.

 Reference:

DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” ManningPublications, 2016


Advertisement

Follow US

Join 12,000+ People Following

Notifications

More

Results

More

Java Tutorial

More

Digital Logic design Tutorial

More

syllabus

More

ANU Materials

More

Advertisement

Top