Introduction to Using Data Sets
Why this matters
Imagine your school's soccer team is running a fundraiser. You're selling team-branded water bottles, and you have a simple notebook where you've jotted down the number of bottles sold each day for the past two weeks: 15, 22, 18, 25, 30, 12, 9, 17, 20, 24, 28, 31, 19, 21.
The coach comes to you and asks, "Great job! What was our single best sales day? And what was our average for the two weeks?"
Suddenly, that simple list of numbers isn't just a list anymore. It's a data set. To answer the coach's questions, you can't just stare at the whole list at once. You need a process. You need a way to go through it systematically.
That's exactly what we're learning today: how to think about collections of data and create a clear, logical plan—an algorithm—to pull useful information out of them.
Concept overview
flowchart TD
A[Start] --> B{Initialize `max_so_far` to the first item};
B --> C{For each remaining item in the list...};
C --> D{Is current item > `max_so_far`?};
D -- Yes --> E[Update `max_so_far` to current item];
D -- No --> F[Move to next item];
E --> F;
C -- Done with all items --> G[End: `max_so_far` holds the maximum value];
F --> C;
Core explanation
Welcome to the world of data collections! It might sound intimidating, but the core idea is something you do every day.
What is a Data Set?
A data set is just a collection of related pieces of information. That's it. (EK 4.2.A.1)
Think about:
- The contacts in your phone: a data set of names and numbers.
- Your favorite Spotify playlist: a data set of songs.
- A grocery list: a data set of items you need to buy.
In computer science, we often work with data sets of numbers. For example, a list of final exam scores for a class: [95, 81, 76, 99, 88, 72]. Each number is a piece of data, and together they form a data set.
The Goal: Answering Questions with Data
We don't collect data just for fun. We use it to solve problems or answer questions. (EK 4.2.A.2)
Looking at our list of scores [95, 81, 76, 99, 88, 72], we could ask:
- What is the highest score? (99)
- What is the lowest score? (72)
- How many students passed? (Assuming a pass is >= 65, all 6 of them)
- What is the average score?
To answer these, we need to manipulate and analyze the data. And that brings us to the single most important concept in this lesson.
The "One at a Time" Principle
When we process a data set, we almost always do it one item at a time.
Imagine you're a cashier at Target. A customer, Carlos, comes up with a full cart. You don't try to look at all the items at once and guess the total. That would be impossible! Instead, you follow a simple algorithm:
- Pick up one item.
- Scan its barcode.
- Add its price to a running total.
- Set the item aside.
- Repeat until the cart is empty.
Computers work the same way. They aren't magical; they're just incredibly fast at following simple, repetitive instructions. When we give a computer a data set, it iterates (or loops) through it, looking at just one value at a time to perform a calculation. (EK 4.2.A.2)
Planning Your Algorithm with a Diagram
Before you write a single line of code, you should have a plan. For data sets, one of the best ways to plan is to use a simple table or chart. This helps you visualize the "one at a time" process. (EK 4.2.A.3)
Let's make a plan to find the highest score in our list: [95, 81, 76, 99, 88, 72].
Our algorithm in plain English would be:
- Create a variable, let's call it
highest_score_so_far, and set it to the first score in the list. - Go through the rest of the scores, one by one.
- For each score, compare it to
highest_score_so_far. - If the current score is higher, update
highest_score_so_farto this new score. - If it's not higher, do nothing and move to the next score.
- Once you've checked all the scores,
highest_score_so_farwill hold the answer.
Let's trace this with a table, which is a great way to represent our plan. (LO 4.2.A)
| Current Score Being Examined | highest_score_so_far (before check) |
Is Current Score > highest_score_so_far? |
highest_score_so_far (after check) |
|---|---|---|---|
| (Start) | (Initialized to first item: 95) | - | 95 |
| 81 | 95 | No (81 is not > 95) | 95 |
| 76 | 95 | No (76 is not > 95) | 95 |
| 99 | 95 | Yes (99 is > 95) | 99 |
| 88 | 99 | No (88 is not > 99) | 99 |
| 72 | 99 | No (72 is not > 99) | 99 |
This table is our algorithm represented visually. It forces us to think one step at a time. By building this plan, we've defined a clear, repeatable process that a computer can follow perfectly. This skill—translating a question into a step-by-step process—is the foundation of everything we'll do with arrays and ArrayLists.
See it in action
Worked examples
Let's solidify this with a couple of practical examples. The key is to define the problem, then build a step-by-step plan.
Calculating an Average
Problem: You're given the number of minutes Maya spent on her coding homework each night last week: [45, 60, 0, 90, 75, 55, 30]. Calculate the average number of minutes she spent per day.
Solution Walkthrough:
- 1Identify the GoalWe need to calculate an average. The formula for an average is
Total Sum / Number of Items. This tells us we need two things from the data set: the sum of all values and the count of all values. - 2Plan the AlgorithmWe'll process the list one item at a time.
- Initialize a variable
total_minutesto 0. - Initialize a variable
day_countto 0. (Or we can just use the known size, 7). - Iterate through the list
[45, 60, 0, 90, 75, 55, 30]. - For each number, add it to
total_minutes. - After the loop is finished, divide
total_minutesby the number of items (7).
- Initialize a variable
- 3Trace the Plan
| Current Item | total_minutes (before add) |
total_minutes (after add) |
|---|---|---|
| (Start) | 0 | 0 |
| 45 | 0 | 45 |
| 60 | 45 | 105 |
| 0 | 105 | 105 |
| 90 | 105 | 195 |
| 75 | 195 | 270 |
| 55 | 270 | 325 |
| 30 | 325 | 355 |
- Final Calculation: The loop is done.
total_minutesis 355. The number of items is 7. Average = 355 / 7 ≈ 50.71 minutes.
Counting Items that Meet a Condition
Problem: A list represents the point values of prizes at a school carnival: [5, 20, 100, 10, 20, 50, 5, 100]. How many prizes are "big ticket" items, worth 50 points or more?
Solution Walkthrough:
- 1Identify the GoalWe aren't summing or averaging. We are counting how many items meet a specific condition (
value >= 50). - 2Plan the Algorithm
- Initialize a counter variable,
big_ticket_count, to 0. This is our bucket. - Iterate through the list
[5, 20, 100, 10, 20, 50, 5, 100], one item at a time. - For each item, ask a question: "Is this value greater than or equal to 50?"
- If the answer is yes, add 1 to
big_ticket_count. - If the answer is no, do nothing.
- After checking all items,
big_ticket_countwill hold our answer.
- Initialize a counter variable,
- 3Trace the Plan
| Current Item | Is Item >= 50? |
big_ticket_count |
|---|---|---|
| (Start) | - | 0 |
| 5 | No | 0 |
| 20 | No | 0 |
| 100 | Yes | 1 |
| 10 | No | 1 |
| 20 | No | 1 |
| 50 | Yes | 2 |
| 5 | No | 2 |
| 100 | Yes | 3 |
- Final Result: After checking the whole list, the final value of
big_ticket_countis 3. There are 3 "big ticket" items.
Try it yourself
Time to put these ideas into practice. Don't write code—just think through the algorithm and trace it on paper.
Problem 1: Finding the Lowest Bid
You're helping plan a school event and have collected bids from several catering companies for a pasta dinner. The bids are: [$15.50, $12.00, $18.00, $14.25, $12.50]. Your goal is to find the lowest bid.
- Your TaskDescribe the algorithm in plain English. Then, create a trace table similar to the "find the highest score" example to prove your algorithm works.
- HintThis is the mirror image of finding the maximum. What variable will you need? What will you initialize it to? What question will you ask at each step?
Problem 2: Counting Rainy Days
You have a data set representing daily rainfall in inches for Seattle over 10 days: [0.0, 0.5, 0.2, 0.0, 0.0, 1.1, 0.8, 0.1, 0.0, 0.3]. You want to know on how many days it actually rained.
- Your TaskDescribe the algorithm to count the number of days with rainfall greater than 0.0.
- HintWhat variable do you need to keep track of your count? What is the condition that causes you to increment that counter?
Practice — 8 questions
In simple terms, a data set is a group of related facts, and this lesson is about making a step-by-step plan to go through those facts one-by-one to answer a question.
- 4.2.A: Represent patterns and algorithms that involve data sets found in everyday life using written language or diagrams.
- 4.2.A.1
- A data set is a collection of specific pieces of information or data.
- 4.2.A.2
- Data sets can be manipulated and analyzed to solve a problem or answer a question. When analyzing data sets, values within the set are accessed and utilized one at a time and then processed according to the desired outcome.
- 4.2.A.3
- Data can be represented in a diagram by using a chart or table. This visual can be used to plan the algorithm that will be used to manipulate the data.
flowchart TD
A[Start] --> B{Initialize `max_so_far` to the first item};
B --> C{For each remaining item in the list...};
C --> D{Is current item > `max_so_far`?};
D -- Yes --> E[Update `max_so_far` to current item];
D -- No --> F[Move to next item];
E --> F;
C -- Done with all items --> G[End: `max_so_far` holds the maximum value];
F --> C;
Read what Saavi narrates
Hi everyone, it's Saavi. Let's talk about something we all do without even thinking about it: working with lists of information.
Imagine your school's soccer team is running a fundraiser, and you've jotted down the number of water bottles sold each day: 15, 22, 18, and so on. That list of numbers is a data set. Now, if the coach asks for the best sales day, you need a process to find the answer. You can't just guess.
That's our main idea today. We're learning how to take a collection of information, a data set, and create a step-by-step plan, an algorithm, to get answers from it.
The most important rule is to process the data one item at a time. Think of a cashier scanning your groceries. They scan one item, add its price to the total, and move to the next. They don't try to add it all up at once. We need to think like that cashier.
Let's try a quick example. Say we have the daily high temperatures for a week in Boston: 55, 58, 52, 61, and 60 degrees. We want the average temperature.
First, we need a plan. To get an average, we need the total sum of the temperatures. So, our plan is to go through the list, one day at a time, and add the temperature to a running total.
We'll start with a variable, let's call it `total_temp`, and set it to zero. This is a crucial step. A really common mistake is forgetting to initialize your variables. If you try to add to a variable that has no starting value, the computer doesn't know what to do. So, always start your sums at zero.
Okay, `total_temp` is zero. First day is 55. Add it. Our total is now 55. Next is 58. Add it to 55... our total becomes 113. Next is 52. Add it... total is 165. Then 61... total is 226. Finally, 60... our grand total is 286.
Now that we've gone through the *entire* list, we can do the final step. We divide our total of 286 by the number of days, which was 5. That gives us an average of 57.2 degrees.
See how we trusted the process? We focused on one simple step—adding the current number—and repeated it. That's the heart of working with data sets. Keep practicing this way of thinking, and you'll be ready for anything the exam throws at you. You've got this.
Our brains aren't built to track dozens of values and operations simultaneously. This leads to mental overload and errors.
Trust the "one at a time" process. Define the logic for a *single* item, and then apply that logic repeatedly in a loop. Use a table to trace your algorithm to prove it works.
If you try to add a number to a "sum" variable that doesn't have a starting value, what's the result? It's undefined. In Java, this will cause a compiler error.
Always initialize your "accumulator" variables before you start your loop. Sums and counts should almost always start at 0. A "maximum" or "minimum" variable should be initialized to the first element in the data set.
For problems like finding an average, you need the *final* sum and the *final* count. Calculating `sum / count` after every single item is processed is incorrect and computationally wasteful.
The loop's only job is to gather the necessary totals (like the sum and count). Do the final division or other concluding calculation *after* the loop has finished completely.
A small typo like using `>` (greater than) instead of `>=` (greater than or equal to) can completely change the result and cause your program to fail specific test cases.
Read the problem statement carefully. If it says "50 or more," that means `>= 50`. If it says "over 50," that means `> 50`. Write down the condition before you start tracing.