Ethical and Social Issues Around Data Collection
Why this matters
Have you ever finished a show on a streaming service, and the very next recommendation is exactly what you were in the mood for? It feels like magic. But it's not magic—it's data. That service knows what you watch, when you pause, what you re-watch, and what you skip. It collects this information to build a profile of your tastes.
This is incredibly useful, but it also raises some big questions. How much information is too much? What if the recommendations started making unfair assumptions about you based on your viewing habits? As programmers, we're the ones building these systems. We're not just writing for loops and creating objects; we're designing experiences that handle personal information and affect real people.
In this lesson, we'll step back from the code itself to look at the data that powers it. We’ll explore our ethical duties to protect user privacy, how to recognize flawed or biased data, and how to choose the right data set for the job.
Concept overview
flowchart TD
A[Start: Define a Problem] --> B{Find a Data Set};
B --> C{Is it relevant to the problem?};
C -- No --> B;
C -- Yes --> D{Does it protect user privacy?};
D -- No --> B;
D -- Yes --> E{Is the collection method potentially biased?};
E -- Yes --> F[Use data with caution & seek more info];
E -- No --> G[Use Data];
Core explanation
Hey everyone, it's Saavi. Before we dive deeper into arrays and ArrayLists, we need to talk about something fundamental: the data we put in them. The code you write is only as good as the data it uses. Let's explore the ethical and social responsibilities that come with that.
Your Digital Footprint: Privacy Risks
Every time you use an app, sign up for a website, or even just walk around with your phone, you leave a trail of digital breadcrumbs. This is your personal data. It can be obvious things like your name, email, and home address. But it can also be your location history, your search queries, your friends list, and what you buy online.
When we write programs, we often need to collect some of this data to make our apps work. A map app needs your location to give you directions. A shopping app needs your address to ship you a package.
However, collecting and storing this data carries a major risk to privacy. If that data is leaked, stolen, or misused, it can have serious consequences for your users. Imagine a health app leaking a user's medical conditions or a social media site exposing private messages.
As programmers, we have an ethical obligation to be good stewards of our users' data. This means:
- Collecting only what's necessary. Does your simple calculator app really need access to a user's contacts? Probably not.
- Being transparentTell users what data you're collecting and why.
- Safeguarding the dataStore it securely and protect it from unauthorized access.
This isn't just a nice idea; it's a core professional responsibility.
"Garbage In, Garbage Out": Data Quality
There's an old saying in computer science: "Garbage in, garbage out." It means that if you feed a program bad data, you're going to get bad results, no matter how brilliant your code is. Let's break down two major types of "garbage" you'll encounter.
1. Algorithmic Bias
An algorithm isn't born with opinions. It learns them from data. Algorithmic bias describes systemic, repeated errors in a program that lead to unfair outcomes for specific groups of people. The bias almost always comes from the data used to train the program.
Think of it like this: you want to build an app that uses AI to screen job applicants for a software engineering role. To train your AI, you feed it 10,000 resumes of people your company has hired in the past. But, historically, your company has mostly hired men named Jared and Jason from two specific universities.
What will your AI learn? It will learn that successful candidates look like... Jared and Jason. It might then unfairly penalize highly qualified applicants like Maya from a state school or Carlos from a coding bootcamp because they don't fit the pattern it was taught.
The algorithm isn't intentionally "sexist" or "elitist." It's just reflecting the bias present in its training data.
2. Incomplete or Inaccurate Data
Sometimes, data isn't biased, just plain wrong or missing pieces. Using incomplete or inaccurate data can cause your program to fail, give incorrect answers, or run very inefficiently.
Imagine you're building a navigation app for delivery drivers in Seattle. You buy a data set of all the city streets. But what if the data set is from 2015? It's missing a brand-new bridge that saves 20 minutes on a popular route. Your app, working with this outdated data, will keep sending drivers on a longer, less efficient path. The program works "correctly" based on the data it has, but the result is wrong in the real world.
Or, what if a data set of house prices has typos? A house worth $500,000 might be entered as $50,000. If your program calculates the average home price in a neighborhood, this one error could drastically and incorrectly lower the result.
The Right Tool for the Job: Choosing Data Sets
Finally, even if a data set is private, unbiased, and accurate, it might still be the wrong one for your specific problem. The context of a data set is everything.
Let's say you're building an app to help a political campaign in Dallas understand key voter issues. You find a very detailed, high-quality data set about voter concerns... from Boston.
Can you use it? No.
The concerns of voters in Boston (e.g., public transit, winter snow removal) are likely very different from the concerns of voters in Dallas (e.g., highway traffic, property taxes). Using the Boston data to make decisions in Dallas would be like using a French dictionary to translate a Spanish document. The tool is good, but it's for the wrong job.
When you're faced with a problem, always ask: What question am I trying to answer? Then, evaluate potential data sets to see if they are actually relevant to that specific question. A data set about national ice cream flavor preferences won't help you predict the price of gasoline in Atlanta.
Worked examples
Let's walk through a few scenarios to see how these concepts play out in practice.
The Pothole Reporting App
Problem: The city of Chicago wants to create an app that lets residents report potholes. The city plans to use this data to prioritize which streets to repair. After six months, the data shows that 90% of the reported potholes are in wealthier, high-income neighborhoods. A city official suggests this means the roads in lower-income neighborhoods are in better shape. Is this a valid conclusion?
Step-by-Step Solution:
- 1Identify the GoalThe goal is to identify and prioritize pothole repairs across the entire city.
- 2Analyze the Data Collection MethodThe data is collected through voluntary user submissions on a smartphone app. This is a critical piece of information.
- 3Look for Potential BiasWho is most likely to use this app? Residents who have smartphones, reliable internet access, and the time and civic inclination to report a pothole. These factors often correlate with higher income levels. Residents in lower-income areas might lack access to the technology, be working multiple jobs with less free time, or be less trustful that reporting will lead to action.
- 4Formulate a ConclusionThe conclusion that lower-income neighborhoods have better roads is likely invalid. The data is not a complete picture of all potholes in the city; it's a picture of reported potholes. The collection method has introduced a significant sampling bias. The data over-represents the experiences of wealthier residents and under-represents the experiences of others.
- 5Recommend a Better ApproachTo get a more accurate picture, the city should not rely solely on the app. They could supplement the data by sending city crews to survey streets in neighborhoods with few or no reports.
Where students get it wrong: A common mistake is to take the data at face value. "The data says there are more potholes in rich neighborhoods, so that must be true." The key skill here is to look behind the data and question how it was collected.
Choosing a Data Set for a Fitness App
Problem: You're developing a new feature for a fitness app that recommends 5K race training plans for beginners. You need a data set to help predict a realistic finishing time for a new user. You have access to the following three data sets:
- Data Set AThe finishing times for every runner in last year's Boston Marathon.
- Data Set BA data set from a university study containing the weekly mileage and final 5K times for 500 first-time runners.
- Data Set CYour app's own data, which includes the step counts and calories burned for millions of users.
Which data set is most appropriate?
Step-by-Step Solution:
- 1State the Specific Question"What is a realistic 5K finishing time for a beginner runner?"
- 2Evaluate Data Set A (Boston Marathon)This data set is about marathon runners, who are almost all highly experienced and well-trained. The distance is 26.2 miles, not 5K (3.1 miles). This data is not relevant to beginner 5K runners.
- 3Evaluate Data Set C (Your App's Data)This data is huge, but is it relevant? It contains step counts and calories, but it doesn't tell you if a user is a runner, a walker, or a cyclist. It also doesn't contain any race finishing times. This data is not appropriate for answering the specific question.
- 4Evaluate Data Set B (University Study)This data set is directly about the target population: first-time runners. It includes the exact outcome you want to predict: 5K times. Although the sample size (500) is smaller than the other sets, its contents are perfectly aligned with your question.
- 5ConclusionData Set B is the most appropriate choice. It is the only one that is directly relevant to the problem you are trying to solve.
Where students get it wrong: Students are often tempted by "big data" and might choose Data Set C because it has "millions of users." But a smaller, highly relevant data set is always better than a massive, irrelevant one. The quality and context of the data matter more than the quantity.
Try it yourself
Ready to try applying these ideas? Here are a couple of scenarios to think through.
- 1Social Media FeatureA new social media app wants to add a "People You May Know" feature. To do this, the programmers plan to ask for permission to upload and analyze a user's entire phone contact list. What is a potential privacy risk of this feature? What is one way the programmers could reduce that risk while still providing a useful feature?
- Hint: Think about the people in the user's contact list who are NOT on the app. Does the company have their consent to have their data?
- 2Predicting GradesYour school wants to build a program to identify students at risk of failing a class. They have a data set that includes students' attendance records and their final grades from the previous school year. What is a potential issue with using this data to predict performance in the current school year? Is there a group of students this data might be biased against?
- Hint: Think about students who are new to the school. What data would the system have for them? Also, consider why a student might have poor attendance.
Practice — 8 questions
In simple terms, this topic is about the responsibilities we have as programmers: how to protect user privacy, spot unfairness or errors in data, and choose the right information to build helpful and accurate programs.
- 4.1.A: Explain the risks to privacy from collecting and storing personal data on computer systems.
- 4.1.B: Explain the importance of recognizing data quality and potential issues when using a data set.
- 4.1.C: Identify an appropriate data set to use in order to solve a problem or answer a specific question.
- 4.1.A.1
- When using a computer, personal privacy is at risk. When developing new programs, programmers should attempt to safeguard the personal privacy of the user.
- 4.1.B.1
- Algorithmic bias describes systemic and repeated errors in a program that create unfair outcomes for a specific group of users.
- 4.1.B.2
- Programmers should be aware of the data set collection method and the potential for bias when using this method before using the data to extrapolate new information or drawing conclusions.
- 4.1.B.3
- Some data sets are incomplete or contain inaccurate data. Using such data in the development or use of a program can cause the program to work incorrectly or inefficiently.
- 4.1.C.1
- Contents of a data set might be related to a specific question or topic and might not be appropriate to give correct answers or extrapolate information for a different question or topic.
flowchart TD
A[Start: Define a Problem] --> B{Find a Data Set};
B --> C{Is it relevant to the problem?};
C -- No --> B;
C -- Yes --> D{Does it protect user privacy?};
D -- No --> B;
D -- Yes --> E{Is the collection method potentially biased?};
E -- Yes --> F[Use data with caution & seek more info];
E -- No --> G[Use Data];
Read what Saavi narrates
Hey everyone, it's Saavi.
Have you ever finished a show on a streaming service, and the very next recommendation is... perfect? It feels like magic, but it's really just data. That service knows what you watch, what you skip, and for how long. It's useful... but also makes you think, doesn't it?
As programmers, we're the ones building these systems. So today, we're going to talk about the human side of code. We'll explore our responsibilities when we handle user data... focusing on privacy, fairness, and choosing the right information to solve a problem.
Let's walk through an example together. Imagine the city of Chicago creates an app for residents to report potholes. After a few months, the data shows that most reports are coming from wealthier neighborhoods. A city official might look at that and say, "Great! It looks like the roads in lower-income areas are in better shape."
But wait... let's think like a programmer. We have to question the data. How was it collected? Through a smartphone app. Who is most likely to have a smartphone, a stable internet connection, and the free time to report a pothole? It's possible that this method of collection is creating a bias. The data doesn't show where the potholes *are*, it shows where they're being *reported*. The real problem might be that residents in other areas don't have equal access to the reporting tool.
This brings up a really common mistake students make. They tend to take data at face value. They see a chart and assume it's the whole truth. But as programmers, we have to be detectives. We have to look behind the numbers and ask... how was this collected? Who might be left out? The data itself can be misleading.
So, as you continue your journey in computer science, remember that you're not just writing code. You're building systems that affect real people. Always think critically about the data you use. It’s one of the most important jobs a programmer has. You've got this.
A massive data set can still be biased or irrelevant. If your data is biased, more data just gives you more confidence in the wrong conclusion.
Prioritize the relevance and quality of the data over its size. A small, clean, relevant data set is far more valuable than a large, messy, irrelevant one.
The algorithm is usually just executing instructions. The bias almost always originates from the data it was trained on or the assumptions made by the programmers.
When you see a biased outcome, investigate the data first. Ask: "What patterns in the data would lead to this result?"
Data is not neutral. It's collected by people, with specific tools, in specific contexts. Every data set has a story and a point of view, which may include flaws and biases.
Treat every data set with healthy skepticism. Always question where it came from, how it was collected, and what might be missing.
This leads to solving the wrong problem. If you have data on bird migration, trying to use it to predict stock prices is nonsensical, even if the data is perfect.
Start with your question or problem first. Then, go looking for data that is specifically suited to answering that question.
Seemingly harmless data points (like your location, search history, or "likes") can be combined to create a surprisingly detailed and invasive profile of you.
Understand that almost all user data is personal data. As a programmer, treat all of it with care and respect the user's privacy.