Free for students · Ad-free · WCAG 2.1 AA Compliant · Accessibility

Ethical and Social Issues Around Data Collection

Lesson ~10 min read 8 MCQs

In simple terms: In simple terms, this topic is about the responsibilities we have as programmers: how to protect user privacy, spot unfairness or errors in data, and choose the right information to build helpful and accurate programs.

Why this matters

Have you ever finished a show on a streaming service, and the very next recommendation is exactly what you were in the mood for? It feels like magic. But it's not magic—it's data. That service knows what you watch, when you pause, what you re-watch, and what you skip. It collects this information to build a profile of your tastes.

This is incredibly useful, but it also raises some big questions. How much information is too much? What if the recommendations started making unfair assumptions about you based on your viewing habits? As programmers, we're the ones building these systems. We're not just writing for loops and creating objects; we're designing experiences that handle personal information and affect real people.

In this lesson, we'll step back from the code itself to look at the data that powers it. We’ll explore our ethical duties to protect user privacy, how to recognize flawed or biased data, and how to choose the right data set for the job.

The 'magic' of recommendations is an 'if-then' system based on your data.

Concept overview

flowchart TD
    A[Start: Define a Problem] --> B{Find a Data Set};
    B --> C{Is it relevant to the problem?};
    C -- No --> B;
    C -- Yes --> D{Does it protect user privacy?};
    D -- No --> B;
    D -- Yes --> E{Is the collection method potentially biased?};
    E -- Yes --> F[Use data with caution & seek more info];
    E -- No --> G[Use Data];
This diagram shows a flowchart for evaluating a data set. It starts with defining a problem and finding data, then moves through decision points for relevance, privacy, and bias before deciding whether to use the data or find a new set.

Core explanation

Hey everyone, it's Saavi. Before we dive deeper into arrays and ArrayLists, we need to talk about something fundamental: the data we put in them. The code you write is only as good as the data it uses. Let's explore the ethical and social responsibilities that come with that.

Your Digital Footprint: Privacy Risks

Every time you use an app, sign up for a website, or even just walk around with your phone, you leave a trail of digital breadcrumbs. This is your personal data. It can be obvious things like your name, email, and home address. But it can also be your location history, your search queries, your friends list, and what you buy online.

When we write programs, we often need to collect some of this data to make our apps work. A map app needs your location to give you directions. A shopping app needs your address to ship you a package.

However, collecting and storing this data carries a major risk to privacy. If that data is leaked, stolen, or misused, it can have serious consequences for your users. Imagine a health app leaking a user's medical conditions or a social media site exposing private messages.

As programmers, we have an ethical obligation to be good stewards of our users' data. This means:

  • Collecting only what's necessary. Does your simple calculator app really need access to a user's contacts? Probably not.
Comparing good vs. bad data collection practices for privacy.
  • Being transparent
    Tell users what data you're collecting and why.
  • Safeguarding the data
    Store it securely and protect it from unauthorized access.

This isn't just a nice idea; it's a core professional responsibility.

"Garbage In, Garbage Out": Data Quality

There's an old saying in computer science: "Garbage in, garbage out." It means that if you feed a program bad data, you're going to get bad results, no matter how brilliant your code is. Let's break down two major types of "garbage" you'll encounter.

1. Algorithmic Bias

An algorithm isn't born with opinions. It learns them from data. Algorithmic bias describes systemic, repeated errors in a program that lead to unfair outcomes for specific groups of people. The bias almost always comes from the data used to train the program.

How biased training data leads to biased algorithmic outcomes.

Think of it like this: you want to build an app that uses AI to screen job applicants for a software engineering role. To train your AI, you feed it 10,000 resumes of people your company has hired in the past. But, historically, your company has mostly hired men named Jared and Jason from two specific universities.

What will your AI learn? It will learn that successful candidates look like... Jared and Jason. It might then unfairly penalize highly qualified applicants like Maya from a state school or Carlos from a coding bootcamp because they don't fit the pattern it was taught.

The algorithm isn't intentionally "sexist" or "elitist." It's just reflecting the bias present in its training data.

2. Incomplete or Inaccurate Data

Sometimes, data isn't biased, just plain wrong or missing pieces. Using incomplete or inaccurate data can cause your program to fail, give incorrect answers, or run very inefficiently.

Imagine you're building a navigation app for delivery drivers in Seattle. You buy a data set of all the city streets. But what if the data set is from 2015? It's missing a brand-new bridge that saves 20 minutes on a popular route. Your app, working with this outdated data, will keep sending drivers on a longer, less efficient path. The program works "correctly" based on the data it has, but the result is wrong in the real world.

Or, what if a data set of house prices has typos? A house worth $500,000 might be entered as $50,000. If your program calculates the average home price in a neighborhood, this one error could drastically and incorrectly lower the result.

The Right Tool for the Job: Choosing Data Sets

Finally, even if a data set is private, unbiased, and accurate, it might still be the wrong one for your specific problem. The context of a data set is everything.

Let's say you're building an app to help a political campaign in Dallas understand key voter issues. You find a very detailed, high-quality data set about voter concerns... from Boston.

Can you use it? No.

The concerns of voters in Boston (e.g., public transit, winter snow removal) are likely very different from the concerns of voters in Dallas (e.g., highway traffic, property taxes). Using the Boston data to make decisions in Dallas would be like using a French dictionary to translate a Spanish document. The tool is good, but it's for the wrong job.

When you're faced with a problem, always ask: What question am I trying to answer? Then, evaluate potential data sets to see if they are actually relevant to that specific question. A data set about national ice cream flavor preferences won't help you predict the price of gasoline in Atlanta.

Worked examples

Let's walk through a few scenarios to see how these concepts play out in practice.

Example 1

The Pothole Reporting App

Problem: The city of Chicago wants to create an app that lets residents report potholes. The city plans to use this data to prioritize which streets to repair. After six months, the data shows that 90% of the reported potholes are in wealthier, high-income neighborhoods. A city official suggests this means the roads in lower-income neighborhoods are in better shape. Is this a valid conclusion?

Step-by-Step Solution:

  1. 1
    Identify the Goal
    The goal is to identify and prioritize pothole repairs across the entire city.
  2. 2
    Analyze the Data Collection Method
    The data is collected through voluntary user submissions on a smartphone app. This is a critical piece of information.
  3. 3
    Look for Potential Bias
    Who is most likely to use this app? Residents who have smartphones, reliable internet access, and the time and civic inclination to report a pothole. These factors often correlate with higher income levels. Residents in lower-income areas might lack access to the technology, be working multiple jobs with less free time, or be less trustful that reporting will lead to action.
  4. 4
    Formulate a Conclusion
    The conclusion that lower-income neighborhoods have better roads is likely invalid. The data is not a complete picture of all potholes in the city; it's a picture of reported potholes. The collection method has introduced a significant sampling bias. The data over-represents the experiences of wealthier residents and under-represents the experiences of others.
  5. 5
    Recommend a Better Approach
    To get a more accurate picture, the city should not rely solely on the app. They could supplement the data by sending city crews to survey streets in neighborhoods with few or no reports.

Where students get it wrong: A common mistake is to take the data at face value. "The data says there are more potholes in rich neighborhoods, so that must be true." The key skill here is to look behind the data and question how it was collected.

Example 2

Choosing a Data Set for a Fitness App

Problem: You're developing a new feature for a fitness app that recommends 5K race training plans for beginners. You need a data set to help predict a realistic finishing time for a new user. You have access to the following three data sets:

  • Data Set A
    The finishing times for every runner in last year's Boston Marathon.
  • Data Set B
    A data set from a university study containing the weekly mileage and final 5K times for 500 first-time runners.
  • Data Set C
    Your app's own data, which includes the step counts and calories burned for millions of users.

Which data set is most appropriate?

Step-by-Step Solution:

  1. 1
    State the Specific Question
    "What is a realistic 5K finishing time for a beginner runner?"
  2. 2
    Evaluate Data Set A (Boston Marathon)
    This data set is about marathon runners, who are almost all highly experienced and well-trained. The distance is 26.2 miles, not 5K (3.1 miles). This data is not relevant to beginner 5K runners.
  3. 3
    Evaluate Data Set C (Your App's Data)
    This data is huge, but is it relevant? It contains step counts and calories, but it doesn't tell you if a user is a runner, a walker, or a cyclist. It also doesn't contain any race finishing times. This data is not appropriate for answering the specific question.
  4. 4
    Evaluate Data Set B (University Study)
    This data set is directly about the target population: first-time runners. It includes the exact outcome you want to predict: 5K times. Although the sample size (500) is smaller than the other sets, its contents are perfectly aligned with your question.
  5. 5
    Conclusion
    Data Set B is the most appropriate choice. It is the only one that is directly relevant to the problem you are trying to solve.

Where students get it wrong: Students are often tempted by "big data" and might choose Data Set C because it has "millions of users." But a smaller, highly relevant data set is always better than a massive, irrelevant one. The quality and context of the data matter more than the quantity.

Visualizing sampling bias: reported potholes vs. actual potholes.
Comparing data sets for a fitness app: which one is most appropriate?

Try it yourself

Ready to try applying these ideas? Here are a couple of scenarios to think through.

  1. 1
    Social Media Feature
    A new social media app wants to add a "People You May Know" feature. To do this, the programmers plan to ask for permission to upload and analyze a user's entire phone contact list. What is a potential privacy risk of this feature? What is one way the programmers could reduce that risk while still providing a useful feature?
    • Hint: Think about the people in the user's contact list who are NOT on the app. Does the company have their consent to have their data?
  2. 2
    Predicting Grades
    Your school wants to build a program to identify students at risk of failing a class. They have a data set that includes students' attendance records and their final grades from the previous school year. What is a potential issue with using this data to predict performance in the current school year? Is there a group of students this data might be biased against?
    • Hint: Think about students who are new to the school. What data would the system have for them? Also, consider why a student might have poor attendance.
Considering privacy for a 'People You May Know' feature.