Data and Access

Eleanor C Sayre

Abstract

New education researchers are often curious about how much data they need, or whether their small classes are big enough to collect enough data. This article outlines common data types and quantities in education research, and how to match your data types to your research questions. It’s great to use this resource as a starting point to help you frame your work. You’ll want to go more in depth for the particular data types which interest you. There are tons of resources which go into a lot more depth than this quick introduction, and there are a lot of details omitted and subtleties elided here.

What kind of data should I collect?

You should collect data which helps you answer your research question and which helps you generate new research questions to ask. There isn’t a single “best” kind of data to collect, only data which matches research questions well (or poorly). However, most education research projects tend to settle into the following major data types (even though other types are possible).

For small projects or exploratory work, you might only use one data stream to get you started. As projects grow in scope and complexity, many projects use multiple data streams to help triangulate answers to their research questions. Combining different data streams together (of different data types or multiple streams of the same type) is standard practice.

Triangulation happens when you combine different data streams to build a more robust picture of what’s happening in a particular topic or with a particular person.

If you can, you should seek rich data. Rich data are data which, as much as possible, link process information, contextual information (what are they working on? what happened immediately before and after?), and metadata. It’s rich because it has a huge wealth of interacting and supporting information. The opposite of rich data is impoverished data. Anonymized students’ grades are impoverished because you don’t have information about how they were graded, what they did to earn those grades, or who they are as people. In general, the richer your data are, the more interesting questions you can answer with it.

Process information shows the processes of how people interact and learn: how are they learning? In contrast, outcome information merely shows the results of those processes: what do they know afterwards?Metadata is data about data: information about where & when & who the data come from, the instructional context surrounding the data, which ethics approval it was collected under, etc.

Surveys

Surveys are usually administered on paper or electronically. They are formed of a series of predetermined questions, and those questions are often multiple-choice, free-response, or a mix. Some surveys have branching logic, (e.g. if participant answer X, then show question A; if they answer Y, show question B), but many surveys do not.

Surveys are a great choice …

to bring prevalence information to rich qualitative data.
for screening potential interview participants.
to replicate other studies or compare across groups.

Surveys are a poor choice …

when you have few participants
if you don’t validate them
as a sole source of information in a project.

Other considerations about surveys

Some people use a survey to screen participants for some other research participation. For example, you might survey all students in a class to identify students with particularly interesting ideas for follow-up interviews (or just interest and availability for interviews), or you might conduct followup surveys of all your interview participants one year later, to see how their ideas have developed.

Other people use surveys as a primary data source about student ideas. For example, you might administer a concept inventory to all students before and after instruction, so that you can see how their ideas are changing. Alternately, you might want to look for differences in student beliefs about science as a function of student demographics or institutional characteristics.

Where can I get them?

If you know which one you want, write to the author.

If you want to know what’s available, do a literature review.

Can I make my own?

This is very common if the purpose of your survey is very local, or if you’re using the survey information primarily to screen participants for some other research participation. You should check that participants are responding to your questions in ways that you understand, but you don’t need to go through an extensive validation process.

Alternately, some researchers want to develop a new concept inventory or generalizable research survey, for use beyond their particular local context. Validating a new survey about student ideas can be a really extensive process without much payoff at the end, especially because new survey developers often overestimate the desire of other faculty to use their developed surveys. While building a new research survey is an appealing idea to many new researchers, it is rarely actually a good choice. It’s often better to start with qualitative analysis of rich written / verbal information, and use your initial research to motivate other extensions or generalizations of your work.

Rich data is data which, as much as possible, links process information (how do people interact or learn?), contextual information (what are they working on? what happened immediately before and after?), and metadata (when was this collected? who are these people?).

Surveys and statistics

If you use surveys, you might need to use statistics to understand your results quantitatively. Some people are really excited about this. Some people are intimidated.

If your research design might require the use of statistics, then you need to conduct a power analysis (or similar quantitative estimation) to guess at how many participants you will need in order to see the differences you expect. If your power analysis suggests that you will not be able to detect differences with the group of participants you have, then surveys are unlikely to be a good use of your time (or the time of your participants).

(Of course, if you’re using a survey to collect screening or scheduling information about participants, statistics might be excessive or nonsensical).

Interviews

Interviews are characterized by real-time human interactions, usually between an interviewer (who asks specific questions from an interview protocol) and an interviewee. Some interviews have more structure to the questions, while others flow more like a conversation.

The primer on interviews covers why you should use interviews, the major kinds of interviews, how many you need, and some other common considerations.

Interviews are a great choice …

to probe how individuals/groups think about a concept or procedure
to probe how people reflect on a situation or experience
for validating potential survey questions

Interviews are a poor choice …

when you aren’t quite sure what you’re looking for
if you need to see what happens in classrooms
if there’s a big power differential between you and your participants

Other considerations about interviews

There are lots of different kinds of interviews, and you’ll want to mindfully choose the one(s) you want. When you’ve chosen what kind of interview you want, you should plan to test and revise your interview protocol iteratively before you gather data. Generally, interview data is a lot more rich than survey data, in large part because they involve real-time interactions between humans and therefore more information is available: timing, tone, etc.

Classroom artifacts

Classroom artifacts include anything the students are doing as part of their work in the class, for example copies of their homework or exams, or pictures of their whiteboards. It also includes artifacts that instructors generate, like slides or problem sets, as part of their work in the class. It doesn’t include anything that researchers specifically ask students to do for research purposes only.

Classroom artifacts are a great choice…

because they are cheap and easy to collect
when you are curious about what happens in real classrooms
because they usually permit reanalysis for multiple projects
as an additional source of data for a larger project
when you want to explore new ideas

Classroom artifacts are a poor choice…

when you need to find out why students do something
if you want to test a new idea or strategy

Other considerations about artifacts:

Artifacts are common in archival data, because they are so easy to collect and store. For those reasons, metadata is very important in understanding artifacts. Always put in more metadata than you think you need.

Metadata is data about data: information about where & when & who the data come from, the instructional context surrounding the data, which ethics approval it was collected under, etc.

You will see awesome things and then not be able to follow-up with those students. Students will write intriguing or complicated or illegible answers. By the time you analyze the data, the moment has passed. If you can track them down, they might not remember what they were thinking or why they wrote what they did. This problem is exacerbated as your data age. It’s particularly salient for instructors who are doing research in their own classes: you cannot analyze these data until the semester is over, and your dual role as instructor and researcher increases the ethical complications for trying to track down participants later.

Ethics board approval for artifacts can be tricky to navigate. IRBs are very different in their interpretations of the regulations which govern the collection of classroom artifacts, especially if those artifacts include student work. Some IRBs treat this as an easy case to handle; others are very concerned about privacy issues or consent issues. Post-fact permission for student work is ethically fraught: if your students were unaware that you wanted to use their work for research purposes, then they did not have a chance to decline to consent to participate in your research project, and that might look like coercion. In contrast, artifacts which don’t include student work, like syllabi or assignment sheets, don’t require students’ permission to use. Depending on your research questions and how public that information is, you may still need approval from your IRB and consent from the faculty who wrote them. When in doubt, talk to your IRB before you start research.

Classroom observations

Classroom observations are a great choice for when you want to see what real students and teachers are really doing in real classrooms. Broadly speaking, collecting observational classroom data means recording what happens in a classroom. Video is the gold standard for recording, but it’s also possible (and sometimes desirable) to collect only audio data, or only field notes from observers. In online or computer-based contexts, classroom observations usually take the form of screen recordings or zoom recordings; things like students’ written work are more often considered artifacts.

Classroom observations are a great choice…

because the data are rich and multifaceted.
when you need to see what “really happens”
if you’re interested in how people interact with each other or with equipment
early in a research project, to help you generate new ideas

Classroom observations are a poor choice…

when you want to be able to probe participants’ ideas in the moment
if you cannot also collect classroom artifacts to understand what they’re doing
for research questions that depend on controlled conditions or individuals acting alone.

Other considerations with classroom observations

Some observations are highly unstructured and emergent
- This is more common in the early stages of a research project, to help you generate new ideas and questions
- Video data is commonly used to support repeated viewings of the same interaction to gain insight. Video data is the gold standard because it is so rich.
Some observations are highly structured and deterministic
- There are observation protocols which can be used in live classrooms or with classroom video.
- They are more common in situations where video recordings are difficult to obtain (e.g. with children or in the EU), or where large quantities of classroom time need to be analyzed to find patterns.

Archival videos of classroom observations are quite common, but the ethical considerations of how to analyze the video are complicated.

Institutional data

Institutional data, sometimes known as “registrar data”, is data collected by your institution about the students who are enrolled. It includes things like which students registered for which courses in which semesters; their majors, grades, ID numbers, and class (first year, sophomore, junior, senior, etc); and their personal and familial demographics like race, gender, or home zip code.

Course rosters are not research data

To use institutional data for research purposes, it must be covered under a relevant ethics approval and you must have permission from your institution to specifically use it for this purpose. You cannot use old course rosters as research data unless (and until) an ethics board and your institution’s data office has approved this use for those records specifically.

Institutional data are a great choice …

if you are curious about quantitative trends
to enrich classroom artifacts or observational data
if your research questions are about correlations between enrollment and student outcomes

Institutional data are a poor choice …

when you have few participants
if your research questions are about processes or mechanisms
if your institution will not release them to you

Other considerations about institutional data

Institutional data is almost always quantitative in nature: spreadsheets and databases of information about students and their enrollments. While this is information about individuals, it’s very rarely used in research that focuses on telling qualitative stories of individuals.

Institutional data are often controlled by a central office at your institution. This office – separate from your IRB – may have concerns about releasing this data to you, and may only release anonymized data. If you plan to use this information in your research study, you will want to talk to your institutional data office early about what kinds of data they collect, and under what circumstances they’re willing to release it.

Reflections

Reflections include writing that researchers generate to systematically document or understand a setting or the people in it. They can include your generative writing or reflective memoing. The central feature of this data type is that you, the researcher, generate this data.

In contrast, analyzing reflections written by your research participants (either as part of a free-response survey or as assignments for their classes) is another data type – probably survey or classroom artifact.

Reflections are a great choice…

If you want to describe a curriculum or the implementation
If you have a dual role as an instructor and as a researcher, and your research is about your own growth in teaching.
as one of several data streams, to help contextualize them or tie them together.

Reflections are a poor choice…

as a sole stream of data

Archival data

Archival data are research data which already exist.

Classroom records are not archival data

For a data set to be archival research data, it must have been collected as research data and (still) covered under a relevant ethics approval. You cannot use old classroom records as research data unless (and until) an ethics board has approved this use for those records specifically.

If you happen to have a box of old exams in the back of the file cabinet, it is not archival research data. But, depending on the terms of your prior ethics board approvals, it may be possible for you to analyze old research data for new or related research questions.

Archival data are a great choice…

when you want to explore ideas and possibilities
when you need to write a paper quickly
if your research question is about development over a long time
if you can’t access new data
if you want to make arguments about baseline vs. change

Archival data are a poor choice…

when your research questions are very rigid or specific
when they are very poor quality or missing crucial metadata
if the IRB which covered it has expired

Other considerations with archival data:

Can I use old exams / homeworks / etc as data?
- This is an IRB question.
- If they aren’t already research data, then probably not.
You get what you get.
- sometimes, the data are not well-aligned to your research questions
- you cannot ask follow-up questions of the same participants
Is it ok to write new papers if parts of the data have already been analyzed?
- Emphatically: yes. Articulate difference in your new analyses, and cite the other papers.
Who has some archival data that I can work with?
- Lots of people! Start by emailing the authors of recent studies that you find interesting.
The older the data are, the harder it is to remember what’s going on.
- Metadata is enormously important when building a catalog of archival data or maintaining a data library. You’re going to lose data because files get corrupted or whatever, but it turns out that’s not the worst. The worst is when you still have some data files and you have no idea what’s in them or what the students are doing or when the data were taken or what the answers correspond to.

In the US, research on humans needs to be reviewed for ethical considerations by each institution’s Institutional Review Board (“IRB”). It’s pretty common to use “IRB” and “ethics board” interchangably.Metadata is data about data: information about where & when & who the data come from, the instructional context surrounding the data, which ethics approval it was collected under, etc.

Other considerations: Small sample sizes

Sometimes new researchers are worried that their classes are small or they don’t have the resources to collect and analyze data from a lot of participants. Usually, when they’re worried about this, it’s because they’re implicitly thinking that rigorous studies require statistics, and good statistics require a lot of data points.

Some research is great for small sample sizes

Very rich data or many observations of the same people
case studies or exemplars
claims about existence

Some research is not

Impoverished data (e.g. surveys, homework responses)
claims about prevalence

My papers include a study of 50k students over 20 years, and a close analysis of 37 seconds of video in one group where only two people talk. Sample size matters less than matching data with research question.

Examples

These example exercises and discussion prompts might help you think about how to coordinate research questions and data streams.

For each of these scenarios, working on which kinds of data to collect or how much data will be possible suggests that you should adjust your research questions to better match your access to data. This research design process of iteratively refining your research question and data collection plans is important and central to my work as a researcher.

Case study: Maria and Computational Thinking

Physics graduates need computational skills, but Maria’s departmental curriculum doesn’t cover computation. She would like to add training in computational skills, but her department is unwilling to require a new course for everyone.

What kinds of data could Maria collect to show that her homework and activities are helping students learn?
What kinds of data could she collect to suggest how to make changes to her materials to improve student learning?

Maria’s project could be a great example of a design-based research project.

Case study: Kai and Graduating Seniors

Kai worries that their graduating seniors aren’t well prepared for professional life after graduation. Their department has asked them to make recommendations about how the department could better support students about planning for post-grad life.

What kinds of data could Kai collect to
- investigate how current students think about professional life after graduation?
- figure out what their alumni do?
- suggest how the department could better prepare students to think about this topic?

Case study: Arthur and Baseline Data

Arthur’s department teaches an introductory class with 8 different instructors and 10 different GTAs. Many students are struggling with the mathematical formalism, so the department is going to revise the curriculum to support students’ math skills and problem solving. Arthur plans to collect some baseline data so that they will know if the changes are working.

What kinds of data could Arthur collect? Be specific, e.g.:
- not just “surveys”: from whom? how frequently? on which topics?
- not just “registrar data”: what pieces of information?
- not just “classroom artifacts”: which ones? on which topics?
Some baseline data will come from the upcoming year, before the department makes changes to their teaching. Other baseline data concerns semesters which have already passed. What kinds of retrospective data might be available ethically?
What do you think Arthur’s department (or Arthur), means by “if the changes are working”? Working for whom, and in what ways? Think about what kinds of research questions Arthur could pursue. and how those are supported (or not) by each of the data types he could collect.

Arthur’s project could be a great example of a “getting bigger” iterative design.

History

This article was first written on August 1, 2022, and last modified on July 18, 2024.

Citation

For attribution, please cite this work as:

Sayre, Eleanor C. 2022. “Data and Access.” In Research: A Practical Handbook. https://handbook.zaposa.com/articles/data-and-access/.