# Data collection

Ronny Gunnarsson. Data collection [in Science Network TV]. Available at: https://science-network.tv/data-collection/. Accessed June 16, 2024.
This web page provides a brief overview over different ways of collecting observations / data for your study. It is very important to understand the difference between these options when planning a study.

# Data collection in studies using a quantitative approach

Collection of observations are part of what we call data collection. However, it starts long before when you are planning your study. Data collection involves making a few definitions and decisions before actually collecting the data:

1. Define the population of interest
2. Define Sampling frame
3. Decide sampling method
4. Deciding inclusion and exclusion criteria
5. Decide what type of data should be collected
6. Decide sample size
7. Plan practicalities around data collection
8. Perform data collection

## Defining the population of interest

If your project involves humans then the participants in your study can be seen as a sample taken from an underlying population. The results from your project can hopefully be generalised to this underlying population. You must be able to describe to what population you expect your results to be valid. An example of a population of interest can be “all Caucasian women in the age group 40-70 years with diabetes mellitus type II living in a developed country”. You can not investigate all these people so you will take a small sample of them and hope that your sample is representative for your population of interest.

## Defining sampling frame

The sampling frame are those from your population of interest that for practical reasons are eligible for being included. An example could be “all Caucasian women in the age group 40-70 years with diabetes mellitus type II known to the primary health care centre or hospital in XX town”. You will rarely include the whole sampling frame, just a sample of it.

## Deciding sampling technique

We have two main approaches to sampling, non-probability sampling and probability sampling. Each individual’s probability of being chosen for the study is known in advance in a probability sampling. However, it is unknown in a non-probability sampling.

### Non-Probability sampling

• Convenience sampling = Grab what you have at hands
• Snowball sampling
• Quota sampling (not truly stratified)
• Consecutive sampling (grab observations in the order they appear)
• Purposeful sampling (purposefully pick individuals to get a reasonable dispersion / variation in respect of age, gender, experience of phenomena of interest, etc)

Each individual’s probability of being recruited to the study is unknown in all above sampling techniques and this may increase the risk for bias.

### Probability (random) sampling

• Simple random sampling
• Systematic sampling
• Stratified sample
• Cluster sample

Each individual’s probability of being recruited to the study can be calculated before data collection commence in all above sampling techniques and this is likely to reduce the risk for bias.

### Recommended sampling techniques

• Most non-Probability sampling methods are OK for a pilot study estimating feasibility before a larger randomised controlled trial (RCT) is done.
• Consecutive sampling is often OK for an early phase I or phase II RCT to prove if there is any kind of effect.
• Some kind of probability sampling is desired for a large phase III RCT proving effect in the clinical situation. However, most phase III and IV trials use consecutive sampling which is a non-probability sampling method.
• Probability sampling is required for any observational study trying to clarify association between different phenomenon.
• Purposeful sampling is the preferred option in empiric-holistic (qualitative) studies to ensure enough variation in your observations.

(More description of the different sampling techniques will come)

## Eligibility criteria (inclusion and exclusion criteria)

Eligibility criteria are criteria used to identify subjects suitable to be included and to remain included. There are two separate uses of inclusion and exclusion criteria:

A: Your subjects need to fulfill ALL inclusion criteria to be eligible and included. Some of these criteria may be absence of pregnancy, dementia, end stage renal disease or other coexisting conditions making them unsuitable to participate. Exclusion criteria are criteria later applied to determine if subjects previously included later should be excluded . Hence, patients are included solely on the basis of inclusion criteria and exclusion criteria are applied later. In this use of eligibility criteria cross-sectional studies, where all data collection is done at a single occasion, do not have exclusion criteria, only longitudinal studies have.

B: Both inclusion and exclusion criteria are used initially to decide of a participant should be included. Inclusion criteria are usually broader while the exclusion criteria usually deals with specific factors (like comorbidities) that can mask a treatment effect or aims to identify participants unlikely to adhere to the study protocol .

It is a common misconception that exclusion criteria are a mirror of the inclusion criteria. A common example of this might be that being female is one of the inclusion criteria and subsequently male gender is stated as an exclusion criteria. However, males were never included in the first place because they did not meet inclusion criteria. Hence, no need to exclude them because they were never included.

## Decide what type of data should be collected

We use the label “variable” for a specific type of observation. Examples of variables might be age, gender, presence of high blood pressure, etc. These variables have two functions useful when you have your results;

1. Describing what kind of observations / patients were included in your study. This is labelled descriptive statistics and tell the readers if your result may be applicable to their situation.
2. Used as the basis to draw conclusions. This is labelled inferential statistics.

Many variables are used for both descriptive and inferential statistics. Variables used for inferential statistics should be submitted to sample size calculations (see below). Sometimes the sample size calculation may show that one variable requires an unreasonably high number of observations / patients. In that scenario this variable might be ditched completely or it might be kept solely for descriptive statistics. There is usually an interplay between the preliminary list of desired variables and the sample size calculation before you end up with the final list of variables intended for descriptive and / or inferential statistics. The type of data to be collected can be:

1. Direct measurements (such as measurements of the body and its chemistry, body reactions)
2. Indirect measurements of knowledge, attitudes or perceptions using surveys or structured interviews
1. Binary questions (Yes/No)
2. Surveys measuring attitudes or perceptions (Likert scale, Visual analogue scale or similar)
3. Surveys with other fixed response alternatives
3. Structured observations
1. Structured observations of behavior
2. Structured observations of events or processes

## Deciding sample size

It is important when using a quantitative approach to do a sample size calculation for variables intended to be used for inferential statistics, at least for the primary research questions. This involves making some assumptions and decisions. Please read the web-page sample size estimation for detailed information.

## Plan practicalities around data collection

(This section is still under construction. Sorry for the inconvenience.)

## Perform data collection

(This section is still under construction. Sorry for the inconvenience.)

# Data collection in studies using a qualitative approach

## Defining criteria for selecting participants

Purposeful sampling is usually the best choice in studies using a qualitative approach. This means purposefully picking individuals to get a reasonable dispersion / variation in respect of age, gender, experience of phenomena of interest, etc. Random sampling techniques are almost always the completely wrong choice here.

(This section is still under construction. Sorry for the inconvenience.)

## Decide data collection technique

1. Interviews with one person at the time
1. Open (unstructured) interviews
2. Partly open (semi structured) interviews
2. Interviews and discussions in group = focus groups
3. Documents
1. Diaries
2. Written stories
3. Fiction / Poetry
4. Open (unstructured) observations
1. Non participatory observations
1. Non participatory hidden observations (one way mirror or hidden video cameras)
2. Non participatory disclosed observations (sitting observing or disclosed video camera)
2. Participatory observations
1. Participatory hidden observations (Günter Wallraff)
2. Participatory disclosed observations (common in ethnography, grounded theory and social anthropology)

# References

1.
Eligibility criteria [Internet]. [cited 2019 Oct 10]. Available from: https://www.spirit-statement.org/eligibility-criteria/