Introduction

Chemometrics and related mathematical methods has a long history in the early AI (artificial intelligence) literature (e.g., Hemmer, 2007), and many techniques are readily accessible through data analysis packages. One way to categorize a subset of supervised and unsupervised learning in tabular format is shown below (from Takahama et al., 2011):

Result	Supervised	Unsupervised
continuous coefficients	regression	PCA / exploratory factor analysis
discrete membership	classification	cluster analysis

Any number of methods can be used together - e.g., baseline correction \(\to\) PCA \(\to\) cluster analysis to obtain spectral groups (baseline correction a priori is required for most methods).

Examples are shown for several pattern recognition and dimension reduction techniques.

Principal component analysis (unsupervised learning): decomposes spectra matrix into a set of orthogonal variables.
Exploratory factor analysis (unsupervised learning): decomposes spectra matrix into a set of physically-plausible variables.
Cluster analysis (unsupervised learning): determine categories based on spectral similarity.
Classification (supervised learning): determine spectral membership to previously established categories.

The theory behind individual statistical methods can be reviewed in the referenced literature; this vignette focuses on illustrating their application to FTIR spectra.

Chemometric techniques (e.g., Kowalski, 1984; Brereton, 2018; Geladi, 2003; Massart et al., 1988; Wold and Sjöström, 1998) are typically built around vectors and spectra matrices, which are well-represented by matrix objects in R. Several ‘meta-packages’ [e.g., caret (Kuhn and Johnson, 2013), tidymodels, and mlr (Bischl et al., 2016)] consolidate machine learning (ML) methods around application to data.frame objects. Spec objects can easily be converted to the required object class to access the large suite of available operations without wrapping them in separate Spec methods. Note that the landscape of of meta-packages is rapidly changing (e.g., the author of caret is now developing tidymodels and the developers of mlr are now working on mlr3) - but it is worthwhile following this space (of meta-packages) for new developments.

R example

library(APRLspec)
library(APRLssb)
library(RJSONIO)
library(tidyverse)
library(gridExtra)
library(reshape2)

specbl <- ReadSpec(file.path("data", "IMPROVE2011_spec_bl.rds"))

The CO\(_2\) peak arises from the air in the FTIR analysis chamber and is decoupled from aerosol chemical composition, so it is removed using spline interpolation (Takahama et al., 2013). This procedure makes a difference mostly for low-concentration samples.

X <- Align(Spec(do.call(rbind, SpecApply(specbl, SpecFUN(RemCO2))), Wavenumbers(specbl)),
           Wavenumbers(specbl))

set.seed(222)
ix <- sample(rownames(X), 5) # 5 random samples selected for illustration

full_join(
  melt(specbl[ix,]) %>% mutate(type="original"),
  melt(X[ix,]) %>% mutate(type="CO2removed")
) %>% mutate(type = factor(type, c("original", "CO2removed"))) %>%
  ggplot +
  facet_grid(type~.)+
  geom_line(aes(wavenum, spec, group=sample), color=gray(.5))+
  scale_x_reverse()+
  theme_bw()

Principal component analysis

PCA is a common first step for regression, clustering, and classification (e.g., Martens and Næs, 1991; Jackson, 2004; Jolliffe and Cadima, 2016). Decomposition of column-centered matrix is typically written as \(\mathbf{X} = \mathbf{T}\mathbf{P}^T\). prcomp by default centers each column of the matrix by its mean, but does not scale (normalize) prior to decomposition - this is a standard convention for chemometric analysis.

We will examine up to the first \(k=100\) components.

k <- 100
pcc <- prcomp(X, center=TRUE, scale=FALSE, rank=k)
T <- pcc$x        # scores
P <- pcc$rotation # loadings

Visualize component profiles (first 5 only).

#plot(Spec(t(P)[1:5,], Wavenumbers(bl.r)))
ggplot(melt(Spec(t(P)[1:5,], Wavenumbers(X))))+
  geom_line(aes(wavenum, spec, color=sample))+
  scale_x_reverse()+
  labs(x=expression(Wavenumber~(cm^-1)), y="Absorbance")+
  theme_bw()

PCA is useful for exploring the number of dimensions required to represent a large fraction of variances in the spectra matrix. In addition to the conventional scree plot (screeplot / plot), the cumulative proportion of variation explained can be visualized as a function of components.

plot(seq(k), summary(pcc)$importance["Cumulative Proportion",seq(k)], 
     type="o", xlim=c(1, 10), ylim=c(0, 1), ylab="Cumulative Explained Variation")

Biplots can be used to visualize groups of samples that can be differentiated using two dimensions (the first two shown below).

biplot(pcc, cex=.5)

## Factor analysis

Exploratory factor analysis for spectroscopy can be found under different monikers: multivariate curve resolution (MCR) (de Juan et al., 2014), non-negative matrix factorization (NMF) (Donoho and Stodden, 2004; Lee and Seung, 1999, 2001), and positive matrix factorization (PMF) (Brown et al., 2015; Paatero et al., 2014; Paatero, 1994; Ulbrich et al., 2009).

NMF finds the following decomposition from spectra matrix \(\mathbf{V}\): \(\mathbf{V} = \mathbf{W} \mathbf{H} + \mathbf{N}\), which is typically written as \(\mathbf{X} = \mathbf{G} \mathbf{F}\) in the PMF literature. Several R packages are available for NMF: NMF (Gaujoux and Seoighe, 2010), NNLM.

library(NMF)
V <- as.matrix(replace(X, X < 0, 0)) # necessary for NMF

Perform decomposition for 3 factors.

nFac <- 3
res <- suppressWarnings(nmf(V, rank=nFac, method="lee", seed="nndsvd"))
W <- res@fit@W
H <- res@fit@H
N <- V - W %*% H
colnames(W) <- rownames(H) <- GenLabel("Factor", seq(ncol(W)))

Evaluate correlations among factors - to the extent possible, solutions with “low” factor correlations (Russell et al., 2009), particularly among factor contributions (columns of \(\mathbf{W}\)), are preferred as chances of artifically splitting factors (Ulbrich et al., 2009) or interpreting redundant factors are reduced.

rpairs <- full_join(
  melt(cor(W), value.name="rW"),
  melt(cor(t(H)), value.name="rH")
) %>% 
  filter(unclass(Var1) < unclass(Var2)) %>%
  mutate(pair = paste(sep="-", Var1, Var2))

ggplot(rpairs)+
  geom_point(aes(rW, rH, color=pair), size=3)+
  geom_vline(xintercept=0, linetype=2, color=8)+
  geom_hline(yintercept=0, linetype=2, color=8)+
  lims(x=c(-1, 1), y=c(-1, 1))+
  labs(x="correlation between columns of W", y="correlation between rows of H")+
  theme_bw()

Visualize factor components (rows of \(\mathbf{H}\)). Spectral features can be noted (e.g., magnitude of carbonyl peak) and overall profiles can be compared against factors found in previous studies (Corrigan et al., 2013; Russell et al., 2011).

ggplot(melt(Spec(H, Wavenumbers(X))))+
  geom_line(aes(wavenum, spec, color=sample))+
  scale_x_reverse()+
  labs(x=expression(Wavenumber~(cm^-1)), y="Absorbance")+
  theme_bw()

Calculate and visualize explained variation (Paatero, 2002).

kseq <- setNames(seq(ncol(W)), colnames(W))
vh <- function(k) abs(W[,k,drop=FALSE] %*% H[k,,drop=FALSE])
vhcumul <- function(kseq) Reduce(`+`, lapply(kseq, vh))
evik <- sapply(kseq, compose(rowSums, vh)) / rowSums(vhcumul(kseq) + abs(N))
evkj <- t(sapply(kseq, compose(colSums, vh)) / colSums(vhcumul(kseq) + abs(N)))

Visualize explained variation.

coltheme <- colorRampPalette(c("darkblue", "orange"))(max(kseq))
mytheme <- list(scale_y_continuous("Explained variation", limits=c(0, 1), expand=c(0, 0)),
                scale_fill_manual(values=rev(coltheme)),
                theme(panel.background = element_rect(fill = "black"),
                      panel.grid = element_blank()))
OrderComp <- function(x) mutate(x, comp = factor(comp, rev(names(kseq))))

gg1 <- ggplot(melt(evik, c("sample", "comp")) %>% 
         mutate(sample = factor(sample, rownames(evik)[order(rowSums(evik))])) %>%
         OrderComp)+
  geom_col(aes(sample, value, fill=comp))+
  scale_x_discrete("Sample", expand=c(0, 0))+
  mytheme+
  theme(axis.text.x = element_text(angle=45, hjust=1, size=1))

gg2 <- ggplot(melt(Spec(evkj, Wavenumbers(X)), c("comp", "iw")) %>% OrderComp)+
  geom_col(aes(wavenum, spec, fill=comp))+
  scale_x_reverse(expression("Wavenumber"~(cm^-1)), expand=c(0, 0))+
  mytheme


grid.arrange(gg1, gg2, ncol=1)

In a true application beyond this illustration, initial values, number of factors, and rotatial ambiguity should be explored (for which PMF provides a systematic mechanism) (Paatero et al., 2002, @Ulbrich2009). Once such series of solutions are obtained, correlations of factor contributions with elemental or molecular markers, airmass back trajectory analyses, etc. can be combined together to provide interpretation beyond spectroscopic features visible in spectra profiles of the factors (rows of \(\mathbf{H}\)).

Cluster analysis

Some considerations for clustering spectra of ambient samples:

Preprocessing/wavenumber selection. We want to remove interferences from the substrate and additional contributions from spectral features not associated with the aerosol sample (e.g., water vapor and CO\(_2\) in the sample compartment). This can be handled with a combination of baseline correction, smoothing (Takahama et al., 2013), and exclusion of specific regions of the spectrum.
Scaling/normalization. Spectra contain information on both concentration and composition. Since concentration can be dependent on meteorologial factors such as boundary layer height or extent of dispersion, we wish to create spectral categories primarily based on composition (i.e., two samples with same composition and different concentration would end up in the same category). Since absorption is linearly related to concentration, we use the absorbance (spectrum) vector magnitude as a proxy for concentration - provided that non-absorption artifacts are removed a priori through baseline correction - to use for spectrum scaling. (Scaling by the L2 norm leads and determining similarity by Euclidean distances is related to distances assessed by cosine similarity.)
Sample selection. Cluster membership is determined by consideration of all samples together; “low-quality” spectra should be removed prior to clustering so as not to adversly influence the assignments. S/N can be considered one indicator of quality; for measurements where the heteroscedastic error scales with concentration or instrument response, then the S/N should asymptotically reach a fixed value approaching high concentrations. However, samples with low S/N can be dominated by the noise contribution - these samples can be set aside as a separate cluster (Ruthenburg et al., 2014), or later classified into the groups formed with “high-quality” spectra. Low concentration or low S/N samples may additionally suffer from greater baseline correction error.
Dimension reduction. Previous work (since Russell et al., 2009) used normalized spectra (without dimension reduction) for clustering, but we can also consider using a reduced feature set for this task for simplifying computation and signal denoising. Uncentered PCA (Cadima and Jolliffe, 2009) gives the useful result that \(\|\mathbf{x}_i\| = \|\mathbf{t}_i\|\) (the magnitude of the uncentered spectrum vector and the full-rank score vector for each sample \(i\) are equivalent) and \(\|\mathbf{x}_i\| \approx \|\mathbf{t}_i^{\text{(r)}}\|\) for the reduced-rank score vector (denoted by “(r)”), which then allows us to apply the same scaling after dimension reduction. Selecting only the first \(k\) components will presumaby remove other features present in low-concentration samples in addition to noise so retaining a larger number of components than indicated by overall explained variation may be beneficial.

For this illustration, we begin by removing regions mostly dominated by water vapor and CO\(_2\), and conservatively removing 10% of samples with the lowest vector magnitudes.

## exclude regions affected by water vapor and CO2
reg.vap <- Wavenumbers(specbl) > 3700 | findInterval(Wavenumbers(specbl), c(1820, 2400))==1
X <- specbl[,!reg.vap]

## exclude samples with 5% lowest vector magnitude (for low S/N or possible baseline correction error)
magnitude <- apply(X, 1, norm, "2")
X <- X[magnitude > quantile(magnitude, .1),]

We use 20-component uncentered PCA.

k <- 20
pcu <- prcomp(X, center=FALSE, scale=FALSE, rank=k)
T <- pcu$x
P <- pcu$rotation
magnitude.scores <- apply(T, 1, norm, "2")

We can see in a few examples that the selected scores reconstruct the major features of the spectra with minor (high frequency) variations removed.

  set.seed(222)
  ix <- sample(rownames(T), 5)
  full_join(
    melt(specbl[ix,]) %>% mutate(type="original"),
    melt(Spec(T[ix,] %*% t(P), Wavenumbers(X))) %>% mutate(type="reconstructed")
  ) %>% 
  ggplot+
    facet_grid(type~.)+
    geom_line(aes(wavenum, spec, group=sample), color=gray(.5))+
    annotate("rect", xmin=3700, xmax=Inf, ymin=-Inf, ymax=Inf, alpha=.5, fill=gray(.5))+
    annotate("rect", xmin=1800, xmax=2400, ymin=-Inf, ymax=Inf, alpha=.5, fill=gray(.5))+
    annotate("rect", xmin=-Inf, xmax=1500, ymin=-Inf, ymax=Inf, alpha=.5, fill=gray(.5))+
    scale_x_reverse()+
    theme_bw()

The vector magnitude of spectra typically span orders of magnitude, as do atmospheric concentrations (Ott, 1994). The vector magnitude of scores capture the scores of the original spectra.

par(mfcol=c(1, 2))
plot(ecdf(magnitude), pch=19)
abline(v=min(magnitude[rownames(X)]), lty=2, col="pink")
plot(magnitude[rownames(X)], magnitude.scores, pch=19)
abline(0, 1)

We use these magnitude for sample-specific scaling/normalization.

Ts <- T / magnitude.scores

We use hierarchical clustering (Everitt et al., 2011) using Euclidean distance to determine spectral similarity; this approach has often been used for aerosol FTIR spectra clustering since Russell et al. (2009). It is also possible to use Mahalanobis distances (which involves autoscaling of orthogonal PCA scores) to upweight contributions from components with smaller variances, but using Euclidean distance will yield spectral clusters most similar to that obtained by clustering of full spectra and so we use this metric here (also, with Mahalanobis distances, the results will become more sensitive to the number of components used for PCA).

dd <- dist(Ts, method="euclidean")
hc <- hclust(dd)
plot(hc, cex=.2)

## rect.hclust(hc, k = 10, border = rainbow(10))

The optimal number of clusters to be selected depends on application - whether exploration (in which case a larger number of clusters may be beneficial) or for numerical modeling (e.g., multilevel modeling) in which some numerical criteria based on the target variable can be applied. Without defining specific targets, we can select a number of clusters based on representativeness of their constituents samples, cleanliness of cluster separation, and so on. Silhouette analysis (Rousseeuw, 1987) is one method for assessing the extent to which the membership of a sample to a proposed cluster may be sensible - the Silhouette coefficient (ranging between -1 and 1) is an index calculated by the distance of each sample to a neighboring sample not in its own cluster, normalized by the distance to its own centroid. Selecting number of clusters to minimize negative Silhouette coefficients, or to maximize the overall mean Silhouette coefficient is one favored approach. The mean Silhouette coefficient ("si.avg") for a cluster indicates how well each sample, on average, is served by its parent cluster. The within-cluster sum-of-squares ("wss") is another metric used to assess how well a cluster represents its members.

We explore a number of clusters between 2 and 20; a mean Silhouette coefficient of 0 for a cluster is an indicator of a one-sample cluster.

library(cluster) # for Silhouette coefficients

ksweep <- 2:20 # to explore

metrics <- plyr::ldply(ksweep, function(k, tree, m, dist) {
  cl <- cutree(tree, k)
  si <- silhouette(cl, dist)
  spl <- split(names(cl), cl)
  cl.wss <- sapply(spl, function(x, m) crossprod(scale(m[x,,drop=FALSE], scale=FALSE)), m)
  data.frame(k = k, 
             cl = as.numeric(names(spl)),
             cl.n = sapply(spl, length),
             wss = colSums(cl.wss), 
             si.avg = sapply(split(si[,"sil_width"], si[,"cluster"]), mean))
}, hc, Ts, dd)

df <- metrics %>% 
  gather(metric, value,-c(k, cl, cl.n))

df.stat <- df %>%
  group_by(metric, k) %>%
  filter(cl.n > 1) %>%
  summarize(value=weighted.mean(value, cl.n))

ggplot() +
  facet_grid(metric~., scale="free_y")+
  geom_point(data=df, aes(k, value))+
  geom_line(data=df.stat, aes(k, value))+
  geom_point(data=df.stat, aes(k, value), shape=21, size=3)+
  scale_x_continuous(breaks=ksweep)+
  theme_bw()

Selecting 12 clusters, we can see some samples with negative Silhouette coefficients, indicating that their membership in the proposed clusters may be contested (and may be more sensitive to clustering algorithm employed).

k <- 12
grps <- cutree(hc, k)
si <- silhouette(grps, dd)
plot(si)

We can also visualize normalized spectra according to membership to the resulting clusters and assess whether they conform to a “spectroscopist’s judgment” about potential groupings.

ggplot(melt(specbl / magnitude) %>% 
         mutate(clust = factor(grps[sample])) %>%
         filter(!is.na(clust)))+
  facet_wrap(~clust)+
  geom_line(aes(wavenum, spec, group=sample), color=gray(.5), alpha=.5)+
  scale_x_reverse()+
  labs(x=expression(Wavenumber~(cm^-1)), y="Absorbance")+
  theme_bw()

Further analysis entails comparing functional group distributions, PMF/NMF factor contributions, and back trajectories (e.g., in conjunction with potential source contribution function PSCF; Malm et al. (1986); Zeng and Hopke (1989)) among clusters (e.g., Takahama et al., 2011).

In addition, any number of alternative clustering techniques (self-organizing maps, mixture models) are available to explore. Ultimately, the algorithm and number of clusters should be selected on the basis of providing interpretability or performance (e.g., in multilevel modeling; Weakley et al., 2018).

Classification analysis

After building up a database of spectra, “spectra types” that give indication regarding chemical composition or sources can be determined based on spectral profiles. In another application, one-class (unary) classifiers can be used to determine if new samples are substantially different from “old” ones - e.g., those used for building a calibration model (Takahama et al., 2018). Model predictions for samples which are significantly different (“spectral outliers”) than those used for calibration may be treated with caution, as it may be that the model is not well-calibrated for these samples.

Considering the first classification problem of mapping new spectra to previously-established spectra types, we repartition samples used for cluster analysis in the previous section into a training set and evaluation (test) set and apply \(k\)-nearest neighbor (\(k\)-NN) classification analysis for illustration purposes. We use as reference the cluster categories generated by hierarchical clustering (which also used Euclidean distance as does \(k\)-NN). We will use the caret package for building the model.

library(caret)

## create factor out of group indices
cl <- factor(sprintf("Class_%02d", grps))

## split data between training and test sets for this illustration
## categories with only 1 sample will be placed in the training set
set.seed(222)
ix.train <- createDataPartition(cl, p=.8, list=FALSE, times=1)

## Warning in createDataPartition(cl, p = 0.8, list = FALSE, times = 1): Some
## classes have a single record ( Class_12 ) and these will be selected for
## the sample

## 10-fold CV repeated 10 times
set.seed(222)
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10)

## train the model
knnFit <- train(Class ~ ., 
                data=data.frame(Class=factor(cl[ix.train]), Ts[ix.train,]), 
                method="knn", 
                trControl=fitControl,
                tuneGrid=data.frame(k=2:14))

We can see that the optimal accuracy (least number of misclassified samples) is achieved from the consensus of it \(k=3\) nearest neighbors.

plot(knnFit)

The predict method for train objects selects the optical parameter searched during the cross validation phase (i.e., \(k=3\)) to return predicted classes for the samples withheld from training. We can see that the overall accuracy is on the order of 95% (using the group labeling determined by hierarchical clustering as the definitive reference).

knnPred <- predict(knnFit, newdata = Ts[-ix.train,])
confusionMatrix(knnPred, cl[-ix.train])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class_01 Class_02 Class_03 Class_04 Class_05 Class_06 Class_07
##   Class_01       26        0        0        1        1        0        0
##   Class_02        0       25        0        0        0        0        0
##   Class_03        1        2       27        0        0        0        0
##   Class_04        0        0        0        3        0        0        0
##   Class_05        0        0        0        0        1        0        0
##   Class_06        0        1        0        0        0       13        0
##   Class_07        0        0        0        0        0        0       16
##   Class_08        0        0        0        0        0        0        0
##   Class_09        0        0        0        0        0        0        0
##   Class_10        0        0        0        0        0        1        0
##   Class_11        0        0        0        0        0        0        0
##   Class_12        0        0        0        0        0        0        0
##           Reference
## Prediction Class_08 Class_09 Class_10 Class_11 Class_12
##   Class_01        0        0        0        0        0
##   Class_02        0        0        0        0        0
##   Class_03        0        0        0        0        0
##   Class_04        0        0        0        0        0
##   Class_05        0        0        0        0        0
##   Class_06        0        0        0        0        0
##   Class_07        0        0        0        0        0
##   Class_08        7        0        0        0        0
##   Class_09        0        0        0        0        0
##   Class_10        0        0        4        0        0
##   Class_11        0        0        0        0        0
##   Class_12        0        0        0        0        0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9457          
##                  95% CI : (0.8914, 0.9779)
##     No Information Rate : 0.2171          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9348          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Class_01 Class: Class_02 Class: Class_03
## Sensitivity                   0.9630          0.8929          1.0000
## Specificity                   0.9804          1.0000          0.9706
## Pos Pred Value                0.9286          1.0000          0.9000
## Neg Pred Value                0.9901          0.9712          1.0000
## Prevalence                    0.2093          0.2171          0.2093
## Detection Rate                0.2016          0.1938          0.2093
## Detection Prevalence          0.2171          0.1938          0.2326
## Balanced Accuracy             0.9717          0.9464          0.9853
##                      Class: Class_04 Class: Class_05 Class: Class_06
## Sensitivity                  0.75000        0.500000          0.9286
## Specificity                  1.00000        1.000000          0.9913
## Pos Pred Value               1.00000        1.000000          0.9286
## Neg Pred Value               0.99206        0.992188          0.9913
## Prevalence                   0.03101        0.015504          0.1085
## Detection Rate               0.02326        0.007752          0.1008
## Detection Prevalence         0.02326        0.007752          0.1085
## Balanced Accuracy            0.87500        0.750000          0.9599
##                      Class: Class_07 Class: Class_08 Class: Class_09
## Sensitivity                    1.000         1.00000              NA
## Specificity                    1.000         1.00000               1
## Pos Pred Value                 1.000         1.00000              NA
## Neg Pred Value                 1.000         1.00000              NA
## Prevalence                     0.124         0.05426               0
## Detection Rate                 0.124         0.05426               0
## Detection Prevalence           0.124         0.05426               0
## Balanced Accuracy              1.000         1.00000              NA
##                      Class: Class_10 Class: Class_11 Class: Class_12
## Sensitivity                  1.00000              NA              NA
## Specificity                  0.99200               1               1
## Pos Pred Value               0.80000              NA              NA
## Neg Pred Value               1.00000              NA              NA
## Prevalence                   0.03101               0               0
## Detection Rate               0.03101               0               0
## Detection Prevalence         0.03876               0               0
## Balanced Accuracy            0.99600              NA              NA

In addition to \(k\)-NN, any number of advanced algorithms (partial least squares discriminant analysis, support vector machines, neural networks) are available to explore. There, the representativeness of the training set can be anticipated similarly to multivariate calibration.

References

Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G. and Jones, Z. M.: Mlr: Machine learning in r, Journal of Machine Learning Research, 17(170), 1–5 [online] Available from: http://jmlr.org/papers/v17/15-066.html, 2016.

Brereton, R. G.: Chemometrics: Data driven extraction for science, John Wiley & Sons Ltd, Chichester., 2018.

Brown, S. G., Eberly, S., Paatero, P. and Norris, G. A.: Methods for estimating uncertainty in PMF solutions: Examples with ambient air and water quality data and guidance on reporting PMF results, Science of The Total Environment, 518–519, 626–635, doi:10.1016/j.scitotenv.2015.01.022, 2015.

Cadima, J. and Jolliffe, I.: On relationships between uncentered and columncentered principal component analysis, Pak J Statist, 125(7 Suppl), S1–S1, doi:10.1016/j.amjmed.2012.04.013, 2009.

Corrigan, A. L., Russell, L. M., Takahama, S., Äijälä, M., Ehn, M., Junninen, H., Rinne, J., Petäjä, T., Kulmala, M., Vogel, A. L., Hoffmann, T., Ebben, C. J., Geiger, F. M., Chhabra, P., Seinfeld, J. H., Worsnop, D. R., Song, W., Auld, J. and Williams, J.: Biogenic and biomass burning organic aerosol in a boreal forest at hyytiälä, Finland, during humppa-copec 2010, Atmospheric Chemistry and Physics, 13(24), 12233–12256, doi:10.5194/acp-13-12233-2013, 2013.

de Juan, A., Jaumot, J. and Tauler, R.: Multivariate curve resolution (mcr). Solving the mixture analysis problem, Anal. Methods, 6(14), 4964–4976, doi:10.1039/C4AY00571F, 2014.

Donoho, D. and Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts?, in Advances in neural information processing systems 16, edited by S. Thrun, L. K. Saul, and B. Schölkopf, pp. 1141–1148, MIT Press. [online] Available from: http://papers.nips.cc/paper/2463-when-does-non-negative-matrix-factorization-give-a-correct-decomposition-into-parts.pdf, 2004.

Everitt, B. S., Landau, S., Leese, M. and Stahl, D.: Cluster Analysis, 5 edition., Wiley, Chichester, West Sussex, U.K., 2011.

Gaujoux, R. and Seoighe, C.: A flexible r package for nonnegative matrix factorization, BMC Bioinformatics, 11(1), 367, doi:10.1186/1471-2105-11-367, 2010.

Geladi, P.: Chemometrics in spectroscopy. Part 1. Classical chemometrics, Spectrochimica Acta Part B: Atomic Spectroscopy, 58(5), 767–782, doi:https://doi.org/10.1016/S0584-8547(03)00037-5, 2003.

Hemmer, M. C.: Expert systems in chemistry research, Taylor & Francis, Inc., Bristol, PA, USA., 2007.

Jackson, J. E.: A user’s guide to principal components, John Wiley & Sons., 2004.

Jolliffe, I. T. and Cadima, J.: Principal component analysis: A review and recent developments, Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 374(2065), doi:10.1098/rsta.2015.0202, 2016.

Kowalski, B. R.: Chemometrics: Mathematics and statistics in chemistry, Springer-Science+Business Media, Dordrecht., 1984.

Kuhn, M. and Johnson, K.: Applied predictive modeling, Springer New York., 2013.

Lee, D. D. and Seung, H. S.: Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788 [online] Available from: https://doi.org/10.1038/44565, 1999.

Lee, D. D. and Seung, H. S.: Algorithms for non-negative matrix factorization, in Advances in neural information processing systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, pp. 556–562, MIT Press. [online] Available from: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf, 2001.

Malm, W. C., Johnson, C. E. and Bresch, J. F.: In receptor methods for source apportionment, edited by T. G. Pace, pp. 127–148, Air Pollution Control Association, Pittsburgh, PA., 1986.

Martens, H. and Næs, T.: Multivariate calibration, John Wiley & Sons, New York., 1991.

Massart, D. L., Vandeginste, B. G. M., Deming, S. N., Michotte, Y. and Kaufman, L. and: Chemometrics: A textbook, Elsevier Science., 1988.

Ott, W.: Environmental statistics and data analysis, Taylor & Francis., 1994.

Paatero, P.: Users guide for positive matrix factorization programs PMF2 and PMF3, Part 2: Reference., 2002.

Paatero, P., Hopke, P. K., Song, X. H. and Ramadan, Z.: Understanding and controlling rotations in factor analytic models, Chemometrics and Intelligent Laboratory Systems, 60(1-2), 253–264, doi:10.1016/S0169-7439(01)00200-3, 2002.

Paatero, P., Eberly, S., Brown, S. G. and Norris, G. A.: Methods for estimating uncertainty in factor analytic solutions, Atmospheric Measurement Techniques, 7(3), 781–797, doi:10.5194/amt-7-781-2014, 2014.

Paatero, U., P. And Tapper: Positive matrix factorization - a nonnegative factor model with optimal utilization of error-estimates of data values, Environmetrics, 5(2), 111–126, 1994.

Rousseeuw, P. J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53–65, doi:https://doi.org/10.1016/0377-0427(87)90125-7, 1987.

Russell, L. M., Takahama, S., Liu, S., Hawkins, L. N., Covert, D. S., Quinn, P. K. and Bates, T. S.: Oxygenated fraction and mass of organic aerosol from direct emission and atmospheric processing measured on the r/v ronald brown during texaqs/gomaccs 2006, Journal of Geophysical Research-atmospheres, 114, D00F05, doi:10.1029/2008JD011275, 2009.

Russell, L. M., Bahadur, R. and Ziemann, P. J.: Identifying organic aerosol sources by comparing functional group composition in chamber and atmospheric particles, Proceedings of the National Academy of Sciences of the United States of America, 108(9), 3516–3521, doi:10.1073/pnas.1006461108, 2011.

Ruthenburg, T. C., Perlin, P. C., Liu, V., McDade, C. E. and Dillner, A. M.: Determination of organic matter and organic matter to organic carbon ratios by infrared spectroscopy with application to selected sites in the improve network, Atmospheric Environment, 86, 47–57, doi:10.1016/j.atmosenv.2013.12.034, 2014.

Takahama, S., Schwartz, R. E., Russell, L. M., Macdonald, A. M., Sharma, S. and Leaitch, W. R.: Organic functional groups in aerosol particles from burning and non-burning forest emissions at a high-elevation mountain site, Atmospheric Chemistry and Physics, 11(13), 6367–6386, doi:10.5194/acp-11-6367-2011, 2011.

Takahama, S., Johnson, A. and Russell, L. M.: Quantification of carboxylic and carbonyl functional groups in organic aerosol infrared absorbance spectra, Aerosol Science and Technology, 47(3), 310–325, doi:10.1080/02786826.2012.752065, 2013.

Takahama, S., Dillner, A. M., Weakley, A. T., Reggente, M., Bürki, C., Lbadaoui-Darvas, M., Debus, B., Kuzmiakova, A. and Wexler, A. S.: Atmospheric particulate matter characterization by fourier transform infrared spectroscopy: A review of statistical calibration strategies for carbonaceous aerosol quantification in us measurement networks, Atmospheric Measurement Techniques Discussions, 2018, 1–68, doi:10.5194/amt-2018-70, 2018.

Ulbrich, I. M., Canagaratna, M. R., Zhang, Q., Worsnop, D. R. and Jimenez, J. L.: Interpretation of organic components from positive matrix factorization of aerosol mass spectrometric data, Atmospheric Chemistry and Physics, 9(9), 2891–2918, 2009.

Weakley, A. T., Takahama, S., Wexler, A. S. and Dillner, A. M.: Ambient aerosol composition by infrared spectroscopy and partial least squares in the chemical speciation network: Multilevel modeling for elemental carbon, Aerosol Science and Technology, 52(6), 642–654, doi:10.1080/02786826.2018.1439571, 2018.

Wold, S. and Sjöström, M.: Chemometrics, present and future success, Chemometrics and Intelligent Laboratory Systems, 44(1), 3–14, doi:https://doi.org/10.1016/S0169-7439(98)00075-6, 1998.

Zeng, Y. and Hopke, P.: A study of the sources of acid precipitation in ontario, canada, Atmospheric Environment (1967), 23(7), 1499–1509, doi:https://doi.org/10.1016/0004-6981(89)90409-5, 1989.

Vignette: Additional chemometric analyses