Data Visualization

Lecture 09

February 21, 2024

Last Class(es)

Two Ways To Frame “Extreme” Values

“Block” extremes, e.g. annual maxima (block maxima)?
Values which exceed a certain threshold (peaks over threshold)?

Two Ways To Frame “Extreme” Values

Block Maxima: Generalized Extreme Value (GEV) distributions.
Peaks-Over-Thresholds: Generalized Pareto distributions (GP) (plus maybe Poisson processes).
Statistical models are highly sensitive to details: shape parameters \(\xi\), thresholds \(u\), etc.
Models assume independent variables.

Data Visualization: Basic Principles

Purposes of Visualizing Data

Exploratory Analysis
Communication
Interpretation

Quantitative Summaries Can Be Insufficient

Code

# load Anscombe's Quartet data
df = dataset("datasets", "anscombe")

model1 = lm(@formula(Y1 ~ X1), df)
model2 = lm(@formula(Y2 ~ X2), df)
model3 = lm(@formula(Y3 ~ X3), df)
model4 = lm(@formula(Y4 ~ X4), df)

yHat(model, X) = coef(model)' * [ 1 , X ]
xlims = [0, 20]

p1 = scatter(df.X1, df.Y1, c=:blue, msw=0, ms=8)
p1 = plot!(xlims, [yHat(model1, x) for x in xlims], c=:red, xlims=(xlims), linewidth=2)

p2 = scatter(df.X2, df.Y2, c=:blue, msw=0, ms=8)
p2 = plot!(xlims, [yHat(model2, x) for x in xlims], c=:red, xlims=(xlims), linewidth=2)

p3 = scatter(df.X3, df.Y3, c=:blue, msw=0, ms=8)
p3 = plot!(xlims, [yHat(model3, x) for x in xlims], c=:red, xlims=(xlims), linewidth=2)

p4 = scatter(df.X4, df.Y4, c=:blue, msw=0, ms=8)
p4 = plot!(xlims, [yHat(model4, x) for x in xlims], c=:red, msw=0, xlims=(xlims), linewidth=2)

plot(p1, p2, p3, p4, layout = (1,4), xlims=(0,20), ylims=(0,14), 
    legend=:none, xlabel = "x", ylabel = "y",
    tickfontsize=16, guidefontsize=18,
    left_margin=5mm, bottom_margin=10mm, right_margin=5mm)
plot!(size=(1200, 400))

Figure 1: Anscombe’s Quartet

Challenges for Effective Visualization

Limits From Cognitive Processes
No “Optimal” Visualization
Temptation To Overload Figures
Easy to “Lie” About The Data

Further Challenges

Following Munzner (2014):

Possible designs are a bad match with human perceptual and cognitive systems;
Possible designs are a bad match with the intended task;
Only a small number of possibilities are reasonable choices;
“Randomly choosing possibilities is a bad idea because the odds of finding a very good solution are very low.”

What Can Go Wrong?

Healy (2018):

Bad Taste;
Bad Data;
Bad Perception

Remember: Data Never Speaks For Itself

Data must be understood in a particular context. You need to understand your data and what it says (or does not say!) based on your hypotheses.

What question(s) does your data address?
What transformations make the representation of the data as salient as possible?
What scales or channels are most appropriate?

Some Caveats

There is no recipe to effective visualization. Everything depends on your data and the story you want to tell.
This also means that defaults from data visualization packages are usually bad.
These principles are largely based on Western (American/European) norms.
A lot of these guidelines are based on average outcomes, there is likely to be a lot of individual variation.

Human Perception and Cognition

Stages of Human Visual Perception

Rapid, pre-attentive parallel processing to extract basic features;
Slow serial processing for extraction of patterns;
Goal-based retention of a few pieces of information in working memory related to a question at hand.

Working Memory Is Limited!

Estimates of the number of “bits” we can keep in working memory vary, but:

Limit is small;
Exceeding limit results in cognitive load;
Working memory is subject to “change blindness”

The more cognitive work you ask of your viewer, the less they are likely to take away and retain!

Gestalt Principles

The Gestalt school of psychology identified several principles of perception.

Core idea: Humans are very good at finding structure.

As a result, you need to evaluate the totality of a visual field, not just each component.

Gestalt Principles

Proximity
Similarity
Parallelism
Common Fate
Common Region
Continuity
Closure

Illustration of several Gestalt principles. Adapted from Healy (2018).

Don’t Add Unnecessary Artifacts!

Unnecessary artifacts can be “chartjunk.”

But worse, they might mislead the viewer.

Code

p1 = plot(x_pred, y_ci_low, fillrange=y_ci_hi, xlabel=L"$x$", ylabel=L"$y$", fillalpha=0.3, fillcolor=:blue, label="95% Prediction Interval", legend=:topleft, linealpha=0, legendfontsize=12, tickfontsize=14, guidefontsize=14) 
plot!(p1, x_pred, y_med, color=:blue, label="Prediction Median")
scatter!(p1, x, y, color=:red, markershape=:x, label="Data")

p2 = plot(x_pred, y_ci_low, fillrange=y_ci_hi, xlabel=L"$x$", ylabel=L"$y$", fillalpha=0.3, fillcolor=:blue, label="95% Prediction Interval", legend=:topleft, linealpha=0, legendfontsize=12, tickfontsize=14, guidefontsize=14) 
plot!(p2, x_pred, y_med, color=:blue, label="Prediction Median")
plot!(p2, x, y, color=:red, seriestype=:line, label=:false)
scatter!(p2, x, y, color=:red, markershape=:x, label="Data")

plot(p1, p2, layout=(1, 2), size=(800, 600))

Channels For Encoding Information

Channels

A channel is a mechanism for encoding information.

Examples:

Color (Hue/Saturation/Luminescence)
Position (1D/2D/3D)
Size (Length/Area/Volume)
Angle

Ordered vs. Categorical Attributes

The channels available depend on the type of attribute:

Ordered attributes can be
- Ordinal: Ranking, no meaning to distance;
- Quantitative: Measure of magnitude which supports arithmetic comparison;
Categorical attributes are unordered.

Channel Effectiveness: Ordered Data

Channels for ordered data, arranged top-to-bottom from more to less effective (channels in the right column are less effective than those in the left). Modified from Healy (2018) after Munzner (2014).

Channel Effectiveness: Categorical Data

Channels for categorical data, arranged top-to-bottom from more to less effective. Modified from Healy (2018) after Munzer (2014).

Preattentive Popout

Try to make your key features “pop out” to the viewer during the pre-attentive scan.

Searching for the blue circle becomes harder. Adapted from Healy (2018).

Code

npt = 20
dist = Distributions.Product(Uniform.([0, 0], [1, 1]))
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p1 = scatter(pts[1:end .!= blueidx], color=:red, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Color Only, N=20", framestyle=:box)
scatter!(p1, pts[blueidx, :], color=:blue, markersize=5)

npt = 100
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p2 = scatter(pts[1:end .!= blueidx], color=:red, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Color Only, N=100", framestyle=:box)
scatter!(p2, pts[blueidx, :], color=:blue, markersize=5)

npt = 20
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p3 = scatter(pts[1:end .!= blueidx], color=:blue, markershape=:utriangle, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Shape Only, N=20", framestyle=:box)
scatter!(p3, pts[blueidx, :], color=:blue, markersize=5, markershape=:circle)

npt = 100
pts = Tuple.(eachcol(rand(dist, npt)))
blueidx = rand(1:npt)
p4 = scatter(pts[1:end .!= blueidx], color=:blue, markershape=:utriangle, xticks=:false, yticks=:false, legend=:false, markersize=5, title="Shape Only, N=100", framestyle=:box)
scatter!(p4, pts[blueidx, :], color=:blue, markersize=5, markershape=:circle)

plot(p1, p2, p3, p4, layout=(2, 2), size=(800, 500))

Channel Interference

When using multiple channels, be careful about interference: reducing the effectiveness of both channels.

Color Schemes

Different color schemes are appropriate depending on whether the data is sequential, divergent, or unordered.

Appropriate Color Schemes

Color schemes should be perceptually uniform to preserve a mapping between changes in perceived colors and changes in attribute values.

Try to also choose color schemes which avoid confusing people who are color blind.

Color Schemes

Good news: Most plotting libraries include a wide variety of perceptually uniform, color-blind safe color schemes.

Bad news: These are not usually the defaults (in particular, avoid “rainbow” color schemes).

Sequential Color Schemes

Sequential schemes change in intensity from low to high as the value changes.

Divergent Color Schemes

Divergent schemes intensify in two directions from a zero or mean value.

Unordered Color Schemes

Unordered schemes are appropriate for categorical data.

Some Examples

Thoughts On This Plot?

First Street Foundation Return Period Trends

Source: Shu et al. (2023)

How About This One?

Trump Polling Average vs. Employment in Swing States

Source: Joe Weisenthal

Last One!

Source: Bakkensen & Barrage (2022)

Key Points, Upcoming Schedule, and References

Recommendations

Don’t add extraneous artifacts.
Make key features “pop out,” or annotate them.
Summarize data to reduce complexity.
Try to prioritize high-effectiveness channels.
Don’t use 3d!
Mix channels sparingly (but redundancy is good!).
Is the figure an improvement over a table?

But the Biggest Recommendation of All…

Be intentional with your choices based on your storytelling goal!

Relying on defaults will usually steer you wrong, and all “rules” can be broken if they help you tell your story more effectively.

What About Exploratory Analysis?

When exploring data, try lots of things.

Don’t over-interpret one visualization.
Try to rely on hypotheses about what you might see instead of dredging through the data.

Upcoming Schedule

Monday: February Break!

Next Wednesday: In-Class Figure Discussion

Assessments

Friday:

Submit figures for discussion (Exercise 5)
HW2 Due

References

Bakkensen, L. A., & Barrage, L. (2022). Going Underwater? Flood Risk Belief Heterogeneity and Coastal Home Price Dynamics. Rev. Financ. Stud., 35, 3666–3709. https://doi.org/10.1093/rfs/hhab122

Healy, K. (2018). Data visualization: A practical introduction. Princeton University Press.

Munzner, T. (2014). Visualization analysis and design. CRC Press.

Shu, E., Hauer, M., & Porter, J. (2023, November). Future population exposure to flood risk: A decomposition approach across Shared-Socioeconomic pathways (SSPs). Research Square. https://doi.org/10.21203/rs.3.rs-3628132/v1