Introduction to Data Visualization

The objective of this course is to provide students, young researchers, and practitioners without a formal education in data visualization with an introduction and overview of the field. In addition to introducing basic concepts, the course will present diverse visualizations for different kinds of data (e.g., categorical, numeric, hierarchic, network, temporal, and spatial data); and for different kinds of visualization tasks and goals (e.g., retrieve value, filter, compute derived value, find extremum, sort, determine range, characterize distribution, find anomalies, cluster, and correlate). We will also discuss the role of data stories to convey data-driven insights.


Workshop Objective
Learn the basic principles of how to effectively visualize data using a variety of graphics from charts to maps.
Explore considerations for various ways to present data, how to ensure your data visualizations are clear and easy to understand, and examples and comparisons of visuals.
What we will do today: discuss a general framework for designing data visualizations, and some foundational elements or considerations for designing visualizations. There will be examples and opportunities for participation and feedback What we will not do today: we are not going to talk about how to use specific software like Tableau, or specific how-to's in Excel, though we will link to resources you can use for those Data overload and biases "The problem for me was not ignorance; it was preconceived ideas." -Hans Rosling There is a lot of data out in the world -ranging from things like traditional census data, to interviews, to photos It can feel overwhelming to process -how do we make sense of it all? How do we make it manageable?
When people get overwhelmed, we often rely on our biases and pre-conceived notions. So even though we have the data, if it's not accessible, or we feel overburdened, we ignore it.
A pioneer of data visualization in Global Health, Hans Rosling, would survey people on what they thought the state of global health -including so-called experts, donors, or career professionals. Even though the data was widely available, they often fell back on stereotypes or pre-conceived ideas that certain countries or certain populations were doing worse than they were. Our visual will ultimately answer our question For example: Which school district has the fewest number of 10 th grade students who met math standards in 2019?
Chart from: https: //roadmapproject.org/data-dashboard/#3rd-11th-grade-assessments This chart gives us our answer The reason it's important to identify our question is that the goal of our visual is to answer that question.
For instance, say the state has a bucket of funds for math improvement that they need to allocate. To ensure they're going to the places with the greatest need, we want to know which school districts has the fewest number of students who met math standards in 2019.
To answer the question, we build a visual that shows for each school district the number of students who are meeting math standards.
Our visual answers this question by showing clearly that on the far right, Federal Way has the fewest number of students meeting standards. Thus, we'd want to allocate funds there to help meet the needs of students.
This answers our question! The goal of your visual should be to answer your question. We talked at the beginning about how with so much data available we can end up overwhelmed.
When we pick our content we want to be selective so that we don't let our viewers get overwhelmed again! Be honest: minimize distortion

Adapted from Principles of Visualization by Edward Tufte
The only thing I changed was the time period of my data The flip side of that is that we want to be honest.
Many of us have heard the adage that statistics will say anything you want them to. We don't want to manipulate our viewers For example, this graph on the left looks like COVID cases in King County are fairly flat -they're nearly leveling off! But this graph on the left (which is the exact same data), shows a much different portrait with a much sharper increasing trend. Thinking about our content is also important in considering equity.
This map on the left is the map most people have seen.
But what if we used this map on the right that shows indigenous territories instead?
It's the same data. But the content we're choosing tells our readers a specific (colonial) story or specific worldview.
It's important to be careful when you choose your content for this reason, and I encourage you to be reflective -think about: What story is this upholding? Does this reinforce harmful stereotypes or messages? Is there missing historical context that's important?
Again, this is where you may need to pair further text or information to give your reader those additional insights and context of why this information is presented this way, or what world view is being told. How many '9's are in this visual?
What about now?
Why is this visual is easier to understand?
Because it takes advantage of how our brains process information

Adapted from 'Data Visualization: Cognition' presentation by Chris Adolph
Visual A answers our question much more easily

It's fundamentally the same content
But the structure of how we presented our data is different The reason is that A takes advantage of how our brain processes information -it gives us a pattern to look at Why are some visuals easier to understand?
The brain can only absorb so much information at once.
Use visual clues to tell our mind "look here!" For instance, color is a visual clue: Look at me! The brain can only absorb so much information at once visually This is again the issue of information overload -when there is too much going on, our brains don't know what to do with it, or where to look. Think of the iSpy book effect.
However, our brains are very good at recognizing visual patterns through what are called pre-attentive attributes Pre-attentive attributes are essentially visual clues that tell our mind to "look here" For instance, color is a visual clue: Between the black dots and the red dot, our eye is drawn to the red dot, because of the difference The red dot doesn't fit the pattern! This is a visual clue.
We want to use these visual clues to make our graphics easier to read -to tell the audience where to look, and how to answer the question We're going to start with a bar chart In this bar chart here, the goal is to show the difference in life expectancies among different racial groups in King County You can see that each bar here represents one racial group to communicate the life expectancy for each group, It uses length -how far the bar extends It uses position -where is the end point of the bar They've also added color, to differentiate between the different groups Bar charts are good for showing comparisons between groups, or showing rankings, because our mind can easily pull out the patterns of length and position to understand the data for each group Here, you can easily see that American Indian's have a lower life expectancy than other groups One thing you'll note is that this question doesn't ask or answer why are there disparities in life expectancy here? Again, this why question may be hard to answer in a visual, and would be benefited from a footnote or a text box which explains the structural factors which have led to disparate health outcomes One challenge with pie graphs is we often are tempted to cut them into too many slices!
As we talked about before this brings us back to our data overwhelm -we can't process or find a pattern in all 30 of those slices, so they become meaningless In general we want to stay with the fewer slices! Purpose: How many COVID cases and deaths are there in every municipality in the US?

Choropleth Maps
Lastly we have text tables! We often times don't think of these as visuals, because we use them so often in reports. But they are a visual.
They're really best at showing exact values; and not much else.
For instance in this chart where the goal of the NYT was to show every single case count in every single county, it can give us all those numbers The challenge is they don't naturally use any of our visual clues. Two visual clues we can add to help interpret text tables: Color -the areas with the highest case rates have darker colors Position -we sorted our data, which adds an element of position -the highest rated numbers are on top here Small multiples is using panels of the same graphic type repeatedly! Any graphic type could be used (line graphs, bar charts…) Purpose: What has been the change in uninsured adults, by racial group, from 2013 to 2017?

Data from US Census Bureau, American Community Survey
The last bonus re isn't a chart as much as a strategy for multiple charts -what we call small multiples This is taking panels of the same graphic type and using it side by side by side It could be any graphic type -a bar chart, a line graph, or any other type For instance, in this visual here we wanted to answer the question of what is the change in uninsured adults, by racial group from 2013-2017?
We gave each group their own miniature line graph and put them all side by side to see their individuals trends The nice thing about small multiples is that it only requires the reader to interpret one of the graphs, and then they understand the pattern in each graph after

Data from US Census Bureau, American Community Survey
Many charts can be turned into a small multiples by adding panels….

Small multiples are really best for displaying large datasets where there is lots of similar information
And you can turn an existing chart into a small multiple by just adding panels! For instance, this graph on the left initially plots all the groups on the same visual. It's all the same information for each group though To make it easier to see, we put them each in their own panel and put them side by side -small multiples! (graph on the right)

Slide 41
Which chart tells a clearer story? Why?

PATH's M&E PrEP presentation by Jenny Shannon and Jonathan Drummey
Purpose: which district has the biggest change in the number of HIV+ pregnant women between 2015 and 2016?
The goal of the visual here is to identify which district had the biggest change in the number of HIV+ pregnant women between 2015 and 2016 Which do you think is better?
There isn't a single right answer. Both choices are appropriate for visualizing how 2 or more numbers are alike or different. My preference if we are thinking about just identifying the district is Option A, because it highlights the one district with change.
The orientation illustrates a single category increase.
But if we wanted to compare all of the districts, option B might be better because it shows bars for each district. The length helps us understand each district and compare them.
Again, the key question here is also who is your audience, their comfort, and what they need to understand. Some audiences may be more familiar with bar charts; others may intuitively understand a line graph. Another key element of choosing color is choosing a color scheme that is accessible to all Most notably this means accounting for color blindness, which affects 5-10% of the US population The most common form is red-green colorblindness, which means as we mentioned earlier a traffic light color scheme isn't a great fit The other thing you want to consider is how your visual will be presented Frequently in my work, we'll print copies of reports or handouts for people to haveoften times these are in Black and White

Many color schemes don't translate to black and white
For instance, this color scheme which used the Excel default (on the left) -looks like this when printed (on the right) It's not longer possible to tell which lines are which, particularly in the middle where they crossover So try printing your graphic to make sure the colors are BW compatible also Labels and 'chart noise': Some guidelines 1. Label the important parts of your chart! Including a title, axesand data marks as needed.
2. Use language that is appropriate for your audience.
3. Make sure it's understandable; consider how your visual will be delivered.
Labels are another key component of our graph which help interpretation. All of your labeling should really be in service to #3 -so that your chart is easily understood by your viewers.
For #1, Most importantly, this means having a title and axes on your chart. You may also want to add further labeling -such as data labels, or a subtitle. Again, be careful of overwhelming your audience with information -we want to keep the information that's critical at the forefront.
Another key element to consider is #2 -what language are you using? This includes thinking about the content of your language -is it too technical for your audience? Does it use jargon or abbreviations that people don't know? But also thinking about the language as well -is your audience going to be more comfortable viewing and interpreting the materials in Spanish? Another language?
Again all of this is in service to #3 -making sure the chart is understood Remember -you don't want to over clutter your chart and overwhelm the audience You may not need data labels, or a subtitle -be sparing! Chart from: https: //www.bbc.com/news/world-us-canada-52714804 Labels: make the visual fit for the audience Consider how your audience is consuming the visual, and their comfort level. Label your visual accordingly so your audience can interpret it!  Are they looking at the visual on their own, or is someone explaining it to them?  What is their level of comfort with data visualization?  How often have they seen this visualization?
 How long will they have to look at this visualization?
As we said, our end goal is understanding.
The last thing I'd encourage you to think about with labeling is again the audiencehow will they consume the visual?
Are they comfortable with visualization and don't need much guidance? Will you be available to explain to them the material?
Do they have a half hour to look at the visual and digest, or will they only have two minutes between meetings to glance at it?
Depending on the audience you may need more or less labeling to help guide them through the chart Font: Some guidelines 1. Most important information is largest. Use a logical hierarchy when picking font size.
2. Make your font accessible. Consider how your visual will be delivered.

The last formatting piece to consider is font
Two key pieces here: You want to visualize how adults in Washington commute to the office.
In the following visual, what would you change in the formatting to make it more readable?

Slide 59
You want to visualize how adults in Washington commute to the office.
We changed the color scheme, re-ordered our data, integrated the legend, and added labels and a title.
A checklist for formatting your visual Structure:  How can we best present our data? Formatting:  What else is needed to make our graphic visually accessible? Audience:  Who is our user and what do they need?
Case study: Purpose There are two program activities offered monthly:  Play groups for young children  Parenting skill classes Purpose: How often did families participate in each of the program activities in the last 12 months?
We reviewed the program logs for the 188 of the program participants to find out how often they participated in each activity.
We discovered four attendance 'categories' :  Never  Once  Twice  Three or more times We have our purpose: How often did families participate in each of the program activities in the last 12 months?
We have our content: program logs for 188 participants for their attendance, broken into four attendance categories So now we want to think about structure We're fundamentally comparing different groups (the different attendance categories) so we want to use a bar chart  In Excel: In Google Sheets: In Tableau:

Tableau Resources
If you are interested in using Tableau:  Tableau Public allows you a free license, but then your visuals are all publicly available.
 Not good for sensitive data, or data that has personal information attached!  Tableau for Good gives free licenses to non-profits with annual operating budgets under 5 million;  To get your application process and download, there is an administrative fee of ~$50  Tableau has training resources online, including free videos and 90 days of free elearning courses.