Smithsonian Institute Collections Visualization

Museum collections can also tell us about history and the values of society. As one of the world's largest collections, the Smithsonian Collections are a prime target to dig in to.

This visualization gives insight into the composition of the Smithsonian's collections, including what, when, and where items come from. Specifically, the visualization looks at the unit (such as the National Museum of American History or the Human Studies Film Archives), country, and time (if within the last couple centuries) items come from. Interaction enables filtering to a specific unit, allowing comparison in trends between units in addition to the whole.

Since the Open Access dataset contains 11 million records, the data is in its own way opaque. At 26GB uncompressed, it's too large for me to load at once, much less interactively search through. To accomplish this task, the data was sampled for basic structure (at the start of this project the format specification had not been released) and then processed with Python and Jupyter Lab. After logging summaries and anomalies, string processing was used to clean up typos, inconsistencies, and similar issues. Finally, a JSON file was created with the aggregated data.

At a technical level, the data was processed using command-line tools (including grep, head, tail, and awk), Python, Jupyter Lab, and Regex; the visualization was created with d3 and a fork of Semantic UI.