Sunday, October 20, 2013

Controlled Vocabularies vs Social Tagging

I am a fan of controlled vocabularies. The precise use of specific terms to describe an item has allowed me to find exactly what I wanted when I wanted it. My initial browsing of the Smithsonian Institution’s collection on Flickr did not sway me from the opinion that social tagging leads to chaos. I found photo after photo with numerous dubious, if not completely erroneous tags. As an example, the photo of Maria Curie below was tagged as both “black and white” and “sepia”. Without even looking at the photo, we know  one of these tags is incorrect.

Marie Curie
Creator: Transocean
Collection: Smithsonian Institution
URL: http://www.flickr.com/photos/smithsonian/2583275677/

The photo contains many other useless tags. A single user added the tags “person”, “wearing”, “parka”,  and “outside”. In isolation, the tags do not mean much, and combining them into the phrase “person wearing parka outside” is simply not true. The Nobel Laureate’s photograph was also labeled as “intense”, “old woman”, “sad”, and “upset”, all subjective labels which may not be useful for someone looking for a picture of Marie Curie.

This single photo is illustrative of all the arguments against uncontrolled vocabularies. Users do not combine similar ideas under a single term, instead using a multitude of synonyms. The terms “woman” and “scientist” are applied to this photo, as is “women in science”, but other similar, applicable terms are missing. For example, “woman in science” or “women scientists” are equally accurate, but without a controlled vocabulary, we do not have a preferred term to collect all of the images related to the idea of women in science and therefore may not discover certain photos as part of a search.

Another common problem was that some taggers did not combine words into a single term. The tag “Marie Curie” was accurately applied, but the photo was also tagged as “Marie” and “Curie” separately. Do users want this photo when they search for “Marie”?

When we combine the erroneous terms, broken compound terms, and the lack of preferred terms, I wonder how users can effectively use the system. There is no way to know if you should have searched for a synonymous term and search results are flooded with images tagged in error, leaving the user to browse through more results than necessary. All of my initial biases toward controlled vocabularies were confirmed by the tags applied to this image of Marie Curie.

But, as I explored the topic more, I began to reconsider my rigid rejection of user-supplied tags. Following the Library of Congress pilot project on Flickr, Michelle Springer, et al. reported cases where comments and tags from the user community helped catalogers improve the Library of Congress records for specific photos (25-31). In my own browsing, I found an example of users adding information to enrich the Library of Congress catalog record. The Lewis Wickes Hine collection contains an image of Nan de Gallant, a 9-year-old who worked in a sardine cannery. A user pointed out that Nan’s full name was Anna J. Gallant and provided a link to other photos of Anna. Without input from the community, Anna J. Gallant would have remained the anonymous “Nan.” As this example shows, soliciting contributions for a large, diverse community can greatly enhance our ability to provide correct, rich metadata records.


Nan de Gallant
Creator: Lewis Wickes Hine
Collection: Smithsonian Institution
URL: http://www.flickr.com/photos/library_of_congress/7985823070

Our class tagging exercise further exposed the limits of controlled vocabularies. A photo of Ron Blackburn painting a mural seems like a simple idea to express until you attempt to apply a controlled vocabulary to it. Neither the Library of Congress Name Authority nor the Union List of Artist Names contained an authority record for Ron Blackburn. While both vocabularies have mechanisms for adding names, it was disappointing to not find the artist already listed. A cataloger is left to create an entry whether or not the entry is later added to the larger vocabulary. It was also surprisingly difficult to use the Thesaurus for Graphic Materials to express mural painting as a subject of a piece rather than a mural as a medium. I found myself sympathizing with users who find controlled vocabularies to be slow to be updated, blind in certain cultural areas, and difficult to use.

Now that I have argued for and against both sides, where does that leave us? We must find a way to balance the pros and cons of each system. First, we should use the community’s collective wisdom to improve our controlled vocabularies. As Springer et al. state in the LoC report, one suggestion is to “compare tags used by Flickr members against terms/references found in vocabulary lists  used primarily to describe photos at LC like Thesaurus for Graphic Materials (TGM) or Library of Congress Subject Headings (LCSH)” (24). The specific example they give is potentially adding the term “Rosie the Riveter” to the vocabulary, which is ironic given that the image below does not use the tag, but I recognized it from an article attempting to identify the original “Rosie”.

Woman working on a "Vengeance" dive bomber
Creator: Alfred T. Palmer
Collection: Smithsonian Institution
URL: http://www.flickr.com/photos/library_of_congress/2179038448

Second, we should consider the voice of the community at large instead of allowing all tags to be equal. In Introduction to Metadata, Tony Gill advocates for a system where “each time an individual user labels a Web resource with a specific descriptive tag, it counts as a ‘vote’ for the appropriateness of that term for describing the resource. In this way, Web resources are effectively cataloged by individuals for their own benefit, but the community also benefits from the additional metadata that is statistically weighted to minimize the effects of either dishonesty or stupidity.” In other words, one user may erroneously state that Marie Curie is wearing a parka, but it is unlikely most people will make that mistake. We could alleviate the negative effects of these bad tags if we discount the unpopular ones.

Finally, better technology interfaces may help alleviate some of the problems. the LoC report recognizes that some of the bad tags are due to “intended word mergers to overcome system syntax requirements (real or perceived) [and] unintended de-linking of multi-word phrases and terms” (Springer et al. 23). Clearer instructions on the allowed syntax and how to enter multi-word phrases may alleviate some of the problems I found while browsing the collection.

I keep coming back to the question of what are we really trying to accomplish here. We all want to be able to find the resources that satisfy our particular needs. We should not be getting caught up in defending one side or another in the war of controlled versus uncontrolled vocabularies. Instead, we should be taking the best of both systems to satisfy the users’ needs.

Springer, Michelle, et al. “For the Common Good: The Library of Congress Flickr Pilot Project”. October 30, 2008. <http://www.loc.gov/rr/print/flickr_report_final.pdf>

Gill, Tony, et al. Introduction to Metadata. 3rd ed. Los Angeles: Getty Publications, 2008. <http://www.getty.edu/research/publications/electronic_publications/intrometadata/index.html>







No comments:

Post a Comment