Culturomics: Science in Culture
The definition of Culturomics (/ˌkʌltʃəˈrɒmɪks/) goes:
- “Culturomics is an area of study which looks at how frequently a word is used over a period of time and relates this to changes in culture”. (Macmillan Dictionary, http://www.macmillandictionary.com/dictionary/british/culturomics)
- “Culturomics is a field of investigation which links cultural trends to a quantitative analysis of word use over a particular period of time“. (Macmillan Dictionary, http://www.macmillandictionary.com/buzzword/entries/culturomics.html)
The digitization of books by Google Books has sparked controversy over issues of copyright and book sales, but for linguists and cultural historians this vast project could offer an unprecedented treasure trove.
Researchers at Harvard University in Cambridge, Massachusetts, and the Google Books team in Mountain View, California, herald a new discipline called culturomics, which sifts through this literary bounty for insights into trends in what cultures can and will talk about through the written word.
- “The possibilities with such a new database, and the ability to analyse it in real time, are really exciting,” says linguist Sheila Embleton of York University in Toronto, Canada.
- “Quantitative analysis of this kind can reveal patterns of language usage and of the salience of a subject matter to a degree that would be impossible by other means,” agrees historian Patricia Hudson of Cardiff University, UK.
- “The really great aspect of all this is using huge databases, but they will have to be used in careful ways, especially considering alternative explanations and teasing out the differences in alternatives from the database,” adds Royal Skousen, a linguist at Brigham Young University in Provo, Utah. “I do not like the term ‘culturomics’,” he adds. “It smacks too much of ‘freakonomics’, and both terms smack of amateur sociology.”
- “The ability, via modern technology, to look at just so much at once really opens horizons,” says Embleton. However, Hudson cautions that making effective use of such a resource will require skill and judgement, not just number-crunching. “How this quantitative evidence is generated and how it is interpreted are the most important factors in forming conclusions,” she says. “Quantitative evidence of this kind must always address suitably framed general questions, and employed alongside qualitative evidence and reasoning, or it will not be worth a great deal.”
- “But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”
- “But these tools have enormous implications,” Anthony T. Grafton, a history professor at Princeton and a former president of the American Historical Association said, in their ability to reveal unexpected patterns and associations in the historical record. “These are tools that can pick up big changes,” he said. “You can’t do this by using older, conventional means of reading books and taking notes.”
- BOOKWORM-ARXIV will burrow its way through data stored in roughly 743,000 or so papers that have been uploaded by scientists, said Paul Ginsparg, founder of arXiv, Authors typically send their papers to arXiv as “preprints” or unpublished manuscripts before the works appear in journals. Most of the research is in physics, mathematics, computer science, statistics and the quantitative parts of biology and finance, said Dr. Ginsparg, a professor of physics and information science at Cornell. Readers who use the arXiv interface will be able to click through to the original text. “The papers are not behind a paywall,” Dr. Ginsparg said. Dr. Ginsparg says he has already been trying out the topic modeling algorithms of Dr. Blei and others on arXiv’s collection of scientific papers. “The technology gives you a way to home in for finer grains of similarity between articles,” Dr. Ginsparg said, “ones that might not be detected by a keyword search.”
- Steven Pinker, cognitive scientist and Harvard professor, collaborated with the observatory team during development of the n-gram viewer. He said the new interface might be particularly useful for historians of science. “They will increasingly be able to test their explanations, conjectures and hypotheses by looking at the rise and fall of phrases in the scientific literature,” he said.
- Dr. Erez Lieberman Aiden said the worlds of Google’s scanned books and arXiv’s papers were just the beginning for the observatory. “We plan on moving on soon to newspapers, blogs, tweets and other aspects of the historical record,” he said.
Culturomics aims to target computational analysis of cultural trends or social trends using quantitative analysis of digitized text. Culturomics is first coined on 17 December 2010 of this Science paper (The paper appears in print in the 14 January issue.) that helped create Google’s Ngram Viewer by researchers Erez Lieberman Aiden and his colleagues. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
A team led by Jean-Baptiste Michel and Erez Lieberman Aiden created a data set based on the trillions of words within Google Books, which currently represents 15 million books, 12% of those ever published. After quality checking the data, they created a version that captures a third of that corpus and put it online so that anyone can search for patterns of cultural change reflected in the frequency with which words or phrases rise and fall in the corpus over time. Michel and colleagues call it culturomics, the quantitative exploration of massive data sets of digitized culture.
Before you read further, you might enjoy a few minutes playing with the raw data yourself. (Warning: It is addictive.) Go ahead and type in the full name of your scientific heroes and the scientific concepts associated with them. Note that it is case-sensitive, so “special relativity” is fine, but you should enter “Albert Einstein” rather than “albert einstein.” The plots you’ll see are the frequency of those names and phrases in the pages of all books published each year between the dates you choose.
Culturomics: Buzzword or Science?
There has been almost as much discussion on the word formation strategy underlying the new term culturomics as there has on the concept itself. Though it looks like it’s modelled on terms like economics, ergonomics, etc, the clue to its evolution is that it is in fact pronounced with a long ‘o’ vowel, as in ‘home’. Culturomics is modelled on the term genomics – the study of gene sequences within living organisms. Such gene sequences are referred to as genomes, and hence, by extension, culturomics makes reference to culturomes.
Having started with the term genomics, the suffix -omics appears to be becoming increasingly productive, spawning a raft of related terms in the bio-sciences. Use in the term culturomics is its first significant divergence into a domain outside of biology.
In January 2011 the word culturomics was voted the neologism of 2010 which was ‘least likely to succeed’ by the American Dialect Society. Nevertheless the term has sparked a flurry of discussion in recent months, engaging the attention of high-profile linguists across the world. With its strong claims about the link between language use and culture, the concept of culturomics is worth a second look.
In late 2010 it was announced that a team of researchers at Harvard University had put together a corpus of 500 billion words from a Google scanning project of more than 5 million texts. This represented an unprecedentedly large body of language data (including German, French, Spanish, Russian, Chinese and Hebrew as well as English), which was thought to constitute in the region of 4% of books ever printed. Alongside, they had developed a tool, entitled the n-gram viewer, which produced a graphic representation of the frequency of occurrence of a word or sequence of words (an n-gram) within the corpus over the 200 years between 1800 and 2000.
This experimental tool proves very interesting, because it can give us information about the evolution of language within a 200-year timeframe, indicating what lexical, or indeed grammatical, changes occurred. This graph, showing the comparative use of past participles dreamed and dreamt, is one example of a grammatical swing.
But what’s particularly significant is that the graphs can also give us a very engaging insight into the way language use depicts historical and cultural hotspots. The graph for the word ration, for example, has a dramatic peak in the early 1940s (World War II and aftermath), and a similar effect can be seen for nuclear (1980s) and Beatles (1960s). Other words show a more gradual ascent (e.g. mobile), or decline (e.g. petticoat).
Though by no means a new idea, this kind of language analysis, showing the intersection of language use and concepts of cultural significance, has been dubbed culturomics by the authors of this research. They argue that the graphs are windows into a broad spectrum of evolving cultures – such as for example fame, ethical issues, politics, religion or the adoption of technology. The words themselves, they claim, somehow represent a chunk of our cultural make-up, which they refer to as a culturome.
However, though these graphs can indeed be fascinating, as with all such electronic language tools, the results are not perfect, and the cracks really begin to show when you ask the tool to analyse newer words. Searches for words podcast and webinar, for example, suggest that they were thriving way back in the late 18th and early 19th century! Such anomalies can be put down to a variety of problems, including incorrect dating of sources, typographical errors, the limitations of OCR (scanning) technology (which can skew data by favouring certain text types), and the basic fact that computational tools just can’t interpret the data like humans can (e.g. a ‘spike’ for podcast might have occurred simply because the words pod and cast happened to appear next to each other in a sufficient number of late 18th century texts).
Nevertheless it’s a clever exploitation of a large body of language data, and though its main assertion is something we always knew anyway – that cultural and historical events are reflected in language use – there’s something extremely engaging and slightly addictive about seeing facts brought to life in graph form.
Linguists and lexicographers have expressed skepticism regarding the methods and results of some of these studies, including one by Petersen et al (“When physicists do linguistics”, BEN ZIMMER, Boston Globe, February 10, 2013).
Play words: Half a trillion words
Using statistical and computational techniques to analyse vast quantities of data in historical and linguistic research is nothing new — the fields known as quantitative history and quantitative linguistics already do this. But it is the sheer volume of the database created by Google Books that sets the new work apart.
So far, Google has digitized more than 15 million books, representing about 12% of all those ever published in all languages. Michel and his colleagues performed their analyses on just a third of this sample, selected for the good quality of the optical character recognition in the digitization and the reliability of information about a book’s provenance, such as the date and place of publication.
The resulting data set contained over 500 billion words. This is far more than any single person could read: a fast reader would, without breaks for food and sleep, need 80 years to finish the books for the year 2000 alone.
Not all isolated strings of characters in texts are real words. Some are numbers, abbreviations or typos. In fact, 51% of the character strings in 1900, and 31% in 2000, were ‘non-words’. “I really have trouble believing that,” admits Embleton. “If it’s true, it would really shake some of my foundational thoughts about English.”
According to this account, the English language has grown by more than 70% during the past 50 years, and around 8,500 new words are being added each year. Moreover, only about half of the words currently in use are apparently documented in standard dictionaries. “That high amount of lexical ‘dark matter’ is also very hard to believe, and would also shake some foundations,” says Embleton. “I’d love to see the data.”
In principle she already can, because the researchers have made their database public at www.culturomics.org. This will allow others to explore the huge number of potential questions it suggests, not just about word use but about cultural history. Michel and colleagues offer two such examples, concerned with fame and censorship.
They say that actors reach their peak of fame, as recorded in references to names, around the age of 30, while writers take a decade longer but achieve a higher peak. “Science is a poor route to fame,” they add. Physicists and biologists who achieve fame do so only late in life, and “even at their peak, mathematicians tend not to be appreciated by the public”.
Box: A discipline goes digital
The digital humanities — the use of algorithms to search for meaning in databases of text and other media — have been around for decades. Some trace the field’s origins to Roberto Busa, an Italian priest who, in the late 1940s, teamed up with IBM to produce a searchable index of the works of thirteenth-century theologian Thomas Aquinas.
But the field has taken on new life in recent years. Journals have sprouted up and professional societies are blooming. Some universities are now requiring graduate students in the humanities to take statistics and computer-science courses. Funding — far harder to come by in the humanities than in the sciences — flows slightly more generously to those willing to adopt the new methods. This year, the US National Endowment for the Humanities, in collaboration with the National Science Foundation and research institutions in Canada and Britain, plans to hand out 20 grants in the digital humanities, worth a total of US$6 million.
Many researchers in the digital humanities use textual databases composed primarily of books — as Erez Lieberman Aiden does in his ‘culturomics’ project (see ‘Heavy-duty data’). Franco Moretti, a literary scholar at Stanford University in California, has shown that genres of fiction — Gothic novels, for example, or romance — have a textual ‘fingerprint’ that is apparent even in simple frequency counts of nouns, verbs and prepositions. “These genres are different at every scale,” he says, “not only in the huge scene of being held captive by a Count.”
Some researchers are busy digitizing other forms of cultural data. John Coleman, a phonetician at the University of Oxford, UK, is putting 5 million spoken words — about 3 months of speech, end to end — into a database, down to the level of the individual phonemes. Collected largely as recordings made with Sony Walkmen in the 1990s, it contains all sorts of things typically ignored by linguists: neologisms, slurring and sub-verbal honks and snorts. Coleman is already learning how conversation partners take pacing cues from each other, and how pitch of voice reflects attitude. And, he says, he can prove that women and men talk at the same speed. The linguistics textbooks, he says, “are going to have to be rewritten”.
Ichiro Fujinaga, a music technologist at McGill University in Montreal, Canada, is trying to do something similar for music. In a project known as SALAMI (Structural Analysis of Large Amounts of Music Information), Fujinaga is finding the common structural patterns (such as verse–chorus) in 350,000 pieces of music from all over the world. With more than 7,000 hours of Grateful Dead recordings in the database, he says, his team will be able to answer the all-important question: “Did the guitar solos get extended over the years or did they get shorter?”
In a study called Culturomics 2.0, Kalev H. Leetaru examined news archives including print and broadcast media (television and radio transcripts) for words that imparted tone or “mood” as well as geographic data. The research was able to retroactively predict the 2011 Arab Spring and successfully estimate the final location of Osama Bin Laden to within 124 miles.
In a 2012 paper by Alexander M. Petersen and co-authors, they found a “dramatic shift in the birth rate and death rates of words”: Deaths have increased and births have slowed. The authors also identified a universal “tipping point” in the life cycle of new words at about 30 to 50 years after their origin, they either enter the long-term lexicon or fall into disuse.
In a 2014 paper by S. Roth, culturomic analyses is used to trace the decline of religion, the rise of politics, and the relevance of the economy to modern societies, with one of the major results being that modern societies do not appear to be capitalist or economized. This paper is likely to be the first application of culturomics in sociology.
Culturomic approaches have been taken in the analysis of newspaper content in a number of studies by I. Flaounas and co-authors. These studies showed macroscopic trends across different news outlets and countries. In 2012, a study of 2.5 million articles suggested that gender bias in news coverage depends on topic and how the readability of newspaper articles is related to topic. A separate study by the same researchers, covering 1.3 million articles from 27 countries, showed macroscopic patterns in the choice of stories to cover. In particular, countries made similar choices when they were related by economic, geographical and cultural links. The cultural links were revealed by the similarity in voting for the Eurovision song contest. This study was performed on a vast scale, by using statistical machine translation, text categorisation and information extraction techniques.
The possibility to detect mood shifts in a vast population by analysing Twitter content was demonstrated in a study by T Lansdall-Welfare and co-authors. The study considered 84 million tweets generated by more than 9.8 million users from the United Kingdom over a period of 31 months, showing how public sentiment in the UK has changed with the announcement of spending cuts.
In a 2013 study by S Sudhahar and co-authors, the automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from Network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.
In a 2014 study by S Sudhahar, T Lansdall-Welfare and co-authors, 5 million news articles were collected over 5 years and then analyzed to suggest a significant shift in sentiment relative to coverage of nuclear power, corresponding with the disaster of Fukushima. The study also extracted concepts that were associated with nuclear power before and after the disaster, explaining the change in sentiment with a change in narrative framing.
Google Labs’ Ngram Viewer
TED: Erez Lieberman Aiden and Jean-Baptiste Michel’s talk give us a whole picture of Google Labs’ Ngram Viewer
- Buzzword: Culturomics. http://www.macmillandictionary.com/buzzword/entries/culturomics.html
- ebook: Top Five High-Impact Use Cases for Big Data Analytics. eBook-Top-Five-High-Impact-UseCases-for-Big-Data-Analytics
- http://www.nytimes.com/2012/03/25/business/words-by-the-millions-sorted-by-software.html Avalanches of Words, Sifted and Sorted. New York Times. March
- Lieberman, Erez, Jean-Baptiste Michel, Joe Jackson, Tina Tang, and Martin A. Nowak. 2007. “Quantifying the Evolutionary Dynamics of Language.” Nature 449 (7163): 713–16. doi:10.1038/nature06137. (Original Paper)
- Lieberman-Aiden, E., N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950): 289–93. doi:10.1126/science.1181369. (Original Paper)
- Michel, J.-B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82. doi:10.1126/science.1199644. (Original Paper)
- Hand, Eric. 2011. “Culturomics: Word Play.” Nature 474 (7352): 436–40. doi:10.1038/474436a.
- Bohannon, J. 2010. “Google Opens Books to New Cultural Studies.” Science 330 (6011): 1600–1600. doi:10.1126/science.330.6011.1600.
- Cohen, Patricia (16 December 2010). “In 500 Billion Words, New Window on Culture”. New York Times.
- Ginsparg, Paul. 2011. “ArXiv at 20.” Nature 476 (7359): 145–47. doi:10.1038/476145a.
- Leetaru, Kalev H. (5 September 2011). “Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone In Time And Space”. First Monday 16 (9).
- Quick, Darren (7 September 2011). “Culturomics research uses quarter-century of media coverage to forecast human behavior”. Gizmag.com. Retrieved 9 September 2011.
- Petersen, Alexander M. (15 March 2012). “Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death“. Scientific Reports doi:10.1038/srep00313.
- Petersen, Alexander M.; Tenenbaum, Joel; Havlin, Shlomo; Stanley, H. Eugene; Perc, Matjaz (10 December 2012). “Languages cool as they expand: Allometric scaling and the decreasing need for new words“. Scientific Reports. doi:10.1038/srep00943.
- CHRISTOPHER SHEA, “The New Science of the Birth and Death of Words “, Wall Street Journal, March 16, 2012
- “When physicists do linguistics”, BEN ZIMMER, Boston Globe, February 10, 2013
- Roth, S. (2014), “Fashionable functions. A Google ngram view of trends in functional differentiation (1800-2000)”, International Journal of Technology and Human Interaction, Band 10, Nr. 2, S. 34-58 (online: http://ssrn.com/abstract=2491422).
- Lansdall-Welfare TO, Sudhahar S, Veltri GA, Cristianini N. On the Coverage of Science in the Media: A Big Data Study on the Impact of the Fukushima Disaster. In: Big Data (Big Data), 2014 IEEE International Conference on. IEEE, New York. 2014. p. 60-66.
- Flaounas, Ilias, Marco Turchi, Omar Ali, Nick Fyson, Tijl De Bie, Nick Mosdell, Justin Lewis, and Nello Cristianini. 2010. “The Structure of the EU Mediasphere.” Edited by Diego Di Bernardo. PLoS ONE 5 (12): e14243. doi:10.1371/journal.pone.0014243.
- Aiden, E. L., J. P. Pickett, and J.-B. Michel. 2011. “Culturomics–Response.” Science 332 (6025): 36–37. doi:10.1126/science.332.6025.36-a.
- Bohannon, John. 2011. “Google Books, Wikipedia, and the Future of Culturomics.” Science 331 (6014): 135–135. doi:10.1126/science.331.6014.135.
- Evans, James a., and Jacob G. Foster. 2011. “Metaknowledge.” Science 331 (6018): 721–25. doi:10.1126/science.1201765.
- Hurtley, Stella. 2011. “Books, Books, and More Books.” Science 331 (6014): 122–122. doi:10.1126/science.331.6014.122-d.
- Biodiverse, Grazed Grasslands. 2012. “Random Sample.” Science 335 (6075): 1423–24. doi:10.1126/science.335.6075.1423-a.
- Spinney, Laura. 2012. “Human Cycles: History as Science.” Nature 488 (7409): 24–26. doi:10.1038/488024a.
- Hughes, J. M., N. J. Foti, D. C. Krakauer, and D. N. Rockmore. 2012. “Quantitative Patterns of Stylistic Influence in the Evolution of Literature.” Proceedings of the National Academy of Sciences 109 (20): 7682–86. doi:10.1073/pnas.1115407109.
- Twenge, Jean M, W Keith Campbell, and Brittany Gentile. 2012. “Increases in Individualistic Words and Phrases in American Books, 1960-2008.” PloS One 7 (7): e40181. doi:10.1371/journal.pone.0040181.
- Schich, M., C. Song, Y.-Y. Ahn, A. Mirsky, M. Martino, A.-L. Barabasi, and D. Helbing. 2014. “A Network Framework of Cultural History.” Science 345 (6196): 558–62. doi:10.1126/science.1240064. (Original Paper)
- Sudhahar, S., G. a. Veltri, and N. Cristianini. 2015. “Automated Analysis of the US Presidential Elections Using Big Data and Network Analysis.” Big Data & Society 2 (1): 1–28. doi:10.1177/2053951715572916.
- SUDHAHAR, SAATVIGA, GIANLUCA DE FAZIO, ROBERTO FRANZOSI, and NELLO CRISTIANINI. 2015. “Network Analysis of Narrative Content in Large Corpora.” Natural Language Engineering 21 (01): 81–112. doi:10.1017/S1351324913000247.
- J. Bohannon, Digital Data. Google Books, Wikipedia and the future of culturomics, Science, January 14, 2011 [http://www.sciencemag.org/content/331/6014/135.long]
- Hayes, Brian (May–June 2011). “Bit Lit”. American Scientist 99 (3): 190. doi:10.1511/2011.90.190.
- Letcher, David W. (April 6, 2011) Culturomics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases”. American Institute of Higher Education 6th International Conference Proceedings 4 (1): 228 [http://www.amhighed.com/documents/charleston2011/AIHE2011_Proceedings.pdf#page=228]
- Ben Zimmer, “When physicists do linguistics”, Boston Globe, February 10, 2013.[http://bostonglobe.com/ideas/2013/02/10/when-physicists-linguistics/ZoHNxhE6uunmM7976nWsRP/story.html]
- Erez Lieberman Aiden
- home page of Erez Lieberman Aiden. http://www.erez.com/
- Culturomics. http://www.culturomics.org/
- Publications. http://www.erez.com/Science/Publications
- Lieberman, E., Hauert, C., and Nowak, M.A., 2005. Evolutionary dynamics on graphs. Nature, 433 (7023), 312–316.
- Ohtsuki, H., Hauert, C., Lieberman, E., and Nowak, M.A., 2006. A simple rule for the evolution of cooperation on graphs and social networks. Nature, 441 (7092), 502–505.