Culturomics: Science in Culture

“There’s been this humongous black box: ‘What exactly is the role that structure plays in the genome?’ Now we can start looking.”

Culturomics:  Science in Culture

Zhiwen Hu
Reference: Zhiwen Hu. Culturomics:  Science in Culture. Jan. 19, 2016. http://orcp.hustoj.com/2016/01/19/culturomics-2010/.
文化组学 (Culturomics),是一个由“文化(culture)”和“基因组学(genomics)”合成的合并词。是指利用数学方法分析来自网络的海量数据,从而分析人类文化的发展和演变。“文化组学”从基因组学中得到启发,一个研究人员小组设计出一种工具,该工具给出随时间推移文化如何改变的定量数据。基因组学研究可分析海量数据,研究基因如何发挥作用和改变。这个新“文化组学”工具采用一种大规模的方法,研究随时间推移单词使用的频率,用它来观察人类的思想与文化的趋势。

What’s Culturomics?

The definition of  Culturomics (/ˌkʌltʃəˈrɒmɪks/) goes:

The digitization of books by Google Books has sparked controversy over issues of copyright and book sales, but for linguists and cultural historians this vast project could offer an unprecedented treasure trove.

COVER. Quantifying the evolutionary dynamics of language, E Lieberman, J-B Michel, J Jackson, T Tang & M A Nowak Nature 449, 713–716 (11 October). An unusual paper for Nature perhaps. A calculation of the rate at which a language grows more regular, based on 1,200 years of English usage. The trend follows a simple rule: a verb’s half-life scales as the square root of its frequency. Irregular verbs that are 100 times as rare regularize 10 times faster. Exceptional forms are gradually lost. Next to go, and next to tumble in the cover ‘hour-glass’, is the word‘wed’. [doi: 10.1038/nature06137]
COVER. Quantifying the evolutionary dynamics of language, E Lieberman, J-B Michel, J Jackson, T Tang & M A Nowak Nature 449, 713–716 (11 October). An unusual paper for Nature perhaps. A calculation of the rate at which a language grows more regular, based on 1,200 years of English usage. The trend follows a simple rule: a verb’s half-life scales as the square root of its frequency. Irregular verbs that are 100 times as rare regularize 10 times faster. Exceptional forms are gradually lost. Next to go, and next to tumble in the cover ‘hour-glass’, is the word‘wed’. [doi: 10.1038/nature06137]
COVER. First described by David Hilbert in 1891, the Hilbert curve is a one-dimensional fractal trajectory that densely fills higher-dimensional space without crossing itself. A new method for reconstructing the three-dimensional architecture of the human genome (Lieberman-Aiden, E., N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950): 289–293. ), reveals a polymer analog of Hilbert’s curve at the megabase scale.
COVER. First described by David Hilbert in 1891, the Hilbert curve is a one-dimensional fractal trajectory that densely fills higher-dimensional space without crossing itself. A new method for reconstructing the three-dimensional architecture of the human genome (Lieberman-Aiden, E., N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950): 289–293. ), reveals a polymer analog of Hilbert’s curve at the megabase scale.
COVER. Tens of thousands of books appear in this photograph of the interior of the sculpture Idiom, by Matej Krén. Michel et al. describe an even larger collection: a 5.2-million-book corpus containing 4% of all books ever published. Statistical analysis of this corpus makes it possible to study cultural trends quantitatively (Michel, J.-B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82).
COVER. Tens of thousands of books appear in this photograph of the interior of the sculpture Idiom, by Matej Krén. Michel et al. describe an even larger collection: a 5.2-million-book corpus containing 4% of all books ever published. Statistical analysis of this corpus makes it possible to study cultural trends quantitatively (Michel, J.-B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82).

Researchers at Harvard University in Cambridge, Massachusetts, and the Google Books team in Mountain View, California, herald a new discipline called culturomics, which sifts through this literary bounty for insights into trends in what cultures can and will talk about through the written word.

  • The possibilities with such a new database, and the ability to analyse it in real time, are really exciting,” says linguist Sheila Embleton of York University in Toronto, Canada.
  • Quantitative analysis of this kind can reveal patterns of language usage and of the salience of a subject matter to a degree that would be impossible by other means,” agrees historian Patricia Hudson of Cardiff University, UK.
  • The really great aspect of all this is using huge databases, but they will have to be used in careful ways, especially considering alternative explanations and teasing out the differences in alternatives from the database,” adds Royal Skousen, a linguist at Brigham Young University in Provo, Utah. “I do not like the term ‘culturomics’,” he adds. “It smacks too much of ‘freakonomics’, and both terms smack of amateur sociology.”
  • The ability, via modern technology, to look at just so much at once really opens horizons,” says Embleton. However, Hudson cautions that making effective use of such a resource will require skill and judgement, not just number-crunching. “How this quantitative evidence is generated and how it is interpreted are the most important factors in forming conclusions,” she says. “Quantitative evidence of this kind must always address suitably framed general questions, and employed alongside qualitative evidence and reasoning, or it will not be worth a great deal.” 
  • But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”
  • But these tools have enormous implications,” Anthony T. Grafton, a history professor at Princeton and a former president of the American Historical Association said, in their ability to reveal unexpected patterns and associations in the historical record. “These are tools that can pick up big changes,” he said. “You can’t do this by using older, conventional means of reading books and taking notes.”
  • BOOKWORM-ARXIV will burrow its way through data stored in roughly 743,000 or so papers that have been uploaded by scientists, said Paul Ginsparg, founder of arXiv, Authors typically send their papers to arXiv as “preprints” or unpublished manuscripts before the works appear in journals. Most of the research is in physics, mathematics, computer science, statistics and the quantitative parts of biology and finance, said Dr. Ginsparg, a professor of physics and information science at Cornell. Readers who use the arXiv interface will be able to click through to the original text. “The papers are not behind a paywall,” Dr. Ginsparg said. Dr. Ginsparg says he has already been trying out the topic modeling algorithms of Dr. Blei and others on arXiv’s collection of scientific papers. “The technology gives you a way to home in for finer grains of similarity between articles,” Dr. Ginsparg said, “ones that might not be detected by a keyword search.”
  • Steven Pinker, cognitive scientist and Harvard professor, collaborated with the observatory team during development of the n-gram viewer. He said the new interface might be particularly useful for historians of science. “They will increasingly be able to test their explanations, conjectures and hypotheses by looking at the rise and fall of phrases in the scientific literature,” he said.
  • Dr. Erez Lieberman Aiden said the worlds of Google’s scanned books and arXiv’s papers were just the beginning for the observatory. “We plan on moving on soon to newspapers, blogs, tweets and other aspects of the historical record,” he said.
Culturomic analyses study millions of books at once. (A) Top row: Authors have been writing for millennia; ~129 million book editions have been published since the advent of the printing press (upper left). Second row: Libraries and publishing houses provide books to Google for scanning (middle left). Over 15 million books have been digitized. Third row: Each book is associated with metadata. Five million books are chosen for computational analysis (bottom left). Bottom row: A culturomic time line shows the frequency of “apple” in English books over time (1800–2000). (B) Usage frequency of “slavery”. The Civil War (1861–1865) and the civil rights movement (1955–1968) are highlighted in red. The number in the upper left (1e-4 = 10–4) is the unit of frequency. (C) Usage frequency over time for “the Great War” (blue), “World War I” (green), and “World War II” (red).
Culturomic analyses study millions of books at once. (A) Top row: Authors have been writing for millennia; ~129 million book editions have been published since the advent of the printing press (upper left). Second row: Libraries and publishing houses provide books to Google for scanning (middle left). Over 15 million books have been digitized. Third row: Each book is associated with metadata. Five million books are chosen for computational analysis (bottom left). Bottom row: A culturomic time line shows the frequency of “apple” in English books over time (1800–2000). (B) Usage frequency of “slavery”. The Civil War (1861–1865) and the civil rights movement (1955–1968) are highlighted in red. The number in the upper left (1e-4 = 10–4) is the unit of frequency. (C) Usage frequency over time for “the Great War” (blue), “World War I” (green), and “World War II” (red).

Culturomics aims to target computational analysis of cultural trends or social trends using quantitative analysis of digitized text. Culturomics is first coined on 17 December 2010 of this Science paper (The paper appears in print in the 14 January issue.) that helped create Google’s Ngram Viewer by researchers Erez Lieberman Aiden and his colleagues. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Erez Lieberman Aiden
Erez Lieberman Aiden

A team led by Jean-Baptiste Michel and Erez Lieberman Aiden created a data set based on the trillions of words within Google Books, which currently represents 15 million books, 12% of those ever published. After quality checking the data, they created a version that captures a third of that corpus and put it online so that anyone can search for patterns of cultural change reflected in the frequency with which words or phrases rise and fall in the corpus over time. Michel and colleagues call it culturomics, the quantitative exploration of massive data sets of digitized culture.

Before you read further, you might enjoy a few minutes playing with the raw data yourself. (Warning: It is addictive.) Go ahead and type in the full name of your scientific heroes and the scientific concepts associated with them. Note that it is case-sensitive, so “special relativity” is fine, but you should enter “Albert Einstein” rather than “albert einstein.” The plots you’ll see are the frequency of those names and phrases in the pages of all books published each year between the dates you choose.

Culturomics: Buzzword or Science?

There has been almost as much discussion on the word formation strategy underlying the new term culturomics as there has on the concept itself. Though it looks like it’s modelled on terms like economics, ergonomics, etc, the clue to its evolution is that it is in fact pronounced with a long ‘o’ vowel, as in ‘home’. Culturomics is modelled on the term genomics – the study of gene sequences within living organisms. Such gene sequences are referred to as genomes, and hence, by extension, culturomics makes reference to culturomes.

Having started with the term genomics, the suffix -omics appears to be becoming increasingly productive, spawning a raft of related terms in the bio-sciences. Use in the term culturomics is its first significant divergence into a domain outside of biology.

In January 2011 the word culturomics was voted the neologism of 2010 which was ‘least likely to succeed’ by the American Dialect Society. Nevertheless the term has sparked a flurry of discussion in recent months, engaging the attention of high-profile linguists across the world. With its strong claims about the link between language use and culture, the concept of culturomics is worth a second look.

In late 2010 it was announced that a team of researchers at Harvard University had put together a corpus of 500 billion words from a Google scanning project of more than 5 million texts. This represented an unprecedentedly large body of language data (including German, French, Spanish, Russian, Chinese and Hebrew as well as English), which was thought to constitute in the region of 4% of books ever printed. Alongside, they had developed a tool, entitled the n-gram viewer, which produced a graphic representation of the frequency of occurrence of a word or sequence of words (an n-gram) within the corpus over the 200 years between 1800 and 2000.

Google Labs’ Ngram Viewer.
Google Labs’ Ngram Viewer.

This experimental tool proves very interesting, because it can give us information about the evolution of language within a 200-year timeframe, indicating what lexical, or indeed grammatical, changes occurred. This graph, showing the comparative use of past participles dreamed and dreamt, is one example of a grammatical swing.

But what’s particularly significant is that the graphs can also give us a very engaging insight into the way language use depicts historical and cultural hotspots. The graph for the word ration, for example, has a dramatic peak in the early 1940s (World War II and aftermath), and a similar effect can be seen for nuclear (1980s) and Beatles (1960s). Other words show a more gradual ascent (e.g. mobile), or decline (e.g. petticoat).

Though by no means a new idea, this kind of language analysis, showing the intersection of language use and concepts of cultural significance, has been dubbed culturomics by the authors of this research. They argue that the graphs are windows into a broad spectrum of evolving cultures – such as for example fame, ethical issues, politics, religion or the adoption of technology. The words themselves, they claim, somehow represent a chunk of our cultural make-up, which they refer to as a culturome.

However, though these graphs can indeed be fascinating, as with all such electronic language tools, the results are not perfect, and the cracks really begin to show when you ask the tool to analyse newer words. Searches for words podcast and webinar, for example, suggest that they were thriving way back in the late 18th and early 19th century! Such anomalies can be put down to a variety of problems, including incorrect dating of sources, typographical errors, the limitations of OCR (scanning) technology (which can skew data by favouring certain text types), and the basic fact that computational tools just can’t interpret the data like humans can (e.g. a ‘spike’ for podcast might have occurred simply because the words pod and cast happened to appear next to each other in a sufficient number of late 18th century texts).

Nevertheless it’s a clever exploitation of a large body of language data, and though its main assertion is something we always knew anyway – that cultural and historical events are reflected in language use – there’s something extremely engaging and slightly addictive about seeing facts brought to life in graph form.

Linguists and lexicographers have expressed skepticism regarding the methods and results of some of these studies, including one by Petersen et al (“When physicists do linguistics”, BEN ZIMMER, Boston Globe, February 10, 2013).

Play words: Half a trillion words

At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, standing center and right, are among those working on a browser to note changes in language over time. Credit Kris Snibbe/Harvard University.
At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, standing center and right, are among those working on a browser to note changes in language over time. Credit Kris Snibbe/Harvard University.

Using statistical and computational techniques to analyse vast quantities of data in historical and linguistic research is nothing new — the fields known as quantitative history and quantitative linguistics already do this. But it is the sheer volume of the database created by Google Books that sets the new work apart.

So far, Google has digitized more than 15 million books, representing about 12% of all those ever published in all languages. Michel and his colleagues performed their analyses on just a third of this sample, selected for the good quality of the optical character recognition in the digitization and the reliability of information about a book’s provenance, such as the date and place of publication.

The resulting data set contained over 500 billion words. This is far more than any single person could read: a fast reader would, without breaks for food and sleep, need 80 years to finish the books for the year 2000 alone.

Not all isolated strings of characters in texts are real words. Some are numbers, abbreviations or typos. In fact, 51% of the character strings in 1900, and 31% in 2000, were ‘non-words’. “I really have trouble believing that,” admits Embleton. “If it’s true, it would really shake some of my foundational thoughts about English.”

According to this account, the English language has grown by more than 70% during the past 50 years, and around 8,500 new words are being added each year. Moreover, only about half of the words currently in use are apparently documented in standard dictionaries. “That high amount of lexical ‘dark matter’ is also very hard to believe, and would also shake some foundations,” says Embleton. “I’d love to see the data.”

In principle she already can, because the researchers have made their database public at www.culturomics.org. This will allow others to explore the huge number of potential questions it suggests, not just about word use but about cultural history. Michel and colleagues offer two such examples, concerned with fame and censorship.

They say that actors reach their peak of fame, as recorded in references to names, around the age of 30, while writers take a decade longer but achieve a higher peak. “Science is a poor route to fame,” they add. Physicists and biologists who achieve fame do so only late in life, and “even at their peak, mathematicians tend not to be appreciated by the public”.

Box: A discipline goes digital

Many researchers in the digital humanities use textual databases composed primarily of books — as Erez Lieberman Aiden does in his ‘culturomics’ project.
Many researchers in the digital humanities use textual databases composed primarily of books — as Erez Lieberman Aiden does in his ‘culturomics’ project.

The digital humanities — the use of algorithms to search for meaning in databases of text and other media — have been around for decades. Some trace the field’s origins to Roberto Busa, an Italian priest who, in the late 1940s, teamed up with IBM to produce a searchable index of the works of thirteenth-century theologian Thomas Aquinas.

But the field has taken on new life in recent years. Journals have sprouted up and professional societies are blooming. Some universities are now requiring graduate students in the humanities to take statistics and computer-science courses. Funding — far harder to come by in the humanities than in the sciences — flows slightly more generously to those willing to adopt the new methods. This year, the US National Endowment for the Humanities, in collaboration with the National Science Foundation and research institutions in Canada and Britain, plans to hand out 20 grants in the digital humanities, worth a total of US$6 million.

Many researchers in the digital humanities use textual databases composed primarily of books — as Erez Lieberman Aiden does in his ‘culturomics’ project (see ‘Heavy-duty data’). Franco Moretti, a literary scholar at Stanford University in California, has shown that genres of fiction — Gothic novels, for example, or romance — have a textual ‘fingerprint’ that is apparent even in simple frequency counts of nouns, verbs and prepositions. “These genres are different at every scale,” he says, “not only in the huge scene of being held captive by a Count.”

Some researchers are busy digitizing other forms of cultural data. John Coleman, a phonetician at the University of Oxford, UK, is putting 5 million spoken words — about 3 months of speech, end to end — into a database, down to the level of the individual phonemes. Collected largely as recordings made with Sony Walkmen in the 1990s, it contains all sorts of things typically ignored by linguists: neologisms, slurring and sub-verbal honks and snorts. Coleman is already learning how conversation partners take pacing cues from each other, and how pitch of voice reflects attitude. And, he says, he can prove that women and men talk at the same speed. The linguistics textbooks, he says, “are going to have to be rewritten”.

Ichiro Fujinaga, a music technologist at McGill University in Montreal, Canada, is trying to do something similar for music. In a project known as SALAMI (Structural Analysis of Large Amounts of Music Information), Fujinaga is finding the common structural patterns (such as verse–chorus) in 350,000 pieces of music from all over the world. With more than 7,000 hours of Grateful Dead recordings in the database, he says, his team will be able to answer the all-important question: “Did the guitar solos get extended over the years or did they get shorter?”

Culturomics 2.0

Birth and death data of notable individuals reveal interactions between culturally relevant locations over two millennia.
Birth and death data of notable individuals reveal interactions between culturally relevant locations over two millennia.

In a study called Culturomics 2.0, Kalev H. Leetaru examined news archives including print and broadcast media (television and radio transcripts) for words that imparted tone or “mood” as well as geographic data. The research was able to retroactively predict the 2011 Arab Spring and successfully estimate the final location of Osama Bin Laden to within 124 miles.

In a 2012 paper by Alexander M. Petersen and co-authors, they found a “dramatic shift in the birth rate and death rates of words”: Deaths have increased and births have slowed. The authors also identified a universal “tipping point” in the life cycle of new words at about 30 to 50 years after their origin, they either enter the long-term lexicon or fall into disuse.

In a 2014 paper by S. Roth, culturomic analyses is used to trace the decline of religion, the rise of politics, and the relevance of the economy to modern societies, with one of the major results being that modern societies do not appear to be capitalist or economized. This paper is likely to be the first application of culturomics in sociology.

Culturomic approaches have been taken in the analysis of newspaper content in a number of studies by I. Flaounas and co-authors. These studies showed macroscopic trends across different news outlets and countries. In 2012, a study of 2.5 million articles suggested that gender bias in news coverage depends on topic and how the readability of newspaper articles is related to topic. A separate study by the same researchers, covering 1.3 million articles from 27 countries, showed macroscopic patterns in the choice of stories to cover. In particular, countries made similar choices when they were related by economic, geographical and cultural links. The cultural links were revealed by the similarity in voting for the Eurovision song contest. This study was performed on a vast scale, by using statistical machine translation, text categorisation and information extraction techniques.

The possibility to detect mood shifts in a vast population by analysing Twitter content was demonstrated in a study by T Lansdall-Welfare and co-authors. The study considered 84 million tweets generated by more than 9.8 million users from the United Kingdom over a period of 31 months, showing how public sentiment in the UK has changed with the announcement of spending cuts.

In a 2013 study by S Sudhahar and co-authors, the automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from Network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.

In a 2014 study by S Sudhahar, T Lansdall-Welfare and co-authors, 5 million news articles were collected over 5 years and then analyzed to suggest a significant shift in sentiment relative to coverage of nuclear power, corresponding with the disaster of Fukushima. The study also extracted concepts that were associated with nuclear power before and after the disaster, explaining the change in sentiment with a change in narrative framing.

Further Reading

Google Labs’ Ngram Viewer

Have you played with Google Labs' Ngram Viewer? It's an addicting tool that lets you search for words and ideas in a database of 5 million books from across centuries. Erez Lieberman Aiden and Jean-Baptiste Michel show us how it works, and a few of the surprising things we can learn from 500 billion words.
Have you played with Google Labs’ Ngram Viewer? It’s an addicting tool that lets you search for words and ideas in a database of 5 million books from across centuries. Erez Lieberman Aiden and Jean-Baptiste Michel show us how it works, and a few of the surprising things we can learn from 500 billion words.

TED: Erez Lieberman Aiden and Jean-Baptiste Michel’s talk give us a whole picture of  Google Labs’ Ngram Viewer

References

Share
About Sunney 116 Articles
I am currently a Professor of Zhejiang Gongshang University, Hangzhou, China.

3 Comments

  1. I must say you have high quality content here.
    Your blog can go viral. You need initial boost only. How to
    get it? Search for: Etorofer’s strategies

  2. I see your site needs some fresh content. Writing manually is time consuming, but there is solution for
    this. Just search for: Masquro’s strategies

  3. I must say you have high quality content here. Your website can go viral.
    You need initial traffic only. How to get it? Search
    for: Etorofer’s strategies

Leave a Reply

Your email address will not be published.


*