The Concept Of Corpus Linguistics Analysis

Corpus linguistics is a branch of study that dates back to 1980 defined as “an empirical approach to studying language, which uses observations of attested data in order to make generalizations about lexis, grammar, and semantics.” The definition itself suggests the use of attested data, better-called corpora, which have been defined by Sinclair as “a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.”

A corpus-based method is an approach that allows to explore language in depth and also to find how it relates to the others. As Tognini-Bonelli (2010) observes, there are many differences between reading texts and reading corpus. A text is read line by line, whereas a corpus is typically analyzed by looking at concordances of specific forms across a variety of sources. The text “is an instance of parole while the patterns shown up by corpus evidence yield insights into langue”.

Originally paper-based and derived manually, with the technological evolution, corpora now are automatically derived in a completely computerized way, giving the possibility to upload large portions of text, significantly accelerating the work of analysis previously conducted by hand. The Brown Corpus made up by Henry Kučera and W. Nelson Francis in 1967, a carefully compiled selection of current American English, consisting of a million words drawn from a wide variety of sources, was the first modern corpus of the English language. It has also inspired and laid the foundations for several works of equal caliber: the LOB Corpus (1960s British English), Australian Corpus of English (Australian English) or the FLOB Corpus (1990s British English). Furthermore, it has also influenced some other publishers that followed this experiment in compiling English dictionaries (CIDE, 1995; COBUILD, 1995a; LDOCE, 1995; OALD, 1995) or grammars (e.g., COBUILD, 1990) and other branches besides pure linguistic investigation. In fact, linguists had begun to apply corpus linguistic analysis also to other academic and professional fields. As asserted by Stubbs “a sociolinguist might use a corpus of audio-recorded conversations to study relations between social class and accent; a psycholinguist might use the same corpus to study slips of the tongue; and a lexicographer might be interested in the frequency of different phrases.” The reason behind this study originates mainly from the observation that the media, in the way they usually report facts, can influence and determine a particular type of perception in the audience and their attitudes towards that kind of subject. This theory is supported by Richardson, who asserts that “a news is never neutral or valueless” and furthermore finds its foundations in Critical Discourse Analysis. The latter, as described by Fairclough, is distinctive in its view of the relationship between language and society and sees discourse as a form of social practice. Thus, discursive practices may have major ideological effects as produce and reproduce unequal power relations between social classes, women and men, and ethnic groups, through the ways in which they represent things and position people. Most often people are unaware of the interplay of these underlying dynamics and CDA aims to make more visible these opaque aspects of discourse as social practice.

On the base of these assumptions, the analysis will basically be a quantitative data collection, in order to provide a set of relevant elements such as keywords (words that appear significantly more often in the corpus under study) or sentences’ structures that will be further investigated in the study. In fact, corpus data are essential for accurately describing language use, and have shown how lexis, grammar, and semantics interact. The analysis will be conducted following a corpus-driven approach and the data set will consist of a total of 40 online editorials (twenty for each newspaper) published within one year and a half, concerning principally racism and everything connected to it, collected from two newspapers: an Italian one - La Repubblica and an American one - The New York Times. Furthermore, the editorials were collected intentionally between such a specific and short period of time, in order to underline the sudden evolution and modeling of the concept by the two different galloping political scenes. The articles, to ensure that they embrace the topic in all its facets, were collected on each of the newspapers’ websites by entering keywords such as racism, immigration, or migrants (razzismo, immigrazione, and migranti in Italian). The logic behind the work is based on the hypothesis that the media of each nation in the way they depict the concept of racism, depending on its specific social and political structures, will employ different expressive techniques in terms of topics and discourses’ construction.

The software used in the creation and analysis of the corpus is Sketch Engine by Kilgarriff and Rychlý. One of the main advantages of using a corpus manager software like Sketch Engine, is the possibility to upload and compile your own corpora, which can be accessed at any time on the platform; moreover, the latter provides a set of tools which make easier to look into many different aspects of the language and of the corpus in general. The tools provided by the software such as keywords or wordlists, permit to immediately realize what is the most talked topic in a random corpus; moreover, simply choosing an item form the list, with the help of ‘sketch grammars,’ which identify “grammatical relationships between collocates” it is possible to look at its collocates – the words and patterns it most occurs with – or also to explore it further, for example looking at its concordance. The first step to follow in this kind of work is to basically open the Sketch Engine platform, log in and create a Corpus. The latter is possible to be made just going on the left side of the screen, click under the term “Create corpus”, and after giving a name, which in this case would be “The New York Times Corpus” and “La Repubblica Corpus”, and choosing the language (always in this case English or Italian on the base of the corpus I needed to create), simply click on the button “Create”. The next step is to upload all the articles that previously collected and saved on the laptop in order to create what would be the final corpus useful for the work of analysis.

The first area investigated refers to the frequency, simply clicking on the function Word list on the left side of the platform, in order to see how many times, the term racism occurred in both corpora. In fact, after set the minimum frequency of 5, it was immediately possible to notice that the term racism and the two main political exponents as Salvini and Trump, are the most frequent words in both lists, that occurs respectively 94 and 140 times in the American corpus and 62 and 78 in the Italian one. Moreover, the software allows to support the work of research employing sketch grammars, which as defined by Kilgarriff are “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” and for this reason are able to outline the main aspects of the word analyzed from grammatical and lexical point of view.

11 February 2020
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now