Huber on what the corpus reveals about language acquisition

Last 24 March 2026, Visiting Research Fellow (VRF) Eva Huber from the Department of Linguistics at the University of Cologne in Germany gave a talk on what a corpus-based approach to the topic can reveal about the first language acquisition of Batangas Tagalog (ISO 639-3 [tgl]). Her talk entitled A corpus-based approach to explore how children acquire Batangas Tagalog in their natural environment was held at the Palma Hall Rm. 428. This talk is the third installment of the 2026 Linguistics Special Lecture Series (LSLS) which features talks from experts on various linguistic topics.

VRF Huber began her talk by highlighting how all languages differ in various aspects. She then compares her first language German to Tagalog by pointing out that while Tagalog has the verb in the first position in a clause, e.g., kumain ang bata ng mangga ‘the child ate a mango’, German verbs appear in the second position, e.g., Das Kind isst einen Apfel ‘The child ate an apple.’ Beyond this cross-lingusitic variation, languages also exhibit language internal variation, i.e., dialectal variation. Comparing the Tagalog in Metro Manila and in the Southern Tagalog Provinces, Huber notes that while the former’s morpheme expressing the actor voice imperfective (ongoing actions) is a combination of the infix <um> and the reduplication of the first Consonant and Vowel (CV) segments, e.g., kumakain tayo ng kanin ‘we are eating rice’, the latter instead makes use of the prefix nā- with a lengthened a, e.g., nakain tayo ng kanin ‘we are eating rice.’ The ever-present variation in language also means that the way children languages also varies. For example, there may be differences in culture especially with regards to how children socialize such that a child in a small rural Batangas town will learn Tagalog differently from a child in major metropolitan area like Manila.

In order to account for this variation in acquisition, Huber asserts that there is a need for a lot of child language data in their natural environment. Here then comes a corpus-based approach to language acquisition.

In this approach, children are recorded in an environment with a caregiver and both their speech and gestures will be transcribed as basis for later quantitative work. During the recordings, children live their daily lives, resulting in naturalistic data. This is—for instance—how one of the first language acquisition studies in the Philippines was done, with Brother Andrew Gonzales recording the daily lives of his nephews. What is important to record in this approach is what the child says, what the child hears (i.e., child directed speech), how the child interacts with others (e.g., turn taking), surrounding speech (i.e., non-child directed speech such as those they can overhear). While the approach is advantageous in capturing the most naturalistic of data, the corpus is also unable to capture what the child understands since researchers can only infer their understanding based on the child’s reactions.

With the approach as the center point of her talk, Huber then proceeds to talk about the background information to her corpus project such as data collection and annotation, some case studies that highlight the results of a corpus-based approach, and her own impressions on what can be understood using corpus data. In particular, her project is called the BaTa project which involved the collaboration many people and organizations in both the Germany and the Philippines. The project site was in Coloconto, a Barangay in the Municipality of San Juan, Batangas. Huber then jests that audiences may be more familiar with Laiya, a famous beach resort in the Municipality. She describes the place as being very rural with a measly population of 800 people.

Within the population, the BaTa project recorded children ages five (5) months to six (6) years old, mostly in Pre-School. The sampling was cross-sectional with multiple children per age group with up to 59 children in total but only 29 had been recorded in all of the period from May 2022 to October 2023. The recordings lasted for almost a day at around eight (8) hours per day with the data also being longitudinal, that is the project members did the recordings five (5) times every three (3) months.

As the corpus was naturalistic, the process of recording involved going to the families and then setting up cameras with no further instructions except telling them to just go about their days as they usually do. Still, Huber reminds the audience that the observer’s paradox remains and sometimes the participants become aware of the camera yet at other times, they end up forgetting the camera. Besides the camera, participants were asked to don wearable microphones placed on pocket on the children’s T-shirts.

Given this method, a lot of data had been recorded; so, there were decisions to be made as to what to transcribe. Huber and her colleagues sampled only about one (1) hour of recording per day, within three (3) time periods with a sample of 10 minutes each until something appears in the recording. Huber emphasized that the transcription sample was only random and not intentionally manipulated to look for periods of high interaction. Whenever the data was lacking, they added four (4) segments of seven (7) minutes and 30 seconds each at random—again, without looking for portions of high interaction—to reach the 1 hour quota. By doing so, Huber says, they were able to capture the different activities of the child throughout the day. The recordings were transcribed on ELAN, a linguistic transcription and annotation software by the Max Planck Institute for Psycholinguistics, with the help of the local speakers in Coloconto. They would later add further annotations including glosses, Parts-of-Speech (PoS) tags, and English translations so that people who do not speak Tagalog can also use the data. All in all, the data reached around 140 hours of recording, of which half has been glossed and translated. That said, Huber reminds the audience that the data was not composed of constant speech.

Some of the initial observations that Huber and members of the BaTa project had of Batangas Tagalog was that it has a lot of purong Tagalog ‘pure Tagalog’ or “deep” Tagalog words and uses fewer English lexemes as opposed to Metro Manila Tagalog. Some phonological differences between Batangas and the Metro Manila Tagalog included pronouncing ‘there’ as diyan and not dyan, ‘that over there’ as iyon and not yun, and ‘that’ as iyan and not yan; on the other hand, morphological differences included the use of the ka- intensifier in Batangas, e.g., kabait si Ineng ‘Ineng is so kind’ instead of napaka- in napakabait ni Ineng with the same meaning. The syntactic differences were the predominance of ay-inversion, i.e., subject-first word order, in Batangas Tagalog, e.g., ang bata ay kumain ‘the child ate’ instead of kumain ang bata with the same meaning, and its use of pronoun initial structures, e.g., ako‘y uminom ng tubig ‘I drank water’ instead of uminom ako ng tubig with the same meaning. There were also some lexical differences such as the use of ga instead of ba for the question marker, dine instead of dito for ‘here’, and kuyam instead of langgam for ‘ant’.

To better highlight the relevance of the corpus in understanding Batangas Tagalog and its acquisition, Huber then went on to talk about four case studies: (1) on the language environment, (2) on kinship terms, (3) on disfluency markers, and (4) on how children acquire verbs.

On language environment, Huber reminds the audience that while there is an established correlation between the amount of speech heard, i.e., how much and whom do children hear speak, and language proficiency, most of the research has been based on data from the United States of America (USA). In the BaTa project, what Huber and team found was that the amount and type of input strongly varied among children, but at the same time, mothers were most well represented in child directed speech. She then clarifies that this might be a consequence of the recording time since the recordings were mostly done during the day and so, fathers who usually went to work in the morning were exclude from the recordings. Although, the children did interact with family members for the most part, Huber also emphasized that there were stark differences from children-to-children. That said, they also found that the older the children get, the larger the input they got from siblings as opposed to the parents and child directed speech generally declined with age. Huber’s conclusion, however, was that even in this very homogenous community, the data tended to vary strongly.

On kinship terms, Huber cited the work by Alice Mitchell and Fiona Jordan and discussed the complex interplay between language and society. She points out that it is important for children to know how to refer to people in their environment. During the BaTa project, they found out that adults generally take the anchor of the children. For example, a mother will call the grandmother lola when talking about her while speaking to the child instead of calling her mama ‘mother’, since people overwhelmingly take the child’s perspective when talking to children. Huber says this is completely in line with global trends. In fact, it appears that a big proportion of the things children hear are kin terms then followed by the names of adults who are not relatives, characters of movies, and other children, with pronouns being relatively infrequent.

On disfluency Markers, Huber cites the work of Maria Bardají on Totoli [txe], an indigenous language of Indonesia and compares the disfluency markers of that language to Batangas Tagalog. According to Huber, these disfluency markers are lexical fillers and while fillers, e.g., um, e, a, etc., are not unusual among languages, Tagalog and Totoli make use of the interrogative ano and annu ‘what’ instead. When counting the use of ano as fillers, Huber found that ano was more common in adult directed speech and that Totoli annu was more frequent than Tagalog ano perhaps because the latter is more polyfunctional.

On how children acquire verbs, Huber underscores that verbs are difficult since children have to associate them with different events and states and they also have to learn how many participants are needed per verb and what types of participants these are. Additionally, children also need to learn the voice marking system of Tagalog early. Unlike the English passive, Huber recalls that for Tagalog-speaking children, both agent and undergoer voices are experienced early on. Citing Garcia and Evan Kidd, Huber notes that in Tagalog child directed speech, the undergoer voice tends to be used more than the actor voice and because of this, the patient voice is acquired earlier than the agent voice. She also observes that the youngest speakers use only a few verbs frequently but this distribution becomes more flatter, i.e., verb types increase by age. For these youngest speakers, motion verbs were more common but over time, consumption verbs grow less frequent while action verbs grow more frequent. Also, the use of unmarked verbs was quite high. In child directed speech, Huber observes that the agent voice is actually less frequent with the undergoer voice being more frequent, echoing Garcia and Kidd’s findings. As such, she asks: “is the agent voice more difficult?” To answer this, Huber uses entropy to see how varied voice use is, with higher entropy meaning more productivity. She observes that in child directed speech, the entropy remains relatively stable and as a child grows older, the entropy changes, but this change is not too drastic. In fact, the difference in entropy grows smaller as the children grow older; so, it is not that the agent voice is more difficult to acquire or even acquired later.

Using these four case studies and her experiences from the BaTa project, Huber stresses that corpus data can be a reliable way to look at and study children’s speech. This is because the data is naturalistic. At the same time, Huber reminds us that while there are many merits to a corpus-based approach to studying language acquisition, the data is sparse, there are many participants, and the projects can be very expensive and long. However, Huber also encourages us to consider doing corpus-based approaches and eventually make our own acquisition sketches.

Following her lecture, there was a short question and answer portion where members of the audience asked about the applicability of corpus-based approaches to their own contexts, the consistency of orthography-use during transcription, and finally, appreciation for the study of Batangas Tagalog.
Huber’s talk can be viewed on the Department’s Youtube channel. Stay tuned for more installments of the LSLS by following us on our social media pages.

Published by UP Department of Linguistics

LSLS Philippine linguistics

Latest News

100 years of The Archive: A retrospective on a century of scholarship

UP Lingg joins food & language sharing session at Malaysian embassy

ADVISORY: Impersonation of UP Lingg faculty