Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Likan Zhan

doi:10.3791/58086

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

The visual world paradigm monitors participants' eye movements in the visual workspace as they are listening to or speaking a spoken language. This paradigm can be used to investigate the online processing of a wide range of psycholinguistic questions, including semantically complex statements, such as disjunctive statements.

Abstract

In a typical eye tracking study using the visual world paradigm, participants' eye movements to objects or pictures in the visual workspace are recorded via an eye tracker as the participant produces or comprehends a spoken language describing the concurrent visual world. This paradigm has high versatility, as it can be used in a wide range of populations, including those who cannot read and/or who cannot overtly give their behavioral responses, such as preliterate children, elderly adults, and patients. More importantly, the paradigm is extremely sensitive to fine grained manipulations of the speech signal, and it can be used to study the online processing of most topics in language comprehension at multiple levels, such as the fine grained acoustic phonetic features, the properties of words, and the linguistic structures. The protocol described in this article illustrates how a typical visual world eye tracking study is conducted, with an example showing how the online processing of some semantically complex statements can be explored with the visual world paradigm.

Introduction

Spoken language is a fast, ongoing information flow, which disappears right away. It is a challenge to experimentally study this temporal, rapidly change speech signal. Eye movements recorded in the visual world paradigm can be used to overcome this challenge. In a typical eye tracking study using the visual world paradigm, participants' eye movements to pictures in a display or to real objects in a visual workspace are monitored as they listen to, or produce, spoken language depicting the contents of the visual world¹^,²^,³^,⁴. The basic logic, or the linking hypothesis, behind this paradigm is that comprehending or planning an utterance will (overtly or covertly) shift participants' visual attention to a certain object in the visual world. This attention shift will have a high probability to initiate a saccadic eye movement to bring the attended area into the foveal vision. With this paradigm, researchers intend to determine at what temporal point, with respect to some acoustic landmark in the speech signal, a shift in the participant's visual attention occurs, as measured by a saccadic eye movement to an object or a picture in the visual world. When and where saccadic eye movements are launched in relation to the speech signal are then used to deduce the online language processing. The visual world paradigm can be used to study both the spoken language comprehension¹^,² and production⁵^,⁶. This methodological article will focus on comprehension studies. In a comprehension study using the visual world paradigm, participants' eye movements on the visual display are monitored as they listen to the spoken utterances talking about the visual display.

Different eye tracking systems have been designed in history. The simplest, least expensive, and most portable system is just a normal video camera, which records an image of the participant's eyes. Eye movements are then manually coded through frame-by-frame examination of the video recording. However, the sampling rate of such an eye-tracker is relatively low, and the coding procedure is time consuming. Thus, a contemporary commercial eye tracking system normally uses optical sensors measuring the orientation of the eye in its orbit⁷^,⁸^,⁹. To understand how a contemporary commercial eye-tracking system works, the following points should be considered. First, to correctly measure the direction of the foveal vision, an infrared illuminator (normally with the wavelength around 780-880 nm) is normally laid along or off the optical axis of the camera, making the image of the pupil distinguishably brighter or darker than the surrounding iris. The image of the pupil and/or of the pupil corneal reflection (normally the first Purkinje image) is then used to calculate the orientation of the eye in its orbit. Second, the gaze location in the visual world is actually contingent not only on the eye orientation with respect to the head but also on the head orientation with respect to the visual world. To accurately infer the gaze of regard from the eye orientation, the light source and the camera of the eye-trackers either are fixed with respect to participants' head (head-mounted eye-trackers) or are fixed with respect to the visual world (table-mounted or remote eye-trackers). Third, the participants' head orientation must either be fixed with respect to the visual world or be computationally compensated if participants' head is free to move. When a remote eye-tracker is used in a head-free-to-move mode, the participants' head position is typically recorded by placing a small sticker on participants' forehead. The head orientation is then computationally subtracted from the eye orientation to retrieve the gaze location in the visual world. Fourth, a calibration and a validation process are then required to map the orientation of the eye to the gaze of regard in the visual world. In the calibration process, participants' fixation samples from known target points are recorded to map the raw eye data to gaze position in the visual world. In the validation process, participants are presented with the same target points as the calibration process. The difference existing between the computed fixation position from the calibrated results and the actual position of the fixated target in the visual world are then used to judge the accuracy of the calibration. To further reconfirm the accuracy of the mapping process, a drift check is normally applied on each trial, where a single fixation target is presented to participants to measure the difference between the computed fixation position and the actual position of the current target.

The primary data of a visual world study is a stream of gaze locations in the visual world recorded at the sampling rate of the eye-tracker, ranging over the whole or part of the trial duration. The dependent variable used in a visual world study is typically the proportion of samples that participants' fixations are situated at certain spatial region in the visual world across a certain time window. To analyze the data, a time window has firstly to be selected, often referred to as periods of interest. The time window is typically time-locked to the presentation of some linguistic events in the auditory input. Furthermore, the visual world is also needed to split into several regions of interest (ROIs), each of which is associated with one or more objects. One such region contains the object corresponding to the correct comprehension of the spoken language, and thus is often called the target area. A typical way to visualize the data is a proportion-of-fixation plot, where at each bin in a time window, the proportion of samples with a look to each region of interest are averaged across participants and items.

Using the data obtained from a visual world study, different research questions can be answered: a) On the coarse-grain level, are participants' eye movements in the visual world affected by different auditory linguistic input? b) If there is an effect, what is the trajectory of the effect over the course of the trial? Is it a linear effect or high-order effect? and c) If there is an effect, then on the fine-grain level, when is the earliest temporal point where such an effect emerges and how long does this effect last?

To statistically analyze the results, the following points should be considered. First, the response variable, i.e., proportions of fixations, is both below and above bounded (between 0 and 1), which will follow a multinomial distribution rather than a normal distribution. Henceforth, traditional statistical methods based on normal distribution such as t-test, ANOVA, and linear (mixed-effect) models¹⁰, cannot be directly utilized until the proportions have been transformed to unbounded variables such as with empirical logit formula¹¹ or have been replaced with unbounded dependent variables such as Euclidean distance¹². Statistical techniques that do not require the assumption of normal distribution such generalized linear (mixed-effect) models¹³ can also be used. Second, to explore the changing trajectory of the observed effect, a variable denoting the time-series has to be added into the model. This time-series variable is originally the eye-tracker’s sampling points realigned to the onset of the language input. Since the changing trajectory typically is not linear, a high-order polynomial function of the time-series is normally added into the (generalized) linear (mixed-effect) model, i.e., growth curve analyses¹⁴. Furthermore, participants’ eye positions in the current sampling point is highly dependent on previous sampling point(s), especially when the recording frequency is high, resulting in the problem of autocorrelation. To reduce the autocorrelation between the adjacent sampling points, original data are often down-sampled or binned. In recent years, the generalized additive mixed effect models (GAMM) have also been used to tackle the autocorrelated errors¹²^,¹⁵^,¹⁶. The width of bins varies among different studies, ranging from several milliseconds to several hundred milliseconds. The narrowest bin a study can choose is restricted by the sampling rate of the eye tracker used in the specific study. For example, if an eye tracker has a sampling rate of 500 Hz, then the width of the time window cannot be smaller than 2 ms = 1000/500. Third, when a statistical analysis is repeatedly applied to each time bin of the periods of interest, the familywise error induced from these multiple comparisons should be tackled. As we described earlier, the trajectory analysis informs the researcher whether the effect observed on the coarse-grain level is linear with respect to the changing of the time, but does not show when the observed effect begins to emerge and how long the observed effect lasts. To determine the temporal position when the observed difference starts to diverge, and to figure out the duration of the temporal period that the observed effect lasts, a statistic analysis has to be repeatedly applied to each time bin. These multiple comparisons will introduce the so-called familywise error, no matter what statistical method is used. The familywise error is traditionally corrected with Bonferroni adjustment¹⁷. Recently, a method called nonparametric permutation test originally used in neuroimaging field¹⁸ has been applied to the visual word paradigm¹⁹ to control for the familywise error.

Researchers using the visual world paradigm intend to infer the comprehension of some spoken language from participants’ eye movements in the visual world. To ensure the validity of this deduction, other factors that possibly influence the eye movements should be either ruled out or controlled. The following two factors are among the common ones that need to be considered. The first factor involves some systematic patterns in participants’ explanatory fixations independent of the language input, such as the tendency to fixate on the top left quadrat of the visual world, and moving eyes in the horizontal direction being easier than in the vertical direction, etc.¹²^,²⁰ To make sure that the observed fixation patterns are related to the objects, not to the spatial locations where the objects are situated, the spatial positions of an object should be counterbalanced across different trials or across different participants. The second factor that might affect participants’ eye movements is the basic image features of the objects in the visual world, such as luminance contrast, color and edge orientation, among others²¹. To diagnose this potential confounding, the visual display is normally presented prior to the onset of the spoken language or prior to the onset of the critical acoustic marker of the spoken language, for about 1000 ms. During the temporal period from the onset of the test image to the onset of the test audio, the language input or the disambiguation point of the language input has not been heard yet. Any difference observed between different conditions should be deduced to other confounding factors such as the visual display per se, rather than the language input. Henceforth, eye movements observed in this preview period provide a baseline for determining the effect of the linguistic input. This preview period also allows participants to get familiarized with the visual display, and to reduce the systematic bias of the explanatory fixations when the spoken language is presented.

To illustrate how a typical eye tracking study using the visual world paradigm is conducted, the following protocol describes an experiment adapted from L. Zhan ¹⁷ to explore the online processing of semantically complex statements, i.e., disjunctive statements (S1 or S2), conjunctive statements (S1 and S2), and but-statements (S1 but not-S2). In ordinary conservation, the information expressed by some utterances is actually stronger than its literal meaning. Disjunctive statements like Xiaoming's box contains a cow or a rooster are such utterances. Logically, the disjunctive statement is true as long as the two disjuncts Xiaoming's box contains a cow and Xiaoming's box contains a rooster are not both false. Therefore, the disjunctive statement is true when the two disjuncts are both true, where the corresponding conjunctive statement Xiaoming's box contains a cow and a rooster is also true. In ordinary conversation, however, hearing the disjunctive statement often suggests that the corresponding conjunctive statement is false (scalar implicature); and suggests that the truth values of the two disjuncts are unknown by the speaker (ignorance inference). Accounts in the literature differ in whether two inferences are grammatical or pragmatic processes²²^,²³^,²⁴^,²⁵^,²⁶. The experiment shows how the visual world paradigm can be used to adjudicate between these accounts, by exploring the online processing of three complex statements.

Protocol

All subjects must give informed written consent before the administration of the experimental protocols. All procedures, consent forms, and the experimental protocol were approved by the Research Ethics Committee of the Beijing Language and Culture University.

NOTE: A comprehension study using the visual world paradigm normally consists of the following steps: Introduce the theoretical problems to be explored; Form an experimental design; Prepare the visual and auditory stimuli; Frame the theoretical problem with regard to the experimental design; Select an eye-tracker to track participants' eye movements; Select a software and build a script with the software to present the stimuli; Code and analyze the recorded eye-movements data. A specific experiment can differ from each other in any of the described steps. As an example, a protocol is introduced to conduct the experiment and discuss some points that researchers need to keep in mind when they build and conduct their own experiment using the visual world paradigm.

1. Prepare Test Stimuli

Visual stimuli
1. Download 60 clip arts of animals that are free of copyright from the internet. Open each image one by one with an image editor (e.g., Pixelmator), click Tools | Quick selection tool to select and delete the background. Click Image | Image Size to resize them to 120 x 120 pixels.
2. Invite a student majoring in painting to draw four light green boxes, as illustrated in Figure 1. Use the image editor to rescale the big open box to 320 x 240 pixels, the small closed box with the size of 160 x 160 pixels, and the two small open boxes to 160 x 240 pixels, respectively.
3. Click Pixelmator | File | New to build a template of the test image with the size of 1024 768 pixels. Drag the animals and the boxes to correction locations being illustrated in Figure 1.
  NOTE: The layout of the test image varies between studies, but the optimal way is to use four objects and to put them at the four quadrants of the test image. In this way, it is easier to counterbalance the spatial position of the objects.
4. Create 60 test images like Figure 1, with each animal image being used twice. Counterbalance the spatial locations of the four boxes among the images.
  NOTE: The number of the images does not have to be exact 60, as long as their effect is dissociable from that of the experimental manipulations.
Spoken language stimuli
1. Design four test sentences corresponding to each test image and 240 test sentences in total to be recorded. Ensure that three of the four sentences are in the form of Figure 2; and the filler sentence is in the form of Xiaoming's box doesn't contain a rooster but a cow.
  NOTE: The test sentences should be presented in the native language that participants speak. The participants in this experiment are Chinese from Beijing, Mainland China, so the test language is Mandarin Chinese.
2. Recruit a female native speaker (a native speaker of Mandarin Chinese in this experiment) to record four example statements like Figure 2, as well as audio of all the animals being used in the experiment. When recording the isolated animal names, ask the speaker to imagine that the names of the animals are intact components of a simple sentence, such as Xiaoming's box contains a/an ___, but she only needs to pronounce the name of the animal overtly.
3. Replace the audio segments of the two animals in the example statements with the audio of the two animals used in each trial to create the full list of the test audios. First, open Praat (Any other audio editing software is an eligible alternative) and click Open | Read from file | Navigate to the file | Open and edit, navigate to an element to be replaced, and click View and Edit | Edit | Copy selection to sound clipboard. Second, use the same steps to open an example statement, click paste after selection. Third, click Save | save as wav file to save the edited statement. Repeat the process for all the elements to be changed and all the test sentences.
4. Recruit about 10 native speakers of the test language (Mandarin Chinese here) to determine whether or not the constructed test audio is intelligible and natural.
  NOTE: The test audio is traditionally recorded as a whole, rather than as separate words. This traditional recording method is reasonable if the test audio are themselves separate words. If the spoken language stimuli are sentences rather than separate words, however, this traditional method has several shortcomings: First, a ubiquitous property of a continuous speech is that two or more speech sounds tend to temporally and spatially overlap, which makes it hard to determine the onset of the critical word. Second, the variance between the length of different trials also makes it difficult to combine all the trials together for statistical analyses. Third, the traditional recording method is often time consuming especially when the numbers of the test audio are relatively large, such as the experiments we reported in the protocol. To overcome the shortcomings of the traditional recording method, a different method is proposed to construct the spoken test audios. First, a list of sample sentences containing the words that are common among all the test audio was recorded. Second, all words that change between trials were also recorded in isolation. Finally, sample sentences were replaced with the recorded words to construct the full list of the test audios. Compared to the traditional method, the new method has several advantages. First, all the test audio is exactly the same except for the critical words, and all potential confounding effects in the test audio are henceforth controlled. Second, being the same in length also makes the segmentation of the test audios easier than when the test audios are recorded as a whole. One potential disadvantage of this method is that the constructed audio might be not natural. Henceforth, the naturalness of the test audio has to be evaluated before they are eligible for the actual testing..
Divide the 240 test sentences into four groups, with each group containing 15 conjunctive statements, 15 disjunctive statements, 15 but statements, and 15 filler sentences. Ensure that each participant encounters only one group of 240 trials: he/she sees all the test images but hears only one group of the test audios.
NOTE: This is to address the concern that if the same stimulus is repeated, participants might be getting accustomed to these stimuli and possibly even becoming strategic about how they have responded to the stimuli.
Save all important information regarding the test stimuli into a tab-delimited txt file, with each row corresponding to each of the 240 trials. Ensure that the file includes at least the following columns: experiment_group, sentential_connective, trial_number, test_image, test_audio, test_audio_length, ia_top_left, ia_top_right, ia_bottom_left, ia_bottom_right, animal_1_image, animal_1_audio, animal_1_audio_length, animal_2_image, animal_2_audio, animal_2_audio_length.
NOTE: experiment_group is used to split the 240 trials into 4 groups. sentential_connective corresponds to different experimental conditions. animal_1_image corresponds to the image of the animal that will be firstly presented to familiarize the participants with the animals used in the test image. test_image, test_audio, and test_audio_length refer to the test image and the test audio as well its length used in the current trial. ia_top_left, ia_top_right, ia_bottom_left, ia_bottom_right refer to the name of the four interest areas in the current trial, i.e., whether it is a "Big open" box, "small closed" box, the small open box containing the "first mentioned" animal in the test audio, or the small open box containing the "second mentioned" animal in the test audio. animal_1_audio and animal_1_audio_length refer to the audio and length of the audio corresponding to the animal_1_image. animal_2_image, animal_2_audio, and animal_2_audio_length correspond the second animal that will be presented. One thing to stress is that the sequence to present the two animals is counterbalanced with respect to whether the animal is mentioned in the first or in the second half of the test audios.

2. Frame the Theoretical Prediction with regard to the Experimental Design.

Ensure participants' behavioral responses and eye-movements in the experimental design can be used to differentiate comprehensions of the test sentences and can be used to adjudicate between different accounts to be tested.
NOTE: Given the experimental design, the correct response to a conjunctive statement is the big open box, such as Box A in Figure 1. The correct response to a but-statement is the small open box containing the animal being mentioned in the first half of the test audios, such as Box D in Figure 1. Participants' responses to the disjunctive statement, however, depend on whether and/or how the two discussed inferences are processed. If participants compute neither the scalar implicature nor the ignorance inference, then all the four boxes are eligible options. If participants compute the scalar implicature but not the ignorance inference, then the big open, such as box A in Figure 1, will be ruled out, and the remaining three boxes B, C, and D are all eligible options. If participants compute the ignorance inference but not the scalar implicature, then the small open boxes will be ruled out, i.e., boxes C and D will be ruled out. To summarize, the small closed box, such as box B in Figure 1, will not be chosen as the final option of a disjunctive statement until the scalar implicature and the ignorance inferences are both computed.

3. Build the Experimental Script

Open the Experiment Builder, click File | New to create an experiment project. Input the project name such as vwp_disjunction. Select the project location. Check EyeLink Experiment and choose Eyelink 1000plus from the drop list. These operations will create a subdirectory containing all files related to the experiment It will create a subdirectory named vwp_disjunction with a file named "graph.ebd" in the folder.
NOTE: Experiment Builder is used to build the experimental script to present the test stimuli and to record participants' eye movements as well as their behavioral responses. The Experiment Builder is a What-You-See-Is-What-You-Get tool to build experimental script. It is easy to use, but any other stimuli presentation software is an eligible alternative.
Visualize the hierarchical structure of a typical eye-tracking experiment using the visual world paradigm as seen in Figure 3. Each pink rectangle in the figure is implemented as a SEQUENCE object by Experiment Builder; and each object with gray background is implemented as a node object.
NOTE: A SEQUENCE in the Experiment Builder is an experimental loop controller used to chain together different objects as a complex node. A sequence always begins with a START node. And a data source can be attached to a sequence node to supply different parameters for each trial.
Build the Experiment sequence
1. Click File | Open, browse to the directory of experiment and double click the graph.ebd file in the project directory to open the saved experiment project.
2. Click Edit | Library Manager | Image | Add to load the images into the experiment Project. Similarly, click Edit | Library Manager | Sound | Add to load the audio into the experiment project.
3. Drag a DISPLAY_SCREEN object into the work space and change its label value on the properties panel to rename it as Instruction. Double click to open the Instruction node, and click the Insert Multiline Text Resource button to input the experimental instruction. Ensure the instruction contains the following information:
  In each trial, first you will see images of two animals, one animal each printed on the screen in turn, along with the audio of the animals played on the two speakers situated at both sides of the screen. A black dot will then be presented at the center of the screen. You should press the SPACE key while fixating on the dot. Next, you will see a test image consisting of four boxes printed on the screen and hear a test sentence being played via the two speakers. Your task is to locate Xiaoming's box according to the test sentence you heard and press the corresponding button as soon as possible:
  Top left box --- Left arrow
  Top Right Box --- Up arrow
  Bottom left box --- Left arrow
  Bottom right box --- Right arrow
  In each test image, you will see four boxes situated at the four quadrants and two animals containing in the boxes. The four boxes can vary in two dimensions: its closeness and its size. Whether a box is closed or not influences our epistemic knowledge on that box, but not the animal(s) it contains. If a box is open, then the animal(s) contained in that box is known. If a box is closed, then the animal(s) contained in that box is unknown. The size of a box affects the number of animals contained in the box, but not our epistemic knowledge on that box. No matter the box is closed or not, a small box only and always contains one animal, and a big box always contains two different animals.
  If you are comfortable with the experimental aim and the procedure, please let the experimenter know and we will help you to perform the standard eye tracking calibration and validation routines. If you have any questions, please don't hesitate to ask.
  Note: This is an instruction that will be printed on the screen prior to the experiment (The instructions should be written in the native language the participants speak, such as Mandarin Chinese here).
4. Drag a KEYBOARD object into the work space.
  NOTE: This step is used to end the Instruction screen
5. Drag a SEQUENCE object into the work space and rename it as Block.
6. Select the Block sequence, click the value field of the Data Source property to bring up the Data Source Editor. Click the Import Data Button on the data source editor screen, brow to the .txt file created in step 1.4 to import the data source.
7. Click the Randomization Setting button in the data source editor, check Enable Trial Randomization, select trial_number from the value field of the Column field, and select experimental_group from the drop-list of the Splitting Column field.
8. Drag the second DISPLAY_SCREEN object to the work space and rename it as Goodbye. Double click the Goodbye node and insert the following information: in participants' native language (Mandarin Chinese in this protocol):The experiment is finished and Thank you for very much your participation.
9. Left-click on the START node, drag the arrow to the Instruction node, and release the mouse button to connect the START node to the Instruction node. Repeat the same mouse moves to connect Instruction node to the KEYBOARD node, KEYBOARD node to Block node, then Block node to the Goodbye node. Click View | Arrange Layout to arrange the nodes in the workspace.
Build the block sequence
1. Double click to open the Block sequence. Drag an El_CAMERA_SETUP node into the Block sequence to bring up a camera setup screen on the EyeLink Host PC for the experimenter to perform camera setup, calibration, and validation. Click the Calibration Type field in the Properties panel and choose HV5 from the dropdown list.
  NOTE: The number of locations in the mapping process varies between different experimental designs. The more locations sampled and the more space covered, the greater the accuracy can be expected. But more samples mean more time to finish the processes. So practically, the number of locations in a specific study cannot be very big, especially when participants are preliterate children or clinical patients. In the visual world paradigm, the number of the interest areas is relatively small, and the areas of interest are normally relatively big. The mapping process can reach a satisfying level with relatively small number of locations. In the protocol I described, I used a five points' calibration and validation.
2. Drag a SEQUENCE node into the Block sequence and rename it as Trial. Connect the START node to the CAMERA_SETUP node, then to the SEQUENCE node.
Build the Trial sequence
1. Double click to open the Trial sequence, drag a DISPLAY_SCREEN node into the Trial sequence and rename it as animal_1_image. Double click to open the Screen Builder node and click the Insert Image Resource button on the Screen Builder toolbar to insert an animal image from the uploaded image sources. Click the value field of the Source File Name property, navigate to the DataSource attached to the Block Sequence; and double click the Animal_1_Image column to connect the DISPLAY_SCREEN with the correct column of the data source.
2. Drag a PLAY_SOUND node into the Trial sequence and rename it as animal_1_audio. Click the Sound File property of the animal_1_audio node and connect it with the correct column of the data source (as being described in step 3.5.1).
3. Drag a TIMER node into the Trial sequence and rename it as animal_1_audio_length. Click the Duration Property of the TIMER node and navigate to the correct column of the data source created in 3.4.1.
4. Drag another DISPLAY_SCREEN node, another PLAY_SOUND node, and another TIMER node into the Trial sequence, rename them as animal_2_image, animal_2_audio, and animal_2_audio_duration, repeat the steps being described in steps 3.5.1 - 3.5.3.
  NOTE: These steps are included to control for the potential confounding that the same image might be named differently by different participants. Counterbalance the sequence of presenting the two animals with respect to whether it is mentioned in the first or second half of the test audios.
5. Drag a Prepare Sequence object into the Trial Sequence and change the property Draw To Eyelink Host to IMAGE.
  NOTE: This node is used to preload the image and audio files to memory for real-time image drawing and sound playing. And it is also used to draw feedback graphics on the Host PC so that the participants' gaze accuracy can be monitored.
6. Drag a DRIFT_CORRECT node into the Trial sequence to introduce the drift correction.
7. Drag a new SEQUENCE node and rename it as Recording. Connect the START to these nodes one after one.
Build the Recording sequence
1. Check the Record field in the property panel of the Recording sequence, and double click to open the Recording sequence.
  NOTE: A sequence with Record property checked means that participants' eye movements during this period will be recorded.
2. Drag a new DISPLAY_SCREEN into the Record sequence, rename it as test_image. Add the message test_image_onset into the Message property of the test_image node.
  NOTE: In data analyses stage, the message in the test_image node and the message in the test_audio node (section 3.6.6) are important to locate the onset of the test images and the onset of the test audios in each trial.
3. Double click to open the Screen Builder node and click the Insert Image Resource button on the Screen Builder toolbar to insert any animal image from the uploaded image sources. Click the value field of the Source File Name property, navigate to the DataSource attached to the Block Sequence; and double click the test_image column to connect the DISPLAY_SCREEN with correct column of the data source.
4. Double click the DISPLAY_SCREEN node to open the Screen Builder, click the button of Insert Rectangle Interest Area Region, and draw four rectangular areas of interest as illustrated by the blue boxes in Figure 1. Change the labels of the four areas of interest to Top_Left, Top_Right, Bottom_Left, and Bottom_Right, and connect the DataViewer Name filed with the correct columns of the data source.
  NOTE: These areas are invisible to the participants. To make the areas of interest more meaningful, label the name of top left area in the example as "Box A (big open)", area top right area as "Box B (small closed)", bottom left area as "Box C (second mentioned)", and area bottom right area as "Box D (First Mentioned)", because the two small open boxes contain the two animals being mentioned in the first and second half of the test audios, respectively.
5. Drag a TIMER node into the workspace, rename it as Pause, and change the Duration property to 500 ms.
  NOTE: This TIMER node adds some time lag between the onset of the test image and the onset of the test audio. The time lag gives participants a chance to familiarize with the test images. Participants' eye movements during this preview period also provide a baseline for determining the effects of the spoken language input, especially when the critical words are situated at the beginning of the test audios.
6. Drag a PLAY_SOUND node in to the work space and rename it as test_audio. Click the Sound File property and connect it with the correct column of the data source (as being described in step 3.5.1) and add the message test_audio_onset into the Message property.
7. Drag a TIMER node into the work space, rename it as test_audio_length. Change the Duration Property to 10500 ms.
8. Drag a Null Action -node into the work space.
9. Add a new TIMER node, rename it as record_extension, and change the Duration property to 4000 ms.
10. Add a new KEYBOARD node into the work space, rename it as behavioral responses, and change the acceptable Keys property to "[Up, Down, Right, Left]".
  NOTE: Participants' behavioral choices can be used to double check the validity of the conclusion deduced from participants' eye movements.
11. Connect the START node to Pause, test_audio, test_audio_length, NULL Action, then to Record_extension node. Add another connection from test_audio_length to behavioral_responses node.
  NOTE: By adding these connections, current trial will end and a new trial will start after participants made a key press to choose Xiaoming's Box, or 4000 ms after the offset of the test audio.
12. Drag a VARIABLE node into the work space, rename it as key_pressed, and connect its value property to behavioral_Responses Keyboard | Triggered Data | Key.
13. Drag a RESULT_FILE node into the work space, drag an ADD_TO_RESULT_FILE node into the work space, and connect both the record_extension node and the behavioral_responses node to the ADD_TO_RESULT_FILE node.
Click Experiment | Build to build the experimental script, click Experiment | Test run to test run the experiment. After everything is done, click Experiment | Deploy to create an executable version of the experimental project.
NOTE: For more information on how to use the Experiment Builder, please consult the software manual²⁷.

4. Recruit Participants

Ensure the participants to have normal or corrected normal vision. Recommend that the short-sighted participants to wear contact lenses, but glasses are also acceptable as long as the lenses are clean. Ensure that all participants are native speakers of the testing language, such as Mandarin Chinese here.
NOTE: As a general guideline, a participant is regarded as eligible as long as the participant can see the test images at a distance of about 60 centimeters. In terms of the number of participants, according to some rules of thumb, the number of participants for regression analysis should be no less than 50. Here, thirty-seven postgraduate students from the Beijing Language and Culture University participated in the experiment, which is a little smaller than the recommended amount.

5. Conduct the Experiment

NOTE: When participants are normal developed adults, one experimenter is enough to conduct the conduct the experiment. But if participants are special populations, such as children, two or more experimenters are required.

Select an eye tracker to record participants' eye movements.
NOTE: The eye tracker used in this experiment is Eyelink 1000plus running under the free-to-move head mode. This is a video-based, desktop mounted eye tracking system, using the principle of pupil with corneal reflection (CR) to track eye's rotation. When running under the free-to-move head mode, the eye tracker has the monocular sampling rate of 500 Hz, with a spatial resolution of 0.01° and an average error of less than 0.5°. For more detailed information of the system, please consult its technical specification²⁸^,²⁹. Alternative trackers can be used, but the ones with remote tracking mode are better, especially when participants are preliterate children.
Boot the system on the Host PC to start the Host application of the camera.
To configure the system to desktop remote mode, click the Set Option button, set the Configuration option to Desktop -- Target Sticker -- Monocular -- 16/25mm length -- RTARBLER.
Click the executable version of the experimental project on the Display PC, input participant's name, and choose a group from the prompt window to select the condition value to run.
NOTE: Each test session will create a folder with the inputted name under the subdirectory Results of the experiment project. The EDF file under the folder contained relevant eye movements data.
Ask the participants to sit approximately 60 cm from a 21 inch, 4:3 color monitor with 1024px x 769px resolution, where 27 pixels equals to 1 degree of angle.
Adjust the height of the Display PC monitor, to ensure that when the participant is seated and looking straight ahead, they are looking vertically at the middle to top 75% of the monitor.
NOTE: The chair, desk, and/or the PC monitor are preferred if they are adjustable in height. The chair and the desk with casters should be avoided, as they tend to cause unintentional move and roll.
Place a small target sticker on the participant's forehead, to track the head position even when the pupil image is lost, such as during blinks or sudden movements.
NOTE: Different eye trackers might use different methods to track participants' head. To maximize the lateral movement range of the subject, the tracked eye should be on the same side as the illuminator.
Rotate the focusing arm on the desk mount to bring the eye image into focus.
Click the Calibrate button on the host PC to conduct the calibration process by asking participants to fixate a grid of five fixation targets in random succession with no overt behavioral responses, to map participants' eye movements to the gaze of regard in the visual world.
Click the Validate button on the host PC to validate the calibrated results by asking participants to fixate the same grid of fixation targets. Repeat the calibration and validation routines, when the error is bigger than 1°.
Conduct the two routines at the beginning of the experiment and whenever the measurement accuracy is poor (e.g., after strong head movements or a change in the participants' posture).
Click the Record button on the host PC to start the experiment.
Perform a drift check on each trial by asking participants to press the SPACE key on the keyboard while fixating at the black dot presented in the center of the screen.
NOTE: When the participants are preliterate children or clinical patients, explicitly instructing them to press the keyboard while fixating the black dot is normally impractical. But their attention and eye fixations tend to be automatically attracted by the displayed black dot. In this case, the experimenter should be the person to press the keyboard while the participant is fixating on the black dot.
Present the visual stimuli via the Display PC monitor and play the auditory stimuli via a pair of external speakers situated to the left and right of the monitor (earphones are also acceptable).
NOTE: The recordings are played from the hard disk as 24 kHz mono sound clips. If there is no special reason, mono sound clips are preferred to stereo sound clips. In a stereo sound clip, the difference between the two sound tracks, as well as the difference between the two speakers might affect participants' eye movements. For more information on how to use the eye tracker, please consult the user manual³⁰.

6. Data Coding and Analyses

Open Data Viewer, click File | Import File | Import Multiple Eyelink Data Files to import all the recorded eye tracker files (with the extension of EDF), and save them into a single .EVS file.
Open the saved EVS file and click Analysis | Reports | Sample Report to export the raw sample data with no aggregation.
NOTE: If the eye tracker has a sampling rate of 500 Hz, the exported data will have 500 data points, henceforth 500 rows, per second per trial. If participants' left eye is tracked, ensure the following columns as well as the variables created in the data source are exported: RECORDING_SESSION_LABEL, LEFT_GAZE_X, LEFT_GAZE_Y, LEFT_INTEREST_AREA_LABEL, LEFT_IN_BLINK, LEFT_IN_SACCADE, LEFT_PUPIL_SIZE, SAMPLE_INDEX, SAMPLE_MESSAGE. For more information on how to use the Data Viewer, please consult the software manual³¹.
Restrict the statistical analyses to the temporal window from the onset of the test image to the offset of the test audios, i.e., the temporal window with the duration of 11 s.
Delete the samples where participants' eye movements are not recorded, such as participants blink their eyes, which roughly affects 10% of the recorded data.
NOTE: This is an optional step, as the results are normally the same no matter whether these samples deleted.
Code the data. To construct the data for a specific area of interest in a certain sampling point, code the data as 1 if participants' eye fixation is situated in the area of interest to be analyzed at that sampling point. Code the data as 0 if the eye fixation is not situated in the areas of interest at that sampling point.
Draw a proportion-of-fixation to visualize the obtained data. To calculate the proportion-of-fixations over certain area of interest, average the coded data for all the trials and for all the participants in each sample point under each condition. Plot the calculated proportion-of-fixations on the y-axis against the sampling point on the x-axis, with different panels denoting different areas of interest and with the plotting colors denoting different experimental conditions.
NOTE: In the experiment, the four panels depicted participants' fixation patterns on the four areas of interest. The red, green, and blue lines illustrated participants' fixation patterns when the test statements were conjunctions (S1 and S2), but-statements (S1 but not S2), and disjunctions (S1 or S2), respectively. The software used to draw the descriptive plot is the ggplot2 package from R environment. Other software is also available. Figure 5 is an example of such plot.
Fit a binomial generalized linear mixed model (GLMM) on each area of interest at each sampling point, as the data was coded as either 1 or 0, depending on whether the participant's fixation is situated in or out of the area of interest at that sampling point.
NOTE: As the data is not binned, and the coded data can only be 1 or 0, so the distribution of the coded data is binary rather than normal. Henceforth, a GLMM model with the family of binomial distribution is used. The GLMM model includes a fixed term, the experimental conditions, and two random terms, participants and items. The formula evaluated to the two random terms includes both the intercepts and the slope of the experimental conditions. The software used to do the model fitting is the lme4 package from R environment. Other software is also available. One thing should be mentioned is that the baseline of the fixed items differed when the analyzed interest area, i.e., the analyzed boxes, are different. To be specific, the conjunction (S1 and S2) was chosen as the baseline when analyzing the big-open box (Box A), the disjunction (A and B) was chosen as the baseline when analyzing the small-closed box (Box B), and the but-statement was chosen as the baseline when analyzing the first-mentioned box (Box D).
Bonferroni adjust the p values obtained with Wald z test, to reduce the familywise error induced by multiple comparisons.
NOTE: Bonferroni adjustment is the traditional way to tackle the familywise error induced by multiple comparisons. Other methods are also available, as we described in the introduction section.

Results

Participants' behavioral responses are summarized in Figure 4. As we described earlier, the correct response to a conjunctive statement (S1 and S2) is the big open box, such as Box A in Figure 1. The correct response to a but-statement (S1 but not S2) is the small open box containing the first mentioned animal, such as Box D in Figure 1. Critically, which box is chosen ...

Discussion

To conduct a visual world study, there are several critical steps to follow. First, researchers intend to deduce the interpretation of the auditorily presented language via participants' eye movements in the visual world. Henceforth, in designing the layout of the visual stimuli, the properties of eye movements in a natural task that potentially affect participants' eye movements should be controlled. The effect of the spoken language on participants' eye movements can then be recognized. Second, acoustic cue...

Disclosures

The author declares that he has no competing financial interests.

Acknowledgements

This research was supported by Science Foundation of Beijing Language and Cultural University under the Fundamental Research Funds for the Central Universities (Approval number 15YJ050003).

Materials

Name	Company	Catalog Number	Comments
Pixelmator	Pixelmator Team	http://www.pixelmator.com/pro/	image editing app
Praat	Open Sourse	http://www.fon.hum.uva.nl/praat/	Sound analyses and editting software
Eyelink 1000plus	SR-Research, Inc	https://www.sr-research.com/products/eyelink-1000-plus/	remote infrared eye tracker
Experimental Builder	SR-Research, Inc	https://www.sr-research.com/experiment-builder/	eye tracker software
Data Viewer	SR-Research, Inc	https://www.sr-research.com/data-viewer/	eye tracker software
R	Open Sourse	https://www.r-project.org	free software environment for statistical computing and graphics

References

Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., Sedivy, J. C. Integration of visual and linguistic information in spoken language comprehension. Science. 268 (5217), 1632-1634 (1995).
Cooper, R. M. The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology. 6 (1), 84-107 (1974).
Salverda, A. P., Tanenhaus, M. K., de Groot, A. M. B., Hagoort, P. . Research methods in psycholinguistics and the neurobiology of language: A practical guide. , (2017).
Huettig, F., Rommers, J., Meyer, A. S. Using the visual world paradigm to study language processing: A review and critical evaluation. Acta Psychologica. 137 (2), 151-171 (2011).
Meyer, A. S., Sleiderink, A. M., Levelt, W. J. M. Viewing and naming objects: Eye movements during noun phrase production. Cognition. 66 (2), B25-B33 (1998).
Griffin, Z. M., Bock, K. What the eyes say about speaking. Psychological Science. 11 (4), 274-279 (2000).
Young, L. R., Sheena, D. Survey of eye movement recording methods. Behavior Research Methods & Instrumentation. 7 (5), 397-429 (1975).
Conklin, K., Pellicer-Sánchez, A., Carrol, G. . Eye-tracking: A guide for applied linguistics research. , (2018).
Duchowski, A. . Eye tracking methodology: Theory and practice. , (2007).
Baayen, R. H., Davidson, D. J., Bates, D. M. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language. 59 (4), 390-412 (2008).
Barr, D. J. Analyzing 'visual world' eyetracking data using multilevel logistic regression. Journal of Memory and Language. 59 (4), 457-474 (2008).
Nixon, J. S., van Rij, J., Mok, P., Baayen, R. H., Chen, Y. The temporal dynamics of perceptual uncertainty: eye movement evidence from Cantonese segment and tone perception. Journal of Memory and Language. 90, 103-125 (2016).
Bolker, B. M., et al. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in Ecology and Evolution. 24 (3), 127-135 (2009).
Mirman, D., Dixon, J. A., Magnuson, J. S. Statistical and computational models of the visual world paradigm: Growth curves and individual differences. Journal of Memory and Language. 59 (4), 475-494 (2008).
Baayen, H., Vasishth, S., Kliegl, R., Bates, D. The cave of shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language. 94, 206-234 (2017).
Baayen, R. H., van Rij, J., de Cat, C., Wood, S., Speelman, D., Heylen, K., Geeraerts, D. . Mixed-Effects Regression Models in Linguistics. 4, 49-69 (2018).
Zhan, L. Scalar and ignorance inferences are both computed immediately upon encountering the sentential connective: The online processing of sentences with disjunction using the visual world paradigm. Frontiers in Psychology. 9, (2018).
Maris, E., Oostenveld, R. Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods. 164 (1), 177-190 (2007).
Barr, D. J., Jackson, L., Phillips, I. Using a voice to put a name to a face: The psycholinguistics of proper name comprehension. Journal of Experimental Psychology-General. 143 (1), 404-413 (2014).
Dahan, D., Tanenhaus, M. K., Salverda, A. P., van Gompel, R. P. G., Fischer, M. H., Murray, W. S., Hill, R. L. . Eye movements: A window on mind and brain. , 471-486 (2007).
Parkhurst, D., Law, K., Niebur, E. Modeling the role of salience in the allocation of overt visual attention. Vision Research. 42 (1), 107-123 (2002).
Grice, H. P., Cole, P., Morgan, J. L. Vol. 3 Speech Acts. Syntax and semantics. , 41-58 (1975).
Sauerland, U. Scalar implicatures in complex sentences. Linguistics and Philosophy. 27 (3), 367-391 (2004).
Chierchia, G. Scalar implicatures and their interface with grammar. Annual Review of Linguistics. 3 (1), 245-264 (2017).
Fox, D., Sauerland, U., Stateva, P. . Presupposition and Implicature in Compositional Semantics. , 71-120 (2007).
Meyer, M. C. . Ignorance and grammar. , (2013).
SR Research Ltd. . SR Research Experiment Builder User Manual (Version 2.1.140). , (2017).
SR Research Ltd. . EyeLink® 1000 Plus Technical Specifications. , (2017).
SR Research Ltd. . EyeLink-1000-Plus-Brochure. , (2017).
SR Research Ltd. . EyeLink® 1000 Plus User Manual (Version 1.0.12). , (2017).
SR Research Ltd. . EyeLink® Data Viewer User’s Manual (Version 3.1.97). , (2017).
McQueen, J. M., Viebahn, M. C. Tracking recognition of spoken words by tracking looks to printed words. The Quarterly Journal of Experimental Psychology. 60 (5), 661-671 (2007).
Altmann, G. T. M., Kamide, Y. Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition. 73 (3), 247-264 (1999).
Altmann, G. T. M., Kamide, Y. The real-time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language. 57 (4), 502-518 (2007).
Snedeker, J., Trueswell, J. C. The developing constraints on parsing decisions: The role of lexical-biases and referential scenes in child and adult sentence processing. Cognitive Psychology. 49 (3), 238-299 (2004).
Allopenna, P. D., Magnuson, J. S., Tanenhaus, M. K. Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language. 38 (4), 419-439 (1998).
Zhan, L., Crain, S., Zhou, P. The online processing of only if and even if conditional statements: Implications for mental models. Journal of Cognitive Psychology. 27 (3), 367-379 (2015).
Zhan, L., Zhou, P., Crain, S. Using the visual-world paradigm to explore the meaning of conditionals in natural language. Language, Cognition and Neuroscience. 33 (8), 1049-1062 (2018).
Brown-Schmidt, S., Tanenhaus, M. K. Real-time investigation of referential domains in unscripted conversation: A targeted language game approach. Cognitive Science. 32 (4), 643-684 (2008).
Fernald, A., Pinto, J. P., Swingley, D., Weinberg, A., McRoberts, G. W. Rapid gains in speed of verbal processing by infants in the 2nd year. Psychological Science. 9 (3), 228-231 (1998).
Trueswell, J. C., Sekerina, I., Hill, N. M., Logrip, M. L. The kindergarten-path effect: studying on-line sentence processing in young children. Cognition. 73 (2), 89-134 (1999).
Zhou, P., Su, Y., Crain, S., Gao, L. Q., Zhan, L. Children's use of phonological information in ambiguity resolution: a view from Mandarin Chinese. Journal of Child Language. 39 (4), 687-730 (2012).
Zhou, P., Crain, S., Zhan, L. Grammatical aspect and event recognition in children's online sentence comprehension. Cognition. 133 (1), 262-276 (2014).
Zhou, P., Crain, S., Zhan, L. Sometimes children are as good as adults: The pragmatic use of prosody in children's on-line sentence processing. Journal of Memory and Language. 67 (1), 149-164 (2012).
Moscati, V., Zhan, L., Zhou, P. Children's on-line processing of epistemic modals. Journal of Child Language. 44 (5), 1025-1040 (2017).
Helfer, K. S., Staub, A. Competing speech perception in older and younger adults: Behavioral and eye-movement evidence. Ear and Hearing. 35 (2), 161-170 (2014).
Dickey, M. W., Choy, J. W. J., Thompson, C. K. Real-time comprehension of wh-movement in aphasia: Evidence from eyetracking while listening. Brain and Language. 100 (1), 1-22 (2007).
Magnuson, J. S., Nusbaum, H. C. Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology-Human Perception and Performance. 33 (2), 391-409 (2007).
Reinisch, E., Jesse, A., McQueen, J. M. Early use of phonetic information in spoken word recognition: Lexical stress drives eye movements immediately. Quarterly Journal of Experimental Psychology. 63 (4), 772-783 (2010).
Chambers, C. G., Tanenhaus, M. K., Magnuson, J. S. Actions and affordances in syntactic ambiguity resolution. Journal of Experimental Psychology-Learning Memory and Cognition. 30 (3), 687-696 (2004).
Tanenhaus, M. K., Trueswell, J. C., Trueswell, J. C., Tanenhaus, M. K. . Approaches to Studying World-Situated Language Use: Bridging the Language-as-Product and Language-as-Action Traditions. , (2005).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

This article has been published

Video Coming Soon

Keep me updated: