Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MEASURING LANGUAGE PROFICIENCY FROM ELECTROENCEPHELOGRAPHY DATA
Document Type and Number:
WIPO Patent Application WO/2021/035067
Kind Code:
A1
Abstract:
System and methods for improving intelligibility of a subject are disclosed. The system can comprise one or more processors including a regression model. The regression model can be configured to receive electroencephalogram (EEG) data of the subject, estimate a linguistic feature from a target sound, the EEG data, or a combination thereof, and decode the linguistic features from the EEG data. The linguistic feature can be selected from the group consisting of a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof.

Inventors:
MESGARANI NIMA (US)
LIBERTO GIOVANNI (IT)
NIE JINGPING (US)
Application Number:
PCT/US2020/047232
Publication Date:
February 25, 2021
Filing Date:
August 20, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV COLUMBIA (US)
International Classes:
A61B5/12; H04R25/00
Domestic Patent References:
WO2017068414A22017-04-27
Foreign References:
US20110159467A12011-06-30
US20060074664A12006-04-06
US20060074667A12006-04-06
US20180092567A12018-04-05
US20140148724A12014-05-29
Other References:
WONG DANIEL D.E., FUGLSANG SØREN A., HJORTKJÆR JENS, CEOLINI ENEA, SLANEY MALCOLM, CHEVEIGNÉ ALAIN DE: "A Comparison of Temporal Response Function Estimation Methods for Auditory Attention Decoding", BIORXIV, 13 March 2018 (2018-03-13), XP055795286, Retrieved from the Internet [retrieved on 20201019]
Attorney, Agent or Firm:
LEE, Heon, Goo et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A device for improving intelligibility of a subject, comprising: one or more processors including a regression model, wherein the regression model is configured to receive electroencephalogram (EEG) data of the subject, wherein the subject is exposed to a target sound; estimate a linguistic feature from the target sound, the EEG data or a combination thereof, wherein the linguistic feature is selected from the group consisting of a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof; and decode the linguistic features from the EEG data.

2. The device of claim 1, further comprises one or more sensor components for obtaining the EEG data, wherein the one or more sensor components are coupled to the one or more processors via a wired connection or a wireless connection.

3. The device of claim 1, further comprising one or more sound input components for collecting sounds and one or more sound output components, wherein the one or more sound input and out components are coupled to the one or more processors via a wired connection or a wireless connection. 4. The device of claim 1, wherein the one or more processors are further configured to be trained by receiving a training data, wherein the training data comprises electroencephalogram (EEG) data, linguistic feature data, or a combination thereof.

5. The device of claim 1, wherein the regression model is configured to estimate a temporal response function (TRF) for the linguistic features.

6. The device of claim 5, wherein the device is configured to assess language proficiency and/or nativeness of the subject based on the TRF for the linguistic features.

7. The device of claim 1, the device is configured to assess an accuracy of the decoding by predicting a brain response of the subject exposed to the target sound.

8. The device of claim 1, the linguistic feature further comprises envelopes and/or an auditory spectrogram of the target sound.

9. The device of claim 1, wherein the phonemic feature comprises a cohort size at each phoneme, a cohort reduction variable, or a combination thereof.

10. The device of claim 1, wherein the phonotactic feature comprises a phonotactic probability.

11. The device of claim 1, wherein the semantic feature comprises a semantic vector.

12. A method for improving intelligibility of a subject, comprising: receiving electroencephalogram (EEG) data of the subject, wherein the subject is exposed to a target sound; estimating a linguistic feature of the target sound using a regression model, wherein the linguistic feature is selected from the group consisting of a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof; and decoding the linguistic features from the EEG data.

13. The method of claim 12, wherein the linguistic feature is estimated from the target sound, the EEG data or a combination thereof,

14. The method of claim 12, further comprising training the regression model by providing a training data, wherein the training data comprises electroencephalogram (EEG) data, linguistic feature data, or a combination thereof.

15. The method of claim 12, further comprising measuring electroencephalogram (EEG) data from the subject exposed to the target sound.

16. The method of claim 15, further comprising assessing an accuracy of the decoding by predicting a brain response of the subject exposed to the target sound.

17. The method of claim 12, further comprising estimating a temporal response function (TRF) for the linguistic features.

18. The method of claim 17, further comprising assessing language proficiency and/or nativeness of the subject based on the TRF for the linguistic features.

19. The method of claim 12, further comprising estimating envelopes and/or an auditory spectrogram of the target sound.

20. The method of claim 12, wherein the phonemic feature comprises a cohort size at each phoneme, a cohort reduction variable, or a combination thereof, wherein the phonotactic feature comprises a phonotactic probability, wherein the semantic feature comprises a semantic vector.

Description:
MEASURING LANGUAGE PROFICIENCY FROM

ELECTROENCEPHELOGRAPHY DATA

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Provisional Patent Application No. 62/889,478, which was filed on August 20, 2019, the entire contents of which are incorporated by reference herein.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with government support under R01DC014279 awarded by the National Institutes of Health. The government has certain rights in the invention.

BACKGROUND

The human brain can process a variety of acoustic and linguistic properties for speech comprehension. Speech perception can be underpinned by a hierarchical neural network that processes different linguistic properties in distinct interconnected cortical layers. For example, a person can attend to one speaker in a multi-speaker and noisy environment while inhibiting and masking out the unattended speech (e.g., the so-called cocktail party phenomenon). Furthermore, the language acquisition process is dependent on the pre-existing knowledge of a person. For example, learning a second language can be a challenging process that differs from native language acquisition.

As brain functions can affect speech comprehension, it can be advantageous to develop hearing aid devices, which can monitor the brain functions and improve the intelligibility of a user. There is a need for improved systems and methods for assessing language proficiency and improving the intelligibility of a user. SUMMARY

The disclosed subject matter provides systems and methods for improving intelligibility of a subject. An example device can include one or more processors adapted to implement a regression model. The regression model can be configured to receive electroencephalogram (EEG) data of the subject, wherein the subject is exposed to a target sound, estimate a linguistic feature from the target sound the EEG data or a combination thereof, and the EEG data or a combination thereof.

In certain embodiments, the linguistic feature can be selected from a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof. In non limiting embodiments, the phonemic feature can include a cohort size at each phoneme, a cohort reduction variable, or a combination thereof. In some embodiments, the phonotactic feature can include a phonotactic probability. In certain embodiments, the semantic feature can include a semantic vector. In non-limiting embodiments, the linguistic feature can further include envelopes and/or an auditory spectrogram of the target sound.

In certain embodiments, the device can further include one or more sensor components for obtaining the EEG data of the subject. The sensor components can be coupled to the one or more processors via a wired connection or a wireless connection. In non-limiting embodiments, the device can be configured to assess the accuracy of the decoding by predicting a brain response of the subject exposed to the target sound.

In certain embodiments, the processors can be configured to be trained by receiving training data. The training data can include electroencephalogram (EEG) data, linguistic feature data, or a combination thereof.

In certain embodiments, the regression model can be configured to estimate a temporal response function (TRF) for the linguistic features. In non-limiting embodiments, the device can be configured to assess language proficiency and/or native status of the subject based on the TRF.

In certain embodiments, the device can further include one or more sound input components for collecting sounds and one or more sound output components. The sound input and output components can be coupled to the processors via a wired connection or a wireless connection. In non-limiting embodiments, the processors can be configured to amplify the target sound and/or decrease non-target sounds through the sound output components to facilitate hearing of the subject based on the predicted responses.

An example method according to the disclosed subject matter can include receiving EEG data of the subject, wherein the subject is exposed to a target sound, estimating a linguistic feature of the target sound using a regression model, and decoding the linguistic features from the EEG data. The linguistic feature can be selected from a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof. In non-limiting embodiments, the phonemic feature can include a cohort size at each phoneme, a cohort reduction variable, or a combination thereof. In some embodiments, the phonotactic feature can include a phonotactic probability. In non-limiting embodiments, the semantic property can include a semantic vector.

In certain embodiments, the linguistic feature can be estimated from the target sound, the EEG data or a combination thereof In certain embodiments, the method can include training the regression model by providing training data. In non-limiting embodiments, the training data includes electroencephalogram (EEG) data, linguistic feature data, or a combination thereof.

In certain embodiments, the method can include measuring EEG data from the subject exposed to the target sound. In non-limiting embodiments, the method can further include assessing accuracy of the decoding by predicting a brain response of the subject exposed to the target sound.

In certain embodiments, the method can include estimating a temporal response function (TRF) for linguistic features. In non-limiting embodiments, the method can further include assessing language proficiency and/or native status of the subject based on the TRF for the linguistic features.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram showing an example device in accordance with the disclosed subject matter.

FIG. 2 is a block diagram illustrating an example method in accordance with the disclosed subject matter.

FIG. 3 is a block diagram illustrating an example method in accordance with the disclosed subject matter.

FIG. 4 is a block diagram illustrating an example method in accordance with the disclosed subject matter.

FIG. 5A is a block diagram illustrating one or more elements of the presently disclosed techniques in accordance with the disclosed subject matter.

FIG. 5B is a graph showing example speech descriptors extracted from the audio waveform in accordance with the disclosed subject matter.

FIGs. 6A-6E are exemplary stimulus-EEG temporal response function (TRF) assessments for acoustics and phonetic features in accordance with the disclosed subject matter.

FIGs. 7A-7C provide images showing TRFs for various linguistic features. Fig. 7A provides graphs and images showing TRF analysis for phonotactic features. Fig. 7B provides graphs and images showing TRF analysis for semantic features. Fig. 7C provides graphs showing predicted language proficiency in accordance with the disclosed subject matter.

FIG. 8A provides an example speech hierarchy in accordance with the disclosed subject matter. FIG. 8B provides example electrophysiol ogical recordings in accordance with the disclosed subject matter. FIG. 8C provides an example regression model for predicting the brain signal in accordance with the disclosed subject matter. FIG. 8D provides an example decision model in accordance with the disclosed subject matter.

FIG. 9A provides an exemplary graph showing the average AAD accuracy in accordance with the disclosed subject matter. FIG. 9B provides an exemplary graph showing the AAD performance in accordance with the disclosed subject matter. FIG. 9C provides an exemplary graph showing the AAD improvements in accordance with the disclosed subject matter.

FIGs. 10A-10E provide exemplary topoplots of correlations between the estimated and actual brain response in accordance with the disclosed subject matter.

FIG. 11 provides exemplary topoplots of attended and unattended linear model’s coefficients at various time intervals in accordance with the disclosed subject matter.

FIG. 12A provides a map showing an example correlation between the disclosed features at various linguistic levels in accordance with the disclosed subject matter. FIG. 12B provides a map showing the correlation between shuffled in time features at various linguistic levels.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and are intended to provide further explanation of the disclosed subject matter. DETATT/ED DESCRIPTION

The disclosed subject matter provides techniques for improving the intelligibility of a subject. The disclosed subject matter provides methods and devices for assessing the language proficiency of the subject and improving the intelligibility of the subject to a target sound. The disclosed subject matter can be used for reconstructing intelligible speech from the human brain.

As used herein, the term "about" or "approximately" means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, "about" can mean within 3 or more than 3 standard deviations, per the practice in the art. Alternatively, "about" can mean a range of up to 20%, preferably up to 10%, more preferably up to 5%, and more preferably still up to 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, preferably within 5-fold, and more preferably within 2-fold, of a value.

As used herein, the term “coupled” refers to the connection of a system component to another system component by a suitable connection technique known in the art. The type of coupling used to connect two or more system components can depend on the scale and operability of the system. For example, and not by way of limitation, the coupling of two or more components of a system can include connecting the imaging device to the imaging processor via a wired connection and/or a wireless connection.

A "subject" herein can be a human or a non-human animal, for example, but not by limitation, rodents such as mice, rats, hamsters, and guinea pigs; rabbits; dogs; cats; sheep; pigs; goats; cattle; horses; and non-human primates such as apes and monkeys, etc. As used herein, the term “intelligibility” refers to a measure of how comprehensible speech is in given conditions or the proportion of a speaker's output that a listener can readily understand.

As shown in Fig. 1, an example device 100 for improving intelligibility of a subject can include one or more processors 101. In certain embodiments, the one or more processors 101 can include a regression model 102. The regression model 102 can be a software and/or an instruction operable when executed by the one or more processors to operate the device 100. The regression model 102 can be configured to cause the device to receive and analyze an electroencephalogram (EEG) data 103 and/or sound data 104. For example, the regression model 102 can estimate temporal response functions (TRFs) 105 that can map a stimulus to the subject’s brain. For example, a single input event at time tO can affect the neural signals for a certain time-window [tl, tl+twin], with tl > 0 and twin > 0. Temporal response functions (TRF) can describe the speech-EEG mapping within that latency-window for each EEG channel. The TRF can be obtained by a regression model that estimates a filter to optimally predict the neural response from the stimulus features (e,g., a forward model). In non-limiting embodiments, the input of the regression can include time-shifted versions of the stimulus features, so that the time-lags in the latency-window of interest can be simultaneously considered. The regression weights can reflect the relative importance between time-latencies to the stimulus-EEG mapping. The disclosed subject matter can infer the temporal dynamics of the speech responses using the regression model. In some embodiments, the reliability of the TRF models can be assessed using leave-one-out cross-validation across trials, which quantifies the EEG prediction correlation (Pearson’s r) on unseen data while controlling for overfitting. In certain embodiments, the regression model 102 can analyze the EEG data 103 and extract various features to identify the relationship (e.g., coupling or TRFs) between the EEG signals 103 and the sound source 104. For example, high-level and/or low-level linguistic features can be extracted from the EEG signals and sound source by comparing the EEG signals and the sound sources. In non-limiting embodiments, the high- level linguistic features can include a phonetic feature, a phonotactic feature, a semantic feature, and a combination thereof. In some embodiments, the low-level linguistic features can include an envelope and/or a spectrogram of the sound source.

In certain embodiments, the regression model can be a linear regression model or a non-linear regression model.

In non-limiting embodiments, machine learning techniques (e.g., convolutional neural network models (CNN)) can be used to predict language proficiency from the neural recording of a subject. The disclosed models can capture nonlinear relationships between the signals in addition to estimating prior probabilities, which can improve prediction accuracy.

In certain embodiments, the phonetic feature can include a cohort size at each phoneme, a cohort reduction variable, or a combination thereof. The cohort of a speech can be defined as the set of words or lexical units that match the acoustic input of a word at any time point during the expression of the word. The cohort of each phoneme can be defined by selecting all the lexical items in the dictionary that have a similar phoneme sequence, starting at the beginning of the lexical unit, to that of the phoneme sequence from the beginning of the word to the current phoneme. The cohort size can be estimated for each phoneme by calculating the log of the word numbers in the defined cohort set. With the cohort size at each phoneme, the cohort reduction variable can be defined to be equal to the cohort size at the current phoneme minus the cohort size at the previous phoneme (or in case of the initial phoneme, the log of the number of words in the dictionary). In non-limiting embodiments, the Phonetic feature can be a categorical descriptor. Because of the correlation between phoneme and acoustic descriptors, TRFs can fit between their concatenation and the EEG signal. The acoustic descriptor can be a nuisance regressor that can reduce the impact of acoustic-only responses onto the TRF model for phonetic features (TRF). As such, this multivariate approach can estimate TRFs for phonetic features and their linear projection to phoneme TRF by reducing the impact of acoustic-only responses.

In certain embodiments, the phonotactic feature can include phonotactic probabilities. The regression model can be configured to estimate the likelihood that each phoneme sequence pi . k composing a word pi ..n , where 1 < k < n, belongs to the sound source (e.g., language). Then, the regression model can estimate the phonotactic score (i.e., inverse phonotactic likelihood) corresponding to the negative logarithm of the likelihood of a sequence. In non-limiting embodiments, the phonotactic probabilities can be described by the cohort reduction variable. In non-limiting embodiments, the phonotactic feature can include prosody, phonological neighborhood, semantic neighborhood, or combinations thereof.

In certain embodiments, the semantic feature can include a semantic vector, which represents semantic dissimilarity. Semantic dissimilarity can be quantified as the distance of a word with the preceding semantic context resulting in a sparse vector that marks all content words with larger values for more dissimilar words. For example, the semantic dissimilarity TRFs can be estimated.

In certain embodiments, the regression model can extract low-level linguistic features from the EEG signals. The low-level linguistic features can include an envelope and/or a spectrogram of the sound source. For example, TRFs between amplitude envelope and the EEG signals can be estimated to determine the time dependencies. In non-limiting embodiments, the regression model can extract and analyze a spectrogram, which can provide additional information for the EEG signal. The spectrograms can be estimated using a model of cochlear frequency analysis, and the envelope of the sound source can be estimated by averaging over the amplitude envelope of each frequency in the spectrogram for each time point.

In certain embodiments, the disclosed device can include one or more sensor components. As shown in FIG. 1, the sensor component 106 and the processor 101 can be coupled via a wired connection and/or a wireless connection. In non-limiting embodiments, the sensor component 106 can include a plurality of electrodes 107 for measuring an electroencephalogram (EEG) data of the subject when the subject is exposed to at least one sound source (e.g., speech) 108. The electrodes 107 responsive to the sound source can be identified by comparing the measured EEG responses 103 to the sound source with EEG signals that are recorded during the silence.

In certain embodiments, the regression model can be trained for each linguistic feature. For example, the regression model can receive training data. The training data can include EEG data of the subject exposed to various sound sources. The sound sources can include natural speech stories spoken by several male and female speakers to account for the natural variability of speech. In non-limiting embodiments, the regression model can be trained through machine-learning techniques. For example, a deep learning technique can be used for training the regression model.

In certain embodiments, the device can be configured to assess the language proficiency and nativeness level of the subject. The device can be configured to decode the language proficiency and the nativeness from the brain data (e.g., EEG data) and the extracted linguistic features. For example, the TRFs for the envelope, the spectrogram, the phonetic feature, the phonotactic feature, the semantic feature, or a combination thereof can be used for assessing the language proficiency and nativeness level of the subject. As such information spaces along three dimensions, namely EEG channels, time latencies, and stimulus features, the disclosed device can perform a multilinear principal component analysis (MPCA) on the TRF of each sound descriptor independently. The language proficiency and nativeness level of the subject can be determined by comparing the value of TRF with a control data (e.g., TRFs of native speakers, non-native speakers, low- proficiency level speakers, high-proficiency level speaker, etc.). In non-limiting embodiments, the disclosed device can calculate Pearson’s correlations for determining the language proficiency and nativeness level of the subject.

In certain embodiments, the device can be configured to perform auditory attention decoding (AAD) to distinguish a target sound from non-target sounds. For example, the device can be used for distinguishing the attended speaker in a multi-speaker environment from the unattended speaker using the EEG data of a user. The disclosed device can further adjust the volume of the target and the non-target sounds for facilitating hearing. For example, as shown in Fig. 1, the device 100 can include one or more sound input components 109 for collecting sounds 108 and one or more sound output components 110. The device 100 can collect sounds 108 through the sound input components 109 and amplify the target sound and/or decrease non-target sounds to facilitate hearing through the output components 110. The one or more sound input 109 and out components 110 can be coupled to the one or more processors 101 via a wired connection or a wireless connection.

To perform the ADD, the disclosed device can be trained with various sound stimuli/sources and/or EEG data. For example, the disclosed device can analyze the stimuli and the brain responses to concatenate them using the training data. In non-limiting embodiments, the disclosed device can estimate TRFs for the linguistic features to concatenate the stimuli to the brain responses. The linguistic features can include the envelope, the spectrogram, the phonetic feature, the phonotactic feature, the semantic feature, or a combination thereof. In non-limiting embodiments, the trained device can map the concatenated stimuli to the brain responses.

In certain embodiments, the trained device can be configured to predict a sound stimulus, a sound feature, or a combination thereof based on a measured EEG data of the subject. In non-limiting embodiments, instead of using encoding models to predict the neural responses from the brain signals (e.g., forward modeling), the disclosed device can perform the opposite (e.g., backward modeling). In this approach, the disclosed device can decode the features from the neural signals using linear or nonlinear regression models. The disclosed decoding approaches can have the advantage that the correlational structure between EEG recordings can be modeled, and the decoding approaches can result in improved decoding accuracy.

In certain embodiments, the disclosed subject matter can provide methods for improving the intelligibility of a subject. As shown in FIG. 2, an exemplary method 200 can include receiving electroencephalogram (EEG) data of the subj ect, wherein the subj ect is exposed to a target sound 201, estimating a linguistic feature of the target sound using a regression model 202, and decoding the linguistic features from the EEG data 203. The linguistic feature can be selected from the group consisting of a phonemic feature, a phonotactic feature, a semantic feature, and combinations thereof. In certain embodiments, as shown in FIG. 3, the method can include training the regression model by providing training data 301, wherein the training data comprises electroencephalogram (EEG) data, linguistic feature data, or a combination thereof. The training data can be obtained from the subject and/or others. The training data can be used for machine learning (e.g., deep learning) to concatenate the sound inputs/stimuli and the brain responses using the training data.

In certain embodiments, the method can include measuring electroencephalogram (EEG) data from the subject 304 when the subject is exposed to the target sound. The EEG data can be obtained through one or more sensor components (e.g., electrodes). In non-limiting embodiments, as shown in Fig. 4, the measured EEG data can be used for assessing the accuracy of the decoding 407 by predicting a brain response of the subject exposed to the target sound.

In certain embodiments, the method can further include estimating a temporal response function (TRF) for the linguistic features. For example, the regression model can be utilized for estimating the TRF for the envelope, the spectrogram, the phonetic feature, the phonotactic feature, the semantic feature, or a combination thereof. In non-limiting embodiments, the TRFs for the linguistic features can be used for assessing the language proficiency and nativeness level of the subject 305. The language proficiency and nativeness level of the subject can be determined by comparing the value of TRF with a control data (e.g., TRFs of native speakers, non-native speakers, low-proficiency level speakers, high-proficiency level speaker, etc.). In non-limiting embodiments, Pearson’s correlations can be performed to compare the TRF values of the subject to the control group. The term “Pearson’s correlation” refers to a measure of the strength of the association between the two variables. It can be calculated by dividing the covariance of the two variables by the product of their standard deviations.

In certain embodiments, the method can further include measuring additional linguistic features. For example, envelopes and/or an auditory spectrogram of the target sound can be estimated and used for improving the accuracy of the predicted responses. In non-limiting embodiments, the additional linguistic features can be used in combination with the phonemic feature, phonotactic feature, the semantic property to improve the accuracy. The phonemic feature can include a cohort size at each phoneme, a cohort reduction variable, or a combination thereof. The phonotactic feature can include a phonotactic probability. The semantic property can include a semantic vector.

In certain embodiments, as shown in Fig. 4, the method can further include assessing accuracy of the decoding 407 by predicting a brain response of the subject exposed to the target sound 406. This technique can be used for performing auditory attention decoding (AAD) to distinguish a target sound from non-target sounds.

EXAMPLES

The following Examples are offered to illustrate the disclosed subject matter but are not to be construed as limiting the scope thereof. The presently disclosed subject matter will be better understood by reference to the following Example. The Example provided as merely illustrative of the disclosed methods and systems, and should not be considered as a limitation in any way.

EXAMPLE 1: How Proficiency Shapes The Hierarchical Cortical Encoding of Non- Native Speech

Learning a second language (L2) can be a challenging process that differs from the native language (LI) acquisition. Age at the time of learning, frequency of exposure to the language, and interest can be factors that can make a person more or less successful in mastering a second language. Younger learners can generally achieve proficiency levels that are more native-like than older learners. Even in the presence of optimal conditions, adult learners rarely fully master a second language with native-like proficiency. One typical difference can be the “foreign accent” that can characterize L2 speakers, who tend to carry LI features to their L2 (e.g., phonetic). Similar considerations can be made for speech listening, which can elicit different cortical patterns in LI and L2 users and for different L2 proficiency levels. However, the precise neural underpinnings of L2 listening and the differences with LI processing remain poorly understood and somewhat controversial.

One issue that can shed light on L2 acquisition can be determining how increased proficiency shapes the cortical correlates of speech listening. Furthermore, the ability to objectively assess such a transformation can provide information regarding whether and how closely the brain dynamics for increasing L2 proficiency-levels converge to LI processing.

The disclosed subject matter can utilize a framework, which includes a linear modeling approach that allows deriving objective measures of speech perception at multiple linguistic levels from a single electrophysiological recording during which participants listen to the continuous natural speech, to investigate how proficiency shapes the hierarchical cortical encoding of L2 speech, and how that differs from LI speech encoding. The disclosed subject matter can be used for analyzing speech processing at the levels of sound acoustics, phonemes, phonotactics, and semantics.

The disclosed subject matter was utilized to assess the proficiency to modulate brain responses to all the investigated linguistic properties. Phoneme encoding was expected to become more native-like with proficiency, but not full converge. The processing of phonotactics (statistics on phoneme sequences) can occur, at least in part, as a form of implicit learning. Therefore the encoding of phonotactics was expected to gradually become more native-like with proficiency, a change that would all proficiency levels, even with no speech comprehension. The disclosed subject matter was also utilized for identifying whether L2 phonotactics is encoded separately from LI or it remains influenced by LI statistics.

Different neural patterns were assessed for semantic-level encoding, which can rapidly change at an intermediate level of proficiency. Understanding a few words can facilitate the comprehension of the neighboring ones (semantic priming), determining a turning point beyond which comprehension increases. The disclosed subject matter also includes EEG analyses on word-pair listening, which can indicate that the encoding of semantic dissimilarity can be molded by proficiency. The disclosed subject matter was also utilized to combine cortical measures of acoustic and linguistic speech processing to assess the differences between LI and L2 encoding during natural speech listening showing effects of nativeness, proficiency, and a difference between LI and L2 perception that goes beyond proficiency.

EEG data acquisition and preprocessing: Fifty -two healthy subjects (24 male, aged between 18 and 60, with median = 24 and mean = 26.8) that learned English as a second language (or that did not speak English) participated in the EEG experiment. All the subjects reported having normal hearing and having no history of neurological disorder. The 52 subjects were divided into three groups with different proficiencies in American English listening. They were asked to take a 20-minute English listening level test before the experiment, which provides CEF (Common European Framework of Reference for Languages). The CEF level includes C2, Cl, B2, Bl, A2, Al, and A0 (from high proficiency to low proficiency). The subjects who got C2 and Cl were classified as Proficient user, while the subjects who got B-level scores were defined as Independent user, who got Al and A2 were categorized as a Basic user, and who got A0 were characterized as No English. Each subject reported no history of hearing impairment or neurological disorder, provided written informed consent, and was paid for their participation. The experiment was carried out in a single session for each participant. EEG data were recorded from 62 electrode positions, digitized at 512 Hz using a BioSemi Active Two system. Audio stimuli were presented at a sampling rate of 44,100 Hz using Sennheiser HD650 headphones and Presentation software. Testing was carried out in a dark room, and subjects were instructed to maintain visual fixation on a crosshair centered on the screen and to minimize motor activities while music was presented.

Neural data were analyzed offline using MATLAB software (The Mathworks Inc). EEG signals were digitally filtered between 1 and 15 Hz using a Butterworth zero-phase filter (low- and high-pass filters both with order 2 and implemented with the function), and down-sampled to 64 Hz. EEG channels with a variance exceeding three times that of the surrounding ones were replaced by an estimate calculated using spherical spline interpolation. All channels were then re-referenced to the average of the two mastoid channels with the goal of maximizing the EEG responses to the auditory stimuli.

Fig. 5 A shows an example design of the disclosed subject matter. Participants 501 listened to natural speech sentences 502 while 64-channel EEG signals 503 were recorded. Speech descriptors 504 were extracted from the audio waveform capturing various acoustic and linguistic properties. Multivariate linear regression models were fit between each speech descriptor and the preprocessed EEG signal to describe the temporal response function at various linguistic levels. As shown in Fig. 5B, speech descriptors were extracted from the audio waveform to capture the speech acoustics, phonemes, phonotactics, and semantic dissimilarity.

Stimuli and experimental procedure: EEG data were collected in a sound-proof, electrically shielded booth in dim light conditions. Participants listened to short stories narrated by two speakers (1 male) while minimizing motor movements and maintaining visual fixation on a crosshair at the center of the screen. The male and the female narrators were alternated to minimize speaker-specific electrical effects. Stimuli were presented at a sampling rate of 44,100 Hz, monophonically, and at a comfortable volume from loudspeakers in front of the participant. Each session consisted of 20 experimental blocks (3 min each), divided in five sections that were interleaved by short breaks. Participants were asked to attend to speech material from seven audio-stories that were presented in a random order. The engagement to the speech material was assessed by means of behavioural tasks. L2 participants were asked three questions at the end of each block. First, participants wereasked whether the last sentence of the section was spoken by a male or female speaker. Next, participants were asked to identify 3-5 words with high-frequency in the sentence from a list of eight words. Third, participants performed a phrase-repetition detection task. Specifically, the last two to four words were repeated immediately after the end of some of the sentences (1-5 per block). Given that our target was monitoring attention, a finger-tip clicker was used to count the repetitions so that they would be engaged in a detection and not counting task, which would instead require additional memory resources and, potentially, reduce their engagement to the main listening task. Participants were asked to indicate how many sentences in the story presented these repetitions at the end of each block. To assess attention in LI participants, three questions about the content of the story were asked after each block. All LI participants were attentive and able to all answer correctly at least 60% of the questions

Speech features: The coupling between the EEG data and various properties of the speech stimuli were assessed. These properties were extracted from the stimulus data based on previous research. First, a set of descriptors summarizing low-level acoustic properties of the music stimuli was defined. Specifically, a time-frequency representation of the speech sounds was calculated using a model of the peripheral auditory system consisting of three stages: (1) a cochlear filterbank with 128 asymmetric filters equally spaced on a logarithmic axis, (2) a hair cell stage consisting of a low-pass filter and a nonlinear compression function, and (3) a lateral inhibitory network consisting of a first-order derivative along the spectral axis. Finally, the envelope was estimated for each frequency band, resulting in a two-dimensional representation simulating the pattern of activity on the auditory nerve (the relevant Matlab code is available at https://isr.umd.edu/Labs/NSL/Software.htm). This acoustic spectrogram (S) was then resampled to 16 bands. A broadband envelope descriptor (E) was also obtained by averaging all envelopes across the frequency dimension. Finally, the half-way rectified first derivative was used as an additional descriptor, which was shown to contribute to the speech-EEG mapping and was used here to regress out the most acoustic-related responses as possible. Additional speech descriptors were defined to capture neural signatures of higher-order speech processing. The speech material was segmented into time-aligned sequences of phonemes using the Penn Phonetics Lab Forced Aligner Toolkit, and the phoneme alignments were then manually corrected using Praat software. Phoneme onset times were then encoded in an appropriate univariate descriptor (Pon), where ones indicate an onset and all other time samples were marked with zeros. An additional descriptor was also defined to distinguish between vowels and consonants (Pvc). Specifically, this regressor consisted of two vectors, similar to Pon, but marking either vowels or consonants only. While this information was shown to be particularly relevant when describing the cortical responses to speech, there remains additional information on phoneme categories that contributes to those signals. This information was encoded in a 19-dimensional descriptor indicating the phonetic articulatory features corresponding to each phoneme (Phn). Features indicated whether a phoneme was voiced, unvoiced, sonorant, syllabic, consonantal, approximant, plosive, strident, labial, coronal, anterior, dorsal, nasal, fricative, obstruent, front (vowel), back, high, low. The Phn descriptor encoded this categorical information as step functions, with steps corresponding to the starting and ending time points for each phoneme. Next, phonotactic probability information was encoded in an appropriate two-dimensional vector (Pt). Probabilities were derived by means of the BLICK computational model, which estimates the probability of a phoneme sequence to belong to the English language. This model is based on a combination of explicit theoretical rules from traditional phonology and a maxent grammar which find optimal weights for such constraints to best match the phonotactic intuition of native speakers. The phonotactic probability was derived for all phoneme sub-sequences within a word (phi . k , 1 < k < n, where n is the word length) and used to modulate the magnitude of a phoneme onset vector (Pti). A second vector was produced to encode the change in phonotactic probability due to the addition of a phoneme (phi . k - phi ..k -i, 2 < k < n) (Pt2). Finally, a semantic dissimilarity descriptor was calculated for content words using word2vec, a state-of-the-art algorithm consisting of a neural network for the prediction of a word given the surrounding context. In this specific application, a sliding-window of 11 words was used, where the central word was the output and the surrounding 10 words were the input. This approach is based on the “distributional hypothesis” that words with similar meaning occur in similar contexts, and it uses an artificial neural network approach to capture this phenomenon. This network has a 400-dimensional hidden layer that is fully connected to both input and output. For our purposes, the weights of this layer are the features used to describe each word in a 400-dimensional space capturing the co occurrence of a content word with all others. In this space, words that share similar meanings will have a closer proximity. The semantic dissimilarity indices are calculated by subtracting from 1 the Pearson’s correlation between a word’s feature vector and the average feature vector across all previous words in that particular sentence (the first word in a sentence was instead correlated with the average feature vector for all words in the previous sentence). Thus, if a word is not likely to co-occur with the other words in the sentence, it does not correlate with the context, resulting in a higher semantic dissimilarity value. The semantic dissimilarity vector (Sem) marks the onset of content words with their semantic dissimilarity index.

Computational model and data analysis: A single input event at time tO affects the neural signals for a certain time-window [tl, tl+twin], with tl > 0 and twin > 0. Temporal response functions (TRF) were fit to describe the speech-EEG mapping within that latency-window for each EEG channel. This was performed by means of a regularized linear regression that estimates a filter that allows to optimally predict the neural response from the stimulus features (forward model). The input of the regression also included time- shifted versions of the stimulus features, so that the various time-lags in the latency- window of interest were all simultaneously considered. Therefore, the regression weights reflect the relative importance between time-latencies to the stimulus-EEG mapping and were here studied to infer the temporal dynamics of the speech responses. Here, a time-lag window of 0-600 ms was used to fit the TRF models which is thought to contain most of the EEG responses to speech of interest. The reliability of the TRF models was assessed using a leave-one-out cross-validation procedure (across trials), which quantified the EEG prediction correlation (Pearson’s r) on unseen data while controlling for overfitting. Note that the correlation values are calculated with noisy EEG signal, therefore the r-scores can be highly significant even though they have low absolute values (r ~ 0.1 for sensor-space low-frequency EEG). Stimulus descriptors at the levels of acoustics, phonemes, phonotactics, and semantics were combined in a single TRF model fit procedure. This strategy was adopted with the goal of discerning EEG responses at different processing stages. In fact, larger weights are assigned to regressors that are most relevant for predicting the EEG. For example, a TRF derived with Pt alone can reflect EEG responses to phonotactics and phoneme onset. A TRF based on the combination of Pt and Pon would instead discern their respective EEG contributions, namely by assigning larger weights to Pt for latencies that are most relevant to phonotactics. Here, individual-subject TRFs were fit by combining Sgr, Phn, Pon, Pt, and Sem, which provided a higher level of detail on spectrotemporal and phonological speech features, at the cost of higher dimensionality. The TRF weights constitute good features to study the spatio-temporal relationship between a stimulus feature and the neural signal. However, studying this relationship for multivariate speech descriptor, such as Phn, requires the identification of criteria to combine multiple dimensions of TRF weights. One solution to use the EEG prediction correlation values to quantify the goodness of fit for a multivariate TRF model. Here, the relative enhancement in EEG prediction correlation was considered when Phn was included in the model, thus allowing to discern the relative contribution of phonetic features to the neural signal. This isolated index of phoneme-level processing was also shown to correlate with psychometric measures of phonological skills. Additional analyses were conducted with a generic modelling approach. Specifically, one generic TRF model was derived for each of the groups A, B, C, and LI by averaging the regression weights from all subjects within the group. Then, EEG data from each left-out subject (whose data was not included in the generic models) was predicted with the four models. The four prediction correlations were used as indicators of how similar the EEG signal from a subject was to the one expected for each of the four groups, providing a simple classifier.

Proficiency-level decoding : Linear regression, Gradient boosting regression, and classification were performed. Grid-search for hyperparameter selection (number of trees, depth tree) and backward elimination. Nested-loop Cross-validation and Feature selection were performed by means of factor analysis.

Statistical analysis: Statistical analyses were performed using two-tailed permutation tests for pair-wise comparisons. Correction for multiple comparisons was applied where necessary via the false discovery rate (FDR) approach. One-way ANOVA was used to assess when testing the significance of an effect over multiple (> 2) groups. The values reported use the convention F(df, df error ). Greenhouse-Geisser corrections were applied when the assumption of sphericity was not met (as indicated by a significant Mauchly’s test). Cohen’s d was used as a measure of effect size.

62-channel EEG was recorded from 74 participants as they listened to continuous speech sentences in the English language. 52 of them were native Chinese speakers with English as a non-native language. English proficiency was assessed by means of a standardized test of receptive skills that assigned participants to seven different CEFR levels (Common European Framework of Reference for languages). No English (A0 level), Basic user (A1 and A2 levels), Independent user (B1 and B2 levels), and Proficient user (Cl and C2 levels). The remaining 22 participants were native English speakers. To investigate the low- versus higher-level brain processing speech, linear regression models were used to measure the coupling between the low-frequency cortical signals (1-8 Hz) with progressively more abstract properties of the linguistic input. This procedure allowed, for the first time, to simultaneously assess the processing of L2 speech processing at several distinct linguistic levels with a single listening experiment based on the ecologically-valid speech stimuli.

Hierarchical cortical encoding of native and non-native speech

Envelope TRF: Forward TRF models were fit between E and the EEG signals (TRFE) for a broad time-latency window from 0 to 600ms that can capture the time dependencies of interest. Leave-one-out cross-validation indicated that the resulting TRF models can reliably predict the EEG signal for all subjects (renv > 0, permutation test with p < 0.01 , where renv indicates the average EEG prediction correlation across all electrodes when using TREE). Fig. 6A shows the model weights of TREE after averaging across subjects for LI and for each of the L2 proficiency groups A 601, B 602, and C 603(averaged across all electrodes). TRFs for all groups appear strongly temporally synchronized, which was expected for cortical responses to low-level acoustic properties. Furthermore, significant correlations between proficiency and the TREE magnitude emerged for the negative components at speech-EEG latencies between 60 and 80ms (p < 0.05, FDR-corrected Pearson’s correlation). TREE for LI 604 participants was more strongly correlated with A than B and C participants, showing significant differences with the TRFs for B and C. The topographical distribution of these TRFs did not show differences for distinct participant groups.

Here, the low-level auditory responses were modeled by considering the acoustic spectrogram (S), which was shown to be a better predictor of the EEG signal. However, observing TRFs for different auditory frequency bands provides insufficient new clear-cut insights in this case. Although both TREE and TRFS were promising in that they showed specific speech-EEG latencies sensitive to the effects of proficiency and nativeness, an ability to assess the functional origins of those effects is hindered by the inherent strong correlation between the acoustic spectrogram and higher-order properties such as phonemes. To clarify this issue, additional analyses were performed by investigating the TRFs in response to higher-order properties of speech.

Phoneme TRF: Phonetic feature information was represented by the categorical descriptor F, where the occurrence of a phoneme is marked by a rectangular pulse for each corresponding phonetic features (e,g., speech features). Because of the correlation between phoneme and acoustic descriptors, TRFs were fit between their concatenation and the EEG signal. Here, the acoustic descriptor played the role of a nuisance regressor, meaning that it reduced the impact of acoustic-only responses onto the TRF model for phonetic features (TRFF). In order to regress out the most acoustic-related responses as possible, the acoustic descriptor A, which consists of the concatenations of the acoustic spectrogram with the half-way rectified first derivative of the envelope, was employed. Fig. 6B and 6D show the resulting TRFs for phonetic features (nuisance regressor weights are not shown) and their linear projection to phoneme TRFs respectively (TRFPh). To assess the effects of proficiency and nativeness, these multivariate TRFs were summarized by means of a classical multidimensional scaling analysis (MDS) that considered phonemes as objects and time-latencies as dimensions. Specifically, phonemes were first mapped to MDS space for each proficiency group independently. Phonemes in A-, B-, and C-MDS space were then projected to the LI -MDS space by means of a Procrustes analysis. The L1-L2 distance calculated in that space revealed that proficiency increases the similarity between LI and L2 phoneme TRFs (phoneme distances in the MDS space decrease with proficiency: ANOVA, F(1.6, 62.2) = 12.2; p = 2.5*10-5). The effect of proficiency on individual phonemic contrasts can also be assessed by means of 2- dimensional plots summarising the position of each phoneme in the Ll-MDS space. Fig. 6C shows distance between LI and L2 phonemes for each language proficiency group calculated at the electrode Cz in the Ll-MDS space. Star indicates a significant effect of proficiency on the L1-L2 phoneme distance (ANOVA, p < 0.05). Error bars indicate the SE of the mean across phonemes. FIG. 6D shows topographies depicting the TRF weights for selected speech-EEG time-latencies after averaging the weights across all phonetic features. The weights reflect the relative importance between time-latencies to the stimulus-EEG mapping and allow to infer the temporal dynamics of the speech responses. Fig. 6E shows this information, with different grey scales indicating phonemes for LI and L2 participants, respectively.

Phonotactics TRF : In a given language, certain phoneme sequences are more likely to be valid speech tokens than others. This is due to language-specific regularities that can be captured, for example, by means of statistical models. One of such models, called BLICK, was used to estimate the likelihood that each phoneme sequence pi.. k composing a word pi ..n, where 1 < k < n, belongs to the language. This model returns the phonotactic score (inverse phonotactic likelihood) corresponding to the negative logarithm of the likelihood of a sequence. Larger values correspond to less likely (more surprising) sequences. These values were used to mark phoneme onsets in a vector, where all other time points were assigned to zero. Fig. 7A compares the corresponding TRF weights between proficiency groups at five scalp locations of interest. First, TRF for LI participants 701 was qualitatively similar to the ones described in the literature, with a significant negative component from time-latencies between about 300 to 500 ms. Conversely, TRF patterns for L2 participants (e.g., A 702, B 703, and C 704) showed significant negativities for latencies around 500 ms on frontal but not parietal electrodes, while a significant effect of proficiency occurred at the much earlier lags 100-160 ms. The topographical patterns in Fig. 7A further clarify that the effect of proficiency begins to emerge at about 80 ms, while previous time latencies showed similar responses for LI and all L2 proficiency groups.

Semantic dissimilarity TRF: A similar analysis was conducted based on semantic dissimilarity rather than phonotactic scores. Specifically, a 300-dimensional feature space defines according to the Word2Vec algorithm. Then, semantic dissimilarity was quantified as the distance of a word with the preceding semantic context, thus resulting in a sparse vector that marks all content words with larger values for more dissimilar words (e.g., speech features) . Fig. 7B shows the semantic dissimilarity TRFs for five selected scalp channels. TRFs for LI participants 705 were consistent with the results shown by Broderick and colleagues, with a centro-parietal negativity peaking at peri-stimulus latencies of 360-380 ms. Similar TRF patterns emerged for the L2 C-level participants 706, who showed similar parietal and posterior TRFs at comparable latencies, with trough between 360 and 420 ms. A significant effect of proficiency was measured at the Oz channel. Interestingly, a centro-frontal negativity arose on TRFs for L2 participants (e.g., A 707, B 708, and C 706) starting from about 360 ms.

Decoding language proficiency and nativeness: The results indicate that both language proficiency and nativeness shape the cortical responses to speech at various processing stages. In order to assess the strength of these effects and their robustness at the individual-subject level, information at distinct hierarchical levels was combined to decode the language proficiency and nativeness of the participants. The first set of features for the decoding was extracted from the TRFs for E, S, F, Pt, and Sd. Because these information spaces along three dimensions, namely EEG channels, time latencies, and stimulus features (e.g., phonetic features), a multilinear principal component analysis (MPCA) was performed on the TRF of each speech descriptor independently and the first component was retained. Second, similarities of a participant’s EEG signal with the ones of all other subjects were calculated by means of a generic modeling approach. Specifically, a generic TRF was calculated for the A, B, C, and LI groups by averaging the TRFs of all subjects within each group. Then, Pearson’s correlations were calculated between the EEG signal of a left out subject and the EEG predictions for each proficiency group. Finally, the last feature included in the decoding was the enhancement in EEG prediction correlation due to phonetic features (FA-A), which was previously suggested to isolate phoneme-level responses. A gradient boosting regression (based on decision trees) was run on the combination of the resulting features to decode the English proficiency on the 52 non native speakers. Fig. 7C shows that the proficiency level was reliably predicted from the EEG features, with a mean-squared error (MSE) of 1.05 and a Pearson’s correlation between actual and predicted values r = 0.84 (p = 4.68*10-15). This result was also used to classify high- vs. low-proficiency participants (C vs A respectively), with a 97% decoding accuracy (only 1 subject was misclassified). One potential confound can be found in that participants in the groups A0 and A1 were not age-matched to all others. This confound was controlled in two ways. First, the proficiency, which can be decoded even when participants aged above 50 were removed to make the A group age-matched, was verified. Second, a factor analysis (ref) confirmed that, while age it is somewhat predictive of proficiency, it explains only a small part of the decoding result. Similar considerations apply to the effect of attention.

Further analyses were conducted to assess the effect of nativeness on the EEG responses to speech. One issue is that LI and L2 subjects differ both in nativeness and proficiency. The TRF results show that a higher proficiency level does not always lead to EEG responses that are equivalent to the ones of native speakers. While there was some level of convergence for phoneme responses, this was not the case with phonotactics and semantic dissimilarity responses. A gradient boosting classification analysis on all the 74 subjects indicated that nativeness can be decoded with 85% accuracy. To minimize the effect of proficiency, the same procedure was performed to classify nativeness only when considering only the C- and LI - groups, producing a significant classification accuracy of

75%. EXAMPLE 2: How Proficiency Shapes The Hierarchical Cortical Encoding of Non- Native Speech

Brain’s ability to selectively attend to one speaker in a multi-speaker and noisy environment while inhibiting and masking out the unattended speech and noise is famously known as the Cocktail Party Phenomenon (CPP). The attended and unattended speech can be represented in the electrophysiological recordings of a user at different strengths and latencies. This representation of the multi-speaker speech provides the opportunity of decoding the attended speech from the electrophysiological signal of a user, an example of the applications of which is in designing hearing aids for helping patients with hearing disabilities. Forward Auditory attention decoding can be performed by using a regularized linear regression model to reconstruct the brain signal from the attended and unattended stimulus and comparing the reconstruction result with the original brain signal and determining the attended speaker. This can be performed by extracting the envelope and acoustic features of the attended and unattended speech.

A speech can be broken down into and represented in multiple levels. Low-level and high-level linguistic features can be represented in the electrophysiological recordings. At the lowest level, a speech can be represented by its acoustic features such as its frequency content and envelope. Moving into higher levels, a speech can be categorized into distinct units of sounds called phonemes (e.g., /b/ and /d/ as in /bad/ versus /dad/). Phonotactic probabilities have to do with the chance of occurrence of a specific sequence of phonemes in a specific language.

Combinations of syllables (particular sequence of phonemes) form high-level linguistic components that are words, conveying semantic information (Fig. 8A). The addition of high-level features to acoustic features has been proven to enhance the reconstruction of brain signals in a single speaker task. This raises the question of whether these high-level features can be used to identify the attention of a subject. Various features, at different linguistic levels that are extracted from the same speech share some commonalities while having their own unique pattern. The representation of these features in the electrophysiological recordings raises the question of whether a model based on the integration of all these features for AAD can achieve better performance compared to models that only use the envelope of the speech.

Here, the EEG recordings of 16 subj ects in a 2-speaker attention task were assessed and forward models were trained for AAD based on each feature individually, and their combination to examine how their integration would affect the model’s performance (Fig. 8)·

Participants: 16 native American English speakers were studied in this paper, who, to the best of their knowledge, had no hearing conditions and self-reported a normal hearing.

Stimuli and procedure: Subjects were placed in a sound-proof booth, which was electrically shielded, where their EEG data were collected. Stimuli, Multiple short stories, were played to subjects, monophonically with a speaker, placed in front of the participants. The speaker’s volume was set to a comfortable, loud enough, and constant level. Participants were instructed to listen to a multi-speaker speech containing multiple short stories, simultaneously, told by two speakers (one male and one female voice actor). The experiment consisted of 16 blocks. Between each block, a short break was given to the participants. Participants were instructed to only attend to one speaker, as specified before the block, during each block. Subjects were asked to attend to the male speaker in the initial block and to switch their attention after each block subsequently. To make sure of subjects’ paying attention and engagement to the experiment, the stories were interrupted at random, on average every 15s, and the participants were asked to write down the last sentence of the attended speech. The pausing points in the story were similar for all subjects, and the subjects were uninformed of when would occur. The experiment consisted of 39 minutes and 20 s of EEG recordings, which later was divided into 20 trials. Written informed consent was provided by each subject and all procedures.

Recording: EEG data were acquired using a g.HIamp bio-signal amplifier (Guger Technologies) with 62 active electrodes assembled on an elastic cap (10-20 enhanced montage) at a sampling rate of 2kHz. The ground and the reference were subsequently fixed by using a separate frontal electrode (AFz) and taking the average of two earlobe electrodes. The earlobe electrodes were chosen as a reference due to the highly correlated activity across electrodes, which can make common reference averaging unfunctional. The channel impedances were maintained below 20k (\ohm) at all times. To remove the DC, an online fourth-order high-pass Butterworth filter at 0.01 Hz was applied to the EEG data.

Preprocessing and Feature Extraction: For the purposes of this paper, the EEG data were passed through a bandpass filter ranging from 2-15 Hz. Then, the recordings were subsampled to 100 Hz. All EEG channels were checked for any abnormal behavior.

The auditory spectrograms of the speeches were estimated using a model of cochlear frequency analysis. The envelopes of attended and unattended speech were estimated by averaging over all the frequencies in the spectrogram for each time point. To find the phoneme and word timings, the Penn Phonetics Lab Forced Aligner (P2FA) was used. Phonetic features were calculated by mapping each phoneme to its 22 binary phoneme attributes, which determine the voicing, manner, and the place of articulation of each phoneme (e.g. Voiced, Nasal etc.)

The cohort of a speech can be defined as the set of words or lexical units that match the acoustic input of a word at any time point during the expression of the word. The cohort of each phoneme was defined by selecting all the lexical items in the dictionary that had a similar phoneme sequence, starting at the beginning of the lexical unit, to that of the phoneme sequence from the beginning of the word to the current phoneme. Then each phoneme, the cohort size was calculated by taking the log of the number of words in the cohort set.

Having the cohort size at each phoneme, the cohort reduction variable was defined to be equal to the cohort size at the current phoneme minus the cohort size at the previous phoneme or in case of the initial phoneme, the log of the number of words in the dictionary. The AAD performance improves if the phonotactic probabilities was described by the Cohort Reduction variable.

At the semantic level, each word was quantified by using its equivalent 25- dimensional global vectors for word representation (GloVe) from a pre-trained dictionary based on the twitter texts.

Forward decoding model: To obtain the Temporal Response functions (TRFs) that map stimulus to the brain response, a GLM model was employed with regularization using the MATLAB mTRF toolbox. To estimate the brain response for a target trial out of the 20 trials, the leave-one-out method was employed. First, the stimuli and the brain responses of all the other trials and the target trial were measured. A GLM can map the concatenated stimuli (at different lags ranging from 0 to 650 ms) to the concatenated brain responses. Having the GLM parameters, the brain response for the target trial was assessed, and the prediction with the actual brain response was comparted. The topoplot of the coefficients of the GLM model (averaged across all trials) for an attended and unattended speech during various time windows.

To determine the extent to which various linguistic features can be used to decode the attended speaker, and whether their addition to the envelope of the speech can improve the auditory attention decoding, the performance of each linguistic feature and their combination to the envelope, for each subject, were calculated separately. Figures 8B-8D show the method used to find the attended speaker from the linguistic features and the envelope. To do so, a generalized linear model (Fig. 8C) was trained for each trial based on all the other trials to estimate and predict the EEG response of the brain based on the attended and unattended stimuli of that trial (each linguistic feature and addition of all linguistic features to the envelope). Then the predicted brain responses were compared to the actual brain responses and the speaker with a higher correlation was selected as the attended speaker (Fig. 8D). The AAD was done using, the envelop, phonemic, phonotactic, semantic features, and the combination of all features to examine whether models that use higher linguistic features for AAD are more advantageous.

Fig. 9A shows the results for AAD for each stimulus. The accuracy of auditory attention decoding was evaluated by predicting the brain’s response at each time point with a forward linear regression decoder from the envelope of the speech (env), its phonetic vectors (phn), phonotactic vectors (phon), semantic features (sem), and their combination. As can be seen from Fig. 9A, higher linguistic features can be good indicators for determining the attended speech form a mixture of speeches on their own. Phonetic features were especially accurate at predicting attention. Fig. 9A also shows the accuracy of auditory attention decoding based on the integration of all linguistic features and envelope (env+phn+phon+sem). As can be seen, the combination of all the features results in a higher AAD accuracy. Fig. 9B compares the accuracy of AAD when combining all features versus only using the envelope of speech. As can be seen from Fig. 9B, AAD accuracies improved in the majority of subjects when higher linguistic features were included in the regression model. Fig. 9C compares the relative improvement in AAD to the AAD accuracy using the envelope. On average the AAD performance was improved 20.7 percent. According to the Fig. 9, there is a strong negative correlation (-0.81) between the relative improvement and the envelope accuracy, suggesting that including higher- level linguistic features can especially be helpful in improving the performance of AAD for subjects that have a bad AAD performance based on the envelope of the stimulus. relative improvement multi-feature accuracy-envelope accuracy = envelope accuracy (1)

Furthermore, to understand the role of high-level linguistic features in AAD, the average correlations between the actual and estimated response for each recording channel were assessed. Figs. 10A-10E compare the correlation between reconstructed EEG and actual EEG for both attended and unattended models for each EEG Channel (normalized for each subject and then averaged across all subjects). The correlations for the attended model are significantly larger than those of unattended model for all features and by combining all the features this difference becomes more significant (the EEG channels had higher correlations for the attended model than the unattended one, with averaged across subjects p-value = 0.13 for the envelope feature and p-value = 0.04 for the integration of all features). The topoplots of the attended and unattended model weights, trained on the combination of all features for various time intervals can be found in Fig. 11.

The disclosed subject matter was used for showing that higher-level linguistic features such as phonemic, phonotactic, and semantic features in the AAD models can improve the performance of auditory attention decoding. The disclosed subject matter provided the followings: A) higher linguistic features on their own are capable of decoding the attention from the neural data and their accuracies are above chance and comparable to that of using the envelope in AAD, B) a model is established based on the addition of all the low level and high-level linguistic features and the envelope and achieved higher accuracy in the attention decoding task ( on average 5.6% improvement from 58.12% when using the envelope stimulus to 63.75% when using the combination of linguistic stimuli) which shows that adding higher level linguistic features to the envelope of the speech, can improve the AAD performance. It is important to note that the features used in the model are not independent and are correlated (Fig. 12). This correlation between the features limits further improvement in the performance of auditory attention decoding. Further enhancement can be achieved by using a set of features that are independent or by removing the codependency of these features, C) the improvement in the accuracies was more robust in subjects who had a poor performance with models based on the envelope stimulus alone and there existed a strong negative correlation between the improvement achieved when combining all the features and when only using the speech envelope. This means that using complementary linguistic features is especially useful when the envelope feature fails to deliver a good performance in the attention decoding task, D) the correlation between the estimated response and brain response, for each recording channel, on average, becomes more distinguishable for the attended and unattended model when the linguistic features were added to the envelope which in turn leads to better attention decoding accuracies.

The disclosed subject matter shows that the primary auditory cortex (AC) encodes the speech of both attended and unattended in a multi-speaker task regardless of attention, while the nonprimary auditory cortex selectively encodes the attended speech and masks out the unattended one. Given that the higher areas in the brain mostly encode higher linguistic features, especially of the attended speech, including high-level linguistic cues allowsto focus the AAD analysis on these higher areas and nonprimary AC.

The applications of AAD using low level and high-level linguistic features need to be accompanied by parallel real-time phoneme and whole word automatic speech recognition models (ASR). End to end models that are capable of generating phoneme and word sequence, and then phonotactic and semantic information from the acoustic speech in real-time, can be used in AAD task in parallel and improve the performance of auditory attention decoding using higher-level linguistic features.

One way to improve upon the work of this paper would be to use more complex models such as nonlinear models and models based on neural networks. Linear AAD models using high-level linguistic features are not capable of extracting the nonlinear encoding of these features in the neural data.

Furthermore, as shown in Fig. 12, the linguistic features are inherently correlated, and this correlation prevents the linear models from achieving any further improvement in the performance. However, by including linguistic features that are not correlated or removing the codependency of these features together, one can achieve better results in

AAD using high level linguistic information.

* * *

The features, structures, or characteristics of certain embodiments described throughout this specification can be combined in any suitable manner in one or more embodiments.

One having ordinary skill in the art will readily understand that the disclosed subject matter as discussed above can be practiced with procedures in a different order, and/or with hardware elements in configurations which are different than those which are disclosed. Therefore, although the disclosed subject matter has been described based upon these embodiments, it would be apparent to those of skill in the art that certain modifications, variations, and alternative constructions would be apparent while remaining within the spirit and scope of the disclosed subject matter.