
Activez les alertes d’offres d’emploi par e-mail !
Générez un CV personnalisé en quelques minutes
Décrochez un entretien et gagnez plus. En savoir plus
Un institut de recherche en informatique à Bordeaux recherche un(e) post-doctorant(e) pour développer un modèle neuronal dynamique de traitement vocal. Le candidat idéal devra avoir une bonne formation en mathématiques, une maîtrise de Python et un intérêt pour les neurosciences. Le poste propose une rémunération brute de 2788€ par mois, ainsi que divers avantages incluant 7 semaines de congés annuels et la possibilité de télétravail.
When we listen to a song, or listening at the radio, our brain needs to parse incoming stimuli incrementally and on the fly. When we learn a song, we learn to imitate what we hear by trial and error, we try to reproduce the sounds we hear. There is converging evidence that (song, language or gesture) production and perception are not separated processes in the brain, they are rather interwoven. This interweaving is for instance what enables people to predict themselves and each other [6]. Interweaving of action and perception is important because it allows a learning agent (e.g. a baby, a bird or a model) to learn from its own actions : for instance, by learning the perceptual consequences (e.g. the heard sounds) of its own actions (e.g. vocal productions) during babbling. Thus, the agent will learn in a self‑supervised way. This kind of learning is more biologically plausible than supervised learning which assumes the availability of “teacher signals” which have to be designed by the modeller. Self‑supervised learning is fundamental for developmental processes such as babbling. Schwartz et al. [11] propose that perception and action are co‑structured in the course of speech development : gestures are perceptually‑shaped, they form a perceptuo‑motor unit. A clear neuronal model explaining which are the mechanisms shaping such perceptuo‑motor units through development is missing.
To learn songs, we need to have a good cognitive representation of sounds and musicality. In order to obtain plausible brain representations of songs or complex sequence of movements we cannot rely on engineered representations (e.g. word or audio embeddings such as Word2Vec or Wave2vec), because this would prevent from modelling the representations obtained during developmental and bootstrapping processes. Thus, we want to obtain perceptuo‑motor representations that emerge through action‑perception mechanisms. The existence of sensorimotor (i.e. mirror) neurons at abstract representation levels (called action‑perception circuits [5]), jointly with the perceptuo‑motor shaping of sensorimotor gestures, suggest the existence of similar action‑perception mechanisms implemented at different levels of hierarchy (e.g. phoneme, syllable or word in the case of human language). Consequently, models of action‑perception mechanisms should be able to be stacked as hierarchical processes.
If we want to recognize a song or to understand a sentence on the fly, our brain needs to process the information as quickly as possible in order to not saturate our “cognitive buffer” (i.e. working memory), thus loosing what is coming next. Christiansen & Chater propose that when the brain is processing a stimulus (e.g. an utterance) it must avoid getting stuck in the “Now or Never Bottleneck” [1] : the brain is forced to extract the necessary information as soon as possible, otherwise the information will be lost. Thus, the rich perceptual input needs to be recoded as it arrives in order to capture the key elements of the sensory information [1]. These compressed (or “chunked”) representations are abstractions of inputs (filtering out the details) rather than predictions encoding all the fluctuations of fast incoming inputs. Memory limitations also apply to these recoded representations; hence the brain needs to chunk the compressed representations into multiple levels of representation of increasing abstraction in perception, and decreasing levels of abstraction in action [1]. Therefore, each sequence of chunks at one level will be encoded as a single chunk to a higher level.
This post‑doctoral project will be conducted over a 13‑month period, potentially renewable, to allow for in‑depth investigation of developmental sequence learning mechanisms.
The general aim of the ANR DeepPool project is to build a dynamic neuronal model of vocal processing and production : the model should be developmental, hierarchical and use action‑perception mechanisms. This multi‑scale model will span from sensorimotor vocal imitation towards processing and production of long sequences. It will use incremental learning schemes, with goal‑directed exploration and seek symbol emergence. We want to create a generic action‑perception mechanism that (i) would enable action and perception to shape one another, (ii) while allowing to bootstrap the development of representations from raw sound perceptions, and (iii) which could be stacked as layers of a hierarchical architecture. More info on the ANR DeepPool project :
The post‑doc project will explore one or several of the topics of the ANR project above. The methods developed will be based on Recurrent Neural Network (RNN), reservoir in particular, but could also use emerging hybrid models in‑between Transformers and reservoirs [14] that we create in the team. A reservoir [3] is a random recurrent neural network made of non‑linear units that have been used to model various cortical areas [2, 12]. Reservoirs do not involve unfolding of time like BPTT used in LSTMs. In order to build action‑perception mechanisms we will embed various concepts from incremental, developmental, reinforcement and unsupervised learning. In particular, we will build on top of preliminary results we have on distal learning with reservoirs [4]. We will also use and develop new reinforcement learning rules adapted to reservoir computing, such as Hebbian exploratory rules [7], that we will combine with unsupervised learning rules that we previously developed such as Dynamic Self‑Organising Maps (DSOM) [9]. Moreover, we will enhance such models with a robust long‑term memory mechanism that we recently developed [12].
We will start by implementing the full sensorimotor architecture that we defined in our review [8]. We will build on our recent results both on human speech and birdsong data. For instance, on the songbird side, we built a simple sensorimotor model using a reservoir as the perceptive decoder, a simple Hebbian learning rule for the inverse model, and a Generative Adversarial Network (GAN) as the sound generator given the motor commands. This model is able to reproduce faithfully canary syllables using only 3‑dimensional latent space [13, 14]. In order to create the core action‑perception layer, the first steps will be to incorporate a forward model and replace the GAN by a reservoir. Later on, we will stack several of these layers at different levels of hierarchy in order to extract chunks (i.e. groups of acoustic elements) of increasing size and complexity. The models will be bootstrapped from goal‑directed learning (e.g. vocal imitation). Model features will not to be predefined by the modeller but they will emerge through developmental processes. Because we will be using similar model components, we will be able to apply similar analysis methods, thus facilitating multi‑scale analyses.
The RNN mechanisms developed will be applied on human speech and bird songs, because both share similar properties adequate for the project : humans and birds learn to imitate the complex sounds that their fellows produce; they developmental learn them starting from a babbling exploration phase; both bird songs and human language share a hierarchical organisation of elements with increasing chunk sizes; temporal context is key to make decisions on chunks (i.e. delimitation of chunk boundaries is ambiguous if the context is ignored); and vocal production models are available for both human and bird (e.g. VocalTractLab for human voice) [8].
Generic models, such as random reservoirs, can have a cross‑domain impact, opening potential adaptations to non‑vocal tasks. The methods and neural mechanisms that will be developed will not be limited to audio applications, but will be generic enough to be also applied to other domains such as motor gesture learning. Because such methods will be based on online, incremental and loosely supervised learning, they could provide more efficient methods useful for machine learning and artificial intelligence domains. Moreover, such sensorimotor models will be use as tools to analyse neuroscience experimental data of our collaborators with a new perspective, and could help in the long run to better understand mechanisms at work in speech rehabilitation therapies.
grossly remuneration : 2788€ per month (before taxs)