SPOKEN ENGLISH FLUENCY SCORING USING CONVOLUTIONAL NEURAL NETWORKS
Recently, Computer Aided Language Learning (CALL) has received considerable attention as a method for improving the English speaking skills of non-native students. In order for CALL systems to provide useful tutoring feedback, an automated scoring system is required to evaluate the pronunciation quality, and fluency of non-English native-speaker students, as well as, specific mistakes made by them.Most automatic spoken English fluency scoring systems have three components: Automatic Speech Recognition (ASR), fluency feature extraction, and a scoring model. ASR generates time-aligned word sequences for an input speech. The fluency feature extraction computes the features that are assumed to be highly correlated to fluency in spoken English [1, 2, 3]. In [1, 4], various features are investigated.
Korean-Spoken English Corpus (K-SEC) is a database of English speech sounds spoken by Koreans for experimental phonetics, phonology, English education and speech information technology. The K-SEC is composed of six sets of English pronunciations spoken by Koreans. In this study, we used Set#5 composed of English sentences. Set#5 was designed to see i) Korean speakers’ intonation and rhythmic patterns in English connected speech, ii) the degree of sandhi between5.3. Model architecturesVarious CNN architectures are evaluated to investigate the effects of the number of kernels and layers. Table 2 lists the CNN models: ”conv” indicates the convolutional layer, and ”fc” indicates the fully connected layer. ’’Orthogonal” indicates that the weights are initialized in a random orthogonal way. ”Model-1” to ”Model-8”, are configured to focus on extracting different numbers of primitive fluency features so that the models have a single convolutional layer. ”Model-9” to ”Model-11” consist of two convolutional layers to investigate the effect of the mid-term characteristics of the speech signals. ”Model-12” to ”Model-17” consist three convolutional layers in order to consider the long-term characteristics. In this work, TensorFlow  is used to conduct the experiment.https://speakinenglish.in/
Although the proposed approach shows promising feasibility results, it is hard to guarantee that the proposed approach works robustly because the training corpus is small. However, the results show that the proposed approach is applicable to solve problems in scoring fluency of spoken English.
In this paper, we investigate spoken English fluency scoring using CNN that uses raw time domain waveform as input and optimizes feature extraction and prediction model parameters jointly. Although the size of the K-SEC size is not sufficient to train the proposed model parameters robustly, the results show that the proposed approach is effective.We are currently collecting a large corpus of spoken English that is scored by human experts. Therefore, in the next work, we will attempt trying to train the proposed model robustly, and we will use a hybrid approach by combining learned features and conventional feature