Recently, Computer Aided Language Learning (CALL) has received considerable attention as a method for improving the English speaking skills of non-native students. In order for CALL systems to provide useful tutoring feedback, an auto­mated scoring system is required to evaluate the pronuncia­tion quality, and fluency of non-English native-speaker stu­dents, as well as, specific mistakes made by them.Most automatic spoken English fluency scoring sys­tems have three components: Automatic Speech Recognition (ASR), fluency feature extraction, and a scoring model. ASR generates time-aligned word sequences for an input speech. The fluency feature extraction computes the features that are assumed to be highly correlated to fluency in spoken En­glish [1, 2, 3]. In [1, 4], various features are investigated.

Korean-Spoken English Corpus (K-SEC) is a database of En­glish speech sounds spoken by Koreans for experimental pho­netics, phonology, English education and speech information technology. The K-SEC is composed of six sets of English pronunciations spoken by Koreans. In this study, we used Set#5 composed of English sentences. Set#5 was designed to see i) Korean speakers’ intonation and rhythmic patterns in English connected speech, ii) the degree of sandhi between5.3. Model architecturesVarious CNN architectures are evaluated to investigate the ef­fects of the number of kernels and layers. Table 2 lists the CNN models: ”conv” indicates the convolutional layer, and ”fc” indicates the fully connected layer. ’’Orthogonal” indi­cates that the weights are initialized in a random orthogonal way. ”Model-1” to ”Model-8”, are configured to focus on ex­tracting different numbers of primitive fluency features so that the models have a single convolutional layer. ”Model-9” to ”Model-11” consist of two convolutional layers to investigate the effect of the mid-term characteristics of the speech sig­nals. ”Model-12” to ”Model-17” consist three convolutional layers in order to consider the long-term characteristics. In this work, TensorFlow [16] is used to conduct the experiment.https://speakinenglish.in/


Although the proposed approach shows promising feasibil­ity results, it is hard to guarantee that the proposed approach works robustly because the training corpus is small. However, the results show that the proposed approach is applicable to solve problems in scoring fluency of spoken English.

In this paper, we investigate spoken English fluency scoring using CNN that uses raw time domain waveform as input and optimizes feature extraction and prediction model parameters jointly. Although the size of the K-SEC size is not sufficient to train the proposed model parameters robustly, the results show that the proposed approach is effective.We are currently collecting a large corpus of spoken En­glish that is scored by human experts. Therefore, in the next work, we will attempt trying to train the proposed model robustly, and we will use a hybrid approach by combining learned features and conventional feature