Speech to text mp3 audio files using Azure Cognitive Services and .NET Core

Transcribe mp3 audio files to text using Azure SpeechServices and C#

There is a big buzz about AI these days and major Cloud vendors like Amazon Web Services, Azure, Google Cloud are competing to bring better products to their platforms for variety of AI tasks. One of these services is speech recognition and generating transcription text from the audio.

I recently worked on a project which involved transcribing large amount of daily generated audio recordings. For transcription I used Azure SpeechServices to get the text from the previously recorder audio files. 

Setting up Azure Cognitive Services - SpeechServices instance

Before you can use Azure SpeechServices you need to add instance to your Azure account. It is categorized as Azure Congitive Services so from dashboard find Cognitive Services. 

Azurespeech 1

If you haven't use any of Azure Cogntive Services this list will be like on the picture empty. Hit Add or Create cognitive services button to create a new SpeechService Cognitive Service instance. In the search bar type "Speech" and in the result list you will Speech item available. 

Azurespeech 2

Select Speech item from the result list and populate the mandatory fields.

Azurespeech 3

If you are going to use the Speech service only for demo or development, choose F0 tier which is free and comes with cetain limitations. Click Create button and your SpeechService instance is ready for usage.

Converting audio from MP3 to WAV format

Unfortunately Azure SpeechServices for now does not support direct mp3 to speech (transcription text) processing. You can for now only submit wav audio format files to transcription. This is definitely a downside as you need to upload a lot more data with wav format instead of just simply post MP3 and the conversion from MP4 to wav could be something that could be implemented in the SpeechServices itself.
Even if you have wav files as a source, it would be more convenient to compress it to MP3 and send for transcription. Libraries and tools like ffmpeg can be pretty handy for something like that. Since I have sample files in MP3 and will use ffmpeg for converting MP3 to wav in order to be able to send the audio for recognition.

The good part is that ffmpeg tool is cross platform so if you choose to write your solution in .NET Core, you can have everything functional on multiple platforms on which your .NET Core application runs (Windows, Linux, MacOS).

To get the wav file from your mp3 audio recording you can simply run

ffmpeg -i sample.mp3 sample.wav

This will process your mp3 and give you wav as an output which is usually about 10 times larger than original mp3 file. To reduce the output file size, you can also apply some options like reducing the number of channels or using some of the ffmeg codecs that come with ffmpeg tool

To reduce to mono audio from the stereo, just run ffmpeg with the -ac argument set to 1

ffmpeg -i sample.mp3 -ac 1 sample.wav

This will impact the output size and it will give you output approximately 5 times grater that the input mp3 file. Another thing to use when trying to reduce the output file is lowering down the sample rate. Let's say we want to downgrade the sample rate to 22.05 kHz, we would have to add -ar parameter for the ffmpeg tool

ffmpeg -i sample.mp3 -ac 1 -ar 22050 sample.wav

This will reduce the output for roughly another 50% and save you time when uploading your audio to Azure SpeechServices instance. Reducing channels and sample rate approach worked for me with the audio files I was processing using Azure SpeechServices, but for your audio files and your requirements these command line arguments might now work. I am not an audio expert, so if you need to dig more into this subject please check online on how to use ffmpeg tool for converting audio files from one format to another.

Now this is fine if we run it manually, but we want this to be automated as a part of our solution. We can still rely on ffmpeg tool to do the conversion part for us by simply calling it from the code and passing the parameters. Remember I mentioned we might run this on different platforms, so to keep it this way let use both Linux and Windows ffmpeg build. You can download both Linux and Windows ffmpeg executables from www.ffmpeg.org, add it as a part of your core project and configure project to copy it to the output folder

  <ItemGroup>
    <None Update="lib\ffmpeg">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="lib\ffmpeg.exe">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
  </ItemGroup>
    

When you build the project, the executable binaries will be available along with your application binary and you can invoke them from the code using System.Diagnostics.Process class.

Lin Win Mac

Before we actually run the ffmpeg command as a Process instance in our code we need to determine the OS where our application is executing. For this purpose I used System.Runtime.InteropServices.RuntimeInformation which provides methods for probing the OS version.

Once we have the operating system version known, we can run the proper binary and process out mp3 input audio file.

public static void ConvertMp3ToWav(String inputMp3FilePath, String outputWavFilePath)
{
	var ffmpegLibWin = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "lib", "ffmpeg.exe");
	var ffmpegLibLnx = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "lib", "ffmpeg");
	String procPath = ffmpegLibWin;
	if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
	{
		procPath = ffmpegLibLnx;
	}
	var process = Process.Start(procPath, $" -i {inputMp3FilePath} -ac 1 -ar 22050 {outputWavFilePath}");
	process.WaitForExit();
}
    

We have one more step finished in order to get the transcription from our audio file. We have both Azure Cognitive Services SpeechService instance in place and raw audio file in wav format, now we just need to pass it to Azure and get our transcription text as a result.

Invoking Azure SpeechServices from C#

The final step in getting transcription from your audio files using Azure SpeechServices is to invoke it from your C# code. I wrote the sample code in .NET Core just to be able to run it on both Linux and Windows as well. For the same reason I included both Linux and Windows build of ffmpeg tool in the project structure and build output.

First things first, to make invoking of the API easier I will use Microsoft.CognitiveServices.Speech nuget package

  <ItemGroup>
    <PackageReference Include="Microsoft.CognitiveServices.Speech" Version="1.2.0" />
  </ItemGroup>
    

Azure SpeechServices client class SpeechRecognizer can be instantiated in a few ways. I used SpeechService instance key and region configuration constructor. These two values can be fetched from Azure portal. Just navigate to Cognitive Services section in your Azure portal subscription home page.

Azurespeech 3 1

 Click on the name of your SpeechService instance you created previously and got to Overview option.

Azurespeech 4

You will need region value in the code in order to instantiate SpeechConfig class instance which you will use to instantiate SpeechRecognizer class which will communicate with your SpeechService instance. I choose West US region to keep my instance in, so the value in my code will be westus.

The second value you need to configure SpeechRecognizer is SpeechService key. Select Keys section for selected SpeechService in Azure portal and you will get your instance keys in the right page of the portal page.

Azurespeech 5

We have all the values we need to invoke our SpeechService instance

    class Program
    {
        static async Task Main(string[] args)
        {
            var taskCompleteionSource = new TaskCompletionSource<int>();
            var config = SpeechConfig.FromSubscription("??????????????????????????", "westus");

            var transcriptionStringBuilder = new StringBuilder();

            using (var audioInput = AudioConfig.FromWavFileInput(@"D:\Temp\Samples\sample.wav"))
            {
                using (var recognizer = new SpeechRecognizer(config, audioInput))
                {
                    // Subscribes to events.
                    recognizer.Recognizing += (sender, eventargs) =>
                    {
                        //TODO: Handle recognized intermediate result
                    };

                    recognizer.Recognized += (sender, eventargs) =>
                    {
                        if (eventargs.Result.Reason == ResultReason.RecognizedSpeech)
                        {
                            transcriptionStringBuilder.Append(eventargs.Result.Text);
                        }
                        else if (eventargs.Result.Reason == ResultReason.NoMatch)
                        {
                            //TODO: Handle not recognized value
                        }
                    };

                    recognizer.Canceled += (sender, eventargs) =>
                    {
                        if (eventargs.Reason == CancellationReason.Error)
                        {
                            //TODO: Handle error
                        }

                        if(eventargs.Reason == CancellationReason.EndOfStream)
                        {
                            Console.WriteLine(transcriptionStringBuilder.ToString());
                        }

                        taskCompleteionSource.TrySetResult(0);
                    };

                    recognizer.SessionStarted += (sender, eventargs) =>
                    {
                        //Started recognition session
                    };

                    recognizer.SessionStopped += (sender, eventargs) =>
                    {
                        //Ended recognition session
                        taskCompleteionSource.TrySetResult(0);
                    };

                    // Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
                    await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

                    // Waits for completion.
                    // Use Task.WaitAny to keep the task rooted.
                    Task.WaitAny(new[] { taskCompleteionSource.Task });

                    // Stops recognition.
                    await recognizer.StopContinuousRecognitionAsync();
                }
            }

            Console.ReadKey();
        }
    }
    

Since recognition method is asynchronous, our Main of the console application will run to completion and will not wait for recognition to finish. For that reason we have to wait for the client to finish recognition process.

Summary

SpeechRecognition services are only one of the many applied usages of the AI models. Currently cloud service providers offer these services at quite affordable prices and big quotas for the free tier for one simple reason. Cloud AI service providers need as much as data as possible to train their AI models. Over time, we can only expect these services to be better and more accurate as the amount of data provided to the AI models gets increased.

 

References

Disclaimer

Purpose of the code contained in snippets or available for download in this article is solely for learning and demo purposes. Author will not be held responsible for any failure or damages caused due to any other usage.


About the author

DEJAN STOJANOVIC

Dejan is a passionate Software Architect/Developer. He is highly experienced in .NET programming platform including ASP.NET MVC and WebApi. He likes working on new technologies and exciting challenging projects

CONNECT WITH DEJAN  Loginlinkedin Logintwitter Logingoogleplus Logingoogleplus

JavaScript

read more

SQL/T-SQL

read more

Umbraco CMS

read more

PowerShell

read more

Comments for this article