Captain’s Log

I completed this project a couple of years ago, however once Google changed their API I never got round to updating it. So now seems like a good a time as any to start this from scratch and document to the process.

For this we take a file containing speech (e.g. from a Dictaphone – if such a thing still exists) and it outputs a transcript of that speech. This can be handy if you have recorded a meeting, or a lecture, and you want a quick way to be able to search through the contents.

First of all – all the code is here

This project takes a voice file and converts it to text using Google Cloud Voice-to-text. It also analyses the voice for sentiment (i.e. is it positive or negative).

The process of the programme is as follows:

  • Convert voice file to compatible format
  • Upload voice file to Google Cloud storage bucket
  • Perform voice-to-text transcription
  • Create and open a text file / open existing file
  • Write time and date at the top of the file, then append text.

Now, here be dragons! I’m not going to detail every little step so if you don’t have any familiarity with programming then this might not be for you…

Setting up the project:

Google Cloud Setup:

To use Google Cloud services you need to active you first need to setup a project using the following link: Here you will create a project, create an API etc etc. Follow the instructions provided by Google (which looks like the picture below).

Once you’ve installed the Cloud SDK (#6 above) you will need to log into your Google account to finish the installation.

First of all, I advise you to clone the code and requirements file for this off my Github repo as it will give you all the parts you need. You can then ‘pip install’ all the necessary parts, or my preferred way… (

pip install -r requirements.txt

The requirements file has everything at the exact version numbers when I installed it. If you prefer to have the latest and try for yourself, the following modules are required:

  • pydub
  • google-cloud-speech
  • gcloud
  • google-cloud-language

The Code

If you look within my code, I start the file importing various modules, as well as fixing a few global variables. I’ll go through each here:

 os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "DiaryText.json"

I use this in the python file to directly change the project credentials. If you follow the instructions above it looks like you have to change the credentials for every project you run - this just simplifies the process by changing it for each project. For that however you need to save your JSON file obtained above into your python project folder (or hardcode the file path here).  
parser = argparse.ArgumentParser(description='Convert speech audio to text using Google Speech API')
parser.add_argument('in_filename', help='Input filename (`-` for stdin)')

This function is used to take in arguments from the command line. For example, when I run this function I will call the name of the python file followed by the filename of the file I wish to transcribe. 
e.g. "python C:\Users\craig\Desktop\File.mp3"
#user variables
diary_location = "C:/Users/craig/Google Drive/CaptainsLog/Captains_log.txt"
processed_directory = "C:/Users/craig/Google Drive/CaptainsLog/Processed"

This is a list of variables which are hard-coded into the programme. When I process the files I wish to move them to a set folder. You can set this to your desktop or wherever you wish. 

Getting the file

#Get the name of the voice file to transcribe
args = parser.parse_args()

This gets the name and path of the file we are going to convert

Converting the Voice File

The voice file needs to be in a compatible format. That is FLAC @ 16kHz sample rate. Some other formats may be accepted however for compatibility it’s just easier to convert every file to the same format regardless of origin.

My first attempt at this programme used to directly open a command line, feed FFMPEG ( which is a programme which can convert almost any media file to another, cut video, combine audio – basically damn near anything… ) the file, and convert it, however while I’m re-doing this project let’s do it properly and do all the work in Python. For that we use a file called “Pydub”. It invokes FFMPEG directly.

To do this, we need to install FFMPEG on to our computer. For that we follow this tutorial:

So we feed the module the file:

#Change the file type and bitrate into the correct format
print("Converting the file to the correct format...")
converted_file = decode_audio(args.in_filename)

and we convert the file:

def decode_audio(in_filename, **input_kwargs):
audio = AudioSegment.from_file(in_filename)
audio = audio.set_frame_rate(16000)
audio = audio.set_channels(1)
print(f"Audio Bitrate: {audio.frame_rate}")
audio.export("converted.flac", format="flac")
print("Conversion Completed")
return "converted.flac"

This will create a file called “converted.flac” in your python directory. Once you have completed the programme you can delete this. You could even script the deletion of this file, however I will leave that as an exercise for the reader.

Upload to Google Cloud Storage

We then need to upload the file to a Google Cloud Bucket for processing – so we feed the converted file (name) into another module:

#Upload the file to google storage bucket
print("Uploading the file to Google Storage...")
file_name = upload_to_gcs(converted_file)

And we upload the file:

def upload_to_gcs(in_filename):
    # upload resultant file to GCS
    storage_client = storage.Client()
    buckets = list(storage_client.list_buckets())
    bucket = storage_client.get_bucket("audioprocess") # your bucket name
    blob = bucket.blob(in_filename)
    print(f"file uploaded successfully in {buckets}")
    return str("gs://audioprocess/" + in_filename)

So for this, I have created a bucket called ‘audioprocess’ which keeps the files separate. I take the converted file and upload into that bucket. It then returns the address of that bucket to the next programme.

Speech to Text

We then need to start Google Cloud running the speech to text process.

#Run the speech to text process via google cloud
print("Running Speech to Text process via Google Cloud...")
gcs_response = speech_recognise(file_name)

Now for completeness I’ll put my code here, however Google have some pretty good examples for all different programming languages at the link here which I used in the creation of this project.

So we start the process. You will notice I included the imports for the Google Cloud components within the module itself – I found this to be the best way as it’s fairly self-contained, and also when I imported more cloud modules they started to interfere with each other.

def speech_recognise(storage_uri):
from import speech_v1
from import enums

client = speech_v1.SpeechClient()
sample_rate_hertz = 16000

# The language of the supplied audio
language_code = "en-gb"

encoding = enums.RecognitionConfig.AudioEncoding.FLAC
config = {
"sample_rate_hertz": sample_rate_hertz,
"language_code": language_code,
"encoding": encoding,
audio = {"uri": storage_uri}

operation = client.long_running_recognize(config, audio)

print(u"Waiting for operation to complete...")
response = operation.result()
return response

Delete from GCS

Now a little housekeeping. The transcription is done and you have that in memory. You no longer need the file you uploaded to the Google Cloud bucket so lets delete it so it’s not taking up space.

#Delete the file from GCS
print("Deleting file from Google Cloud Storage...")

Which starts the following module (which you will see is almost the same as the upload process):

def delete_from_gcs(in_filename):
# upload resultant file to GCS
storage_client = storage.Client()
buckets = list(storage_client.list_buckets())
bucket = storage_client.get_bucket("audioprocess") # your bucket name
blob = bucket.blob(in_filename)
print("File has been deleted")

Joining the chunks together

Now when the transcription comes back, it splits it into little manageable chunks, each with their own accuracy measurement. I have little use for this as I just want the whole file, so this module combines all the component parts, and also averages out the accuracy (or ‘confidence’) values into one global value. So we start it:

#Print the results to the console & append into one long text document
print("Appending document into one long file...")
output_text = text_and_confidence_combined(gcs_response)
print(f"Output Text: {output_text[1]}\n Confidence {output_text[0]}")

Which starts the following module, which returns the averaged confidence as well as the full joined text:

def text_and_confidence_combined(gcs_response):
text_output = "" # Blank string for long text document - all parts in one doc
confidence_total = 0 # Starts the overall confidence total at this point
confidence_counter = 0 # Starts a counter to which the overall total will be divided
for result in gcs_response.results:
# First alternative is the most probable result
alternative = result.alternatives[0]
#print(f"Transcript: {alternative.transcript}") #debug lines
text_output += alternative.transcript
#print(f"Confidence: {alternative.confidence}") #debug lines
confidence_total += alternative.confidence
confidence_counter += 1
overall_confidence = confidence_total / confidence_counter
return overall_confidence, text_output

Sentiment Analysis

This is another module from the Google AI suite. “Sentiment Analysis inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer’s attitude as positive, negative, or neutral. “

So we send the joined up file to this module. As the previous module returned two values as a list, we use the [1] to call the second value in the list – i.e. the text_output:

#Sends the text to recognise sentiment
print("Performing sentiment analysis...")
overall_sentiment = sentiment_analysis(output_text[1])

Which in turn calls this module. Again, this code sample came from the excellent examples provided by Google ( , with just a few changes by myself which I’ll detail below:

def sentiment_analysis(sentiment_text):
from import language_v1
from import enums

#Module Setup
client = language_v1.LanguageServiceClient()
type_ = enums.Document.Type.PLAIN_TEXT
language = "en"
document = {"content": sentiment_text, "type": type_, "language": language}
encoding_type = enums.EncodingType.UTF8

#Get overall document sentiment
response = client.analyze_sentiment(document, encoding_type=encoding_type)
# Get overall sentiment of the input document
document_sentiment_score = response.document_sentiment.score
print(f"Document sentiment score: {document_sentiment_score}")
document_sentiment_magnitude = response.document_sentiment.magnitude
print(f"Document sentiment score: {document_sentiment_magnitude}")

#Interpret the results
if document_sentiment_score >= 0.2:
print("Sentiment returned is: Positive")
return "Positive"
elif document_sentiment_score == 0.1:
print("Sentiment returned is: Neutral")
return "Neutral"
elif document_sentiment_score == 0.0:
print("Sentiment returned is: Mixed")
return "Mixed"
elif document_sentiment_score <0.0:
print("Sentiment returned is: Negative")
return "Negative"

Here I have added a separate part to interpret the results. Google gives a summary of what the returned values mean ( ) so from that I have made a global decision on what is positive or negative. It’s a little hacky, so feel free to play around.

Writing to a file

Next we take all this information and write it to a file. To do that we feed it all the values:

#Writes the text to the top of your overall file
print("Appending text to your main Diary file...")
append_text_to_file(output_text, overall_sentiment)

Remembering above that ‘output_text’ is actually a list which contains both the text and the confidence values. This invokes the following module:

def append_text_to_file(text_list, sentiment):
#with open automatically closes file after you leave the code block
with open(diary_location, 'a') as Captains_Log:
the_date =
the_date = the_date.strftime("%d %b %Y")
confidence = text_list[0]
text_to_write = text_list[1]
Captains_Log.write(f"Date: {the_date}, Mood: {sentiment}, Confidence: {confidence} \n")

What I’ve done here is to open a file (which you detailed in the global variables section above), write the current date, the sentiment, the confidence and the overall transcribed text. Using the ‘a’ modifier on ‘with open()’ it appends to an existing file, or if no file is present makes a new one. Then it closes the file when it’s finished.

Moving the processed file

Lastly, we want to move the file into a directory to let us know we have finished with it. This command moves the file you fed into the programme at the start and puts it in your ‘processed’ directory as defined by the global variables at the top of the file:

#Move the converted file into the processed directory
print ("Moving the file to the processed directory in your Google Drive...")

Which calls the following module. You will notice I’ve used a programme called ‘shutil’. This is a python module which allows you to do a number of high level file operations such as move or copy etc:

def move_to_processed(file_path):
destination = processed_directory
shutil.move(file_path, destination)

So lets take this puppy for a spin!

So for this example I have downloaded a long file to test – Obama’s acceptance speech from 2008. I used one of the many youtube downloading programmes to get this file in .mp3 and save it to my desktop. So lets’s start the programme – open your command line and run the following – you can see name of file, then filename+path:

python C:\Users\craig\Desktop\Obama1.mp3

Which all going well will produce the following:

All the little status updates, then a loooooooad of text

which finishes with:

and the other half…

So we can see from that, the accuracy of the transcription was around 92%, the mood was ‘mixed’, and all the parts completed successfully.

So go grab that beer!