Transcribing JB Shows with Whisper
Background
Whisper for Audio Transcription
Whisper by OpenAI is a phenomenal audio to text transcription engine. The default output formats (TXT, SRT, VTT) are great for immediate use, but you can also integrate the whisper library into custom Python.
Jupiter Broadcasting
made with Stable Diffusion
Jupiter Broadcasting is a podcast network rich in Linux related content.
Bringing it together
A problem with podcasts and likely more so Technology podcasts is that some great content gets covered in the shows, but the discoverability of these topics is often limited to some show notes, or the occasional annotated timeline. From a producer standpoint, this means that once your show airs and enters the back catalog, its recurring contribution to the show drastically fades be it directly financial from consumer ad impressions, or for reference or re-use in future shows. From a consumer standpoint, if you’re a regular listener, you may be able to keep track of how long ago you heard a topic that interests you. However, if a new interest pops up that you know was covered, then you’ll have limited metadata to help your search.
A solution could be full text searchable transcripts of the shows. As a side project I scripted the processing of Jupiter Broadcasting shows and built a Django web application called jbshows to demo the idea.
The project was educational. While Whoosh works, I’d like to revisit it to clean up metadata and see if Solr would provide better results.
problems
Something I did find is that public feed podcasts will often feature ad reads. These are often in many episodes and can add noise to your search results. At the least, it will impact your ability to find non-ad read content about that particular product or service. Really you’d want to ingest an ad free feed, and then have your results link to the ad populated feed. This would mean having to transcribe both versions of your podcast so that you could use one for search indexing and another for captions of the episode as it airs.
give it a try
Repositories are available and permissively licensed.