So you just published your conference talk videos on YouTube, and you want to add captions. YouTube will use speech to text and try to add their own captions for you, but it's pretty hit or miss, especially for technical talks where there's a lot of proper nouns or jargon in the talk. The good news is there's a much better solution, and one that provides benefits to your in-person audience as well!
You can hire a captioning service to transcribe the speakers' presentations live during the event, providing a display of the captions on a separate display at the conference as well.
There's a real person at the other end of the screen, listening to everything the speakers are saying and typing really quickly. (There's a special keyboard and years of training that go into this, but there are plenty of resources online if you're interested in learning!)
I've worked with a few people who do captioning for these events, and they're always fantastic. If you run a conference, check out StenoKnight and White Coat Captioning and hire them for your next event! They can either send someone out in person to be at the event and do the captioning right there, or they can connect via Skype and do it all remotely.
Typically they'll provide you with a web page that you can pull up on a large monitor at the venue to display the captions in real time for the in-person audience. This is of course great for people who are deaf or hard of hearing, but is also really helpful for audience members who are not native English speakers, since often times the presenters will talk too quickly to be understood.
At the end of the conference, ask the captioner to send you the plain text files of everything they typed. This will just be a regular text file, no fancy formatting. Here's where the magic happens.
YouTube has a special feature which will match up typed text with the speech it recognizes in videos, doing all the work of syncing your captions with the video timings automatically. If you have a clear audio recording and accurate text, it does a surprisingly good job. Here are the detailed steps to take your plain text file and turn it into captions on YouTube.
First, launch YouTube Studio.
Click on "Videos" to show the list of all the videos uploaded.
Choose the video you want to add captions to.
Click the sidebar option "Transcriptions".
Then click "Add Language". We don’t want to use the autogenerated captions at all, since the live captionist does a way better job.
That adds a new row to this table. Click "Add" under "Subtitles".
A new window opens. Choose the "transcribe and auto-sync" option. That will let us paste in the transcribed text.
Copy the captions of the presenter into the box. Make sure the text you paste from starts and ends with the words the video starts with, or YouTube gets confused trying to line things up.
Once you do that, click "Set Timings".
This part takes a few minutes, so go make a coffee or tea, since you’ll need it for the next step. You can refresh this page to check if it’s done. It will look like this while YouTube is busy with it.
Finally when YouTube has finished thinking, it will appear as a draft you can edit.
Now YouTube has done its magic, and matched up the typed text with the spoken words! It usually does a pretty good job of it. Click on the draft and you can see what it's done.
You could probably publish this at this point, but I like to do a manual review of everything to make sure it looks good. You can play the video to review the captions and timings (check out the keyboard shortcuts which can really help speed up this step).
While reviewing, I’m mainly looking for the following:
Did YouTube leave any dangling words that could otherwise fit into the previous caption?
For example, this would look better if the word "fit," was in the previous caption frame instead of starting a new caption frame with the end of that phrase.
In that case, just move the word to the other caption frame.
Are there any obvious typos on technical terms or proper nouns?
The live captionists do a pretty good job, but occasionally some typos slip through.
If the presenter has any long gaps in between sentences, sometimes YouTube gets confused about the timing.
You can find some of these spots by visually looking at the waveform compared to the length of text in the caption. (In this particular example this happens to be accurate but it usually looks similar to this when it’s wrong.)
Once you’re happy with the transcript, click "Save Changes".
Now you need to delete the auto transcript so that only the good transcript is left.
You first have to click "Unpublish".
Then you can click "Delete Draft".
Now there is only one set of captions, the good one! You are finished, congrats!
Now when people watch the video on YouTube and enable captions, they'll be seeing what the live captionist typed during the event!
I've been helping Donut.js record their meetup talks for about 3 years now. It's been a fun way to experiment with various video recording setups, and the talks are always great so it's nice to have those recorded.
I've always used a smaller kit compared to the rig I bring to large multi-day conferences. But lately I've had to hand off some camera gear to their volunteers since I've been out of town on work trips for most of the meetups this year.
I've been looking for smaller and smaller gear to make this easy to transport, cut down on setup time, but also so that it can be operated by someone without a lot of experience with the gear. I've tried a few iterations of kits for them, but so far the most reliable way to record the meetup is using a camcorder, a separate device to capture the presenter's slides, and a separate audio recorder, then sync up everything in post.
In an ideal world, I'd be able to hand them a small box and a video camera, they could plug everything in, and it would record a single stream mixing in the presenter's slides, the video camera, and the audio. Here's a little diagram showing what I'd like in an ideal world.
The video camera and presenter's laptop connect to the "magic box" via HDMI. We use a PA in the venue, which has an XLR output, and that needs to connect to the magic box to get audio into the mix. Lastly, I want the presenter's laptop HDMI to pass through to an output so that we can feed the venue's projector from the box. The box should be able to record the video output to an SD card or hard drive. I don't need it to be able to livestream, but bonus points if it can.
Part of the goal is also to cut down on post-production time. Right now I have to sync the audio and video (automatically) in Final Cut or Premiere, then line up the recording of the slides manually. It ultimately isn't that bad, but does mean the whole process takes around an hour for the three talks. Ideally I could record a mix of the video and slides already combined into a single video so that the only work is trimming the start and end, and adding the title slides. Here's a snapshot of the kind of layout I'd like to make.
This means scaling the slides and scaling and cropping the video. I would settle for a side-by-side layout like the below, where both are scaled but not cropped.
My last requirement, and one that rules out a few otherwise great devices, is that I need the device to be simple enough to operate to explain in a single page of instructions. I need it to be plug the HDMI and audio inputs in, then turn on the device, and hit record. Once I've set it up once, it can't require any configuration on site.
So far, I haven't been able to find a device that can do all of these things at the same time. Here's a list of all my requirements:
Here are a few setups that I've tried or investigated.
This device is so close to being perfect. It has only one HDMI input, but you can also plug in a USB webcam as a second camera. While that's obviously not going to be as good quality as a real camera, I would consider it good enough for this use case. The Webcaster X2 is the device that made the screenshot above with the text "My Great Presentation", so you can see that it's able to scale both the HDMI input as well as the USB webcam. Here's where it fails:
I'm also not a huge fan of the fact that it's actually an Android device, but it is pretty well done anyway, and mostly you can ignore that fact while using it.
Total cost: $300 for the Webcaster, but this doesn't really apply because it can't record locally at all so it's not really an option.
At a recent conference I recorded, I hacked together a DIY version of the box using a few components.
The inputs are connected with short HDMI extenders to expose them to the outside of the box.
This makes setup super easy, since you just plug in the three HDMI connectors and you're good to go. In this conference we were using a lav mic that fed into the camera, so the audio was coming in via that HDMI.
The scaler handles taking whatever resolution peoples' laptops throw at it and convert it to 1080p, plus outputs that back out for the projector. The multiviewer then takes the scaled computer output and the HDMI camera and creates an image with two smaller windows of each video. It's also able to select which one to use audio from. The output of the multiviewer goes into the Atomos recorder to record the final output.
This worked well, but is kind of a clunky solution, plus wouldn't work for Donut.js where the audio needs to get fed in separately from an XLR cable. That'd require a few more pieces of hardware such as an HDMI audio injector or such.
Total cost: $300 scaler, $300 multiviewer, $100 cross-converter, $500 HDMI recorder: $1200 plus some cables and maybe also an audio injector.
I haven't actually used this device yet, but it comes very very close to being a perfect solution based on all the videos and reviews I've seen.
It has three HDMI inputs, and one pass-through port which is perfect for feeding the projector.
It even has an XLR input on the side which we can use to input the feed from the PA.
But here's what it's missing:
Switching between picture-in-picture and one of the HDMI sources can be done with the physical buttons on top, and would be easy enough to instruct people how to do.
I suppose I could live with picture-in-picture instead of side by side, but I would feel better about that if it also had built-in recording to an SD card.
By the time you add an external recorder, you're spending $1500 on the VR-1HD and $500 on the recorder, for a total of $2000.
The Tricaster is definitely an all-in-one solution, but I'm ruling it out immediately since it requires quite a lot of configuration to get running and isn't something I would consider handing off to a volunteer. It's also quite expensive at a baseline price of $6000.
The Blackmagic ATEM is my goto for larger events, and I do really like it. However, it's still a bit too complicated to hand off to someone to use. It doesn't have built-in recording, so you have to pair it with an external recorder. It also doesn't have a built-in scaler so you need that for the slides too.
It also can only do picture-in-picture, not side by side video. In order to do that you need to step up to the much larger and more expensive devices.
I haven't actually put together a complete parts list for what it would cost for this option because I don't think it's viable at all. The ATEM TVS is $1000, the recorder is $600, and the scaler is $300, so the base cost before all the other accessories you'd need is $1900.
I'm including this option in the list just so people don't tell me I forgot it. It turns out this isn't actually a very good solution, because it won't be an all-in-one box, and also is kind of complicated to operate, requiring a monitor and keyboard and mouse.
Trying to do this on a laptop isn't really feasible since it requires two HDMI capture cards plus a USB audio interface. I wouldn't trust Windows to do this since it's very easy for a Windows computer to accidentally start running auto-updates at inopportune times. Running Linux is an option, but would then likely require more explanation to people using it.
I would want to be able to configure the computer to launch OBS on boot and restore a saved configuration, so that there is no fiddling with software to get it running.
Overall I feel like there are too many moving parts and different ways this can fail, and also would require a lot of plugging wires in so the setup time would actually be pretty long.
I've actually used the SlingStudio at Donut.js quite a bit myself, and it is again almost perfect.
The total cost of this setup is:
The Epiphan Pearl Mini sure looks like a fantastic device. I haven't tried it out myself, but I've looked at a bunch of reviews of it. It actually seems like it's the only thing that actually ticks all of the feature boxes.
The only thing I am not clear on is what happens when you first boot it up. I am hoping that it would restore the last used configuration and be ready to go immediately.
Really the only downside to this is the cost. It's a $3500 device, which is good for what it can do, but also still quite a lot of money. After this much research though, I'm coming around to the idea that maybe it's worth it.
So I think out of all of these options, the best is the Pearl Mini ($3500), and second best is the Roland VR-1HD with external recorder ($2000).
I'd be curious to hear your thoughts on this! Did I forget about any options? Is there a new device that's come out that I don't know about yet?
Write a blog post response or ping me on Twitter to get in touch!
If I ever do find the perfect solution, I will be sure to post a review video on my YouTube channel!