Speech To Picture - kinda' creepy

October - December 2023 -

This page will give you all the instructions you need to create your own speech driven picture frame. It's a fun project that will amaze your friends. It brings the power of AI image generation to everyone. (A short video of the project )

See a bunch of images from this project
Creepy Picture Frame

Other Rants and Raves

How it started

Bruce is always looking for cool things and sent me a link to the Whisper Frame by Christopher Moravec (as seen on Hackaday) Chris described using a Raspberry Pi and a pretty sophisticated microphone to listen to a conversation and use AI to create an image based on the conversation. I loved the idea. Eavesdrop on the surroundings, create some art, wait for the unsuspecting participants to lookup and say something like, "What the heck? We were just talking about Egypt and now there's a picture of the pyramids? How can that be?" I had to have one.

I code up all kinds of projects but didn't know anything about the OpenAI API, and just a smidge more about Python. Friends at my local coffee shop had been extolling the virtues of GitHub CoPilot for a while and I decided to see if it could help me with this project. Wow, did it ever! (I'll have to write another story about that experience.) I started out to do it on my Macbook, knowing that I'd have a microphone and the compute power I needed. With CoPilot at my side I had things working in a rudimentary way within a day. I was excited.

I decided to start cheap; turns out I was able to end cheap as well! I had an old RPi 3 laying around. For a microphone Chris used an expensive directional microphone; I ordered a cheap $7 USB microphone from Amazon. I had a spare monitor, so that came onto my desk. In a few hours I had the code running on the RPi and when the microphone arrived the next day, that also worked quickly. There were a few bumps along that way, but nothing that couldn't be solved with a little googling.

The project was wrapped up in time to put the display out for our Halloween party. After a lot of discussion we decided that while the surepiticious monitoring would be cool and certainly creepy, it might trigger some privacy concerns from our guests. Instead, I added a button to the RPi that would trigger a session. This put the guest in control of the situation. It turned out to be very popular. It's fun to command it to make pictures on any topic at all.

I've learned a lot along the way that I'll document for you below. My code is in GitHub. You can probably get it running on your own Macbook in half an hour. Running on an RPi requires a bit more, but not much. If you make your own, please email me and let me know how it goes for you and your friends.

I have mine installed at our makerspace where we let people play with it. At open houses, and on tours, it has proven to be a fun interactive exhibit

How it works

Using Python the project uses several calls to the OpenAI cloud for services. The basic flow is:

  1. Collect an amount of audio. Either 10 seconds or 2 minutes.
  2. Send the audio to OpenAI to get a transcript of the conversation.
  3. If we collect 2 minutes, then send the transcript to OpenAI to get a key phrase. Otherwise we'll use the transcript as the key phrase.
  4. Send the key phrase to OpenAI to get 4 images.
  5. Combine the 4 images into a grid and add a caption at the bottom.
  6. Display the images on the screen.

I'll cover each step below and what I learned along the way. If you just want to get going, then you can download the code from my GitHub repo.

Running the Program

Run from the command line as >python pyspeech it will bring up a simple menu. Select "o" for "once" or "a" for 10 iterations of the capture and picture generation.

The -h option will print a list of command line options. The most useful for the beginner are

  1. -d 0 to turn on debugging
  2. -o to set the 10 second capture mode

There are a number of other options for testing that let you start at different points in the pipeline. For example, if you want to work on the display step, then use the -i option to pass in an existing image file. The code won't do all the OpenAI work and just execute the image display step.

Getting Set Up

OpenAI - To use this code you'll need an API account from OpenAI. On their site create an account. Then you need to put some money on account. At first this worried me, but OpenAI lets you set monthly limits on how much you can spend. Once you reach that limit they stop access to the API and send you an email. This protects you from wild spending; without this option I might not have gone forward with the project. The service is pretty cheap. The whole evening of our party people were pressing the button and making images. I had set a limit of $50 hoping that it didn't go too quickly. In the end, the entire evening of fun cost me $3. That's right, less than a latte.

You can get the software running on your Macbook first and purchase the stuff below only if you want to run on an RPi.

Raspberry Pi - I used an old Model 3 that had been sitting around so long the heatsinks were falling off. (I bought more heatsink tape for $6.) You have to set up your RPi but that is well documented other places, and pretty easy these days.

I bought the SunFounder Mini Microphone MI-305 for $8 on Amazon. I was worried about background sounds and limited sensitivity - bah, it works great!

On the RPi I had to use a Python virtual environment. It was simple to do. Instructions are in the header comments of the code.

Collecting Audio

At first this was really tough, even on my Macbook. It seemed like there was no microphone. Ah, you need to give permission for the terminal to access the microphone. On OSX it took me some searching to figure this out. This page helped me. Until the Terminal app shows up in Settings / Privacy & Security / Microphone this program just won't work.

On the RPi I had to add my user to the "audio" group. I did this with usermod -a -G audio

On the RPi you have to add your user to the "audio" group.

Transcribing Audio

Back on the OpenAI site you have to generate an API access key. Go to your account, click on the View API Keys choice, give it a name. To run you'll have to set the Unix environment variable OPENAI_API_KEY (at least this is true in v1.0; when OpenAI releases another version you'll have to figure out how they've changed. The OpenAI API is not stable yet.) There are instructions on how to do this in the comments at the top of the code.

The actual transcription was so easy as to be trivial. CoPilot gave me the code. The transcription is really marvelous. OpenAI charges you by the number of "tokens", which is hard to grasp but in the end costs only pennies.

Getting a Key Phrase

At first I was collecting 3 minutes of audio, which was too much for the later step of picture generation. I decided to ask OpenAI to summarize the transcript and then pick an interesting phrase from the summary.

The summary itself is scarily good. In testing with my friends we sat at our local coffee roastery just chatting and then every few minutes a summary would appear on the screen. We would read it in awe of the technology. You can't believe how good it is. In the end I didn't use the summary, but I've left the code commented out because you might want to try it.

I found that the two step approach was too indirect. And I had a lot of trouble getting a good prompt to use in generating the image. My friend Michael suggested I ask OpenAI to pick "the most interesting noun phrase" from the transcript. That worked great for these long captures. (As an aside about CoPilot - I'm updating this page right now and in a part below about interesting discoveries I had CoPilot suggest this, "look for the first noun phrase that is not a pronoun." How interesting. Perhaps I should go back to the summary and ask CoPilot how to improve my prompt.)

For the short captures I just use the transcript as the key phrase. In this case the whole process is not as amazing as the longer captures, but it's also not as creepy. The use case here is to tell my guests "you have 10 seconds to say a few words." This puts them in control and the results are more fun than privacy invading creepy.

Generating the Images

The images were now interesting, but I had two problems: they were often off the mark, and they were cartoonishly booring. I solved the first problem by asking OpenAI to generate 4 images and then I put them together in a 2x2 grid. This gave me a better chance of getting something interesting. For the second problem I added two sets of random modifiers: one of media type (painting, photo, sketch, etc) and one for artist style (Dali, Van Gogh, Rembrandt, etc). This gave me a lot more interesting results.

This was now working great, but I sometimes couldn't see how the images related to the conversation, particularly when using the long capture. And certainly a week later I had no idea what the prompt was. As a result I added a caption to the bottom of each image grid. This is a simple step, but it really made a difference. CoPilot was a great help in giving me the code I needed for this image manipulation.

Displaying the Image

The CoPilot code was simple and worked, but it had room for improvement. The show() function had the image pop up in a new process window that I had no programmatic control over (perhaps the result of my know-nothing-about-Python approach to this).

Eventually I Googled Python display libraries and the results suggested TKinter. I told CoPilot to use TKinter. I was pleased to see that TKinter is cross platform and when something runs on my Mac, it also runs on the RPi. There was a bit of folderal needed to have TKinter run in a separate thread - that's in my code too. Now I have a real, extensible solution. This opens up the possiblity of a much more sophisticated display; in my GitHub repo you can read the Issues for what I plan to do.

Hardware control

My last spasm on this before the Halloween debut was to get a push button up and working. I wanted people to know when they were being recorded and to have some indication of the progress, since this whole thing can take some time on the RPi.

I decided to use a momentary contact button and a red LED. When the button is pushed the capture starts and the red LED flashes quickly as long as audio is being captured. Then the flashing slows way down and gradually changes as the steps of the process are worked. Eventually it blinks very slowly and then, bang, the image appears on the screen.

I made a quick 3D printed control panel to hang over the top of my monitor. It also holds the RPi upright so that the microphone is exposed. A bit of nano-tape keeps it all in place. I added a hand written sign (I was running out of time) inviting guests to press the button and speak a few words.

The menu now has an option for hardware control (h) that waits for a button press.

The code in the repo now supports "kiosk" mode on the RPi. When the RPi boots, the program starts up and enters hardware control mode. The button is the only way to start the program. This is a good way to deploy the program at a public venue. (Like our makerspace.)

Interesting Discoveries

During one of my many tests I accidentally had OpenAI transcribe a 10 second file with no talking. What came back was some Japanese characters. Google Translate told me it was "Thank you for watching until the end." That's kind of creepy.

Asking for an interesting noun phrase was good but when using OpenAI for keyword generation of a long transcript I'd sometimes get a leading, "The most interesting noun phrase in the text is,". That would mess up the image generation step. So the code now strips out a number of these phrases. Michael suggested I write the summary request with the part "as a sentence fragment" in all caps, thinking OpenAI would take that instruction more seriously.

In one test I recorded my rambling thoughts of whatever came into my mind. I wondered how OpenAI would summarize that. OpenAI responded, "The text is a series of disconnected thoughts." Hah! I ran the same transcript through summarization several times. Once it responded, "The speaker questions whether electricity is real," which was something I had rambled about - that phrase generated an interesting image.

OpenAI will gladly transcribe some bad things, but image generation has some standards. I can get it to return an error of, "your request was rejected as a result of our safety system." I'm not sure how they decide what isn't safe, but good for them. This led to some special code in my program to handle the situation and chastise the user.

Our makerspace had some visitors from Spain. On a lark they tried speaking Spanish to it. I am amazed to say that it worked perfectly. We later had a group from a Korean makerspace come by. We tried Korean. I think the transcription worked (at least unicode characters were returned) but the image generation failed. I might be munging up the unicode characters somewhere in the pipeline. I'll have to look into that.

What's next

Depending on when you read this, these might already be done. Check my GitHub repo issues list.