Animating the still: Photo-Video Autolography and AI as creative medium

In late 2024 and 2025, a new class of generative AI systems made it possible to upload still images and use natural language prompts to generate short videos. While models such as OpenAI’s Sora and Google’s Veo could generate video from scratch, it was perhaps more interesting that they could be instructed to mobilise photographs – whether found, captured, or generated. These invoked short moving sequences manifest many features of conventionally created movement images: bodily movements and gestures, facial expressions, camera movement, environmental dynamics, sound and dramatic special effects. This emerging practice occupies an unstable space between writing, photography, animation and cinema, and has yet to be adequately theorised within digital media studies.

This post proposes that photo-video autolography is an emerging creative practice based on invoking moving images from still photographs through written prompts. Building on earlier work on autolography (Chesher & Albarrán-Torres 2022) – the invocation of images from text through generative AI – I argue that photo-video autolography constitutes a distinct cultural technique that reconfigures long-standing relations between word and image, instruction and interpretation, stillness and movement. Unlike traditional animation with drawing or filmmaking and motion picture cameras, this practice operates through invocations of prompts addressed to large-scale multimodal models trained on vast repositories of textual, visual, and audiovisual culture.

I analyse prompting not as a form of representation or command, but as an event that mobilises statistically encoded blocs of sensation within AI systems. Through close reflection on a series of autolographic video experiments, ranging from animated architectural photographs to AI-generated portraiture, I show how prompting functions as an aesthetic practice that struggles against cliché while remaining structurally attracted to it.

By situating photo-video autolography in relation to ekphrasis, invocation, and the movement-image, this article contributes to emerging critical understandings of generative AI as a cultural medium rather than just a technical curiosity. It argues that analysing these practices is necessary not only for understanding new aesthetic forms, but for grasping how contemporary AI systems reorganise authorship, creativity, and access to the cultural record.

This video of a museum in Brazil standing up and walking is one of many videos I made using photo-video autolography. In this post, I explain my prompting strategies, but also reflect on the personal and cultural significance of this creative medium. The video above began with a visit to the stunning Niterói Museum of Contemporary Art, directly opposite Rio De Janeiro. This building is sometimes compared to a lighthouse, a flower, or a spaceship. Writing prompts for images begins with describing what is in the scene already with your own choice of words, so I took the spaceship metaphor and included it in my prompt to encourage Google’s Veo3 to recognise a flying saucer rather than an art museum. Rather than make the spaceship hover and fly into outer space (perhaps a cliché), I thought it might be better to give it legs and make it walk.

But my first attempt did not achieve what I had hoped. The prompt was:

First attempt: The flying saucer’s windows glow red and hum. A green glow appears underneath with a shushing sound. The ship starts to shake rise up, with rumbling and swelling sci fi music, revealing large muscly human legs holding up the flying saucer, and the whole ship walks slowly one step at a time towards the camera. Science fiction music builds. As the saucer gets closer, The camera starts shaking and the camera man turns and runs away screaming

The first prompt did not produce the video I had intended. After being disappointed that the magic had failed me, I deconstructed the prompt I had used. The phrase ‘revealing large muscly legs’ resulted in the legs suddenly appearing once the ‘spaceship’ had already risen up, as there was too much prompt between the rising up and the ‘legs’. The camera movement didn’t end well, either. It did not shake in the way that I hoped, but rather drifted slowly off and away from the spaceship, with no scream from the camera operator. I also thought the video vignette lacked a spectacular narrative resolution.

I rephrased the prompt to clarify the relationship between the legs and the spaceship. I simplified the wording and added a more dynamic element by incorporating the fall into the water at the climax of the action. Notice how the water appears only once the spaceship is read to fall into it.

The flying saucer’s windows glow red and hum. The ship starts to shake and rise up, revealing that it is held up by large muscly human legs. The saucer takes a step towards the camera. It takes another step. Birds scatter. It takes another step closer to the camera and climactic sci fi music crescendos. It takes another step and falls down into the water. (see video at the top).

Writing to make images

Writing for AI image generators is a creative practice based on the cultural technique of prompting: a specialised form of writing that invokes multimodal AI models to generate images guided by the prompt. In a 2023 article in Media International Australia, César Albarrán-Torres and I called this autolography. The term is derived from Greek words for ‘self’, ‘word’ and ‘drawing’, in the same lineage as ‘photography’, which is light-drawing. In this article, I will expand upon this new phenomenon by considering generating videos from still images as photo-video autolography: the practice of capturing a photo, importing it into a video model (like OpenAI’s Sora or Google’s Veo), and writing a prompt that animates it to allow it to escape its status as a photograph to become a moving image.

A distinctive feature of AI prompting is that its sociotechnical protocols are already familiar to anyone capable of writing and imagining. Prompting is an emerging practice of mastering the poetry and protocols of the medium. Different prompters adopt different strategies: some minimalist, others richly descriptive, and others highly formalised and detailed. Yet, because AI is trained on content that statistically favours dominant meanings, prompting well involves diverting the video generation from the most clichéd cultural spaces.

Certainly, autolographic systems are significant achievements in computer science. But this is not where the value of generative AI primarily comes from. Its value derives from accumulated textual and visual artefacts in the cultural record used as training data. Training involves extracting numerical values from meaningful fragments of culture – tokenised text, image primitives such as edges, textures, lighting, and styles, and the physics and performativity of movements – and modelling their relations in high-dimensional space. At planetary scale, this makes substantial portions of the world’s cultural record invocable as new, but derivative, creative works.

Practicing photo-video autolography

Photo-video autolography is different from generating images or videos from prompts alone because it begins with an image from one’s own everyday life, and introduces time and movement. It becomes a practice that resembles story-boarding and directing a digital special effects. The captured, curated or generated still image establishes a mise-en-scene and the prompt dictates how these image elements should be mobilised and transformed in a space of people, objects, holes, physics, planes and light.

My own practice of photo-video autolography has mostly involved capturing digital images with my phone and immediately writing prompts to bring them to life. By grounding the image in the immediate situation, I could begin with a bloc of sensation and relate what is generated to my own experience, movements and desires, particularly as I was travelling in the Americas between August and October 2025. This changed how I took photos, and how I mobilised prompting language.

In the rest of this post I will explore the techniques and aesthetics that I developed through my prompting practice and how it changed as I made a series of videos. I will talk about its frustrations and surprises, and how it relates to the wider relationship between writing and moving images in culture.

First encounter

I discovered this technique a few months before my trip. Just as with still autolography, the first encounter with video generation was experienced as magic. When I first logged onto Sora in late 2024, I wasn’t sure about its capacities or limits of the platform. So of course the first thing I wanted to generate was an invocation.

Wizard summons a spirit, waving his arms and disappears in a puff of smoke

My prompt was pretty short and simple, and yet Sora did not follow my prompt exactly, as the Wizard did not disappear (which is what would have actually made it magical). I would come to find this unreliable nature of invoking the earlier models both frustrating but sometimes pleasantly surprising.

Invoking flying pasta

When visiting a friend’s restaurant I tried to create a viral video for it (before this kind of content was widely dismissed as AI slop). Its failures to follow my prompt invoked its strangeness and humour. Examining the videos more closely revealed some of the capacities and constraints of the medium and the model. Notably, the video shows a distorted consistency with the space of the mise-en-scene. While I wanted the plates to spin, the model required that the action come from the people and not from objects that I wanted to move. I began to notice how sequencing and camera movement came from the word order in the prompt. The obvious technical failures of the video model proved hilarious.

Italian man in green suit smiles at the camera. Red headed woman next to him laughs. The setting is a cosy Italian restaurant. A sign reads ‘Ferrara’ Bowls of pasta on the table in front off them spin around throwing off sauce. Everyone laughs.

Generating a video from only a prompt meant placing action in a fictionalised and generic space, so I thought I would experiment with beginning the AI generation with a photo. This was my first experience with photo-video autolography. It turned out that Sora would not accept a photo with a person in it, so I took a photo of a table in the garden area of the restaurant and started prompting an invented character. Aiming to make the video as ridiculous as possible, I used expressive adverbs and a nonsensical description. But even I was surprised by the bizarre spatial and temporal distortions in these almost abject movement images.

Red headed woman dancing frantically on the table while pasta falls from the sky. A flickering sign at the bottom of the frame reads ‘Ferrera’.

The fact that the results were Alice in Wonderland bizarre was delightful but also emphasised the imprecision of prompting with the earlier models. Again there were missing elements, and the model’s interpretation was out of control

Photo-video autolographic montage

The video vignettes on their own were sometimes unsatisfying, so I instinctively sought to experiment with montage and the transformative power of editing. I took several images around my suburb of Newtown. In taking the images I had to take trouble to avoid capturing recognisable faces and I needed to leave negative space in which the moving elements I would invoke could move. I decided that I would populate these streets and footpaths with animals and people as moving elements. I wanted to re-wild the space of my neighbourhood.

Prompts include ‘a row of camels wander down the street’, ‘Monkeys run wild down the street’ and ‘Five people wearing overalls and flowery hats jump down the street on pogo sticks towards the camera until they smash into the camera’. Sora didn’t generate sound, so I added sound effects and used the AI tool Suno to create a musical soundtrack.

Not all the videos produced what I expected. Some of the spaces and people were pretty wonky, but I was impressed with how the model seemed to recognise visual elements in the images I had taken and respect their spatial relationships much of the time. In Deleuze’s (1986) terms, these include perception images (the contours of the situation) and action images (establishing vectors of movement), but not affection images (close-ups of faces). Like the early cinema of attraction, this video emphasises spectacle and sensory impact rather than narrative immersion.

Invoking talkies

With each new AI model, new prompt vernaculars, visual aesthetics and invoked worlds come to operate, and it takes time and experimentation to use them well. In late August 2025, I departed for Trump’s America, just after I signed up for a trial of Google’s Gemini, which gave me access to the Veo3 video generator model. Veo3 was reputed to be more effective and to add the element of sound. I wasn’t sure what to expect, but over a beer with my friend Brian on a rooftop bar in Seattle, I tried Veo3.

Man on rooftop bar starts ranting and jumping up and down while it starts raining frogs

One of the first things that was different about the Veo3 model was that this image of Brian was not rejected, even though it contained a recognisable face. My prompt was quite short again, but my simple invocation summoned a world and a personality uncannily rich, familiar and bizarre. The model expanded the vertically oriented video frame to widescreen, inventing the crowd of mystified people in the bar. Frogs dropped from the sky. And Brian ranted in a way that seemed to mock his actual personal style (which is why I like him).

Photo-video autolographic portraiture

Although I didn’t recognise it immediately, I began performing a kind of dynamic AI portraiture. I subjected each friend I visited to autolographic examination, selecting a scene nearby, asking them to pose, and writing a prompt that seemed intuitively to fit. While video models are refined to maintain a likeness from the initial image information, the likeness often broke down during the movement in amusing or disturbing ways: beards appearing on faces, shirts changing colour, faces puckered into monstrous features. Phantoms often appear on close examination.

Despite these errors, AI prompting has established itself as a new and distinct cultural technique that becomes more refined with the release of each new model. The efficacy of the platform as a creative environment is bound to its close relation to the familiar and highly developed cultural techniques of taking photos and using language. While writing prompts functions technically as giving instructions for the operation of the machine, it only distantly resembles formal programming languages. Learning to prompt is an intuitive skill that develops over time. Yet, it is a constant battle against its intrinsic attraction to the cliché.

Through practice, I improved my use of the medium and the genre I have defined as photo-video autolography. I learnt how to take a photo that establishes a scene, characters, objects and a site in which action could be invoked. I became sensitised to the poetics and technical requirements of writing with more accurate and detailed prompts. From trial and error and watching instructional YouTube videos, I learnt about the underlying operations of AI video and audio generation. Like the early and better days of Twitter, I learnt how to write within constraints – in this case an eight second limit. I learnt about mise-en-scene, movement, action, narrative. I found ways to reference things that appeared in an image (the woman in a blue dress) and invoke what was not (Godzilla). I learnt to make humans and non-humans in a scene act (statues, halloween decorations, murals, brooms, guitars, a stove). I worked out techniques to control camera movement.

Found artworks

A variation on photo-video autolography is the reinterpretation of familiar artworks by adding to them duration and action. This autolographic video using Sora 2 was based on Grant Wood’s American Gothic that I saw at the Art Institute of Chicago, which trades on the viewer’s familiarity with this iconic image, and makes an unexpected transformation in mood and scenario.

The man lifts the fork into the air and plunges it down, shouting ‘Let’s go!’ Old style dance music kicks in. The woman reels backwards and starts dancing the twist. A confetti cannon out of shot fires confetti into the air. The man spins around and dances like Uma Thurman in Pulp Fiction. The camera moves out to reveal a large number of people wearing similar clothes, dancing like anything.

AI portraiture

I can reflect on my motivations for some of these videos and some of the failures in the generation process. In prompting the tiger in the Japanese gardens in Portland, OR I was thinking about the scene in Apocalypse Now when the tiger leaps out from the jungle. I meant for the tiger to chase after me and my friend, so I was surprised with when we ran illogically alongside the tiger. David T did some impressive break dancing moves, but they were not anatomically correct. When Michael R is sitting on the couch in the flooded room, his invented monologue was a little too literal-minded and flat.

When Godzilla appears in New York (rather than King Kong), I had taken the photo deliberately to leave space for the monster to rampage down, but the resulting video had problems with objects suddenly appearing and disappearing. With this video I tried writing a longer and more detailed prompt and sequenced a more complex mini-narrative. When I was screaming on the exploding Brooklyn Bridge, I don’t know why I sat down rather than falling forward to take cover.

The first video of Michael G flying down the street didn’t turn out in the way I expected (although it is kind of weird and cool), so I revised my prompt to get him to turn away from the camera and fly down the street on the broom (which I provided), with the camera following. The video that started with Nickolas standing by the stove in his kitchen was also a long and involved prompt. The only annoyance was that he was supposed to say ‘strewth’ at the end of the clip, and not at the beginning. The video with the band began with an image of Michael B posing with my stratocaster. I added the rest of the band in the image generator Nanobanana, and then prompted the action with a detailed prompt to animate the mural and the band. The weird element is that the guitarist is lifted out of his own body, duplicating the character.

Statues of Santiago

So, making AI videos also serves as a new touristic practice. One day in Santiago I ran around capturing images of statues, and that evening (channeling Night at the Museum), and animated the statues using a combination of Midjourney and Veo 3. It demanded a kind of looking at the world for things that can serve as faces and bodies and can be brought to life. Again, I generated a soundtrack in Suno and edited them together into a montage.

Conclusions

So, at a practical level, I propose that photo-video autolography is an accessible creative practice with pedagogical and expressive possibilities. It has potential to foster in people who are learning to use it a better understanding of photography, writing, large foundation model AI, animation and the moving image. It establishes a genre of still photography for video autolography that requires an awareness of the possibilities for combining people, objects and setting as part of a mise-en-scene to which movement can be added. Prompting is a distinctive form of writing that requires competencies in language as a cultural resource and as a code with operative potential for scripting movement. Autolography is a mode of invocation that calls upon massive invocable domains of AI modelled resources of written and visual culture to generate new content. Performing this work demands knowledges beyond technical skills, and particularly rewards understanding of cultural history, visual culture, film studies, media studies, science and technology studies and the broader humanities.

References

Chesher, C. (2024) Invocational media: Reconceptualising the computer. Bloomsbury.

Chesher, C., & Albarrán-Torres, C. (2023). The emergence of autolography: the ‘magical’ invocation of images from text through AI. Media International Australia, 189(1), 57-73.

Deleuze, G. (1986). Cinema 1: The movement-image (H. Tomlinson & B. Habberjam, Trans.). University of Minnesota Press.