Skip to content

Commit

Permalink
Updates from Overleaf
Browse files Browse the repository at this point in the history
  • Loading branch information
yamanksingla committed Nov 1, 2024
1 parent ffba281 commit 94d19cf
Show file tree
Hide file tree
Showing 7 changed files with 171 additions and 51 deletions.
15 changes: 0 additions & 15 deletions Conclusion.tex
Original file line number Diff line number Diff line change
@@ -1,23 +1,8 @@
%\addcontentsline{toc}{chapter}{Conclusion and an Outlook for Future Work}
\chapter{Conclusion And An Outlook For Future Work}
\label{chapter:conclusion}
Communication, in the form of messages, symbols, and culture is omnipresent.
Arguably, the emergence of communication has been cited as the most recent and most impactful evolutionary transition in the history of life on earth \cite{smith1997major}. It enables cooperation, coordination, information transmission, and culture. Language is unique in being a system that supports unlimited heredity of cultural information, allowing our species to develop a unique kind of open-ended adaptability. Although this feature of language as a carrier of cultural information


three waves:
fourth century BCE Ancient Greece
The early modern period was another epoch of intensified interest, with the rise of print, the Reformation, technoscience, and the colonial pursuits of Europe. The middle of the twentieth century was still another, when propaganda studies, argumentation studies, and the New Rhetoric all arose from the trauma of the Second World War.
We are in another such period

long history of use and study in legal defense and prosecution, scientific discourse, songs, tropes, examples,

In short the goal is still, as it was for Aristotle, identifying “the available means
of persuasion” (Rhetoric 1355b10–12).


In this thesis, I covered




Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Behavior As A Modality
Yaman K Singla's (work-in-progress) PhD Thesis

The thesis document is available in the PDF format here: [main.pdf](/main.pdf).
Thoughts and questions are welcome. Feel free to email me at [yamank@iiitd.ac.in](mailto:yamank@iiitd.ac.in).
The thesis document is available in the PDF format here: [main.pdf](/main.pdf). Click on the download button if PDF does not get rendered on Github.
Thoughts and questions are welcome. Feel free to email me at [yamankum@buffalo.edu](mailto:yamankum@buffalo.edu).
28 changes: 28 additions & 0 deletions abstract.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
\addcontentsline{toc}{chapter}{Abstract}
\chapter*{Abstract}

Communication, as a system of messages, symbols, and cultural exchanges, is ubiquitous across all species. Scholars have argued that communication represents one of the most transformative evolutionary transitions in life's history \cite{smith1997major}, alongside pivotal developments like chromosomal mechanisms, eukaryotic formation, sexual reproduction, and multicellular life. Its unique capacity to enable cooperation and facilitate the unlimited transmission of cultural information grants species an unprecedented form of adaptive flexibility \cite{kirby2008cumulative}.

Because of the critical role communication plays in the survival and advancement of the species, communication has been studied since the ancient times. The earliest known work on communication, called Precepts by Ptah-Hotep appeared more than 4500 years ago (\~2300 BCE) \cite{gray1946precepts}. Since then, communication has seen three distinct waves of intensified interest: the first one in Ancient Greece with great Sophists like Aristotle, Isocrates, and Plato producing seminal works like Rhetoric, Phaedrus, and Antidosis \cite{hackforth1972plato,rapp2002aristotle,norlin1928isocrates}, the second one with the rise of print, the reformation, the Renaissance, and the European colonial pursuits \cite{mack2011history}, the third and most recent one during the Second World War \cite{brinol2012history}.
We currently stand at the cusp of a fourth such phase, precipitated not by political upheaval (like the ideas of democracy or world war) or mechanical innovation (like the printing press and the steam engine), but by the unprecedented accumulation of digital content and behavioral data. This data now serves as the foundation for developing large language and diffusion models, which hold transformative potential for behavioral scientific inquiry. We will show in this thesis that these tools, while still in their infancy, have the potential to solve many problems considered ambitious in behavioral sciences.


Communication is composed of seven modalities: the communicator, message, channel, time of receipt, receiver, time of behavior, and receiver's behavior \cite{shannon-weaver-1949,lasswell1948structure,lasswell1971propaganda}. Critically, each communication turn's behavior becomes the subsequent turn's message, rendering communication a strategic interaction between sender and receiver aimed at optimizing shared or individual objectives \cite{smith2003animal}. Examples like legal defense and prosecution, scientific discourse, mating, organizational communication, diplomacy, political propaganda, and culture (like folk songs and maxims), present different types of goals.


This thesis explores behavioral sciences' enduring mission—first articulated by Aristotle 2,500 years ago—of identifying and leveraging persuasive mechanisms \cite{rapp2002aristotle}. The field has traditionally bifurcated into two epistemological approaches: explanation and prediction. Historically, behavioral scientists have sought explanations that can provide interpretable causal mechanisms behind human and societal functioning. However, societies and humans do not render themselves to clean-cut equations and formulas, as is evidenced by the limited success of behavioral explanations in predicting behavior. The emergence of extensive digital behavioral repositories has consequently shifted focus towards more robust predictive methodologies.


% We then advance to constructing generalized behavior models—Large Content and Behavior Models (LCBMs)—trained on extensive digital analytics repositories. These models aim to comprehend behavior holistically, unlike task-specific approaches. Our investigation critically examines large language models' limitations in addressing behavioral challenges, revealing that behavioral training data is often inadvertently filtered out as statistical noise. We demonstrate that reintegrating behavioral data not only restores models' behavioral capabilities but enables novel inferential approaches—such as deriving content insights through receiver behavioral responses. Finally, we pioneer content generation research across text and image domains, focusing on metrics of performance and engagement. This includes developing the first automated arena for benchmarking text-to-image model engagement potential, establishing new standards for evaluating and improving content generation systems.



In this thesis, we start with the more traditional approach of behavior explanation, where we cover persuasion strategies in advertising images and videos. We construct the largest set of generic persuasion strategies based on theoretical and empirical studies in marketing, social psychology, and machine learning literature. We introduce the first dataset for studying persuasion strategies in advertisements.

Next, we turn attention towards behavior prediction by constructing general behavior models. These models, similar to large language models, aim to understand behavior \textit{in general}, as opposed to designed for a specific behavioral task. We use the large repositories of digital analytics to train these models. The format of this data is the general communication model consisting of the communicator, message, time of message, channel, receiver, time of effect, and effect. We call these models, Large Content and Behavior Models (LCBMs). We further show that large language models, while being used as general purpose models for a variety of tasks in different domains, are unable to solve behavioral problems. We investigate the reason for this and find that while training LLMs, behavioral data is removed as noise due to which they lose the behavioral capabilities.


We also show that after including the behavioral training data back leads to other positive side effects. Namely, we show that since behavior is an after effect of content (message), therefore, we can make inferences about content by looking at the receiver behavior. An example for this is blood pressure or eye dilation levels upon watching the movie Jurassic Park indicates the excitement level of different scenes. We show results for this hypothesis on more than 30 content understanding tasks across all four modalities of text, image, video, and audio.


Finally, we make initial strides towards solving the problem of generating performant content. We show this both for performant text generation, by taking the illustrative case of the behavior of memorability, and images, by generating images that are more engaging. We also develop mechanisms to measure the engagement potential of text to image generation models. We show that existing metrics to benchmark the quality of text to image generation models are not correlated with engagement. We develop a model to measure the engagement potential of an image. We release the first automated arena to benchmark the engagement of text-to-image models. We rank several popular text-to-image models on their ability to generate engaging images and further encourage the community to submit their models to the arena.
27 changes: 15 additions & 12 deletions chapter-explaining-behavior.tex
Original file line number Diff line number Diff line change
Expand Up @@ -696,6 +696,21 @@ \subsubsection{Video Verbalization}
Using either of these methods, we obtain a set of frames that represent the events in the video. These frames are then processed by a pretrained BLIP-2 model \cite{li2023blip2}. The BLIP model facilitates scene understanding and verbalizes the scene by capturing its most salient aspects. We utilize two different prompts to extract salient information from the frames. The first prompt, ``\textit{Caption this image}", is used to generate a caption that describes what is happening in the image, providing an understanding of the scene. The second prompt, ``\textit{Can you tell the objects that are present in the image?}", helps identify and gather information about the objects depicted in each frame.



\begin{figure*}[!t]
\centering
\includegraphics[width=\textwidth]{images/example-stories.pdf}
\caption{An example of a story generated by the proposed pipeline along with the predicted outputs of the video-understanding tasks on the generated story. The generated story captures information across scenes, characters, event sequences, dialogues, emotions, and the environment. This helps the downstream models to get adequate information about the video to reason about it correctly. The original video can be watched at \url{https://youtu.be/_amwPjAcoC8}.}
\label{fig:example-story}
\end{figure*}


\textit{b. Textual elements in frames:} We also extract the textual information present in the frames, as text often reinforces the message present in a scene and can also inform viewers on what to expect next \cite{9578608}.
For the OCR module, we sample every 10th frame extracted at the native frames-per-second of the video, and these frames are sent to PP-OCR \cite{10.1145/2629489}. We filter the OCR text and use only the unique words for further processing.

\noindent \textbf{3. Text Representation of Audio:} The next modality we utilize from the video is the audio content extracted from it. We employ an Automatic Speech Recognition (ASR) module to extract transcripts from the audio. Since the datasets we worked with involved YouTube videos, we utilized the YouTube API to extract the closed caption transcripts associated with those videos.


\begin{landscape}
\begin{figure*}
\includegraphics[width=1.5\textwidth]{images/verbalizing-marketing-graphics.pdf}
Expand All @@ -711,18 +726,6 @@ \subsubsection{Video Verbalization}



\begin{figure*}[!t]
\centering
\includegraphics[width=\textwidth]{images/example-stories.pdf}
\caption{An example of a story generated by the proposed pipeline along with the predicted outputs of the video-understanding tasks on the generated story. The generated story captures information across scenes, characters, event sequences, dialogues, emotions, and the environment. This helps the downstream models to get adequate information about the video to reason about it correctly. The original video can be watched at \url{https://youtu.be/_amwPjAcoC8}.}
\label{fig:example-story}
\end{figure*}


\textit{b. Textual elements in frames:} We also extract the textual information present in the frames, as text often reinforces the message present in a scene and can also inform viewers on what to expect next \cite{9578608}.
For the OCR module, we sample every 10th frame extracted at the native frames-per-second of the video, and these frames are sent to PP-OCR \cite{10.1145/2629489}. We filter the OCR text and use only the unique words for further processing.

\noindent \textbf{3. Text Representation of Audio:} The next modality we utilize from the video is the audio content extracted from it. We employ an Automatic Speech Recognition (ASR) module to extract transcripts from the audio. Since the datasets we worked with involved YouTube videos, we utilized the YouTube API to extract the closed caption transcripts associated with those videos.

\noindent \textbf{4. Prompting:} We employ the aforementioned modules to extract textual representations of various modalities present in a video. This ensures that we capture the audio, visual, text, and outside knowledge aspects of the video. Once the raw text is collected and processed, we utilize it to prompt a generative language model in order to generate a coherent story that represents the video. To optimize the prompting process and enable the generation of more detailed stories, we remove similar frame captions and optical character recognition (OCR) outputs, thereby reducing the overall prompt size. %\cy{Do we simply concantenate all the text as the prompt? in what order?}

Expand Down
Loading

0 comments on commit 94d19cf

Please sign in to comment.