Winning in AI Alignment's Experimental Age

preface

It must be noted that the alignment space, given that it is AI adjacent, is relatively new and still very research-oriented. This thesis draws far more heavily on technical research than industry statistics in making assumptions about what will succeed/what will emerge, as understanding it will be crucial in identifying success.

In fact, much of what I argue is that because so much of it is research-oriented, very technical startups that initially prioritize research over commoditization will succeed in the long run. New discoveries mean new opportunities and new ways to neutralize threats, which inherently means new businesses.

I also will not argue that the research produced by startups will have significant impacts on the ML space as a whole. There is a tradeoff between commercialization and impact here. While startups in the alignment space seem to toy with the balance, in the end, you simply cannot open-source everything.

the state of human <> AI alignment/safety

I’ll introduce two widely referenced/used principles for alignment training:

Constitutional AI (CAI)

Encodes training goals in a simple, transparent principles document (a constitution). It works in two stages:

Supervised learning stage where you generate responses to harmful prompts with the help of an AI assistant, revise the response, and fine-tune a model based on a revised response.
Reinforced learning stage: generate response pairs and have the “constitutionalized” model evaluate, then train a preference model, and use RL with the preference model as a reward signal.

Reinforced Learning with Human Feedback (RLHF)

Developed mainly at OpenAI and popularized in model training there, the process is similar to CAI. A shortened explanation for brevity:

Build a model with a reward function by collecting human evaluation to response pairs, and then moving forward,
Have this model grade the agent’s responses to various harmful inputs.

Yet, many challenges still arise.

Casper et al., 20233 point out some key problems with these existing frameworks that prominent frontier models are trained on. These include, but are not limited to:

Reward hacking - models playing along with alignment training, to maximize rewards. also difficulty in optimizing reward models themselves.
Human feedback issues - feedback can be biased, and largely surround the scalability concerns for human oversight (what happens when models start becoming infinitely robust?)
Policy optimization - Policy is typically a function that maps states to functions (pi: S x A → [0,1]), with optimal policy being the one that maximizes cumulative reward over time. However, they are susceptible to adversarial attacks, are difficult when deployed in unfamiliar environments, and can lead to mode failure (where models produce output that maximizes reward instead of diversity).

This implies that RLHF/CAI are pieces in a much larger puzzle of a suite of methodologies.

Anthropic, which was founded on the core tenet of a “safer OpenAI”, highlights the scope of the problem quite well in their “Core Views on AI Safety”4 statement. View my full notes on the statement here.

Essentially they’ve bucketed the future of alignment into:

Alignment capability: developing new algorithms for training AI systems to be more helpful.
- debate, scaling automated red teaming, constitutional AI, debiasing, and RLHF. - blue teaming. Sometimes, new algorithms won’t be efficient until AI systems become more capable (which I will talk more about later).
Alignment science: evaluating and understanding whether AI systems are really aligned, and whether we can extrapolate techniques to broader AI systems. This includes:
- Mechanistic interpretability - filling the gaps left by other alignment research. Anthropic touches on the point of reward hacking here, and they are trying to reverse engineer NNs (neural networks) into human-understandable algorithms, to eventually develop a systematic way of auditing models in the same way code reviews work.
- Studying generalizations in LLMs using influence functions, and how they arise from permutations in data.
- As a commentary, it is worth noting that LLMs are just massive probability formulas and pattern matchers - alignment science has the potential to be limited by this fact. It could simply be a defensive game, but it’s something for time and research to prove.

A core point that Anthropic and other researchers emphasize: As models become more robust, they can do more harm. Many techniques within alignment capability demonstrate increasing returns to robustness, i.e. as models become more advanced algorithms and methodologies become more and more useful.

An example can be provided through a study titled Bijection Learning5 - by Haize Labs, a Soma portco. It was conducted as a red teaming experiment that involved creating various advanced ciphers/cryptography, training the model on this new “language”, and using it to elicit negative responses. However, what was most interesting was the demonstrated correlation between how advanced the ciphers were + their ASR (attack success rate) with model robustness:

You have a pre-emergence sector (flat) where the fixed size is too small/hard for the model to understand, with 0 ASR and a random (25%) score on the MMLU (since it’s multiple choice). The scaling regime is the increasing portion of the quadratic curve; here as model capacities are restored/MMLU scores increase the responses from jailbreak attempts become less cloudy and more harmful. The right side (decreasing) of the curve represents the saturated regime, where fixed-size bijection increases, model capacity returns to 100%/native MMLU score, and more refusals from harmful prompts emerge.

Essentially, the sweet spots occur at higher ASR and higher with-jailbreak capacities as we look at more capable models - meaning the stronger the model gets, the more effective bijection learning becomes. Refer to reference 5.

Conclusively, as LLMs scale, we will see more opportunities arise for startups to exploit in the alignment space, and at the same time, new developments that render old research ineffective. In a way, we’re looking at Moore’s Law and its implications but from the lens of orders of magnitude. The result is an industry with a large technical moat with many spaces being a winner-takes-all situation if you manage to research novel methodologies. Another key point about new research is that it is in line with “alignment science” explained by Anthropic - extrapolating findings to different applications.

For example, Harmbench6, is a standardized evaluation framework designed to assess automated red teaming methods and improve the safety of LLMs. It exists in four main categories: Standard behaviors, contextual behaviors, copyright behaviors, and multimodal behaviors, with a total of 510 different behaviors, and acts as a fast and scalable method for evaluating attacks. It currently exists as an open-source repo, accessible here. The key finding here is that no single defense mechanism is entirely effective, which means soon we will concretely identify more areas where further improvements are needed. In summary, opportunity grows via a plane, across methodologies and time.

In terms of analyzing this space’s capability for investment, it’s interesting to note that both Anthropic and OpenAI outsource a lot of this work to smaller companies/startups.

Anthropic has created an initiative7 that is essentially a YC for alignment startups/researchers. They highlight a list of things they want to see in three areas: AI Safety level assessments, Advanced capability and safety metrics, and advanced tools for developing evaluations. Although we should be wary that oAI and such companies will likely have the capability to kill stuff like this (comparable to if Netflix were to outsource their recommendation algorithm), it does still prove the need for such research. Hence,

So long as you can transform research into commercializable products, you have an unbounded suite of tools that people will be using in the near future. But, you still need research. We are entering an age of “technicality” which will be defined as research prowess, rather than engineering.

why do we even care about alignment in the first place?

Aside from the obvious (we do not want a Skynet situation), AI is critical for humanity’s future. It is still debatable whether or not it has been successful in achieving the vast vision (LLMs are not AGI, we have yet to see vast deployments of agents that have created significant contributions to humanity) - i.e., we have a long way to go.Yet, we are encountering some non-tech related roadblocks already:

2,851 miles. That is the distance between SV and DC. Summarizing Bill Gurley’s amazing talk on regulatory capture:

The more regulation in an industry, the less likely it is for industries to grow. Gurley talks about various examples - Epic systems and the HITECH act that led to significant market dominance for Epic and higher healthcare costs. The reason why SV has been so successful and so impactful to society is that DC has left it relatively untouched aside from some Lina Khan actions here and there.

Regulation stems from fear + lobbying. And, unfortunately, DC simply does not understand SV, and vice versa. This memoir8 sent to Soma Reads was so good at explaining it: “It’s simply hard to create effective, complementary policy that guides tech innovation in a responsible yet innovative direction, especially in the newest generative AI space, when policymakers are tapping the glass from the outside, passively observing and reacting to systems they don’t fully understand.”

And the reality is that AI is inherently a fearful space. Spaces that people - including lawmakers - are afraid of tend to result in regulation. A very important case study was California SB-1047: (Safe and Secure Innovation for Frontier Artificial Intelligence Models Act), which was vetoed. Not going to deep dive into the actual details of the act, but critics argued that it would stifle innovation for smaller companies and open-source projects. It was too heavily focused on large-scale models while ignoring the trend of smaller, more focused models (agents), and even proposed a “kill-switch” mechanism that’s just straight-up impractical in development.

Literally every major tech company, big and small, heavily opposed the bill, which demonstrates a deep misunderstanding, or a gap in understanding at the very least, of the AI space as a whole.

A large portion of the job for people developing AI systems - not just LLMs, but agents, LAMs, etc. - will be to convince the general public and lawmakers that they are developing ethically. Persuasion also revolves around logos and pathos. You cannot convince anyone of this fact unless you are doing it and have evidence that you are.

In a pure hypothetical, say a bank deploys an agent that deals with hyper-sensitive information, like SSNs. How do you convince customers that it’s safe? You take your niche group of believers, and over time, prove that there are no flaws in the system, until it is adopted by the masses (an ideology of Clayton Christensen’s Disruptive Innovation9 framework).

And how do you prove that it’s flawless? Through rigorous testing, in all stages, on all fronts, from multiple angles. Until you can prove with 100% efficacy that it does what it’s supposed to do.

Which, begins to highlight the need for a rapid expansion of the AI alignment space, on a variety of fronts. Frontier LLM developers have been recognizing this. Arguably, AI-adjacent industries are as critical to the success of AI as AI itself. This also happens to be the answer to, “Why are you building in this space?”

the importance of deep research moats:

The landscape is changing. Alignment, and almost all AI adjacent spaces, favors the research-heavy; those pursuing deeptech-like levels of academic pursuit in building their products.

Adjacent spaces are infrastructure, both out of technical and perceptive necessity. A company like Etched, which is building at deeptech levels in the AI hardware space, is tackling a technical problem with hardware efficiency. In order for AI to scale beyond what exists today without spending unreasonable amounts of money on compute, first hardware infrastructure must scale as well - if they succeed in actually building their product, you can increase compute OOMs which leads to significantly more robust models.

On the other hand, alignment companies are perceptive infrastructure; although you can continue to build frontier models without alignment considerations, simply everyone refuses to do so due to the negative implied and real consequences. On the surface level, you can build platforms that apply existing discoveries and succeed in the short term. But to have a company with unicorn potential, you need to approach it from Etched’s perspective - build something that you don’t know is possible to build.

Essentially, you can only make a real impact if it extends to the unknown. And the only way to do that is through research. Maybe as of now, Arxiv isn’t populated with alignment research produced by startups. And I’m not suggesting that this will change because for these kinds of startups, research is the biggest moat.

On that point though, it’s extremely interesting to see the open-source nature of many of the highest-ceiling companies that currently exist in the space. Some of them have even gone to the extent of building a base product entirely through open-source while publicizing many research findings, and then extending it to a pro version for revenue. Arguably the #1 red teaming startup and the #1 guardrail startup do this, which indicates something.

Perhaps it has something to do with the concept of perspective infrastructure; in the same way that Anthropic is a PBC, you may need to provide some kind of social benefit to establish credibility. Essentially doing the duty of what was discussed in the “Why care about alignment” portion. But that’s not the main takeaway.

As we have seen, companies that can already be deemed as successful (most lack revenue, solely based on valuation) do have deep research moats. Perhaps their impact on the frontier AI space won’t be as significant as findings published by researchers in a lab, but that is the tradeoff for commercialization. Whose balance, as noted above, is being altered too. I expect this trend to continue.

And yes, research is typically produced in well-funded labs. Which furthers the need to fund ambitious geniuses doing commercialized research in alignment.

the fronts that demonstrate the most promise:

Highlighted in chronological order of the model building phases, these are the types of startups that will see the most success in the alignment space:

Synthetic Data

It is known that models tend to develop behaviors. According to Anthropic, this is how they arise:

LLMs are pre-trained on raw text to observe patterns and permutations, then fine-tuned in hyperparameterized ways that could end up having unexpected consequences.

Since fine-tuning is hyperparameterized, the learned model depends on the implicit bias of the pretraining stage, which arises from a large web of representations built up from pretraining on a large fraction of the world’s knowledge.
Essentially, when a model behaves in a way that it shouldn’t, how do you trace back the root of the problem? Is it baked in from the pretraining phases that it will apply across all contexts or is it just a regurgitation of a specific training sequence? One of the ways to answer these questions is to make training data very retraceable. - this is where synthetic data will likely lie in the alignment space.

Alignment in the Age of Synthetic Data10 highlights that “jailbreaks simply aim to move the model out of the aligned distribution it has been trained on during a short finetuning phase, which is often fairly superficial. However, if examples of such knowledge and behavior have never been seen, there is no latent capability to ‘unlock’.”

The value here is that with synthetic datasets, you can theoretically (unseen yet, due to a lack of quality) build models that we have complete control over from the pretraining phase because we have control over the entire dataset that it was pre-trained on. There are various methods for generating synthetic data:

Statistical Models: Use techniques like Gaussian distributions, mixture models, or Markov chains to generate data that follows similar statistical properties as the original data
Generative Adversarial Networks (GANs): Consists of a generator and discriminator network. The generator creates synthetic data while the discriminator evaluates authenticity. Through adversarial training, GANs produce highly realistic synthetic data
Variational Autoencoders (VAEs): Encode real data into a latent space and decode it back into synthetic data
Data Augmentation: Applies transformations like rotation, translation, or noise addition to existing data. Commonly used for image and text data to increase the diversity of training datasets

However, in the context of synthetic data, it’s worth noting that a core component isn’t the generation of the data itself, but the quality. Understanding SGD (Stochastic Gradient Descent)/Heavy tail distributions within the context of industries is crucial to implementing proper methodologies. Real-world datasets are extremely varied, i.e. what shows up as a heavy tail in finance will show up differently than in healthcare.

Heavy tail data modeling’s relevance to alignment is that a 95% CI for healthcare is awful, and 99.9999% is necessary for high finance - any outliers in the dataset that can confuse the model/make it hallucinate/generate inaccurate output which can be quite catastrophic depending on the context of why the user needed to query in the first place.

Optimizing synthetic data to deal with these points is not only crucial but also different from the very fundamental approaches:

Modeling datasets to account for such outliers allows you to produce synthetic datasets that better account for the 1 in a million scenarios at a more granular level.
Due to the variation in industries, which techniques that best optimize synthetic datasets will vary - hence, there are tons of verticals within this space that are capable of being explored and generating venture backable returns.
- For instance, data form matters more in certain industries (healthcare) whereas value/rule-based data matters more in others (finance) - thus those optimizing for products in each space will find drastically different ICPs resulting in no clear “dominating” technique.

Each industry also can apply different techniques based on the types of data, such as in finance: masking account numbers vs. using DP to aggregate transaction data without exposing sensitive content.

There’s a lot to be said about the fundamental research - much of it is, well, fundamental, i.e. the techniques that get produced and used will be relatively static across the verticals that apply it. Hence companies must understand their niche to build long-lasting businesses. It also implies that it’ll be impossible to have a general data synth startup, at least in the early stages.

In identifying what will succeed:

Quality is essential across applications - if synthetic data can accurately replicate mathematical trends of real-world events (depending on industry), then it’s valid. As mentioned though, getting synthetic data to a near 100% accuracy is extraordinarily difficult.
Core moats in this space are quite literally product effectiveness. For data scientists, the only thing that matters is how similar synthetic data is to real data. If it doesn’t produce the best in the vertical, it will struggle to succeed.
Then, due to the natural variance of data across industries, it’s nearly impossible for a synthetic data generator to be horizontal. Check Turing’s article11.
It’s not at the point where generating accurate synthetic data is easy yet. Currently, very deep market maps can be made for the nichest of verticals for synthetic data players. Hence, in the race to “solve” the synthetic data problem, there will be many opportunities.

It’s also worth noting that many companies pursuing synthetic data aren’t doing it exclusively for alignment. It’s just a portion of a broader spectrum of reasons. But the concept of having control over unstructured data is pretty huge and can be preventative.

Guardrailing

Where guardrails are at today is highlighted quite well by Ayyamperumal et al., (2024)12. This paper outlines some of the risks for LLMs and a layered approach for protecting LLMs using guardrails. It proposes a Layered Protection Model at 3 levels: External level, Secondary Level, and Internal level. Key advancements include:

Taking the problem of prompt obfuscation and creating a way to recraft prompts to prevent prompt injections from ever reaching input.
Identifying RAG (which is already established today) as a good approach because it helps cycle data from trusted sources, and reduces potential misconstrued parameters from the training phase. Essentially fine tuning becomes a process of making sure the correct data is received rather than focusing on retrieving output itself.
Various bias mitigation techniques:
1. Diverse Training Data: Ensuring the training data includes a wide range of perspectives and sources to minimize inherent biases.
2. Bias Quantification: Using tools like IBM's AI Fairness 360 to measure and quantify bias in LLMs.
3. Post-processing Techniques: Implementing methods like re-ranking to adjust LLM outputs based on fairness criteria.
4. Pre-processing Techniques: Applying re-weighting or re-sampling to training data to reduce the influence of biased examples.
5. Adversarial Training: Training the model to become less sensitive to demographic features by predicting the source of input data and minimizing differences between predictions.

Essentially how guardrails work to prevent prompt injections.

Guardrails’ immediate direction points to platforms that ease the integrations to prevent LLMs from producing incorrect/harmful output, from the beginning of development. There’s lots of opportunity here; most existing input moderation APIs suck, as highlighted by an experiment by Haize Labs13.

A key factor in the process is that we’re at a stage where the LLM has already been trained/built. Instead, there are “checkpoints” in between the input and the output to a. Prevent harmful prompts from being provided an output and b. Preventing output from being incorrect/dangerous.

Currently, a lot of it is being done through open-source - for example, Guardrails AI (https://github.com/guardrails-ai/guardrails ) has thousands of commits, and is widely used. But, like synthetic data, this is not a one-size-fits-all space.

Unlike synthetic data, guardrails are hyper-organization specific - while synthetic data fits industries, guardrails need to fit enterprises/organizations. Hence, the level of customization required means that there isn’t really a need for an industry-specific suite.

Regardless of this point, there are different types of guardrails, which include but are not limited to14:

Rule-based string manipulation
LLM-based metrics: apply metrics on embeddings and probabilities through perplexity and semantics;
- Perplexity: how much “gibberish” an input has, rejecting it from being processed by the LLM - high perplexity prompts prevent LLMs from predicting a sequence of words, usually they’re jailbreak attempts
- Semantics: Measuring similarity between input and target topic to guardrail against - an input that contains a high semantic score (for example terrorism) can be rejected

In identifying what will succeed - The value in building guardrailing systems comes from agents. When companies/startups start building agents, and you can build a guardrail infrastructure for them that they can easily implement, you create a ton of value. With different types of guardrails, you have multiple spaces to succeed in, there will be differences in LLM guardrails vs agent/LAM guardrails. It’s reasonable to assume the trajectory of the AI space as a whole (which alignment, as an AI Adjacent space depends on), is agents. Hence, those building deployable guardrails for different types of agents will find great success. So long as guardrails are deployable independent of the training phase, it is fairly simple to standardize a product.

Evals/benchmarking

Benchmarking is inherently more science-based (which says a lot already), based on the state of AI today (research dominated in frontier LLMs) - and most benchmarks created today are left open source for open-source LLM devs.

Some of the work that has been done on benchmarking includes the aforementioned Harmbench6, which aims to benchmark red-teaming methods (not models themselves, which should be noted later). What they proved was:

understanding the visual - ASR (attack success rate) is quite standard across model families but very varied across. Reference Harmbench6

Model size is not a direct indicator of robustness but more so on training algorithms, highlighting the importance of model-level defenses. Harmbench6 argues that existing alignment methods like RLHF may need to be complemented by more intrinsic approaches, which led to R2D2 (Robust Retrieval augmented Dual Defense) - using RAG to ground model responses in external sources alongside a two-layer defense of input and output (similar to guardrailing), to align LLMs against adversarial attacks.

Benchmarking is incredibly important to research, which leads to a conflicting assumption: Metr Evaluations is a nonprofit in AI safety, whose thesis is that evals need to be based on the researcher for ai alignment, and created a model/framework that does alignment research better than 90% of humans. Some of their work includes:

Re-bench - a benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks
Autonomy evaluation resources - through a suite of tasks designed to test autonomous abilities at different difficulty levels. https://github.com/METR/task-standard - task repository to test autonomous capabilities.

Metr extends benchmarking outside of research results alone, but also in a way that measures human researchers. It’s quite insane. However, it is necessary to completely cover all phases of the alignment research process (starting from human bias). I will say that evaluation of methodologies/research doesn’t tend to be anything but open source, making it difficult to identify VC backable success. However, I also point out that there exist other types:

Arize is a commercialized platform for companies to evaluate the models/agents they produce. Their framework involves tracing the flow of data, making sure agents follow the expected path - then evaluating task performance from LLMs through their platform. But, going back to the section on the importance of research, I would argue that this is too much of a platform than it is discovery, to find long-term success. So perhaps in this space, there isn’t a clear winner yet.

All this is to say that the concept that Anthropic mentioned earlier of alignment science stands in research-heavy fields like this. Harmbench was run on 33 LLMs, but even still, LLMs are not entirely the future. You will need to build model-level defenses for agents, multimodal models, etc. Can you extrapolate R2D2 to action models? Probably not if it involves RAG. Metr already begins to answer this question of agent task evaluation, and Arize does it in a commercializable manner.

As models scale, we uncover more and more about them. It could be possible when new things emerge, you have to completely redo R2D2, for instance. When you have to extrapolate to more than 33 models, you will likely encounter issues with working with standardized benchmarking that will eventually become outdated. There is a need for persistent research + people who are willing to work on the model level; if you can pick up this niche you have a chance at success.

What sets a good evaluation startup apart from a bad one? Again, research is a moat - platforms that will just leverage existing open-source methodologies without building the techniques on their own will likely not succeed long term. Uncovering holes in existing research and creating unique benchmarking systems that no one else can do will define a unique product offering.

Red teaming

It’s not unreasonable to assume that the concept of red-teaming (or very comparably, penetration testing) is going to look similar to the cybersecurity space: multiple approaches and consistently evolving alongside the internet. Without diving too deeply into the history of cybersecurity, from a venture perspective, there is no one clear winner in the cybersecurity space. New threats emerge, and new players rise to neutralize them, like a constant endless cycle in the best way possible. In the same vein, right now we’re working on automating red teaming for LLMs, soon we will need to work on automating red teaming for agents, and so forth.

The current main goal of red teaming is developing scalable oversight, or when humans will eventually be unable to properly assess model safety when they become too robust. In other words, figuring out a way to “automate” this process through building complementary AI systems that supervise learning.

Essentially, training a model to learn human values over time the same way it would learn other subjects.
On top of that, larger models (parameters-wise) should have an easier time understanding human values due to robustness.
This also proves the point that red teaming becomes more robust as models become more robust, as mentioned in the bijection learning example.

It’s not even a completely uncovered space yet. Wang et al. (2024)15 provides a framework for Distribution-aware Adversarial Attack methods (DA3), since existing adversarial attack methods produce high ASRs but are easily detectable due to distribution shifts.

Typically have lower confidence (Maximum Softmax Probability) and greater divergence from training data distributions.
Essentially, DA3 produced a new evaluation metric of non-detectable ASR, which is resistant to common detection methods.
- This is unlike most of the other spaces because, despite the amount of red-teaming methods that exist, you do need to encompass a great deal of them in the process of completely red-teaming a model. This is simply a given.

But, like SaaS and almost everything else about the world, one entity can't do everything the best due to company focus, which could be why most frontier model producers currently outsource red teaming along with in-house methods for “perspective”. It’s reasonable to assume that since oAI and Anthropic are PBCs, their intentions for outsourcing are possibly different from, say, FAANG. It’s not purely profit-driven. It also means that there can be many winners so long as building your own LLMs becomes as commonplace as building your own software.

Still a very research-oriented space as well - lots of red teaming concepts have been covered already in this article, especially in Harmbench.
Startups like Haize, which are doing as much research as AI researchers themselves will succeed - the only way to stay at the forefront of new methodologies that emerge. These kinds of companies are the ones that will develop ways to consistently automate existing and uncover new red teaming methodologies, hence capturing a very significant market share.

Even so, there is always the risk that those building frontier models create the most scalable automated red teaming - internally, those building frontier models will always be ahead, as robust models allow for more robust automation.

limitations

The biggest limitation to this hypothesis is that to develop novel solutions, you need increasingly robust models. In today’s state of AI, that means LLMs.

Anthropic has a major advantage in that it has access to the most advanced models before anyone else because they’re the ones building it. “We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments.” They have such a head start when it comes to actually doing the research.

This raises the question, will it be a startup problem to solve if theoretically, the most robust products/research will be spun out in-house? Will all alignment-related technologies have to be built in-house moving forward?

The thing with AI adjacent spaces is that they’re dependent on the trajectory of AI itself. Fortunately, it appears that smaller models in the form of agents are taking a fast grip on the space. It’s too resource intensive for companies with far less funding than oAI and Anthropic to take on the alignment side of things as well; there comes a point where outsourcing is required.

Going back to the Anthropic program mentioned earlier, even companies making LLMs at the scale of Anthropic are still outsourcing, because it isn’t a solved problem at all. Very experimental, which means the more perspectives the better. With the shape of this, that means there will be a few winners that come from more humble beginnings.

Regardless, alignment for agents remains hot, diverging from the frontier model norm. These are easy bets; for now, identifying research in commodities remains the underlying differentiator.

references

1. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.

2. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.

3. Casper, S., Hadfield-Menell, D., & Russell, S. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2307.15217.

4. Anthropic. (n.d.). Core Views on AI Safety. Retrieved from https://www.anthropic.com/news/core-views-on-ai-safety

5. Haize Labs. (n.d.). Bijection Learning. Retrieved from https://blog.haizelabs.com/posts/bijection/

6. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249.

7. Anthropic. (n.d.). A New Initiative for Developing Third-Party Model Evaluation. Retrieved from https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations

8. Wang, C. (n.d.). Two Essential Ecosystems in Our Country Vastly Misunderstand Each Other at a Terrible Price. Retrieved from https://carolynwangjy.medium.com/two-essential-ecosystems-in-our-country-vastly-misunderstand-each-other-at-a-terrible-price-f6816959dcc3

9. Harvard Business School Online. (n.d.). 4 Keys to Understanding Clayton Christensen's Theory of Disruptive Innovation. Retrieved from https://online.hbs.edu/blog/post/4-keys-to-understanding-clayton-christensens-theory-of-disruptive-innovation

10. Beren, A. (2024, May 11). Alignment in the Age of Synthetic Data. Retrieved from https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/

11. Turing. (n.d.). Synthetic Data Generation Techniques. Retrieved from https://www.turing.com/kb/synthetic-data-generation-techniques#what-is-synthetic-data

12. Ayyamperumal, S. G., & Ge, L. (2024). Current state of LLM Risks and AI Guardrails. arXiv preprint arXiv:2406.12934.

13. Haize Labs. (n.d.). Content Moderation APIs are Bad. Retrieved from https://blog.haizelabs.com/posts/content-moderation-apis-are-bad/

14. ML6. (n.d.). The Landscape of LLM Guardrails: Intervention Levels and Techniques. Retrieved from https://www.ml6.eu/blogpost/the-landscape-of-llm-guardrails-intervention-levels-and-techniques

15. Wang, Y., Dong, X., Caverlee, J., & Yu, P. S. (2023). DA3: A Distribution-Aware Adversarial Attack against Language Models. arXiv preprint arXiv:2311.08598v3.