ARTS Check-in Day 6

A: 278. First Bad Version #

You are a product manager and currently leading a team to develop a new product. Unfortunately, the latest version of your product did not pass the quality check. Since each version is developed based on the previous version, all the versions after a bad version are also bad.
Suppose you have n versions [1, 2, ..., n] and you want to find out the first bad one, which causes all the following ones to be bad.
You can call a function bool isBadVersion(version) which will return whether version is bad. Implement a function to find the first bad version. You should minimize the number of calls to the API.
Example 1:
Input: n = 5, bad = 4
Output: 4
Explanation:
call isBadVersion(3) -> false
call isBadVersion(5) -> true
call isBadVersion(4) -> true
So, 4 is the first bad version.
Example 2:
Input: n = 1, bad = 1
Output: 1

var solution = function (isBadVersion: any) {
  return function (n: number): number {
    for (let i = 1; i <= n; i += 1) {
      if (isBadVersion(i)) {
        return i;
      }
    }
    return n;
  };
};

Submission Result:

Time Limit Exceeded
22/24 cases passed (N/A)

It should pass logically. I will try to change it to binary search later after the check-in.

According to the reference answer, a better solution using binary search is:

var solution = function (isBadVersion: any) {
  return function (n: number): number {
    let left = 0
    let right = n
    while (left <= right) {
      let middle = Math.floor((left + right) / 2)
      if (isBadVersion(middle) && !isBadVersion(middle - 1)) {
        return middle
      } else {
        isBadVersion(middle) ? (right = middle - 1) : (left = middle + 1)
      }
    }
    return -1
  }
}

Binary search can quickly narrow down the search range when identifying a position in an ordered array.

R: Open challenges in LLM research #

The author identified 10 challenges that LLM currently needs to address, with illusion and context learning being the most discussed topics. The author's own focus is on multimodality, new architectures, and reducing GPU requirements while increasing selectivity.

Reducing and measuring model illusions#

Illusions may be a feature for some creative scenarios, but for most scenarios, illusions are bugs. Therefore, mitigating and measuring model illusions is a popular research direction. There are currently some temporary solutions to reduce illusions, such as adding more context in the prompt, CoT, self-consistency, etc., which are further referenced and introduced in the article.

Improving context length and context building#

Most questions require context to provide good answers because models need to learn relevant information from the context of the prompt, a process called "context learning."

Context length is particularly important for Retrieval Augmented Generation (RAG). To make RAG work, two stages are needed: 1. Chunking: collect all the documents needed and store them in a vector database after chunking; 2. Querying: when a query is input, it is also embedded into vectors and compared with the data in the vector database for similarity retrieval.

The longer the context length supported by LLM, the more relevant chunked texts can be included in the context, resulting in better generation results.

However, it is not necessarily better to include more content in the context. The model's capacity and processing efficiency should also be considered. Therefore, another parallel path is to optimize the prompt itself to make it easier for LLM to process, thereby improving efficiency. This path is called "Prompt Engineering" or prompt construction.

Collaboration with other modalities#

The consideration of multimodality is necessary because many scenarios involve multimodal data. In addition, the leading LLM models have already made extensive use of text-related data, and further improvements require leveraging the value of multimodal data.

The author is particularly excited about the possibility of multimodal models enabling visually impaired individuals to better access the internet and the real world.

Making LLM faster and cheaper#

When GPT-3.5 was first released, there were concerns about its latency and cost. However, in just half a year, the community has been able to achieve the same performance with only 2% of the memory used by GPT-3.5. Several important techniques for model optimization and compression were written in the author's book many years ago: 1. Model quantization; 2. Knowledge distillation; 3. Low-rank factorization (not sure if it is the same as LoRA); 4. Model pruning. These techniques are still important and popular today.

Designing new model architectures#

Transformer is an architecture that has been around since 2017, and the question of how long this architecture can continue to lead is a concern.

It is not easy to surpass the Transformer architecture, which has been continuously optimized for 6 years. Considerations need to be given to current concerns such as scaling applications and hardware resources. Transformers were initially designed to run quickly on TPUs at Google and were later optimized for GPUs.

Developing GPU alternatives#

Since AlexNet in 2012, GPUs have become the dominant hardware in the field of deep learning.

The scarcity of GPU resources is widely felt, so in the past decade, some companies have attempted to create new hardware for AI, such as Google's TPU, Graphcore's IPU, and expectations for quantum computing and exploration of photonic chips.

Making agents truly usable#

Agents are LLMs that can take actions, such as browsing the web and sending emails. Compared to other directions, this direction is relatively new.

Due to the novelty of this direction, people are enthusiastic about it. The related GitHub repository Auto-GPT is ranked 25th among the most popular repositories, and GPT-Engineer is also a very popular repository.

Although people are enthusiastic, there is still a considerable amount of skepticism about whether LLM is reliable and can be trusted to handle actions.

A recent emerging case is the use of LLM for sociological research. Stanford University conducted an experiment: defining an agent to host a Valentine's Day party, and the agent autonomously performed simulation behaviors such as party invitations and making new friends in the next two days.

A well-known company in this field is ADept, which demonstrated last year how to make an agent browse the web and add a new account in Salesforce.

Learning from human preferences#

RLHF (Reinforcement Learning from Human Feedback) is a good technique for aligning models with human preferences, but it is somewhat hacky. The author believes that better methods can be found to align models with human preferences.

Some problems with RLHF include:

How to quantitatively represent human preferences?
Currently, human preferences are determined by comparison, manually labeling which option is better, but it does not indicate how much better.
What are human preferences?
Anthropic measures the models based on the 3Hs (Helpful, Honest, Harmless), and Deepmind attempts to generate responses that satisfy the majority of people.
What kind of model do we actually want? One that can express a position or one that avoids controversial topics?
Whose preferences are "human preferences," considering cultural, regional, and political factors?
It is difficult to obtain training data that represents the preferences of all potential users. For example, OpenAI did not hire annotators over the age of 65, and the annotators were mainly from the Philippines and Bangladesh.
Community-driven data may still be biased, such as the OpenAssistant dataset, where 90.5% of respondents were male.

Improving the efficiency of LLM dialogue interfaces#

Since ChatGPT, there have been discussions about what a suitable dialogue interface for a wide range of tasks should look like.

However, this is not a new discussion. In many countries, especially in Asian countries, chat interfaces have been used as the entry point for super applications for ten years.

In 2016, there were discussions suggesting that applications were dead and chatbots were the future.

The author likes chatbot interfaces for three reasons:

Chatbot interfaces are designed for people who have never interacted with computers before and can learn quickly.
Chatbot interfaces are easy to interact with, and if hands are not convenient, voice input is also possible.
Chatbot interfaces are robust enough that you can send any request to them.

However, the author believes that there are also areas for improvement in chatbot interfaces:

Only one message can be input per turn.
This is not how we message friends. Sometimes our input is segmented, multi-type (images, locations, links, etc.), or we simply don't want to input a long paragraph.
Multimodal input.
In terms of multimodality, most of the effort has been focused on building better models, with little effort put into building better user interfaces.
Incorporating generative AI into your workflow.
For example, if you want to ask a question about how to handle a column in a chart you are working on, you should be able to ask the question directly to that column.
Editing and deleting messages.
How to edit and delete messages in a chat session to improve the entire conversation.

Building LLM for non-English languages#

The current English-centric LLMs do not work well in terms of performance, latency, and speed on other languages.

Efforts have been made in other languages, such as Symato's efforts in Vietnamese. However, some people believe that this direction is not meaningful for the following reasons:

It is more of a resource allocation problem rather than a research problem. We already know how to do it, but there is a lack of resources invested in other languages, even if data is available.
More pessimistic people believe that multilingualism will disappear in the future, and there will only be English and Mandarin on the Internet.

The impact of LLM on language learning is still unclear. Is it to help people learn new languages faster or to eliminate the need for people to learn new English?

T: Notepal - Synchronized Reading Notes#

This is a browser extension that can synchronize reading notes from WeChat Read to other software.

S: The Copper Rule#

When evaluating whether something is worth doing and how much effort should be put into it, we should abandon a single perspective and consider two different dimensions. One is the magnitude of the benefits that the event will bring to me (including cognitive, emotional, material, and physical benefits), which is called "benefit value." The other is the speed at which the benefits decay over time, which I call "benefit half-life." Events with longer half-lives will have a longer and more significant impact on us.

Reference:

ARTS Challenge

A: 278. First Bad Version#

R: Open challenges in LLM research#