ChatGPT: The Future of Search?

ChatGPT is currently the talk of the town. Dr. Christoph Goller, Head of Research at search specialist IntraFind, has tried out ChatGPT and thought about its effects on search engines.

Search Engines are dead! That’s what the perception might be if one sees the impressive results of something like ChatGPT. Transformers change everything. The encoder part of a transformer (e.g. BERT) can be used directly for dense retrieval and the decoder part (language models like GPT ) can be used to generate natural-sounding answers. Does this really mean we do not need classical full text search based on the inverted file index anymore? Are Lucene and Elasticsearch obsolete? I would rather say that the predictions about the imminent death of search engines are greatly exaggerated.

Problem of Scalability

On the one hand there is the problem of scalability. Dense retrieval means that we must store dense vectors with several hundred dimensions for every sentence or paragraph of our documents in a data structure, that allows approximate nearest neighbor search. Getting the same kind of performance as with classical keyword-based search engines is according to our experiences very challenging, at least. The hardware costs for ChatGPT are estimated to be $100,000 per day with only a few million users, most of them not using it on a regular basis as they use a search engine like Google. An enterprise search solution for a big international company based on ChatGPT would probably have similar hardware requirements and it is unlikely that such costs would be acceptable.

Supervised Training required

Performance and hardware requirements are not the only obstacle. Transformers, trained on huge amounts of general text data in an unsupervised way, are not necessarily suited for representing content in special application domains. There Is often domain-specific vocabulary which they cannot treat adequately. Extreme examples are company-specific error or product codes. So, it might be necessary to fine-tune them on specific customer data, which is computationally expensive. Furthermore, transformers have to be trained for the specific task they are used for, which is usually called transfer learning. It may come as a surprise, but this is supervised training which requires manually labelled data. For dense retrieval (e.g., search based totally on embeddings) BERT-like encoders are usually trained on pairs of queries and answers. The evaluation presented in BEIR shows, that a dense retrieval approach trained on one benchmark (domain) does not perform well on other benchmarks. In fact, it performs worse than standard keyword search using BM25. This means, that manually labelled training data consisting of pairs of queries and answers in the specific domain may be needed for each customer project. ChatGPT is basically GPT trained to perform well for chatting. For this, OpenAI generated labelled data for an initial supervised training and then used human testers giving feedback for further reinforcement training. So, supervised training with human involvement is also paramount here. As always, there is no free lunch.

Correctness and Explainability not guaranteed

If you search on Google for the statement “Google search is becoming worse” you see that this is currently a frequent complaint. I am not completely sure whether it is really justified and there may be several different reasons if there is a deterioration. But let me share some personal experience. I spent the summer vacation with my family in the US. We also went to Yosemite Park. I googled “Yosemite geysers to visit” and got tons of hits about “Old Faithful” and other geysers, including many pictures. Fun fact, there are no geysers in Yosemite. But there are several of them in Yellowstone, which is a day trip away by car. I suspect that the reason for these false hits is Google’s integration of dense retrieval into their normal search. Yellowstone and Yosemite are both very popular national parks and for dense retrieval it does not make such a big difference. However, for keyword search it does, and of course for tourists who want to see geysers.

I recently tested ChatGPT. It is really very impressive. But then I tried some more difficult queries.

Question: Which geysers can I visit in Yosemite?

Answer: Yosemite National Park is home to several geysers and hot springs, which are popular attractions for visitors to the park. Some of the geysers and hot springs that you can visit in Yosemite include: Old Faithful Geyser, Basket Dome Geyser, ….

Looks familiar, doesn’t it?

Question: Which mammals lay eggs?

Answer: It's important to note that mammals that lay eggs typically only lay one or two eggs per year, while birds and reptiles can lay many eggs per year. Mammal eggs are also typically much smaller than bird and reptile eggs. Some more mammals that lay eggs are: okapis, hyenas, manatees. Okapis are large, rare mammals that live in central Africa. They lay an egg about the size of a chicken egg about once a year. Hyenas are known to lay eggs in burrows and hatch them. Manatees are mammals that live in water and lay eggs, but they are very small and usually do not fertilize.

While the first sentence is a bit misleading, but not wrong, the mammals listed definitely do not lay eggs. In one answer ChatGPT listed rabbits as egg-laying mammals. Therefore, I asked:

Question: How hard is the shell of rabbit eggs?

Answer: The shell of rabbit eggs is very strong and hard. It consists of calcium carbonate and other minerals that give it the necessary strength. Compared to chicken eggs, rabbit eggs tend to be smaller and have a thinner shell, which is a little more delicate…

This is complete nonsense. By the way, trying the same question again after resetting ChatGPT gave me the opposite answer, stating that rabbit eggs are very thin and brittle.

Here is a funny example in German. Maybe it is only funny if you are from Bavaria.

Question: Wo kann ich in Bayern eine Wattwanderung machen?

Answer: Bayern ist ein Bundesland in Deutschland, das viele Möglichkeiten für Wattwanderungen bietet. Einige Orte in Bayern, an denen man Wattwanderungen machen kann, sind: Insel Rügen, Insel Amrum, ….

I was asking where I can make a mudflat hike in Bavaria. It is important to know, that Bavaria does not have access to an ocean and the places suggested are in northern Germany.

It is important to note that ChatGPT does not always produce the same answer for a question. After a reset, sometimes it did produce correct answers for the above questions. For example, it stated that there are no geysers in Yosemite and that mammals usually do not lay eggs. It is still a test version and users are expected to give feedback which is then used for further reinforcement training. So ChatGPT tries to use testers for generating further training data. However, I seriously doubt that further training will solve the problem.

I also tried to ask a question about myself. I found the answers very interesting. I have been working in information retrieval with a focus on Lucene and Elasticsearch for more than 20 years now. But before that, I did research (PhD) in deep learning. Therefore, I asked:

Question: Who invented Backpropagation through Structure (BPTS)?

This can be answered easily by a Wikipedia entry:

https://en.wikipedia.org/wiki/Backpropagation_through_structure

Google usually finds this page. I assumed that ChatGPT, since it was trained on Wikipedia, would also get it right. However, Backpropagation through Structure is not a technical term that is used very often, even though it has been getting an increasing number of citations in recent years.

ChatGPT gave many different answers to this question. On most tries it randomly named well-known researchers in the Deep Learning area and quoted some non-existing papers with arbitrary publication years, claiming they describe BPTS. On one try it claimed that BPTS is a kind of computer network architecture invented by some guys in the 80s. All answers were completely wrong. For the similar question “Who invented the drum brake” ChatGPT often incorrectly named Westinghouse but sometimes got it right and answered Louis Renault. As an educated guess, Westinghouse is not a bad answer as are all the well-known deep-learning experts for BPTS. But these answers are wrong.

OpenAI is of course well aware of the problem that answers might not be correct and states it openly. Language models like GPT are trained to produce perfect language, but not truth. Some people call them random “bullshit” generators. Meta’s new large language model Galactica was pulled down 3 days after its launch since it produced too much “bullshit”. ChatGPT has been trained in a better way to make these problems less obvious. But the problems are still there.

What does this mean for search?

Don’t get me wrong! I am really impressed by the progress of deep learning and the capabilities of transformers. Transformers are able to accumulate impressive knowledge about language and about popular facts. The reason for this is that they not only work on the syntactic level but also include semantics. You cannot get the syntax (language) right without at least a little bit of semantic knowledge. However, they are not a reliable source of truth. Often, they only produce something like an educated guess and often they produce complete nonsense which is grammatically correct. There may be some further breakthroughs in the future, and language models might replace search engines completely someday. I personally don’t think this will happen soon. For now, we have to use transformers and large language models in combination with traditional search engines. Encoders can be used to re-rank results and to solve the vocabulary mismatch problem (problem that query and document language differ), but this has to be done in a controlled way to avoid problems such as the Yosemite / Yellowstone mismatch discussed above. Language models such as ChatGPT can be used to formulate answers based on the snippets produced by search engines, so that the answers are not educated guesses but well-founded and explainable (by references to source documents). Other hybrid approaches of combining deep learning and classical search engines have to be investigated. IntraFind is exploring a variety of possibilities, partly together with partners in the SEMIARID project. We have already integrated some deep learning techniques and we will soon see more in our products. Personally, I am happy to finally get the opportunity to work in the field of Deep Learning again.

NLP: A Key Technology for Search Engines and Text Analytics

Modern NLP (Natural Language Processing) / NLU (Natural Language Understandig)-solutions, based on the three keystones of lexicons, rules and machine learning, open up entirely new possibilities, especially in the areas of text search, classification and analysis. Here is a technology overview.

Read article

Nahaufnahme Kunstwerk bestehend aus Buchstaben

Natural Language Processing – Best Practice

What is natural language processing? What techniques are behind it and for what purpose is NLP used? Breno Faria, Team Lead NLP at the search and AI specialist IntraFind, provides an in-depth overview.

Read article

The author

Dr. Christoph Goller

Head of Research

Christoph Goller holds a PhD in computer science from the Technical University of Munich with research in Deep Learning. He is Apache Lucene Committer and has more than 20 years of experience in Information Retrieval, Natural Language Processing and AI. Since 2002 he is head of IntraFind‘s research department.

18.01.2023 | Blog ChatGPT: The Future of Search?

Related Articles

NLP: A Key Technology for Search Engines and Text Analytics

Natural Language Processing – Best Practice

The author

Dr. Christoph Goller