I recently had the idea of making a web browser that is automated by AI. For example, you could type into a text input field “find the latest NBA stats, open google sheets, and copy all the data over to google sheets” and the browser would do it for you.
Read about how I came up with the idea
I have built a prototype of this browser using Python, Selenium, BeautifulSoup, Langchain, and GPT-4. The code is available here.
The prototype does the following:
- get a url and command from user
- scrape the url for text and clickable elements
- combine instructions, command, clickable elements, and text into a prompt
- feed the prompt to GPT-4
- the response will be the element to click
The commands on this page are relatively specific, and they result in only one element being clicked. However, by using AI to interpret broader instructions and parse it into smaller steps, it is possible to make the browser fully automated by AI.
Improvements In Progress
Currently, this program is prohibitively expensive. The prompt for the YouTube homepage is 4,700 tokens long and contains 280 clickable elements. This means that it would cost 4.7 cents to click one element on one page using GPT-4 Turbo1. Fortunately, there are many optimizations that simultaneously decrease the cost of the program and improve its performance.
First, I am currently only using a single user prompt and response. This prompt includes the general instructions, user command, list of clickable elements on the webpage, and webpage text. The instructions (400 tokens)2 in this prompt can be moved to the system prompt. This will not reduce the token usage for a single page, but significantly reduces the token usage per page as more pages are visited.
Second, there is a lot of text on webpages that is irrelevant to our task. In the YouTube example, homepage suggestions accounted for 2,200 tokens and are unnecessary if the task is to, for example, find the latest MKBHD video. More generally, when the task is to do (as opposed to understand) something, body text usually doesn’t provide much useful context. If link titles were descriptive enough, I could get rid of the website text altogether. However, this is not always the case, so a possible compromise is to attach a set number of context words before and after each link. There are also a significant amount of blank links – links with no title attached. I don’t know where these links are coming from, but it seems like they are unnecessary because every button I have looked for has been titled.
-
GPT-4 Turbo costs 1 cent per 1,000 tokens. GPT-3.5 Turbo is 20x cheaper, but gives worse results and still becomes expensive when scaled up. ↩︎
-
The instructions are always the same; the user command varies. My current instructions are:
I am trying to create a web browser that can be controlled by a user’s text commands. One part of this program is reading the contents of a webpage, interpreting the user’s commands, and clicking the appropriate link based on the command. You are an AI that can help me accomplish the above. I will provide you with the webpage contents and the user’s text command. You will respond with the best link to click.
The format in which I will provide the inputs is as follows: First, I will provide the user command. Second, I will provide a list of all the links on the webpage. Each link will be in the format L{i}[link title], where i is the index. For example, if the webpage text is “This is a LINK32[abc] to a page”, then ‘abc’ is the 32nd clickable link. The index is to help identify the link if there are multiple links with the same title. Third, I will provide the text in the webpage; in the webpage text, the links will be marked with the same format as the list of links : L{i}[link title]. Use the context in the webpage to determine the best link to click. Try to infer as much as possible because the user’s command may be multiple links away. For example, if the site is Youtube and the user command is “Go to MKBHD’s latest video”, you may have to click on the search button. Another example is if the user command is “I want to try the latest AI model,” you may have to read the surrounding context, determine which model is the newest, and find the link that leads to that model.
Reply with only the best link to click in the format L{i}[{title}]. Do not include any other text. Infer as much as possible.