Exploring AirBnB Listings with Semantic Search: A Qdrant and LLM Powered Approach

11 min readApr 30, 2024

Introduction

Imagine searching for your ideal AirBnB not just by location and dates, but with a natural language query like “Find an Airbnb in Friedrichshain with an overall score of more than 95.” This is the power of semantic search and LLMs. Leveraging the advanced capabilities of Qdrant — a vector search system enriched with geo-coordinate and date-time functionalities — we aim to demonstrate a practical application that not only responds to complex user queries but also provides intuitive and relevant search results. It is powered by LlamaIndex which provides powerful abstractions for LLM orchestrations.

Technology Stack Overview

The project is built upon a well-considered stack of technologies, each chosen for its strength in delivering efficient and scalable solutions for AI-powered applications.

Qdrant Vector Search: Utilized for its robust support for filtering searches based on location coordinates and date metadata, enhancing the relevance of search results.
LLMs (Large Language Models): We leverage “mixtral-8x7b-32768” to process natural language queries and understand contextual nuances, making our search engine more intuitive.
Groq: We utilize the Groq API for its exceptional performance in processing speed and responsiveness. This choice allows our application to handle complex natural language queries with reduced latency, offering users swift and accurate search results. This capability is crucial for maintaining a fluid user experience, especially when dealing with large datasets and intricate query requirements.
Streamlit: Chosen for its simplicity and effectiveness in building interactive UIs for Python applications, allowing users to interact directly with the data.

The Dataset

Our exploration focuses on the “Berlin Airbnb Ratings” dataset from Kaggle. This dataset offers a wealth of information on Berlin listings, including reviews, host details, location data, and more. We work with a truncated version of this dataset, which includes 1000 rows, allowing us to manage our resources efficiently while demonstrating the app’s capabilities.

Here’s a glimpse of the dataset schema:

Application Architecture

Our application is structured around several key components:

Qdrant Vector Index: We begin by setting up the Qdrant client and configuring our vector store with the necessary details to handle our Airbnb dataset.
Data Loading and Processing: Using PagedCSVReader, we load each row of the dataset into a document structure. Each document is then enriched with metadata extracted directly from the dataset.
Vector Embeddings: The data from each document is processed through an embedding model (FastEmbedEmbedding), converting text data into vector format which is then stored in our Qdrant vector store.
Streamlit UI: The front-end of our application is developed using Streamlit, which displays the dataset and provides a query input field for users.

How to Run the App

1. Set Up Your Environment

To get started, you need to ensure that Python and Streamlit are installed on your system. Follow these steps:

Clone the Repository: First, clone the repository containing the project files using the following Git command:

git clone https://github.com/AI-ANK/Airbnb-Listing-Explorer.git

Navigate to the Project Directory:

cd Airbnb-Listing-Explorer

Install Dependencies: Install all required dependencies by running:

pip install -r requirements.txt

Environment Variables: Create a .env file in the root directory of the project to store sensitive data such as API keys:

GROQ_API=your_groq_api_key_here
QDRANT_URL=your_qdrant_url_here
QDRANT_API_KEY=your_qdrant_api_key_here

2. Configure Qdrant

Before you can launch the application, you need to set up a Qdrant cluster:

Create a Qdrant Cluster: Follow the steps outlined in the Qdrant documentation to create a cluster in the Qdrant Cloud. You can find the guide here: Qdrant Cloud Quickstart.
Configuration: Make sure to note down the URL and API key for your Qdrant cluster. These will be used in your .env file to enable communication between your application and the Qdrant database.

3. Prepare the Qdrant Collection

Collection Name: Decide on a name for your Qdrant collection where the property embeddings will be stored.
Populate the Collection: If you are running the application for the first time, you will need to ensure that the code on line 141 of app.py, which handles the creation and population of the collection, is uncommented and configured correctly with your chosen collection name.

4. Launch the Application

Finally, you can start the application by executing the following command in your terminal:

streamlit run app.py

This will start the Streamlit server and open your default web browser to the URL where the app is hosted, typically http://localhost:8501. You can interact with the app through this interface to explore Airbnb property listings.

Code Deep Dive: Bringing the Pieces Together

Library Imports and Initial Configuration

# Required imports
import streamlit as st
import pandas as pd
import os
import json
import re
from pathlib import Path
from dotenv import load_dotenv
from llama_index.core import SimpleDirectoryReader, ServiceContext, StorageContext, VectorStoreIndex
from llama_index.core.vector_stores import (
    MetadataFilter, MetadataFilters, FilterOperator, FilterCondition
)
from llama_index.readers.file import PagedCSVReader
from llama_index.llms.openai import OpenAI
from llama_index.llms.groq import Groq
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from qdrant_client import QdrantClient

Here, we import all necessary libraries and modules required for the app’s functionality, including handling data with Pandas, reading and writing files with Pathlib, environmental variables management, and using LlamaIndex’s advanced features for document handling and vector storage.

Initializing Environment and Streamlit

load_dotenv()
st.set_page_config(layout="wide")

We load environment variables and set the Streamlit page configuration to a wide layout for better display of data and components.

Initializing Qdrant, LLM, and Embedding Model

# Initialize Qdrant client
client = QdrantClient(
    url=os.environ['QDRANT_URL'], 
    api_key=os.environ['QDRANT_API_KEY'],
)

# Initialize LLM and embedding model
# llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
llm = Groq(model="mixtral-8x7b-32768", api_key=os.getenv('GROQ_API'))


embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
service_context = ServiceContext.from_defaults(chunk_size_limit=1024, llm=llm, embed_model=embed_model)

vector_store = QdrantVectorStore(client=client, collection_name="airbnb_5")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Here, we initiate a connection to our Qdrant server, specify the LLM (Mixtral in this case) and embedding model (FastEmbed), and set up the service context that integrates these components. The QdrantVectorStore object is crucial, as it acts as our interface to store and retrieve vector embeddings within Qdrant.

Data Preparation with PagedCSVReader

@st.cache_data
def load_data():
    # if write_dir.exists():
    reader = PagedCSVReader(encoding="utf-8")
    documents = reader.load_data(file=Path("Airbnb Berlin 1000.csv"))
    return documents

documents = load_data()

For demonstration purposes, we use a truncated version of the dataset containing 1000 rows. The PagedCSVReader efficiently processes the dataset row by row, converting each row into a document object that will later be embedded into vectors and stored in Qdrant.

Populating Document Metadata

for doc in documents:
    # Regular expression to extract key-value pairs
    pattern = re.compile(r'(\w+[\s\w]*?):\s*(.*)')

    # Creating a dictionary to store the parsed data
    parsed_data = {match.group(1).strip(): match.group(2).strip() for match in pattern.finditer(doc.text)}
    # st.write(parsed_data['Latitude'])
    # Setting the parsed data as metadata

     # Check if 'Latitude' and 'Longitude' are available in parsed data to avoid KeyError
    if 'Latitude' in parsed_data and 'Longitude' in parsed_data:
        # Creating a nested 'location' dictionary
        location = {
            "lon": float(parsed_data['Longitude']),
            "lat": float(parsed_data['Latitude'])
        }
        parsed_data['location'] = location
    doc.metadata = parsed_data

We extract relevant information like review date, location (latitude and longitude), and other details from each document and store them as metadata. This metadata will be crucial later for filtering search results based on specific user preferences.

Vector Embedding and Storage

index = VectorStoreIndex.from_documents(documents, vector_store=vector_store, service_context=service_context, storage_context=storage_context, show_progress=True)

# Once you have created an index, subsequently you can load the vector index from Qdrant collection
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context, embed_model=embed_model
)

Using the defined service context, we generate vector embeddings for each document and store them within the Qdrant vector store via the VectorStoreIndex. Qdrant’s efficient vector storage and search capabilities are at the heart of our semantic search functionality.

Streamlit UI and User Interaction

st.title('Airbnb Listing Explorer - Berlin')
.
.
.
# Input from user
user_query = st.text_input("Enter your query:", "Find a good property in Friedrichshain with Overall rating above 96")


# Processing the query
if st.button("Submit Query"):
    # Generate query vector
    query_vector = embed_model.get_query_embedding(user_query)


    # Regular expression pattern to match the date format MM-DD-YY
    date_pattern = re.compile(r'\b(\d{2}-\d{2}-\d{2})\b')


    # Search for a date in the user query
    date_match = date_pattern.search(user_query)


    # Initialize filters list
    base_filters = []

This large block of code is part of a Streamlit application that handles a user’s query for Airbnb listings, applying sophisticated search filters and displaying results. When the user submits a query through the Streamlit interface, the application first converts the user’s natural language query into a query vector using an embedding model. This vector represents the semantic meaning of the query and is used to search a Qdrant vector store where Airbnb listings are indexed.

    # Check if a date was found and add a date filter
    if date_match:
        date_value = date_match.group(0)  # Extract the matched date
        base_filters.append(
            MetadataFilter(key="review_date", operator=FilterOperator.EQ, value=date_value)
        )


    # Create neighborhood mapping
    unique_neighborhoods = df[['neighbourhood', 'Latitude', 'Longitude']].drop_duplicates(subset='neighbourhood')
    neighborhood_mapping = unique_neighborhoods.set_index('neighbourhood').to_dict(orient='index')


    selected_neighborhood = None
    # Check for neighborhood in user query
    for neighborhood in neighborhood_mapping:
        if neighborhood in user_query:
            selected_neighborhood = neighborhood
            break


    if selected_neighborhood:  # Check if neighborhood was found in the query
        lat = str(neighborhood_mapping[selected_neighborhood]["Latitude"])
        lon = str(neighborhood_mapping[selected_neighborhood]["Longitude"])


        # Add location filters
        base_filters.append(
            MetadataFilter(key="Latitude", operator=FilterOperator.TEXT_MATCH, value=lat)
        )
        base_filters.append(
            MetadataFilter(key="Longitude", operator=FilterOperator.CONTAINS, value=lon)
        )


    # Check if filters were added and combine them under a MetadataFilters object with an AND condition
    if base_filters:
        filters = MetadataFilters(
            filters=base_filters,
            condition=FilterCondition.AND,
        )
        # st.write(filters)
    else:
        st.write("No valid filters applied based on the user query.")
        filters = []

The code then applies several dynamic filters based on the content of the user’s query. For example, if the query contains a specific date, the code uses a regular expression to extract this date and adds a filter to only include listings that match this date. Similarly, it checks if the query mentions a specific neighborhood, and if so, adds geographic filters for latitude and longitude based on a predefined mapping of neighborhoods to their geographic coordinates. This ensures that the search results are highly relevant to the user’s input. These filters are combined using an AND condition, ensuring all conditions must be met for a listing to be included in the results.

    retriever = index.as_retriever(filters=filters)
    response = retriever.retrieve(user_query)
    # Processing and displaying the results
    text = ''
    properties_list = []  # List to store multiple property dictionaries
    for scored_point in response:
        text += f"\n{scored_point.text}\n"    
        # Initialize a new dictionary for the current property
        property_dict = {}
        last_key = None  # Track the last key for appending text


        for line in scored_point.text.split('\n'):
            if ':' in line:  # Check if there is a colon in the line
                key, value = line.split(': ', 1)
                property_dict[key.strip()] = value.strip()
                last_key = key.strip()  # Update last_key with the current key
            elif last_key:  # Handle the case where no colon is found and last_key is defined
                # Append the current line to the last key's value, adding a space for separation
                property_dict[last_key] += ' ' + line.strip()


        # Add the current property dictionary to the list
        properties_list.append(property_dict)


    # properties_list contains all the retrieved property dictionaries
    with st.status("Retrieving points/nodes based on user query", expanded = True) as status:
        for property_dict in properties_list:
            st.json(json.dumps(property_dict, indent=4))
            # print(property_dict)
        status.update(label="Retrieved points/nodes based on user query", state="complete", expanded=False)
   
    with st.status("Generating response based on Similarity Search + LLM Call", expanded = True) as status:
        prompt_template = f"""
            Using the below context information respond to the user query.
            context: '{properties_list}'
            query: '{user_query}'
            Response structure should look like this:


            *Detailed Response*
           
            *Relevant Details in wellformatted Markdown Table Format. Select appropriate columns based on user query*


            """
        llm_response = llm.complete(prompt_template)
        response_parts = llm_response.text.split('```')
        st.markdown(response_parts[0])

Once the filters are set up, the application retrieves matching documents from the Qdrant vector store. For each document returned, the code extracts and formats detailed information about the Airbnb property from the structured text stored in the documents. This includes parsing the property details such as location, price, and ratings. The results are then displayed to the user in a structured format, using Streamlit’s capabilities to dynamically update the webpage. Additionally, the code uses another call to a language model to generate a detailed response based on the query context and the listings retrieved, aiming to provide a richer, more informative answer to the user’s query. This demonstrates an advanced use of vector search combined with natural language processing to deliver a user-friendly and effective search interface.

Metadata Filtering: Fine-tuning the Search

We leverage Qdrant’s powerful metadata filtering capabilities to refine the search experience. If the user’s query mentions a specific date or neighborhood, we apply filters based on the document metadata. For example, if the user asks for “List review for listing name ‘Elegant Apartment in Kreuzberg’ on 06–07–15,” we filter the results based on the ‘review_date’ metadata. Similarly, if the user mentions “Friedrichshain,” we utilize the ‘neighborhood’ metadata to refine the search.

End-to-End Example: A User’s Journey

Let’s follow a user’s journey through our AirBnB listing explorer using the query: “List review for listing name ‘Elegant Apartment in Kreuzberg’ on 06–07–15”.

Step 1: Parsing the Query and Applying Filters

Regular Expression for Date Extraction: We employ a regular expression (date_pattern) to identify and extract the date mentioned in the user’s query. In this case, it successfully matches “06–07–15”.
Neighborhood Mapping: A predefined mapping between neighborhood names and their corresponding latitude and longitude coordinates is used. If the query included a neighborhood name, we would identify its coordinates using this mapping.
Metadata Filters: Based on the extracted date, a metadata filter is created for the ‘review_date’ field. This filter ensures that only documents with a matching review date are considered during the search. If a neighborhood had been mentioned, additional filters for ‘Latitude’ and ‘Longitude’ would be applied.

Step 2: Qdrant Similarity Search with Filters

Vector Embedding of the Query: The user’s query is converted into a vector embedding using the same embedding model used for the documents.
Searching with Qdrant: Qdrant takes the query vector and efficiently searches its vector space for documents with similar vectors. The applied metadata filters ensure that only documents matching the specified date and/or neighborhood are considered.
Retrieving Relevant Nodes: Qdrant returns the most relevant documents (nodes) based on the similarity search and applied filters.

Step 3: Displaying Retrieved Nodes

The retrieved nodes, containing information about relevant AirBnB listings, are displayed in the Streamlit UI. This allows the user to quickly see if the retrieved information aligns with their query.

Step 4: LLM Response Generation

Prompting the LLM: The retrieved nodes and the user’s original query are combined into a prompt for the LLM. This prompt provides context and instructs the LLM on the desired response format.
LLM Processing and Response: The LLM analyzes the context, understands the user’s intent, and generates a response. The response might summarize the reviews for the specified listing on the given date, extract key points from the reviews, or provide additional insights based on the retrieved information.

Step 5: Presenting the Final Results

The LLM’s response is presented to the user in a clear and informative format. This could include a textual summary, a table highlighting key aspects of the reviews, or even a visualization of the sentiment analysis from the reviews.
In this example, the user would receive a response focusing specifically on reviews for the “Elegant Apartment in Kreuzberg” on “06–07–15”. The combination of Qdrant’s efficient filtering and search capabilities with the LLM’s language understanding allows for a powerful and intuitive way to explore AirBnB listings.

Conclusion

This exploration demonstrates the immense potential of combining Qdrant’s vector search capabilities with the power of LLMs. We’ve built a system that allows users to search for AirBnB listings using natural language, filtering results with precision based on specific criteria. While this example focuses on AirBnB listings, the underlying principles can be applied to various domains, from e-commerce product searches to complex document retrieval systems. As vector search technology and LLMs continue to evolve, we can anticipate even more innovative and user-friendly search experiences.

Your insights and contributions are invaluable as I continue to evolve this tool. Feel free to delve into the project on GitHub or join me on LinkedIn for further discussions.

GitHub Repo

Connect with Me on LinkedIn