top of page
red_stars_BACKGROUND 2.png

Bringing the DataEngBytes experience into the GenAI era

Updated: May 14

The image features a colorful and vibrant illustration that reads "Bringing the DataEngBytes experience into the GenAI Era". It depicts a diverse group of people interacting with a large smartphone displaying a friendly chatbot character. The chatbot has the DataEngBytes logo on its screen, symbolizing its role as a virtual assistant. The people, shown in various poses, appear engaged in a conversation with the chatbot, highlighting the interactive and accessible nature of the DataEngBytes community event in the age of generative AI.

I founded Cloud Shuttle in 2020, but people in my network often know me as ‘the conference guy’. I founded the Data Engineering meetup in Sydney back in 2017, a meetup group that has now expanded to Melbourne, Brisbane, Perth, Hobart and Auckland (check it out!). From the community established from the data engineering meetup groups, I then launched DataEngBytes, a community event which has grown to become multi-city, full-day conferences with well over 1,000 attendees each year. In 2024, we’re even expanding into New Zealand for the first time.   


Wanting to provide an even better experience for our community, our organisers started to create and add richer features, including our revamped website. Putting on my data and AI cap, I started to toy with the idea of a chatbot that would be useful for our attendees and prospective attendees. 


The challenge


We settled on building a chatbot that our community could ask their most burning questions relating to DataEngBytes. For example:


  • When will the conference be and where?

  • Which companies attended last year?

  • Which companies similar to mine attended last year?

  • (most importantly, of course :-D ) Will there be catering, coffee, swag or prizes?


The problem: LLMs are about language prediction, and their output is based on what they know up until a certain period of time. The corpus that they are trained on is the limit of their knowledge. Meaning – they wouldn’t know anything about DataEngBytes. Not one iota.


For an LLM to be useful for our purposes, Retrieval-Augmented Generation (RAG) is needed to augment what an LLM knows with the event/organisation-specific DataEngBytes information. 

What’s RAG? Let me give you a primer.


Retrieval-Augmented Generation (RAG)


Retrieval-Augmented Generation, or RAG for short, is a technique that makes LLMs more reliable by having them check against external information from authoritative sources before they generate a response. There are two main types of RAG, a vector store and a knowledge graph.


The image is an infographic contrasting "Graph Database" with "Vector Store." It is structured as a comparison table with a dark blue background and pink to orange molecular structures signifying network connections. On the left, "Graph Database" is highlighted with properties like structure, retrieval, nature, strengths, challenges, and use cases, mentioning nodes and edges, deterministic nature, and its suitability for domains needing complex inference. On the right, "Vector Store" is defined by vectors in multi-dimensional space, probabilistic nature, fast data retrieval, susceptibility to inaccuracies, and its use for general knowledge and large unstructured data. The infographic is branded with the Cloud Shuttle logo at the bottom right.

Vector stores 


As its name implies, vector databases store data as vectors of numbers. Vector stores really shine when it comes to handling large volumes of unstructured data. They operate by converting things like text, audio or images to vectors (multi-dimensional space). They enable very fast and efficient retrieval from large data sets. Embeddings, which are used in virtually all well-known NLP models, are vectors of numbers that encode the meaning of words in a way that captures their semantic relationships with other words. What this means is that when the LLM searches against these vectors, they look for nearest neighbours. Vector stores are probabilistic in nature. 


Knowledge graphs


Knowledge graphs, on the other hand, represent knowledge in a structured format using nodes (entities) and edges (relationships between entities). This structure means data that is relational and hierarchical can be retrieved very efficiently. Because of the explicit relationship of the edges and nodes, it is more deterministic in nature. While vector databases can only tell you how similar or related two entities are, knowledge graphs can actually tell you how the entities are related. For example, it can tell you if an entity is a subset of another, and the exact relationships between different entities. Knowledge graphs can be a highly effective way of knocking out hallucinations in LLMs.


This means that knowledge graphs are better than vector databases for applications that require very deep and specialised knowledge, or where more complex inference and reasoning is required, for example in the academic, engineering and medical fields.


Combining vector stores and knowledge graphs


Depending on your use case, each type can be a powerful solution on its own. They can even be combined for the best of both worlds when you need to leverage structured relational data as well as similarity-based retrieval. For example, combining both can lead to enhanced search capabilities when you use knowledge graphs to understand the context and relationship behind a query, and a vector database to retrieve items contextually similar to the query vector. 


DataEngBytes Chatbot




We built out our POC for the DataEngBytes chatbot, hosting an app on LangChain serve, Neo4j Knowledge Graph and Amazon Bedrock for the LLM.

While vector stores are formidable in their own right and getting more powerful day-by-day, we used a knowledge graph-based approach for our chatbot, because the specificity of the DataEngBytes information requires a more deterministic model. 


Let’s imagine that attendees or prospective DataEngBytes attendees might want to ask the following questions:


  • When is DataEngBytes in Sydney?

  • What companies attended DataEngBytes in Sydney last year?

  • I come from a small company, what similar companies attendees DataEngBytes in Sydney last year?


To enable RAG for our LLM, we spun up a Knowledge graph on Neo4j which did the following:


  • Connected our conferences with venues and dates

  • Connected previous conferences with attending companies

  • Connected those companies with their classifications

  • Connected the conference experience with prizes, catering and swag options available.


What’s happening under the hood?


1. User types in a plain English query to the chatbot, for example, “Where is the Sydney DataEngBytes conference being held in 2024?”


2. The question gets sent to the LangChain API.


The image displays a dark-themed code editor screen with a snippet of a Cypher query template. The template includes instructions for generating a Cypher statement to query a graph database, emphasizing the use of provided relationship types and properties within a given schema. Example Cypher queries are shown for finding the number of actors in a movie, the date and ticket price of the DataEngBytes Sydney conference, and the venue location of the same event. Placeholder text for a user's question indicates where the input should be inserted to generate a tailored query. The overall appearance suggests a user-friendly interface for querying a database using structured query language.

3. The query gets converted to cypher query (basically a ‘prompt engineer’s prompt’ – an instruction that tells the LLM the parameters needed in order to format an answer we’re expecting in return.) Langchain automates a cypher statement to query the graph database, an example as follows:

4. LangChain automates the cypher query against the Neo4J graph database, which has DataEngBytes data preloaded. The graph database houses triples - trios of related entities. This web of nodes and edges represents the relationships such as employment, venue the event takes place, city the event takes place in, attending companies, company sizes, and so on.


 The image shows a coding interface with a dark background where a series of RDF triples are defined. The triples are outlining relationships within the context of a conference called DataEngBytes. The subject-predicate-object statements establish DataEngBytes as a conference, DataEngBytesSydney2024 as an event, and SofitelSydney as a venue. They also define relationships indicating that DataEngBytes occurs at DataEngBytesSydney2024, the event takes place at SofitelSydney, and the venue is located in Sydney.
By grouping triples together, we start building a graph.

The image is a screenshot of a Neo4j graph database interface showing a visual representation of a Cypher query result. The query matches a pattern within the graph that connects a conference named "DataEngBytes" to an event "Dataengbytes Sydney 2024," which in turn is connected to a venue "Sofitel Sydney." The venue is then connected to the city "Sydney." The nodes are color-coded and labeled as "Dataeng," "Sofitel Sydney," and "Sydney" with relationship arrows labeled "OCCURS," "TAKES_PLACE_AT," and "LOCATED_IN" indicating the types of relationships between them. On the right, there is an overview panel listing the node labels "City," "Conference," "Event," "Venue" and relationship types "OCCURS," "TAKES_PLACE_AT," "LOCATED_IN." This visual data graphically demonstrates the relationships and connections within the DataEngBytes conference ecosystem.
A visual representation of the relationships between nodes in our graph database, in this case mapping the relationship between DataEngBytes the conference (Conference), with DataEngBytes Sydney (the event), which takes place at Sofitel Sydney (the venue), located in Sydney (the city).

The image is a screenshot from a Neo4j graph database interface, showing a visual representation of a data query result. The displayed graph consists of multiple clusters of nodes interconnected by lines, indicating relationships. Each cluster represents a conference and its attending companies, with each company as an individual node linked to the central conference node by a relationship labeled "ATTENDED." The interface shows a total of 165 nodes, comprising 8 conference nodes and 157 company nodes. No relationships are highlighted apart from the "ATTENDED" relationship. The background is dark, with the nodes and relationships illuminated for visual emphasis, signifying a complex network of data depicting conference attendance. The overview panel indicates the presence of two node labels, 'Conference' and 'Company,' and one relationship type, 'ATTENDED.' This visualization serves as an analytical tool for understanding the connections between conferences and participating companies within the database.
Another example of our Knowledge Graph mapping the relationship between companies that attended DataEngBytes and their company size.

5. After the query is executed, the information is returned in raw data form (for example a list of venues, event dates, or names of companies similar in size to the one the user inputs into the chatbot).


6. The answer then gets shaped into a conversational English language response and served up to the user.


Conclusion

As my team and I learnt from integrating knowledge graphs and LLMS, expressing the data within your organisation in the right way can lead to seriously powerful outcomes for your customers. When you start thinking about your data in terms of relationships, your data model will naturally follow. I’m excited to see more developments in this space, and I hope you’ll agree that our proof of concept shows that knowledge graphs are the way to truly enable your organisation to capture the full power of its data.



This image features a promotional graphic for a recap of Cloud Shuttle's participation at the GraphSummit Sydney 2024. The design includes a stylized human head with circuit-like patterns and gears, symbolizing the integration of technology and human intellect. The background has a rich navy color with abstract red wave patterns, and the text "GraphSummit Sydney 2024 Cloud Shuttle recap" is displayed in bold, contrasting colors. The overall layout is modern and visually engaging, effectively conveying the theme of technology and innovation.

Read our recap of the GraphSummit Sydney (powered by Neo4j) that took place at Beta Events on Thursday, 9 May 2024. If you're interested in exploring graph solutions for your business, get in touch with us today!




bottom of page