Building a Semantic Search: Insights from the start of our journey

Research in the field of Artificial Intelligence (AI) is challenging but full of potential – especially for a new team. When CONTACT Research was formed in 2022, AI was designated as one of four central research areas right from the start. Initially, we concentrated on smaller projects, including traditional data analysis. However, with the growing popularity of ChatGPT, we shifted our attention to Large Language Models (LLMs) and took the opportunity to work with cutting-edge tools and technologies in this promising field. But as a research team, one critical question emerged: Where do we get started?

Here, we share some of our experiences which can serve as guidance to others embarking on their AI journey.

The beginning: Why similarity search became our starting point

From the outset, our goal was clear: we wanted more than just a research project, we aimed for a real use case that could ideally be integrated directly into our software. To get started quickly, we opted for small experiments and looked for a specific problem that we could solve step by step.

Our software stores vast amounts of data, from product information to project details. Powerful search capabilities make a decisive difference here. Our existing search function did not recognize synonyms or natural language, sometimes missing what users were really looking for. Together with valuable feedback, this quickly led to the conclusion that similarity search is an ideal starting point and should therefore be our first research topic. An LLM has the power to elevate our search functionality to a new level.

The right data makes the difference

Our vision was to make knowledge from various sources such as manuals, tutorials, and specifications easily accessible by asking a simple question. The first and most crucial step was to identify an appropriate data source: one large enough to provide meaningful results but not so extensive that resource constraints would impede progress. In addition, the dataset needed to be of high quality and easily available.

For the experiment, we chose the web-based documentation of our software. It contains no confidential information and is accessible to customers and partners. Initial experiments with this dataset quickly delivered promising results, so we intensified the development of a semantic search application.

What is semantic search?

In short, unlike the classic keyword search, semantic search also recognizes related terms and expands queries to include contextually-related results – even if these are phrased differently. How does this work? In our first step with semantic indexing, the LLM converts the content of source texts into vectors and saves them in a database. Search queries are similarly transformed into vectors, which are then compared to stored vectors using a “nearest neighbor” search. The LLM returns the results as a sorted list with links to the documentation.

Plan your infrastructure carefully!

Implementing our project required numerous technical and strategic decisions. For the pipeline that processes the data, LangChain best met our requirements. The hardware also poses challenges: for text volumes of this scale, laptops are insufficient, so servers or cloud infrastructure are required. A well-structured database is another critical factor for successful implementation.

Success through teamwork: Focusing on data, scope, and vision

Success in AI projects depends on more than just technology, it is also about the team. Essential roles include Data Engineers who bridge technical expertise and strategic goals, Data Scientists who analyze large amounts of data, and AI Architects who define the vision for AI usage and coordinate the team. While AI tools supported us with “simple” routine tasks and creative impulses, they could not replace the constructive exchange and close collaboration within the team.

Gather feedback and improve

At the end of this first phase, we shared an internal beta version of the Semantic Search with our colleagues. This allowed us to gather valuable feedback in order to plan our next steps. The enthusiasm for further development is high, fueling our motivation to continue.

What’s next?

Our journey in AI research has only just begun, but we have already identified important milestones. Many exciting questions lie ahead: Which model will best suit our long-term needs? How do we make the results accessible to users?

Our team continues to grow – in expertise, members, and visions. Each milestone brings us closer to our goal: integrating the full potential of AI into our work.

For detailed insights into the founding of our AI team and on the Semantic Search, visit the CONTACT Research Blog.

Design decisions in minutes – how AI supports product development

Artificial intelligence (AI) is a hot topic and increasingly important in product development. But how can this technology be effectively integrated into development projects? Together with our client Audi, we put it to the test and examined the potential and challenges of a machine learning (ML) application – a subset of AI – in a real project. For this purpose, we chose a crash management system (CMS). It is both simple enough to achieve a meaningful result and complicated enough to adequately test the general applicability of the ML method.

Expertise as the Key

ML can only be effectively utilized to the extent the underlying data foundation allows. Therefore, the expertise of the professionals involved plays a critical role. For example, design engineers enter their knowledge of manufacturing and spatial constraints, usable materials, and dependencies into the CAD model. Calculating engineers share their expertise on the simulation process, while data scientists assist with sampling and evaluation.

The creation of thousands of design and corresponding simulation models, as required for the use of Machine Learning (ML), presents a tremendous challenge without automation. The FCM CAT.CAE-Bridge, a specially developed plug-in for CATIA, enables seamless automation across all process steps. Additionally, it embeds all simulation-relevant information (material, properties, solver, and more) directly into the CAD model. The fully automatic translation into a simulation file is done via tools such as ANSA or Hypermesh.

Automated process: Sampling, DoE, model creation, simulation, evaluation with subsequent training of the ML models. (© CONTACT Software]

Precise Linking of Parameters and Results

Our approach ensures that the relationship between the CAD model and the simulation model is fully preserved. The automated calculation and evaluation of the models based on specific results create an excellent data foundation for the ML process. The vectors of input parameters with corresponding result values form the basis for the ML approach—clear and comprehensive.

Input parameters (blue) identified based on constrained result vectors (red) that meet the requirements. (© CONTACT Software)

With the trained models and their known accuracy, parameter variations can be quickly tested, and the impact on behavior can be derived—literally within minutes. Once the optimal parameters are identified, they are automatically transferred to the CAD model and the design process can continue.

Conclusion

Our project demonstrated that ML is a valid method for design engineering. The combination of parametric CAD models, simulation, and machine learning provides an efficient approach to making design decisions quickly and accurately. The prerequisite for this is a robust database and the collaboration of the relevant experts on the model. The successful results from the Audi project demonstrate the potential of our data-based approach for product development.

Digital authenticity: how to spot AI-generated content

In today’s digital age, we often question whether we can trust images, videos, or texts. Tracing the source of information is becoming more and more difficult. Generative AI accelerates the creation of content at an incredible pace. Images and audio files that once required a skilled artist can now be generated by AI models in a matter of seconds. Models like OpenAI’s Sora can even produce high-quality videos!

This technology offers both opportunities and risks. On the one hand, it speeds up creative processes, but on the other hand, it can be misused for malicious purposes, such as phishing attacks or creating deceptively real deepfake videos. So how can we ensure that the content shared online is genuine?

Digital watermarks: invisible protection for content

Digital watermarks are one solution that helps verify the origin of images, videos, or audio files. These patterns are invisible to the human eye but can be detected by algorithms even after minor changes, like compressing or cropping an image, and are difficult to remove. They are primarily used to protect copyright.

However, applying watermarks to text is way more difficult because text has less redundancy than pixels in images. A related method is to insert small but visible errors into the original content. Google Maps, for instance, uses this method with fictional streets – if these streets appear in a copy, it signals copyright infringement.

Digital signatures: security through cryptography

Digital signatures are based on asymmetric cryptography. This means that the content is signed with a private key that only the creator possesses. Others can verify the authenticity of the content using a public key. Even the smallest alteration to the content invalidates the signature, making it nearly impossible to forge. Digital signatures already ensure transparency in online communication, for example through the HTTPS protocol for secure browsing.

In a world where all digital content would be protected by signatures, the origin and authenticity of any piece of media could be easily verified. For example, you could confirm who took a photo, when, and where. An initiative pushing this forward is the Coalition for Content Provenance and Authenticity (C2PA), which is developing technical standards to apply digital signatures to media and document its origin. Unlike watermarks, signatures are not permanently embedded into the content itself and can be removed without altering the material. In an ideal scenario, everyone would use digital signatures – then, missing signatures would raise doubts about the trustworthiness of the content.

GenAI detectors: AI vs. AI

GenAI detectors provide another way to recognize generated content. AI models are algorithms that leave behind certain patterns, such as specific wording or sentence structures. Other AI models can detect these. Tools like GPTZero can already identify with high accuracy whether a text originates from a generative AI model like ChatGPT or Gemini. While these detectors are not perfect yet, they provide an initial indication.

What does this mean for users?

Of all the options, digital signatures offer the strongest protection because they work across all types of content and are based on cryptographic methods. It will be interesting to see if projects like C2PA can establish trusted standards. Still, different measures may be needed depending on the purpose of ensuring the trustworthiness of digital content.
In addition to technological solutions, critical thinking remains one of the best tools for navigating the information age. The amount of available information is constantly growing; therefore, it is important to critically question, verify, and be aware of the capabilities of generative AI models.

For a more comprehensive article, check out the CONTACT Research Blog.