Scaling Innovation Discovery with Web-Scale NLP & Early LLMs

I joined this project as a UX designer, but quickly realized the AI engine was way behind. So I dove in, built the ML/NLP stack from scratch, and scaled the whole thing to web scale. If you want to see how to go from zero to 250M+ docs and real LLM-powered search, this is it.

Challenge

Transform a small, manually curated knowledge base into a web-scale platform capable of identifying novel sustainable innovations by processing vast volumes of unstructured text data. Evolve from basic search functionality to sophisticated AI-driven insights and discovery.

My Role & Evolution

I joined the team in 2018 as a UX designer. But after less than a year, I completed all the UX research and design work, and discovered that the AI engine was far behind what the UI promised. This challenge excited me so much that I naturally transitioned to working on what I’m truly passionate about: AI and NLP.

At the time:

Our database contained only a few thousand manually curated innovations
Almost no ML systems were in production
No automated data pipelines existed
Search was mostly keyword and rule-based
The system wasn’t optimized to find solutions to user problems as intended

Key Contributions

Neural Semantic Search Transformation

First, I proposed a shift to neural semantic search:

Ran an annotation project that resulted in a manually labeled innovation search dataset
Trained and tested hundreds of different retrieval and re-ranking models
Implemented vector search solutions
Helped deploy the best models into production
The semantic search system I developed is still running in production today

Early Semantic Search Demo Figure 1: One of my first semantic search demos (2019), showcasing contextual document retrieval

Web-Scale Data Pipeline

I championed the idea that we needed to exponentially grow our database:

Proposed scaling from thousands to hundreds of millions of entries by processing web content
Advocated for importing research papers, patents, and news
Trained custom ML models for domain classification, page classification, and entity extraction
Built a pipeline that could mine the internet and import potential innovative solutions
Scaled the database to over 250 million documents in production

Early LLM Innovation (2019)

Shortly after GPT-2 was released in 2019:

Fine-tuned it as an idea generator for creative problem solving
Deployed it as a Slack bot (later rebuilt by colleagues for MS Teams)
To my knowledge, this was one of the earliest documented enterprise uses of a language model for automated ideation
This early LLM application sparked user interest and excited investors

Figure 2: Human-AI collaborative ideation platform for generating novel innovation concepts

Advanced NLP Systems

I designed and developed numerous custom ML systems that improved the quality of our data:

Custom NER & Knowledge Graph: Created entity extraction and linking pipeline
- Extracted named entities and relations
- Linked them to Wikidata and other knowledge bases
- Built a proprietary knowledge base for emerging innovations not yet in public sources
- Giving users a critical time advantage in discovering emerging technologies

Technology NER Model Figure 3: Custom-trained NER model identifying and extracting emerging technologies from unstructured text

Pioneering LLM Techniques:
- Started using LLMs for knowledge extraction and data augmentation years ahead of mainstream adoption
- Designed prompting methods similar to what is now called Self-Ask, Chain-of-Thought, and Scratchpads years ahead of mainstream adoption
- Created advanced dialog systems incorporating background search, source citation, and goal completion

LLM-Powered Search Figure 4: Evolution to LLM-powered semantic search on large private corpora with context-aware relevance ranking

Advanced Analytics: Developed systems for trend analysis and forecasting
- Created topic modeling and trend analysis tools
- Built LLM-based forecasting capabilities for technology prediction
- Implemented zero-shot innovation detection based on component scores

Topic Trend Analysis Figure 5: Technology trend analysis based on entities extracted by custom NER models

Key Technologies

Custom ML Frameworks, Python, PyTorch, Hugging Face (Transformers), spaCy, NLTK, Elasticsearch/OpenSearch, Vector Search (early implementations), GCP, Custom ML Pipelines, Early LLMs (GPT-2 onwards).

Project Highlights

Custom Named Entity Recognition

Technology NER Model Figure 6: Custom-trained NER model identifying technologies

Topic Trend Analysis

Topic Trend Analysis Figure 7: Visualization of technology trends based on entities extracted by custom NER models

Semantic Search Evolution

From basic semantic search to advanced LLM-powered retrieval:

Figure 8: Early semantic search demo (2019) showing contextual document retrieval

Modern LLM-Powered Search Figure 9: LLM-powered semantic search interface with context-aware relevance ranking

Entity Linking & Knowledge Graph

Our system could identify, extract, and link entities to build a comprehensive knowledge graph:

Figure 10: Entity recognition interface showing automated identification of key technologies

Entity Exploration Figure 11: Interactive exploration of newly mined entities and their relationships

LLM Applications

Early adoption of language models for various innovation tasks:

Figure 12: Early Q&A system prototype for innovation research (pre-ChatGPT)

Figure 13: Human-AI collaborative ideation platform for generating novel innovation concepts

Figure 14: LLM-based forecasting system for predicting technology trajectory

Impact & Scale

Exponentially scaled platform knowledge by approximately 5 orders of magnitude. Implemented core AI search and extraction capabilities that remain in production use. Pioneered practical LLM applications within the organization years before mainstream adoption.

★★★★★ (5.0) “There isn’t any problem that Klim cannot figure out. Simply exceptional. […] If you are looking for an unreasonably productive superstar, rest assured that Klim will exceed your highest expectations.”

Source: Upwork — Client

★★★★★ (5.0) “Klim is outstanding. Full stop.”

Source: Upwork — Client

“Klim is a hyperproductive ML researcher with cutting edge NLP skills. […] successfully delivered several projects ranging from open ended AI research to classifiers, encoders, rerankers, information retrieval systems, benchmarking, and prompt engineering. Klim has a great capacity for working independently, and produces high quality models […] optimized for production use.”

Source: LinkedIn Recommendation — Dennis Kashkin, Former Manager [Link to LinkedIn Profile]

Challenge#

My Role & Evolution#

Key Contributions#

Neural Semantic Search Transformation#

Web-Scale Data Pipeline#

Early LLM Innovation (2019)#

Advanced NLP Systems#

Key Technologies#

Project Highlights#

Custom Named Entity Recognition#

Topic Trend Analysis#

Semantic Search Evolution#

Entity Linking & Knowledge Graph#

LLM Applications#

Impact & Scale#