BBUZZ 22 Sessions
Search
Semantic search using AI-powered vector embeddings of text, where relevancy is measured using a vector similarity function, has been a hot topic for the last few years. As a result, platforms and solutions for vector search have been springing up like mushrooms. Even traditional search engines like Elasticsearch and Apache Solr ride the semantic vector search wave and now support fast but approximative vector search, a building block for supporting AI-powered semantic search at scale.
Undoublty, sizeable pre-trained language models like BERT have revolutionized the state-of-the-art on data-rich text search relevancy datasets. However, the question search practitioners are asking themself is, do these models deliver on their promise of an improved search experience when applied to their domain? Furthermore, is semantic search the silver bullet which outcompetes traditional keyword-based search across many search use cases? This talk delves into these questions and demonstrates how these semantic models can dramatically fail to deliver their promise when used on unseen data in new domains.
"Apache Solr 9.0 might be among the most anticipated release for the project in the last decade for Solr. For folks who don't follow the project very closely, the list of changes is a lot to comprehend and digest. This talk would make that process easy for the developers by highlighting some key aspects of the 9.0 release. During this talk, I'd cover the migration of the Solr build system to Gradle and what it means for developers who work with Solr. I will also talk about updates to modules like the movement of HDFS into a non-core plugin and the removal of auto-scaling framework, CDCR, and DIH. In addition, this talk would also showcase some of the key security, scalability, and stability improvements that Solr 9.0 brings to the users. At the end of this talk, the attendees would have a better understanding of the Solr 9.0 release and a high level road map for the project allowing them to plan better."
An online e-commerce search engine is easy to put in place. Scaling it to serve millions of users, adding a marketplace to provide thousands of products, supporting multiple offers, prices and stocks on the same product are additional challenges more difficult to address. And what if, in addition, you mix your online search engine with the activity of thousands of physical stores? In this talk we explain how we addressed all these challenges in the context of the largest retail group and online grocery store in France. The constraint of multiple physical stores backed by the online search engine introduces additional challenges that we emphasize and address in detail. Our point of view, as we explain the challenges and solutions, is both technical and functional.
Search is a vital part of the online experience and for many brands a key way to interact with their customers. Yet search results are too often derived from data collected by trackers and analytics, tools that disrespect human rights and GDPR or CCPA regulations. In this talk, we'll outline the negative impact of tracking while exploring alternative solutions that actively protect privacy without detracting from the search experience.
Key takeaways:
- Learn the key principles of a privacy-first platform architecture
- Explore high demand performance stability in a data protected environment
- Liberty of liability: look, but don’t touch personal data
This talk is sponsored by Empathy.co
When talking about relevance regarding search, it often sounds like it is a thing, something that can be touched and seen. Nevertheless, that is not the case. What do I mean by that? In this talk, I will provide some examples of how relevance is often merely seen as a score when it can be, in fact, an engaging relationship where the user and the search UI connect in aesthetic and enjoyable ways. I will present numerous examples of innovative search experiences that challenge prevailing schemas and structures and lead instead to elements of motion and correlated visual action that allows us to perceive the beauty of relevancy on a different level. Because relevance is a matter of perception
The ubiquity of public cloud platforms has made it easy to offload operational overhead of maintaining on-premise systems and leverage the ability to scale these systems on-demand in a matter of minutes. But architecting a secure scalable systems in the public cloud comes with its own challenges. This problem is further complicated when you are migrating from an on-premise system. Such migrations often require infrastructure to operate in a hybrid state where some parts of the system have been migrated to the cloud while remaining components continue to run on-premise. We must also ensure that the migration is invisible to the user and there is no impact to overall availability of the system during this transition. Recently Box Search underwent such a migration for our Solr indexing pipeline and document store which involved migrating hundreds of terabytes of customer data from on-premise to GCP. In this talk we present the overall system architecture, the migration process and some of the challenges we encountered when running this system in a hybrid state.
Developers usually approach Apache Lucene with a black box mindset- queries go in, ranked search results come out. Most of us start simple with term queries, then move to boolean queries made up of many sub-queries. It doesn't take long for a well-intentioned Search Engineer to end up delivering painfully slow search queries returning horrifyingly irrelevant results. In this talk, I will walk us through the internal execution of a Lucene search query from start to finish. First, we'll learn the internal data structures unique to Lucene. We'll go over inverted indices, columnar storage, Finite State Transducers, and more. We'll dive into how they're optimized and stored for maximum performance. We’ll now put our newly acquired knowledge to the test. Starting with an IndexSearcher, we'll see how Lucene optimizes our query through query rewriting. Next, we'll see how concurrent query execution takes advantage of the "embarrassingly parallel" task of iterating over index segments. Finally, we'll learn how result collection, ranking, and re-ranking of TopN search results give us what we were looking for in a nicely organized list. After this talk, you will better understand and appreciate the moving parts behind modern search engines. You can bring this knowledge to your work, creating faster and more relevant search results of your own.
Language models have drawn a lot of attention in NLP in recent years. Despite their short history of development, they have been employed and delivered astonishing performances in all sorts of NLP tasks, such as translation, question answering, information extraction and intelligent search. However, we should not forget that giant language models are not only data hungry, but also energy hungry. State-of-the-art language models such as BERT, RoBERTa and XLNet process millions of parameters, which is only possible with the help of dozens of sophisticated and expensive chips. The CO2 generated in the process is also massive. Being responsible for such high energy consumption is not easy in times of climate change. In order for companies to benefit from the performance of state-of-the-art language models without putting too much strain on their computing costs, the models used must be reduced to a minimum. Of course, performance should not suffer as a result. One possible means to achieve this is the so-called knowledge distillation, which is one common technique among model compression methods. In this presentation, we will show you how you can use knowledge distillation to generate models that achieve comparable performances as state-of-the-art language models effectively, and in a resource-saving manner.
Matscholar (Matscholar.com) is a scientific knowledge search engine for materials science researchers. We have indexed information about materials, their properties, and the applications they are used in for millions of materials by text mining the abstracts of more than 5 million materials science research papers. Using a combination of traditional and AI-based search technologies, our system extracts the key pieces of information and makes it possible for researchers to do queries that were previously impossible. Matscholar, which utilizes Vespa.ai and our our own bespoke language models, greatly accelerates the speed at which energy and climate tech researchers can make breakthroughs and can even help them discover insights about materials and their properties that have gone unnoticed.
The combination of big data and deep learning has fundamentally changed the way we approach search systems, allowing us to index audio, images, video, and other human-generated data based on an embedding vector instead of an auxiliary description. These advancements are backed by new and often times increasingly complex machine learning (ML) models, leading to an even wider research-to-industry gap despite the introduction of MLOps platforms and a variety of model hubs. We summarize some of the challenges facing practical machine learning in 2022 and beyond as follows: 1) many ML applications require a combination of multiple models, leading to a lot of overly complex and difficult-to-maintain auxiliary code, 2) many engineers are unfamiliar with ML and/or data science, making it difficult for them to train, test, and integrate ML models into existing infrastructure, and 3) constant architectural updates to SOTA deep learning models creates significant overhead when deploying said models in production environments. In this talk, we discuss lessons learned from building an open-source (https://github.com/towhee-io/towhee) and scalable framework for generating embedding vectors purpose-built to tackle the above challenges. Early on, we communicated with dozens of industry partners to understand their application(s) and architected our platform around their requirements. This open source project is currently being used by 3 major corporations ($10B+ market value) and a number of small- and mid-size startups in proof-of-concept and production systems.
Are you familiar with the following Scenario? You're running your Apache Spark app on EMR, and the log file gets pretty heavy. You try and open it through the AWS UI, or download it straight to your computer. You end up connecting to the server running your driver or any of your executors, relentlessly searching your logs while simultaneously looking at Ganglia and the Spark UI for additional logs and metrics. If you are, this talk is exactly for you. Let me tell you how made it all easy with just some bootstrap actions, some bash scripts, Beats and Elastic. Customizable per app logging, with less searching of big log files and more looking into useful Kibana dashboards. This architecture is not nice to have, it's essential.
Learn how the Data Infrastructure team at Brandwatch rearchitected a group of their current Solr clusters and took a new approach in an unconventional manner. By splitting up the reads and writes, experimenting with Solr plugins, using S3, an application written in Rust and adopting the Solr Operator to spin up a cluster on Kubernetes, we were able to achieve our goal of having a cloud-based cluster which comfortably serves 26bn+ documents. You'll understand the whys of our approach, things we discovered, what we have planned, and why rearchitecting things can be a difficult and strenuous task.
Search Engine technologies, like OpenSearch, have continued to grow in popularity for a number of different use cases. Features like full-text search, fast ingestion, scalability, faceting, and extensible plugin frameworks were often enhanced with the aim to improve the search use case. However, the side effect of these improvements provided much of the foundation that led people to adopting these technologies for other uses like click stream analytics, log analytics, security analytics, and more. In this talk we will explore how features that started as search enhancements opened the door for new use cases and why we continue to see affinity between search engines and broader analytics workloads.
Implementing a machine learning model for ranking in an ecommerce search requires a well-designed approach to how the target metric is defined. In our team we validate our target metrics with online tests on live traffic. This requires both long preparation times and long enough runtimes to yield valid results. Having to choose only a few candidates for the next A/B test is hard and slows us down significantly. So what if we had a way to evaluate the candidates beforehand to make a more informed decision? We came up with an approach to predict how a certain ranking will perform in an onsite test. We leverage historic user interaction data from search events and try to correlate them with ranking metrics like NDCG. This gives us insights on how well the ranking meets the user intent. This is not meant to be a replacement for a real A/B test, but allows us to narrow down the field of candidates to a manageable number. In this talk we will share our approach to offline ranking validation and how it performed in practice.
Defining the KPIs, keeping an eye on the customer satisfaction and sales, defining the backlog, configuring the search engine, debugging relevance issues, preventing regressions … These are a few tasks on the list of a search engine administrator. A search engine is a living thing. Seasonality, levels of stocks, lifecycle of the products, marketing events, news, etc. are a few of the many factors that force the search engine to constantly evolve. In this context, the life of a search engine manager is tough. In this talk we describe the processes and tools that we put in place and help manage a search engine. We also address the limits between what can be automated and what still needs human supervision.
Elasticsearch (or OpenSearch) clusters likely need to scale to adapt to changes in load. But autoscaling Elasticsearch isn't trivial: indices and shards need to be well sized and well balanced across nodes. Otherwise the cluster will have hotspots and scaling it further will be less and less efficient. This talk focuses on two aspects: - best practices around scaling Elasticsearch for logs and other time-series data - how to apply them when deploying Elasticsearch on Kubernetes. In the process, a new (open-source) operator will be introduced (yes, there will be a demo!). This operator will autoscale Elasticsearch while keeping a good balance of load. It does so by changing the number of shards in the index template and rotating indices when the number of nodes changes.
The National Audiovisual Institute (INA) is a repository of all French audiovisual archives, being responsible for archiving over 180 radio and television services, 24/7, since 1995. The generated metadata describing this content currently represents the equivalent of over 50 million documents (e.g.: images, audio and video fragments, text excerpts, etc.). Due to the heterogeneity of the content, the data model is directly inspired from the conceptual models of cultural heritage, represented by a large graph with complex relations between generic entities. The challenge for building a global search engine for this particular use case is twofold: on one hand, the capacity to index and maintain the entire set of documents updated in a reasonable amount of time, and on the other hand the implementation of complex full text search capabilities with high performance. Our talk describes the key choices for the graph representation, facilitating the indexing process of the documents, as well as the technical framework set up around Elasticsearch, implementing dedicated search APIs required by different functional areas. We also briefly mention the implementation optimisations that lead to a full process of 50 million documents in less than 48 hours, for an equivalent of 800GB Elasticsearch index.
Location is an important decision-making factor for many end users. Hotel aggregators, job search portals, property listing companies all filter out results that are too far away. If the results page shows locations that are hard to reach, conversion rates will plummet. If you’re quality scoring results based on straight-line distance, you’re not personalising your results page as well as you could be. That’s because we never truly travel in a straight line, instead we’re at the mercy of the transport networks around us. Distance never considers the context of accessibility, which is unique to every location around the world. Using distance is impacting search result ranking because:
- It doesn’t acknowledge that long distances in quiet rural areas are easier to travel vs. congested urban areas
- It ignores that some locations are situated on fast transport routes – they could appear far away but they may be really easy to access depending on the local infrastructure
- Local geography can massively impact accessibility – mountains, rivers and beaches all provide accessibility challenges
Using real world examples I’ll discuss how to integrate travel times into your recommendation model and what the effects are for businesses and end users. I’ll also discuss how the presence of transport data on search result listings helps reduce cognitive load when users are making a decision. I’ll end with a quick demo showing how to build it into your recommendation engine.
### Context In the past year the interest in Neural Search and vector search engines increased heavily. They promise to solve multi modal, cross modal and semantic search problems with ease. However, when quickly trying Neural search with off-the-shelf pre-trained models the results are quite disillusioning. They lacking knowledge about the data at hand. In order to explicitly solve model finetuning for search problems we implemented an open-source finetuner. It is directly usable with several vector databases due to the underlying data structure. ### Presentation In our talk we present our methodology and performance on an example dataset. Afterwards, we show how well the approach transfers to other datasets, such as deepfashion, geolocation geoguessr and more. It will give hands-on guidance on how you can finetune a model in order to make your data better searchable.
If you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain. The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process". Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other. This talk explores our contribution to Apache Lucene that integrates this technique with the text analysis pipeline. We will show how you can automatically generate synonyms on the fly from an Apache Lucene index and how you can use this new feature along with Apache Solr with practical examples!
Since version 3 of Apache Lucene and Solr and from the early beginning of Elasticsearch, the general recommendation was to use MMapDirectory as the implementation for index access on disk. But why is this so important? This talk will first introduce the user about the technical details of memory mapping and why using other techniques slows down index access by a significant amount. Of course we no longer need to talk about 32/64bit Java VMs - everybody uses now 64 bits with Elasticsearch and Solr, but with current Java versions, Lucene still has some 32bit-like limitations on accessing the on-disk index with memory mapping. We will discuss those limitations especially with growing index size up to terabytes, and afterwards, Uwe will give an introduction to the new Java Foreign Memory Access API (JEP 370, JEP 383, JEP 393, JEP 412, JEP 419), that first appeared with Java 14, but still incubating. This talk will give an overview of the the foreign memory API to be finalized and released to general availability in Java 19 and will present the current state of implementation in Lucene 10. Uwe will show how future versions of Lucene will be backed by next generation memory mapping and what needs to be done to make this usable in Solr and Elasticsearch - bringing you memory mapping for indexes with tens or maybe hundreds of Terabytes in the future!
The first integrations of machine learning techniques with search allowed to improve the ranking of your search results (Learning To Rank) - but one limitation has always been that documents had to contain the keywords that the user typed in the search box in order to be retrieved. For example, the query “tiger” won’t retrieve documents containing only the terms “panthera tigris”. This is called the vocabulary mismatch problem and over the years it has been mitigated through query and document expansion approaches. Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s query without necessarily containing those terms; it avoids the need for long lists of synonyms by automatically learning the similarity of terms and sentences in your collection through the utilisation of deep neural networks and numerical vector representation. This talk explores the first Apache Solr official contribution about this topic, available from Apache Solr 9.0. During the talk we will give an overview of neural search (Don’t worry - we will keep it simple!): we will describe vector representations for queries and documents, and how Approximate K-Nearest Neighbor (KNN) vector search works. We will show how neural search can be used along with deep learning techniques (e.g, BERT) or directly on vector data, and how we implemented this feature in Apache Solr, giving usage examples! Join us as we explore this new exciting Apache Solr feature and learn how you can leverage it to improve your search experience!
Over the decades, information retrieval has been dominated by classical methods such as BM25. These lexical models are simple and effective yet vulnerable to vocabulary mismatch. With the introduction of pre-trained language models such as BERT and its relatives, deep retrieval models have achieved superior performance with their strong ability to capture semantic relationships. The downside is that training these deep models is computationally expensive, and suitable datasets are not always available for fine-tuning toward the target domain. While deep retrieval models work best on domains close to what they have been trained on, lexical models are comparatively robust across datasets and domains. This suggests that lexical and deep models can complement each other, retrieving different sets of relevant results. But how can these results effectively be combined? And can we learn something from language models to learn new indexing methods? This talk will delve into both these approaches and exemplify when they work well and not so well. We will take a closer look at different strategies to combine them to get the best of both, even in zero-shot cases where we don't have enough data to fine-tune the deep model.
Bringing multimodal experience into search journey became of high interest lately: searching images with text, or looking inside an audio file, combining that with the rgb frames of a video stream. Today, vector search algorithms (like FAISS, HNSW, BuddyPQ) and databases (Vespa, Weaviate, Milvus and others) make these experiences a reality. But what if you as a user would like to stay with the familiar Elasticsearch / OpenSearch AND leverage the vector search at scale? In this talk we will take a hardware acceleration route to build a vector search experience over products and will show how you can blend the worlds of neural search with symbolic filters. We will discuss use cases where adding multimodal and multilingual vector search will improve recall and compare results from Elasticsearch/OpenSearch with and without the vector search component using tools like Quepid. We will also investigate different fine-tuning approaches and compare their impact on different quality metrics. We will demonstrate our findings using our end-to-end search solution Muves which combines traditional symbolic search with multimodal and multilingual vector search and includes an integrated fine-tuner for easy domain adaptation of pre-trained vector models.
What can the hit game Wordle teach us about Information Retrieval, Search and AI/ML? As it turns out, quite a bit! We'll use the Wordle game as our example "text problem" we want to solve, and run through many of the key concepts you need to get started with AI and ML for text. We'll see (with code!) how some common text-related statistics work, and how they can be used to solve (cheat...) Wordle. Then, we'll build ourselves an AI to do the same. Finally, we'll see how that compares to brute-forcing it with regular expressions! We won't solve all your text-related problems, but hopefully you'll learn the key concepts you need for more advanced talks. And if nothing else, you'll understand the python code for an AI to help you win Wordle!
Search and ranking are part of many important features on the Yelp platform - from looking for a plumber to showing relevant photos of the dish you search for. These varied use cases led to the creation of Yelp’s Elasticsearch-based ranking platform which we presented at Berlin Buzzwords 2019, allowing real-time indexing, learning-to-rank, and lesser maintenance overhead, as well as enabling access to search functionality to more teams at Yelp. We recently built Nrtsearch, a Lucene-based search engine, to replace Elasticsearch. We have open sourced this search engine under the Apache 2.0 license. This talk will detail Challenges associated with scaling Elasticsearch costs and performance. Mainly issues related to the document-based replication approach. Difficulties with real time auto scaling of Elasticsearch. Inefficient usage of resources due to hot and cold node issues. Architecture of Nrtsearch Uses Lucene’s near-real-time (NRT) segment replication Primary-Replica architecture: Primary does all writing including segment merges while replicas simply copy over segments using Lucene's NRT APIs and serve search queries. Cluster orchestration, availability and management of nodes is left to systems like Kubernetes that excel at resource management and scheduling. Truly stateless architecture: Deployed as a standard microservice using Kubernetes. State is committed to s3, upon a restart of a primary or replica, the most recent state from s3 is pulled down. Benefits of this architecture Performance increased by up to 50% Cluster costs lowered by up to 50% Use of standard tools (k8s) to manage operational aspects of the cluster, relieving ranking infrastructure teams to focus on search-related problems. Challenges involved in rolling this out to production Lucene’s segment replication approach and the code itself is not widely used in the industry so had some rough edges. Exciting performance bugs! Future work Enhance feature support via extensible plugins like vector-embeddings Continue to simplify and open source deployment tooling to help others deploy NrtSearch in their own cloud environments.
Store
Optimization problems are everywhere, from deciding which clothes to pack in our luggage (aka the knapsack problem), to selecting the tasks that will be worked during a sprint. Trying to solve these type of problems by hand is a tedious task often resulting in sub-optimal decisions. In this talk, we'll understand how PostgreSQL recursive queries can help. Starting from the proper problem definition, we'll then explore how to build queries that call themselves recursively, what are the risks associated with this approach and safeguards we can set to optimise performances. Finally we'll demonstrate how two new features released in PostgreSQL 14 enable an easier handling of the recursive statements. If you're into PostgreSQL and eager to understand how recursion works, this session is for you!
There are more data tools available than ever before, and it's easier to build a pipeline than it's ever been. This has resulted in an explosion of innovation, but it also means that data within today's organizations has become increasingly distributed. It can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, Flink, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, Julien Le Dem from Datakin will show how to trace data lineage across Apache Spark and Apache Airflow. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.
Real-time analytics has transformed the way companies do business. It has unlocked the ability to make real-time decisions such as customer incentives, business metrics, fraud detection and provide a personalized user experience that accelerates growth and user retention. This is a complex problem and naturally, there are several OLAP (OnLine Analytics Processing) solutions out there, each focusing on a different aspect. In order to support all such use cases, we need an ideal OLAP platform that has the ability to support extremely high query throughput with low latency and at the same time provide high query accuracy – in the presence of data duplication and real-time updates. In addition, the same system must be able to ingest data from all kinds of data sources, handle unstructured data and real-time upserts. While there are different ways of solving each such problem scenario, ideally we want one unified platform that can be easily customized. In this talk, we will go over the rich capabilities of Apache Pinot that make it an ideal OLAP platform.
The last few years there has been a rise of new Headless Architectures, often based on GraphQL. These architectures are designed to be flexible and scalable, whether these are a CMS, E-Commerce platform or API. But often the GraphQL APIs they provide deliver just parts of the data that you need to build rich, interactive applications for your users. Luckily, there are several ways to combine Headless GraphQL Architectures to enrich their data. In this talk I'll show how to enrich data in multiple Headless GraphQL Architectures, and compare patterns like Schema Stitching and GraphQL Federations.
More and more people are moving from old-school relational databases to a variant of NoSQL. If starting a green-field project with a document database is easy, it can be a different story when migrating from one to the other. Simply porting SQL tables to a collection might cause you more harm than good. In this talk, the attendees will learn about the basic concepts of document databases, such as documents and collections. They will then learn about some of the standard data schemas available. Finally, the speaker will show real-life examples of data migration and how they can be applied to adopt a new NoSQL database.
Scale
In your ever-changing Infrastructure, some changes are intentional while others are not. Infrastructure Drift can happen for many reasons, sometimes it happens when adding or removing resources, other times when changing resource definitions upon resource termination or failure, and even when changes have been made manually or via other automation tools. When something is changed intentionally, it will appear in the source code, and should not raise any alarm. However, if any part of the infrastructure has been changed manually, there are tools that can identify this, and alert to the change. In other words, if your IaC drifted from its expected state, then you can in fact, detect it. Applying simple solutions can empower DevOps and developer velocity, with the reassurance and context for unexpected changes in your IaC, in near real-time. This talk will showcase real-world examples, and practical ways to apply this in your production environments, while doing so safely and at the pace of your engineering cycles. Drift is what happens whenever the real-world state of your infrastructure differs from the state defined in your configuration.s.
Spark is a trend technology that it is being used for a lot of companies for large-scale data analytics. During the first try, companies usually try to use the cloud provider solution to speed up their time to market, but once Spark is broadly embrace by more teams in the company and the solution should be able to be multi cloud provider, then the Kubernetes adoption appear and the journey to make it happen its worth to share to inspire others in the same situation. In this talk the audience will learn some benefits to migrate from AWS EMR to Spark on Kubernetes, from operability point of view (reliability, portability, scalability), through observability and finally reviewing efficiency and costs. This talk is a real use case three teams at Empathy.co were working during 6 months to make their solution more agnostic and with minimum cloud dependencies.
CI/CD brings tremendous value to development teams. The rapid availability of feedback helps developers make informed decisions about their design choices and lets teams deploy with confidence. But when systems become large and test times go from seconds to hours, how do we get our groove back? In this talk, we’ll explore strategies for validating large, complex systems, such as: - Setting well-defined component boundaries - Flexibly modeling dependencies between these components - Ranking tests by cost versus value - Testing in production with canary launches and feature flags These and similar techniques let us minimize test times, maximize confidence, and free our teams up to focus on delivering value to customers.
Kubernetes can be hard. Not only in the initial learning and understanding of the concepts, but also the aspect of keeping an overview of what is happening in and around the cluster can be challenging. How can you quickly and easily tell if the cluster is healthy, well utilised and if all applications are running fine? This talk intends to look at the various aspects of Kubernetes observability and to introduce and compares multiple Open Source tools to achieve that. The range of tools covers different observability levels and requirements of different user groups. It starts with tools simply querying the Kubernetes API and delivering the outputs in an easy-to-understand UI, goes over the possibilities of services meshes and ends with application-side logging and monitoring. For each level of observability the user has to pay a certain price in terms of configuration and runtime overhead. In turn the quality and depth of the information is different. The intended take-away is to get a feeling which type of tooling is the right one for a given purpose. Most options will be shown in a live demonstration.
When building our Kubernetes-native product, we wanted to find the most common sources of failures, anti-patterns and root causes for Kubernetes outages, so we got to work. We rolled up our sleeves and read 100+ Kubernetes post-mortems. This is what we discovered. A smart person learns from their own mistakes, but a truly wise person learns from the mistakes of others. When launching our product, we wanted to learn as much as possible about typical pains in our ecosystem, and did so by reviewing many post-mortems (100+!) to discover the recurring patterns, anti-patterns, and root causes of typical outages in Kubernetes-based systems. In this talk we have aggregated for you the insights we gathered, and in particular will review the most obvious DON’Ts and some less obvious ones, that may help you prevent your next production outage by learning from others’ real world (horror) stories.
It is clearly beneficial for an organization to make data-driven decisions, decentralize access to data processing and empower every team to generate valuable information. There are many ways to achieve these goals, but in an environment of rapid growth, building an accessible Data Platform is just the first step. What happens next determines its long-term success or its dramatic demise. In this presentation, we discuss the main perils of building a platform that processes over 80000 unique datasets built by 1000 people across different teams, how to avoid them, and where to go from there.
Elevator pitch The use of transfer learning has begun a golden era in applications of ML but the development of these models “democratically” is still in the dark ages compared to best practices in SWE. I describe how methods of open-source SWE can allow models to be built by a distributed community of researchers. --- Over the past few years, it has become increasingly common to use transfer learning when tackling machine learning problems (e.g. the BERT model on HuggingFace Hub has been downloaded tens of millions of times). However, pre-training often involves training a large model on a large amount of data. This incurs substantial computational (and therefore financial) costs; for example, Lambda estimates that training the GPT-3 language model would cost around $4.6 million. As a result, the most popular pre-trained models are being created by small teams within large, resource-rich corporations. This means that the majority of the research community is excluded from participating in the design and creation of these valuable resources. Here, I elaborate on why we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research and opens up machine learning research for much larger participation.
So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform. Keywords: MLOps, Inference, Latency
This talk is about creating minimal containers. The author has started to dive into Kubernetes and Container Security some years ago. Minimizing the size and the attack vectors are just two sides of the same coin. As a reward, you get much faster deployment pipelines, enabling more automated testing and higher scalability. A speed up by a factor of 10 or 20 is not unusual, sometimes the size of a cointainer shrinks by a factor of 100. - 12factor IX: disposability - bad examples - optimizing the size of a container - building minimal containers from scratch - a small step in a Dockerfile, a big leap for container size - debugging minimal containers - speed up - security measured by Trivy
Signal AI offers a sophisticated platform to support businesses in their decision making. Customers define searches across billions of documents by using an extensive DSL that includes concepts like entities and topics amongst them. This metadata is being extracted from over 5 million documents each day and is made available to the end users within 30 seconds from its ingestion via a mix of machine learning and text retrieval techniques. Entity Linking is one of the core capabilities in the Signal AI data processing platform. It is a complex system that uses various strategies to achieve the highest quality while retaining excellent throughput characteristics. Back in 2019, one of the existing components of the Entity Linking system was rapidly reaching its limits and could not scale anymore. To overcome the limitation, the team took an innovative approach and used Apache Lucene with its inverted index and term vectors capabilities to enable the identification of rule-based entities. By choosing a percolator model the team had to revisit the previous architecture, breaking it down into smaller components that follow the Single Responsibility Principle for microservices. This talk will take the audience through the evolution of this service, from its inception until today. It will provide details around the technical decisions and trade-offs that make this component one of the most resilient, fast and cost effective solutions, capable of handling 20 times more the number of rules at a fraction of the cost. It will also discuss how the same technology is used to reprocess the entire dataset every night in approximately 15 minutes.
At CybelAngel we scan the internet looking for sensitive data leaks belonging to our clients. As the volume of alerts could count billions of samples, we use machine learning to throw away as much noise as possible to reduce the analysts' workload. We are a growing team of data scientists and a machine learning engineer, planning to double in size. Each of us contributes to projects and we use Notebooks before code industrialisation. As for many other data science teams, a lot of effort and valuable work is encapsulated in a format that is tricky to share, hardly reproducible and simply not built for production purposes. During the talk, we will present what we did to overcome some of these issues and our feedback about notebook versioning and implementation in Google Cloud Platform using open JupyterHub and Jupytext. This talk is addressed to a technical audience but all roles gravitating around a data team are welcome to grasp the challenges of the interaction of data science within the organisation.
The promise of accelerated computing presents an interesting paradox: while no one complains when new compute infrastructure is dramatically faster than its predecessor, few people realize how much they’d benefit from acceleration until they have it. It is perhaps unsurprising that a data scientist’s daily work consists of tasks that they can accomplish with their available computing resources, but simply running our existing work faster makes acceleration into a mere luxury. For accelerated computing to fulfill its promise, we need it to transform our work by enabling us to do new things that wouldn’t have been feasible without it. In this talk, we’ll discuss our experiences accelerating data science with specialized hardware and by scaling out on clusters. We’ll present examples of previously-impossible techniques becoming feasible, of the pleasant luxury of improved performance, and of the data science tasks that aren’t likely to justify additional hardware or implementation effort. You’ll leave this talk with a better understanding of how accelerated and scale-out computing can fit into your data science practice, a catalog of techniques that are still well served by standard hardware, and some actionable advice for how to take advantage of parallel and distributed computing across your workflow.
Stream
IoT applications run on IoT devices and can be created to be specific to almost every industry and vertical, from small devices to large ones, including healthcare, industrial automation, smart homes and buildings, automotive, and wearable technology. The possibilities are limitless. Increasingly, IoT applications are using AI and machine learning to add intelligence to devices. Among all of the variables in the IoT ecosystem, one common theme is the need to be able to handle the constrained operating environment, such as unreliable network connectivity, limited bandwidth, low battery power, and so on. We will take a look into the MQTT protocol, how it has evolved from its early days which was intended for the connection of oil pipelines via satellite, to now the ever-increasing demand in IoT and M2M applications, to how this protocol will evolve to meet the modern needs especially in the current cloud computing era. We will study a few outstanding MQTT libraries that are available in the market, such as the Java-based HiveMQ, and open source libraries such as Eclipse Mosquitto.
Kafka data pipeline maintenance can be painful. It usually comes with complicated and lengthy recovery processes, scaling difficulties, traffic ‘moodiness’, and latency issues after downtimes and outages. It doesn’t have to be that way! We’ll examine one of our multi-petabyte scale Kafka pipelines, and go over some of the pitfalls we’ve encountered. We’ll offer solutions that alleviate those problems, and go over comparisons between the before and after . We’ll then explain why some common sense solutions do not work well and offer an improved, scalable and resilient way of processing your stream. We’ll cover: - Costs of processing in stream compared to in batch - Scaling out for bursts and reprocessing - Making the tradeoff between wait times and costs - Recovering from outages - And much more…
Streaming has changed the way we build and think about data pipelines, but at what cost? In this talk, we’ll introduce Materialize, a streaming database that lets you use standard SQL to query streams of data and get low-latency, incrementally updated answers as the underlying data changes. We’ll cover the basic concepts behind Materialize, where it fits in the current data engineering landscape and what makes it unique in comparison to other tools. To tie it all together, we’ll build a simple streaming analytics pipeline — from data ingestion to visualization!
Change Data Capture (CDC) has become a mundane commodity, much in part due to the ever-rising success of [Debezium](https://debezium.io/). But what happens when you want to keep track of changes in your upstream database and Kafka is not part of your stack? In this talk, we’ll walk through how we built a homegrown Postgres CDC connector at [Materialize](https://materialize.com/) as an add-on alternative to CDC support through Kafka+Debezium.
Streaming frameworks are notoriously difficult to learn. Streaming teams often spend as much time educating their users as they spend building solutions based on frameworks like Apache Beam and Apache Flink. In this talk, I share a few different resources from the Apache Beam community that helped the learning process easier. We will share feedback from users that have tried these resources in their learning, and try to extract common themes to build learning materials for the stateful streaming model that are accessible and easy to digest.
Due to Apache Kafka's widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important task. It can be difficult to understand what is happening in the Kafka cluster and to successfully root cause/troubleshoot problems. To perform effective diagnosis, meaningful insights and visibility throughout all levels of the cluster are a must. This talk will take a deep dive into what metrics or indicators matter most while running Kafka at Scale. How to interpret and correlate these indicators, build dashboards and configure meaningful smart alerts to identify a probable issue to take place. This talk concludes with the idea of utilising machine learning to detect anomalies for long-running scheduled Kafka pipelines and predict overall cluster resource usage for future scaling.
We all know that the world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Message queues and logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) has become popular to capture committed changes from a database and propagate those changes to downstream consumers. In this talk, we will introduce Apache Flink as a general data processor for various kind of use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources using different kinds of joins. Finally, we illustrate how to combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
This case study offers an entertaining way to learn about the possibilities of stream processing, which can be applied to projects in fields that require easy access to current information, such as finance, mobility and energy. We’ll use the Quix platform to set up a series of open source data sets and code samples that collect, transform and deliver data under a machine learning model that learns to handle real-time heart rate data. We’ll show how to include complex transformations to the data, such as how to calculate calories burned with Python.
Production profiling is no new challenge in the operations world. Companies with huge data center scale from the Googles to the Facebooks have long moved into continuous cross-cluster production profiling, to constantly optimize performance and SLAs, which has not yet carried over universally. Many large-scale enterprises still use homegrown solutions to deploy profilers sporadically for research and to gain short-term production performance insights. However, this has its limitations as profilers are often language-specific, come with substantial overhead, and provide limited ability to aggregate and analyze clusterwide data. Production profiling has significantly improved today, and by leveraging tools like eBPF it’s now possible to have a better understanding of your clusterwide performance including CPU, memory, pod health and more, for both stability and cost optimization, as well as advanced segmentation and analysis. In this talk we walk you through the basics from how to use modern continuous profiling tools, how to read a flamegraph, and what to look for with a real-world demo in modern complex, microservices environments. It will continue to advanced profiling of cross-cluster deployment comparisons, code performance profiling over time, and even how to provide feedback loops for developers to optimize performance from the foundations of their code.
Apache Druid is the open source analytics database that enables development of modern data-intensive applications of any size. It provides sub-second response times on streaming and historical data and can scale to deliver real-time analytics with data ingestion at any data flow rate – with lightning fast queries at any concurrency. Sounds great, right? But any large distributed system can be difficult and time-consuming to deploy and monitor. Deployment requirements change significantly from use case to use case, from dev/test clusters on the laptop to hundreds of nodes in the cloud. Kubernetes has become the de-facto standard for making these complicated systems be much easier to deploy and operate. In this talk you will learn about Druid's microservice architecture and the benefits of deploying it on Kubernetes. We will walk you through the open source project's Helm Chart design and how it can be used to deploy and manage clusters of any size with ease.
You’re curious about what Apache Kafka does and how it works, but between the terminology and explanations that seem to start at a complex level, it's been difficult to embark. This session is different. We'll talk about what Kafka is, what it does and how it works in simple terms with easy to understand and funny examples that you can share later at a dinner table with your family. This session is for curious minds, who might have never worked with distributed streaming systems before, or are beginners to event streaming applications. But let simplicity not deceive you - by the end of the session you’ll be equipped to create your own Apache Kafka event stream!