Method, Distributed System and Device for Efficiently Quantifying a Similarity of Large Data Sets
Simple SummaryContent extracted from patent full text and abstract with AI.
This invention provides a method, system, and device for efficiently calculating the similarity between large data sets, even when limited hardware resources are available. The approach uses a two-stage process: it first quickly computes an initial similarity check using resource-efficient statistical metrics, and only applies more sophisticated and computationally-intensive analyses if the data sets are deemed sufficiently similar. This allows significant savings in computation time and hardware usage.
Use CasesContent extracted from patent full text and abstract with AI.
- Comparing large real-world data sets (e.g., in scientific research, business analytics, or IoT) to assess similarity or changes over time.
- Quality assurance of synthetically generated data (e.g., validating simulated or predicted data against actual sensor data).
- Data deduplication and redundancy checks in big data storage systems.
- Detecting anomalies or outliers in streaming sensor or transactional data using similarity comparisons.
- Matching and integrating large, heterogeneous data sources in data warehousing or ETL pipelines.
- Providing similarity analysis as a distributed cloud-based service to third-party users.
- Mobile or embedded applications that require on-device data similarity calculations with minimal hardware resources.
BenefitsContent extracted from patent full text and abstract with AI.
- Efficiently handles very large data sets, crucial for big data applications.
- Reduces unnecessary computational effort and hardware resource consumption by filtering non-similar data sets early in the process.
- Handles diverse data types (nominal, ordinal, interval, ratio), making it broadly applicable across many domains.
- Parallelizable and suitable for distributed environments, allowing scalable performance on clusters or cloud.
- Works with both real (sensor) data and synthetic/estimated data, enabling flexible usage scenarios.
- Can be offered as a network-based service or integrated into lightweight, mobile, or embedded devices.
- Enables more timely and cost-effective operations in data-intensive fields by optimizing resource usage.
Technical Classifications (CPCs)
Main Classifications
Physics & Measurement
Sub Classifications
Computing & Calculating
CPC Codes
Inventors & Applicants
Applicants
Deutsche Telekom Ag
Univ Berlin Tech
Patent Abstract
The present invention is directed towards an efficient computation of similarity values of large data sets. According to the present invention, it is possible to handle large data sets even with poor hardware equipment. Data sets are classified and furthermore, according to the assigned class, subsequent operations are selected. Special focus of the suggested subject matter lies on hardware requirements, which are considered towards enhanced resource efficiency.
Key Information
Publication No.
EP3109771A1
Family ID
53496440
Publication Date
2016-12-28
Application No.
EP15173150A
Application Date
2015-06-22
Priority Date
2015-06-22
Granted
No
Possible Cooperation
For further information please contact the transfer office.