Method, Distributed System and Device for Efficiently Quantifying a Similarity of Large Data Sets

Publication: EP3109771A1

Published: 2016-12-28

Family Size: 1

Granted: No

Simple SummaryContent extracted from patent full text and abstract with AI.

This invention provides a method, system, and device for efficiently calculating the similarity between large data sets, even when limited hardware resources are available. The approach uses a two-stage process: it first quickly computes an initial similarity check using resource-efficient statistical metrics, and only applies more sophisticated and computationally-intensive analyses if the data sets are deemed sufficiently similar. This allows significant savings in computation time and hardware usage.

Use CasesContent extracted from patent full text and abstract with AI.

Comparing large real-world data sets (e.g., in scientific research, business analytics, or IoT) to assess similarity or changes over time.
Quality assurance of synthetically generated data (e.g., validating simulated or predicted data against actual sensor data).
Data deduplication and redundancy checks in big data storage systems.
Detecting anomalies or outliers in streaming sensor or transactional data using similarity comparisons.
Matching and integrating large, heterogeneous data sources in data warehousing or ETL pipelines.
Providing similarity analysis as a distributed cloud-based service to third-party users.
Mobile or embedded applications that require on-device data similarity calculations with minimal hardware resources.

BenefitsContent extracted from patent full text and abstract with AI.

Efficiently handles very large data sets, crucial for big data applications.
Reduces unnecessary computational effort and hardware resource consumption by filtering non-similar data sets early in the process.
Handles diverse data types (nominal, ordinal, interval, ratio), making it broadly applicable across many domains.
Parallelizable and suitable for distributed environments, allowing scalable performance on clusters or cloud.
Works with both real (sensor) data and synthetic/estimated data, enabling flexible usage scenarios.
Can be offered as a network-based service or integrated into lightweight, mobile, or embedded devices.
Enables more timely and cost-effective operations in data-intensive fields by optimizing resource usage.

Technical Classifications (CPCs)

Main Classifications

Physics & Measurement

Sub Classifications

Computing & Calculating

CPC Codes

G06F17/18

Inventors & Applicants

Inventors

Applicants

Deutsche Telekom Ag

Univ Berlin Tech

Patent Abstract

The present invention is directed towards an efficient computation of similarity values of large data sets. According to the present invention, it is possible to handle large data sets even with poor hardware equipment. Data sets are classified and furthermore, according to the assigned class, subsequent operations are selected. Special focus of the suggested subject matter lies on hardware requirements, which are considered towards enhanced resource efficiency.

Key Information

Publication No.

EP3109771A1

Family ID

53496440

Publication Date

2016-12-28

Application No.

EP15173150A

Application Date

2015-06-22

Priority Date

2015-06-22

Granted

Possible Cooperation

For further information please contact the transfer office.

See full document in Espacenet