Application-Aware Local-Global SourceDeduplication for Cloud Backup Services of Personal Storage

ABSTRACT

In personal computing devices that rely on a cloud storage environment for data backup, an imminent challengefacing source deduplication for cloud backup services is the low deduplication efficiency due to a combination of the resourceintensive nature of deduplication and the limited system resources. In this paper, we present ALG-Dedupe, an Application-aware Local-Global source deduplication scheme that improves data deduplication efficiency by exploiting application awareness, and further combines local and global duplicate detection to strike a good balance between cloud storage capacity saving and deduplication time reduction. We perform experiments via prototype implementation to demonstrate that our scheme can significantly improve deduplication efficiency over the state-of-the-art methods with low system overhead, resulting in shortened

backup window, increased power efficiency and reduced cost for cloud backup services of personal storage.

EXISTING SYSTEM

Data deduplication, although not traditionally considered backup software, can be quite handy when backing up large instances of data. The deduplication process works by identifying unique chunks of data, removing redundant data, and making data easier to store. For example, if a marketing director sends out a 10MB PowerPoint document to everyone in a company, and each of those people saves the document to their hard drive, the presentation will take up a collective 5G of storage on the backup disk, tape, server, etc. With data deduplication, however, only one instance of the document is actually saved, reducing the 5G of storage to just 10MB. When the document needs to be accessed the computer pulls the one copy that was initially saved.Deduplication drastically reduces the amount of storage space needed to back up a server/system because the process is more granular than other compression systems. Instead of looking through entire files to determine if they are the same, deduplication segments data into blocks and looks for repetition. Redundant files are removed from the backup and more data can be stored.

PROPOSED SYSTEM:

Applicationaware LocalGlobal source deduplication scheme that not only exploits application awareness, but also combines local and global duplication detection, to achieve highdeduplication efficiency by reducing the deduplication

latency to as low as the application-aware local deduplicationwhile saving as much cloud storage cost asthe application-aware global deduplication. Our application-awarededuplication design is motivated by thesystematic deduplication analysis on personal storage.

Advantage:

To achieve high deduplication efficiency by reducing the deduplication latency to as low as the application-aware local deduplicationwhile saving as much cloud storage cost as the application-aware global deduplication.

FEATURES:

  1. In personal computing devices that rely on a cloud storage environment for data backup
  2. Data deduplication, an effective data compressionapproach that exploits data redundancy, partitions largedata objects into smaller parts, called chunks, representshese chunks by their fingerprints
  1. User upload file is Converting into binary. Computers store all characters as numbers stored as binary data. Binary code uses the digits of 0 and 1 (binary numbers) to represent computer instructions or text.
  2. User again same file upload that time check all upload data than not allow to upload data is (deduplication) ,don’t waste cloud space .
  3. User upload file not directly store in cloud send cloud admin it allow that time only store in cloud and download the data.
  4. There are three type of file is there Dependingon whether filetypeiscompressed or whether SC can outperform CDC indeduplication efficiency, we divide files into three maincategories: compressed files, static uncompressed files, anddynamic uncompressed files.

PROBLEM STATEMENT:

For a backup dataset with logical dataset size L,itsphysical dataset size will be reduced to Pafter local source deduplication in personal computing devices and further decreased to PGL by global sourcededuplication in thecloud, PL9 P. We divide the backup process into threeparts: local duplicate detection, global duplicate detectionand unique data cloud store. Here, the latencies forchunking and fingerprinting are included in duplicatedetection latency. Meanwhile, we assume that there areaverage local duplicate detection latencyG,averageglobalduplicate detection latency TGLand average cloud storageI/O bandwidth B for average chunk size C, TG9 T.Wecanbuild models to calculate BWSLand BWSfor the averagebackup window size per chunk of local source deduplicationbased cloud backup and global source deduplication

based cloud backup

SCOPE:

In the last decade, we have seen the world cross a billion Internet-connected computers, the widespread adoption of Internet-enabled smartphones, the rise of the “cloud”, and the digitization of nearly every single photo, movies, song, and other file. Data has grown exponentially, devices have proliferated, and the risk of data loss has skyrocketed with these trends.

MODULE DESCRIPTION:

Number of Modules:

After careful analysis the system has been identified to have the following modules:

  1. Cloud backup.
  2. personal storage.
  3. source deduplication.
  4. application awareness.

Cloud backup:

Cloud backup, also known as online backup, is a strategy for backing up data that involves sending a copy of the data over a proprietary or public network to an off-site server. The server is usually hosted by a third-party service provider, who charges the backup customer a fee based on capacity, bandwidth or number of users. In the enterprise, the off-site server might be proprietary, but the chargeback method would be similar.Online backup systems are typically built around a client software application that runs on a schedule determined by the level of service the customer has purchased. If the customer has contracted for daily backups, for instance, then the application collects, compresses, encrypts and transfers data to the service provider's servers every 24 hours. To reduce the amount of bandwidth consumed and the time it takes to transfer files, the service provider might only provide incremental backups after the initial full backup.Third-party cloud backup has gained popularity with small offices and home users because of its convenience. Capital expenditures for additional hardware are not required and backups can be run dark, which means they can be run automatically without manual intervention.

personal storage:

Cloud storage is a model of data storage where the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations), and the physical environment is typically owned and managed by a hosting company. These cloud storage providers are responsible for keeping the data available and accessible, and the physical environment protected and running. People and organizations buy or lease storage capacity from the providers to store end user, organization, or application data.

Cloud storage services may be accessed through a co-located cloud compute service, a web serviceapplication programming interface (API) or by applications that utilize the API, such as cloud desktop storage, a cloud storage gateway or Web-based content management systems.

Architecture Overview

An architectural overview of ALG-Dedupe is illustrated in where tiny files are first filtered out by file size filter for efficiency reasons, and backup data streams are broken

into chunks by an intelligent chunker using an application- aware chunking strategy. Data chunks from the same type of files are then deduplicated in the application-aware

deduplicator by generating chunk fingerprints in hash engine and performing data redundancy check in

source deduplication:

In this section, we will investigate how data redundancy, space utilization efficiency of popular data chunking methods and computational overhead of typical hash functions change in different applications of personal computing to motivate our research. We perform preliminary experimental study on datasets collected from desktops in our research group, volunteers’ personal laptops, personal workstations for image processing and financial analysis,andasharedhomeserver.Table1outlinesthekeydataset characteristics: the number of devices, applications and dataset size for each studied workload. To the best of ourknowledge,thisisthefirstsystematicdeduplication analysis on personal storage.

data deduplication

Data deduplicationis a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced.

This type of deduplication is different from that performed by standard file-compression tools, such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space.

4. Application awareness:

Application awareness is the capacity of a system to maintain information about connected applications to optimize their operation and that of any subsystems that they run or control.

An application-aware network uses current information about applications connected to it, such as application state and resource requirements. That capacity is central to software-defined networking (SDN), enabling the network to efficiently allocate resources for the most effective operation of both applications and the network itself.

Application-aware storage systems rely upon built-in intelligence about relevant applications and their utilization patterns. Once the storage "understands" the applications and usage conditions, it is possible to optimize data layouts, caching behaviors, and quality of service (QoS) levels.

System Configuration:

HARDWARE REQUIREMENTS:

Hardware - Pentium

Speed - 1.1 GHz

RAM - 1GB

Hard Disk - 20 GB

Floppy Drive - 1.44 MB

Key Board - Standard Windows Keyboard

Mouse - Two or Three Button Mouse

Monitor - SVGA

SOFTWARE REQUIREMENTS:

Operating System: Windows

Technology: Java and J2EE

Web Technologies: Html, JavaScript, CSS

IDE : My Eclipse

Web Server: Tomcat

Tool kit : Android Phone

Database: My SQL

Java Version : J2SDK1.5

CONCLUSION

In this paper, we propose ALG-Dedupe, an applicationaware local-global source-deduplication scheme for cloud backup in the personal computing environment to improve deduplication efficiency. An intelligent deduplication strategy in ALG-Dedupe is designed to exploit file semantics to minimize computational overhead and maximize deduplication effectiveness using application awareness. It combines local deduplication and global deduplication to balance the effectiveness and latency of deduplication. The proposed application-aware index structure can significantly relieve the disk index lookup bottleneck by dividing a central index into many independent small indices to optimize lookup performance. In our prototype evaluation, ALG-Dedupe is shown to improve the deduplication efficiency of the state-of-the-art application-oblivious source deduplication approaches by a factor of 1.6X _ 2.3X with very low system overhead, andshortenthebackupwindowsizeby26percent37 percent, improve power-efficiency by more than a third, andsave41percent-64percentcloudcostforthecloud backup service. Comparing with our previous localdeduplication-only design AA-Dedupe, it can reduce cloud cost by 23 percent without increasing backup window size. As a direction of future work, we plan to further optimize our scheme for other resource-constrained mobile devices like smartphone or tablet and investigate the secure deduplication issue in cloud backup services of the personal computing environment.