01
█ What is data?
In simple terms, data is a carrier of information. More precisely, we can say that data refers to the raw symbols or information that are recorded and stored from the objective world.
In the era we live in, when people refer to data, they usually mean the text, images, audio, and video files in computer systems, which are binary 0s and 1s.
The entire computer system, even the entire digital world, operates around data. The CPU computes data. Memory and hard drives store data. Communication networks transmit data.
Thus, in computer science, data is defined as: “The collective term for all symbols that can be input into a computer and processed by programs.”
It’s important to note that data itself has no inherent meaning; it is unprocessed “raw material.” Only after processing and analysis can data be transformed into meaningful information.
Some also point out that information is the result of data after processing, an interpretation, and ascribing meaning to the data. Although this statement may seem abstract (a bit brain-teasing), it accurately expresses the relationship between data and information.
02
█ Characteristics of Data
Data has many characteristics. I’ve briefly counted fourteen, which are:
- Symbolic: Data exists in symbolic form, such as numbers, and as mentioned earlier, text, images, audio, video, etc.
- Objectivity: Data reflects the attributes, status, and relationships of things in the real world. It exists objectively, and does not change according to subjective will.
- Quantifiable: Data usually exists in a quantifiable form, making it easier to count, measure, and analyze statistically.
- Comparability: Data can be compared and analyzed to reveal relationships and differences between data.
- Transmissibility: Data can be transmitted in various ways, such as through electronic documents, paper reports, etc., enabling information to be passed between different individuals or organizations.
- Storageability: Data can be stored in databases, file systems, or other storage media for future access and use.
- Processability: Data can be processed through calculations, analysis, and manipulation to extract useful information or transform it into knowledge.
- Multidimensionality: Data can be observed and analyzed from multiple perspectives, such as time, space, categories, and other dimensions.
- Diversity: Data comes in various categories and forms, catering to different fields and needs.
- Timeliness: Data may change over time, and some data may lose its value or accuracy after a certain point.
- Reliability: Data should be reliable, meaning its source, collection method, and processing should be trustworthy to ensure accuracy.
- Relevance: Data is related to one another, and changes in some data may affect the performance of other data.
- Interpretability: Data should be interpretable and understandable, with the meaning behind it and its representation of the real world clearly defined.
- Restrictiveness: Data may be subject to limitations like privacy, legal, and ethical factors, and using data requires adherence to relevant regulations.
Not all these characteristics are always met in real situations.
For instance, when pursuing the timeliness of data, some storage capacity may need to be sacrificed, as collecting and processing real-time data requires more space and cost.
Similarly, to improve the reliability of data, more resources may be invested in data validation and cleansing, which may increase the complexity and time cost of data processing.
In short, data that meets more of these characteristics is considered high-quality data, and its value is greater. In practice, we need to balance the various characteristics of data according to specific scenarios and needs.
03
█ Data Classification
Earlier, we mentioned that data has the characteristic of diversity, meaning it comes in various forms and categories.
There are many ways to classify data. For example, the most common classification is by structure, which includes structured data, semi-structured data, and unstructured data.
Structured data refers to data that can be represented by a predefined data model or data that can be stored in a relational database. For example, the ages of all students in a class or the prices of all products in a supermarket are structured data.
Structured Data
Unstructured data refers to things like web articles, email content, images, audio, and video.
Semi-structured data falls between structured and unstructured data, such as XML, JSON, and similar formats. They have some organization but are not as strict as structured data.
Currently, unstructured data accounts for the largest share. For example, in the internet domain, unstructured data already accounts for more than 80%.
Data can also be classified according to its source.
For instance, marketing data, business system data, production data from businesses; social content data, order data, user data from the internet industry; and social governance data, geographic data, economic data from government departments, etc.
According to the nature of the data, it can be further divided into location data (describing spatial positions, such as coordinates), qualitative data (describing attributes of things, like “rainy weather”), quantitative data (reflecting numerical characteristics, such as length, weight), and timed data (recording time-related features, such as date and time).
In summary, each classification method has its specific application scenario and value.
Understanding how to classify data helps us better understand the nature of data and how to manage and utilize data effectively in different scenarios.
04
█ Data Measurement
Earlier, we also mentioned that data has the characteristic of being quantifiable. This means data can be measured.
The most common units for measuring data are KB, MB, GB, TB, and so on.
The data handled by traditional PCs and smartphones is typically in the GB/TB range. For instance, our hard drives are commonly 1TB/2TB/4TB in capacity.
Above TB, we have PB, EB, ZB, and so on.
The relationships between TB, GB, MB, and KB are as follows:
1 KB = 1024 B (KB – kilobyte)
1 MB = 1024 KB (MB – megabyte)
1 GB = 1024 MB (GB – gigabyte)
1 TB = 1024 GB (TB – terabyte)
1 PB = 1024 TB (PB – petabyte)
1 EB = 1024 PB (EB – exabyte)
1 ZB = 1024 EB (ZB – zettabyte)
At first glance, these letters may not seem very intuitive. Let me give you an example.
1TB can be stored on a single hard drive. It holds about 200,000 photos, 200,000 MP3 songs, or 200,000 eBooks.
1PB requires about two server racks. It holds about 200 million photos or 200 million MP3 songs. If someone continuously listens to this music, it would take almost two thousand years.
1EB requires about 2000 server racks. If placed side by side, they would stretch about 1.2 kilometers. If placed in a data center, it would require 21 standard basketball courts of space to store.
Internet giants like Alibaba, Baidu, and Tencent are said to have data close to the EB level. Currently, the total amount of data generated by humanity is at the ZB level.
According to IDC, in 2020, the total amount of data created, captured, replicated, and consumed globally was about 64ZB. By 2025, the global data volume may reach an astounding 163ZB. If we were to build a data center to store this data, its area would be larger than 196 Bird’s Nest stadiums.
05
█ Data Generation Stages
The volume of data in human society is not only large but growing rapidly—by 50% every year. This means it doubles every two years.
Why is it growing so fast?
To understand this, we must look back at the three important stages of data generation in human society.
The first stage was from 1940 to 1990.
After the invention of computers and databases, the complexity of data management greatly decreased. Various industries began generating computer data, which was recorded in databases. The data generated at this time was mostly structured data (explained later). The data generation method was passive.
The second stage was from 1990 to 2010.
With the explosion of the internet, online content began growing rapidly, leading to the creation of much specialized output (PGC). After Web 2.0, people began using blogs, Facebook, YouTube, and other social networks to produce large amounts of user-generated content (UGC), thus actively generating a massive amount of data. The advent of mobile smart devices further accelerated the generation of data in this stage.
The third stage has been from 2010 to the present.
With the development of the Internet of Things (IoT), various sensory nodes (such as sensors and cameras everywhere) began to automatically generate a large amount of data. Digital transformation in businesses has led to the creation of numerous systems to accumulate and manage this data. The total amount of human data has jumped once again.
After going through the three stages of “passive-active-automatic” development, human data volume has exploded.
It’s worth mentioning that as we gradually enter the AI intelligent era, we might soon see a fourth stage of data explosion. Smart machines, represented by AIGC, are producing content at an increasing rate.
06
█ The Role and Significance of Data
Data is the carrier of information. Its most basic role is to record and represent.
For example, attendance data records employees’ work hours, leave, tardiness, and early departure. This data not only helps us understand employees’ attendance but can further analyze work efficiency, teamwork, and potential management issues.
Another example is health check data, which records our height, weight, blood pressure, blood sugar, and other physiological indicators. This data helps us understand our health status, identify potential health problems, and provides critical information for disease prevention and treatment.
Apart from personal work and life, there are also corresponding systems and data in science, business, and public management. This data volume is much larger, even reaching the level of big data.
By deeply mining and analyzing vast amounts of data, businesses and government departments can uncover hidden patterns and trends behind the data, providing strong support for future development and decision-making.
In the scientific field, experimental data, observational data, simulation data, etc., form the foundation of scientific research. These data not only help scientists verify theories and discover new phenomena but also promote scientific and technological progress and innovation.
For example, astronomical observation data in astronomy records the movement of galaxies, the birth and death of stars, and other cosmic phenomena. These data provide critical clues for understanding the origin and evolution of the universe.
In the business world, sales data, customer data, and market data are key to business operations and decision-making. By analyzing these data, businesses can understand market demand, optimize product design, and improve customer satisfaction, allowing them to formulate more accurate market strategies and business plans.
For instance, e-commerce platforms analyze users’ purchase history and browsing behavior to recommend products that better meet their needs, enhancing both user experience and platform sales.
In public management, government data, public service data, social survey data, etc., are the basis for policy-making and implementation. This data helps governments understand the current state of society, predict future trends, and provides a basis for policy evaluation and optimization.
For example, by analyzing traffic flow data, the government can plan traffic routes more effectively and optimize public transport services, thereby alleviating urban traffic congestion.
07
█ Final Thoughts
In conclusion, data has become an essential intangible asset in this era, often referred to as the “new oil.”
From personal lives to global governance, data plays an indispensable role and has become the core resource driving efficiency, scientific discovery, and social progress.
The recent surge in AI further enhances the value of data. One of the three key elements of artificial intelligence is data (the other two are computing power and algorithms). As the “fuel” for AI, the quality and quantity of data directly determine the performance and accuracy of AI systems.
In the future, as data scales grow exponentially and technology continues to advance, the value of data will be further unleashed.
Well, that’s the basic knowledge about data. Everyone clear on it?
Related:
- Evolution of LDPC in SMI SSD Controllers and Applications
- Introduction of mainstream SSD testing Softwares
- Why Are Powerful Laptops Referred to as Gaming Laptops?
- KIST: Implantable Storage Material Fully Dissolves in Water
- Discover the Hidden Secrets Inside Your WiFi Router
- Revolutionary Glass Drive: SSD Killer Lasts 5000 Years
- FPGA Chips: Why They Aren’t Suited for Algorithms
- 64 bit CPU vs 32 bit CPU: What’s the Performance Difference?
Disclaimer:
- This channel does not make any representations or warranties regarding the availability, accuracy, timeliness, effectiveness, or completeness of any information posted. It hereby disclaims any liability or consequences arising from the use of the information.
- This channel is non-commercial and non-profit. The re-posted content does not signify endorsement of its views or responsibility for its authenticity. It does not intend to constitute any other guidance. This channel is not liable for any inaccuracies or errors in the re-posted or published information, directly or indirectly.
- Some data, materials, text, images, etc., used in this channel are sourced from the internet, and all reposts are duly credited to their sources. If you discover any work that infringes on your intellectual property rights or personal legal interests, please contact us, and we will promptly modify or remove it.