Images – A table of pixels, usually of RGB(3D columns – a.k.a a tensor) – can be ravled/unrolled/flattened to a 1D longer array.
Sound – long series of numbers(1D array) when digitally recorded(x = time(duration column), y = amplitude of the sound) – usually represented by frequences.
Text – parts of speech, embeddings, word frequency.