Python 3-部分的な解決策(760 742 734 710 705 657文字)
(最後の編集、約束します)
これは本当に、かなり、非常に難しい問題のようです(特に音符の開始または終了位置を認識する)。音楽の自動転写は、オープンな研究トピックのようです(私はそれについて何も知らないというわけではありません)。だから、ここでは音のセグメンテーションを行わない部分的な解決策があります(たとえば、周波数を聞くと「Twinkle」を一度にすべて表示します)、おそらくその特定のoggファイルでのみ機能します:
A=-52
F=44100
C=4096
import pyaudio as P
import array
import scipy.signal as G
import numpy as N
import math
L=math.log
i=0
j=[9,2,0,2,4,5,7,9]
k=[2,4,5,7]
n=j+k+k+j
w="Twinkle, |twinkle, |little |star,\n|How I |wonder |what you |are.\n|Up a|bove the |world so |high,\n|Like a |diamond |in the |sky.\n".split('|')
w+=w[:8]
e=P.PyAudio().open(F,1,8,1,0,None,0,C)
while i<24:
g=array.array('h',e.read(C));b=sum(map(abs,g))/C
if b>0 and 20*L(b/32768,10)>A:
f=G.fftconvolve(g,g[::-1])[C:];d=N.diff(f);s=0
while d[s]<=0:s+=1
x=N.argmax(f[s:])+s;u=f[x-1];v=f[x+1]
if int(12*L(((u-v)/2/(u-2*f[x]+v)+x)*F/C/440,2))==n[i]+15:print(w[i],end='',flush=1);i+=1
これが必要です...
マイク、周囲の音の大きさ、曲の再生音量などに応じて、一番上の行のA = -52(最小振幅)を変更します。マイクでは、-57未満が多くの外来ノイズを拾っているようです-49を超えると、非常に大きな音量で演奏する必要があります。
これはもっと多くのゴルフをすることができます。特に単語配列に一連の文字を保存する方法があると確信しています。これは私のPythonでの最初の重要なプログラムであるため、まだ言語にあまり詳しくありません。
https://gist.github.com/endolith/255291から自己相関を介して周波数検出用のコードを盗みました
ゴルフをしていない:
import pyaudio
from array import array
import scipy.signal
import numpy
import math
import sys
MIN_AMPLITUDE = -52
FRAMERATE = 44100
def first(list):
for i in range(len(list)):
if(list[i] > 0):
return i
return 0
# Based on: https://en.wikipedia.org/wiki/Decibel#Acoustics
def getAmplitude(sig):
total = 0;
elems = float(len(sig))
for x in sig:
total += numpy.abs(x) / elems
if(total == 0):
return -99
else:
return 20 * math.log(total / 32768., 10)
# Based on: https://en.wikipedia.org/wiki/Piano_key_frequencies
def getNote(freq):
return int(12 * math.log(freq / 440, 2) + 49)
# --------------------------------------------------------------------------
# This is stolen straight from here w/ very slight modifications: https://gist.github.com/endolith/255291
def parabolic(f, x):
return 1/2. * (f[x-1] - f[x+1]) / (f[x-1] - 2 * f[x] + f[x+1]) + x
def getFrequency(sig):
# Calculate autocorrelation (same thing as convolution, but with
# one input reversed in time), and throw away the negative lags
corr = scipy.signal.fftconvolve(sig, sig[::-1], mode='full')
corr = corr[len(corr)/2:]
# Find the first low point
diffs = numpy.diff(corr)
# Find the next peak after the low point (other than 0 lag). This bit is
# not reliable for long signals, due to the desired peak occurring between
# samples, and other peaks appearing higher.
# Should use a weighting function to de-emphasize the peaks at longer lags.
start = first(diffs)
peak = numpy.argmax(corr[start:]) + start
return parabolic(corr, peak) * (FRAMERATE / len(sig))
# --------------------------------------------------------------------------
# These are the wrong keys (ie it is detecting middle C as an A), but I'm far too lazy to figure out why.
# Anyway, these are what are detected from the Wikipedia .ogg file:
notes = [73, 66, 64, 66, 68, 69, 71, 73, 66, 68, 69, 71, 66, 68, 69, 71 ]
words = ["Twinkle, ", "twinkle, ", "little ", "star,\n", "How I ", "wonder ", "what you ", "are.\n", "Up a", "bove the ", "world so ", "high,\n", "Like a ", "diamond ", "in the ", "sky.\n"]
notes += notes[:8]
words += words[:8]
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16, channels = 1, rate = FRAMERATE, input = True, frames_per_buffer = 4096)
idx = 0
while(idx < len(notes)):
# Read signal
sig = array('h', stream.read(4096))
if(getAmplitude(sig) > MIN_AMPLITUDE):
note = getNote(getFrequency(sig))
if(note == notes[idx]):
sys.stdout.write(words[idx])
sys.stdout.flush()
idx += 1