I have just copy-pasted together a script for retrieving Youtube urls from 4chan/8chan threads for sharing music. I can then open the playlist with a program like vlc.
import sys
import re
import urllib.request
from collections import OrderedDict
if len(sys.argv)<3:
print("usage: python3 "+sys.argv[0]+" url playlist.m3u")
exit()
#I need to fake my user-agent, otherwise I get a 403
req = urllib.request.Request(
sys.argv[1],
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
matches = re.findall('https?\\:\\/\\/www\\.youtube\\.com\\/watch\\?v\\=[A-Za-z0-9._%+-]+', html)
urls = list(OrderedDict.fromkeys(matches))
f = open(sys.argv[2], 'w')
f.write("\n".join(urls))
f.close()
Yes, I know, using regex for parsing html is horrible. Please tell me what I should do instead. Other feedback welcome