• PYTHON > équivalents aux commandes Linux ( wget - grep - sed )

      wget

      urllib.request should work. Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile. Be sure to use read() to append to the localfile until an error occurs.

      This is also potentially a duplicate of Python urllib2 resume download doesn’t work when network reconnects

       

      When I try urllib.request.urlopen or urllib.request.Request with a string containting the url as the url argument, I get ValueError: unknown url type

       

      There is also a nice Python module named wget that is pretty easy to use. Found here. This demonstrates the simplicity of the design:

       

      import wget
      url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
      filename = wget.download(url)
      100% [................................................] 3841532 / 3841532>
      filename
      'razorback.mp3'

       

      However, if wget doesn’t work (I’ve had trouble with certain PDF files), try this solution.

       

      Edit: You can also use the out parameter to use a custom output directory instead of current working directory.

       

      output_directory = <directory_name>
      filename = wget.download(url, out=output_directory)
      filename
      'razorback.mp3'

       

       

      Sorry for late reply, didn’t see this notification for some reason. You need to pip install wget most likely.

       

    • try sudo apt-get install python3-wget.
    • bitbucket.org/techtonik/python-wget
    • import urllib2
      
      attempts = 0
      
      while attempts < 3:
         try:
            response = urllib2.urlopen("http://example.com", timeout = 5)
            content = response.read()
            f = open( "local/index.html", 'w' )
            f.write( content )
            f.close()
            break
         except urllib2.URLError as e:
            attempts += 1
            print type(e)

       

      I had to do something like this on a version of linux that didn’t have the right options compiled into wget. This example is for downloading the memory analysis tool ‘guppy’. I’m not sure if it’s important or not, but I kept the target file’s name the same as the url target name…

      Here’s what I came up with:

       

      python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"

       

      That’s the one-liner, here’s it a little more readable:

       

      import requests
      fname = 'guppy-0.1.10.tar.gz'
      url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
      r = requests.get(url)
      open(fname , 'wb').write(r.content)

       

      This worked for downloading a tarball. I was able to extract the package and download it after downloading.

       

      To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint package, but this was tested on my machine and works fine:

       

      #!/usr/bin/env python
      
      from clint.textui import progress
      import requests
      
      fname = 'guppy-0.1.10.tar.gz'
      url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
      
      r = requests.get(url, stream=True)
      with open(fname, 'wb') as f:
          total_length = int(r.headers.get('content-length'))
          for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1):
              if chunk:
                  f.write(chunk)
                  f.flush()

       

      Any way to show progress of file downloading?

      easy as py:

       

      class Downloder():
          def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"):
              #threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start()
              if self._wget_dl(url, destination, try_number, time_out, log_file) == 0:
                  return True
              else:
                  return False
      
          def _wget_dl(self,url, destination, try_number, time_out):
              import subprocess
              command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url]
              try:
                  download_state=subprocess.call(command)
              except Exception as e:
                  print(e)
              #if download_state==0 => successfull download
              return download_state

      FYI: This won’t work on Windows as the wget command isn’t implemented there. –

      Let me Improve a example with threads in case you want download many files.

       

      import math
      import random
      import threading
      
      import requests
      from clint.textui import progress
      
      # You must define a proxy list
      # I suggests https://free-proxy-list.net/
      proxies = {
          0: {'http': 'http://34.208.47.183:80'},
          1: {'http': 'http://40.69.191.149:3128'},
          2: {'http': 'http://104.154.205.214:1080'},
          3: {'http': 'http://52.11.190.64:3128'}
      }
      
      # you must define the list for files do you want download
      videos = [
          "https://i.stack.imgur.com/g2BHi.jpg",
          "https://i.stack.imgur.com/NURaP.jpg"
      ]
      
      downloaderses = list()
      
      def downloaders(video, selected_proxy):
          print("Downloading file named {} by proxy {}...".format(video, selected_proxy))
          r = requests.get(video, stream=True, proxies=selected_proxy)
          nombre_video = video.split("/")[3]
          with open(nombre_video, 'wb') as f:
              total_length = int(r.headers.get('content-length'))
              for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
                  if chunk:
                      f.write(chunk)
                      f.flush()
      
      for video in videos:
          selected_proxy = proxies[math.floor(random.random() * len(proxies))]
          t = threading.Thread(target=downloaders, args=(video, selected_proxy))
          downloaderses.append(t)
      
      for _downloaders in downloaderses:
          _downloaders.start()
    • This does none of the things OP asked for (and several things they didn’t ask for).
    • The example try to show wget multi download feature
    • No one asked for that. OP asked for the equivalent of -c, --read-timeout=5, and --tries=0 (with a single URL).
    • I’m really glad to see it here, serendipity being the cornerstone of the internet. I might add whilst here though, that during my research I came across this for multithreading and the requests library: requests-threaded github.com/requests/requests-threads
    • A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:

       

      import os
      url = 'https://www.someurl.com'
      os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")

       

      grep

      import re
      import sys
      
      file = open(sys.argv[2], "r")
      
      for line in file:
           if re.search(sys.argv[1], line):
               print line,

      search instead of match to find anywhere in string

      comma (,) after print removes carriage return (line will have one)

      argv includes python file name, so variables need to start at 1

      This doesn’t handle multiple arguments (like grep does) or expand wildcards (like the Unix shell would). If you wanted this functionality you could get it using the following:

       

      import re
      import sys
      import glob
      
      for arg in sys.argv[2:]:
          for file in glob.iglob(arg):
              for line in open(file, 'r'):
                  if re.search(sys.argv[1], line):
                      print line,

      you should compile your regex before using the loops.
      /ul>

    • This has two down votes and I have no idea why. Anyone who downvoted want to leave a comment? I know you could add regex compilation etc, but I thought that would detract from the clarity of the answer. I don’t think there is anything incorrect, and I’ve run the code, unlike some of the other answers
    • This answer was perfect for me thanks. Just another quick question how would i print if no matches were found?
    • "you should compile your regex before using the loops.", No, Python will compile and cache it on its own, it’s a common myth, it’s a nice thing to do for readability reasons, htough.
    • The reasonable answer to the natural question is "Because the code is part of a much larger Python script, and who wants to call out to grep in such a case?" In short, I’m glad this question is here because I’m replacing a bash script with a Python script that is hopefully easier on the system.
    • Concise and memory efficient:

       

      #!/usr/bin/env python
      # file: grep.py
      import re, sys
      
      map(sys.stdout.write,(l for l in sys.stdin if re.search(sys.argv[1],l)))

       

      It works like egrep (without too much error handling), e.g.:

       

      cat file-to-be-searched | grep.py "RE"

       

      And here is the one-liner:

       

      cat file-to-be-searched | python -c "import re,sys;map(sys.stdout.write,(l for l in sys.stdin if re.search(sys.argv[1],l)))" "RE"

       

      Adapted from a grep in python.

      Accepts a list of filenames via [2:], does no exception handling:

       

      #!/usr/bin/env python
      import re, sys, os
      
      for f in filter(os.path.isfile, sys.argv[2:]):
          for line in open(f).readlines():
              if re.match(sys.argv[1], line):
                  print line

       

      sys.argv[1] resp sys.argv[2:] works, if you run it as an standalone executable, meaning chmod +x

       

      what’s the difference between re.match and re.search ?

      see Nick Fortescue’s top answer: "search instead of match to find anywhere in string"

      1. use sys.argv to get the command-line parameters
      2. use open(), read() to manipulate file
      3. use the Python re module to match lines

      You might be interested in pyp. Citing my other answer:

      "The Pyed Piper", or pyp, is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment.

      The real problem is that the variable line always has a value. The test for "no matches found" is whether there is a match so the code "if line == None:" should be replaced with "else:"

      Il y a le projet pyp, que l’on doit à Sony Pictures Imageworks qui avait besoin de se simplifier l’automatisation des tâches de build pour ses films.

      Et il y a pyped, dont j’avais brièvement parlé ici (article qui mérite d’être mis à jour vu que j’ai remplace dateutils par arrow).

      Les deux étaient sympas, mais avait des syntaxes alambiquées, n’étaient pas compatibles python 3 et manquaient d’imports auto. Cependant, pyped est récemment passé en v1.0, donc stable, et a une toute nouvelle approche de syntaxe qui rend la bestiole super agréable à utiliser.

      Stdin, ligne à ligne

      L’installation, c’est du pip :

       

      pip install --user pyped

       

      Et derrière, on obtient la commande pyp. Elle s’utilise essentiellement à la suite d’une autre commande. Typiquement :

       

      cat /etc/fsta | pyp "un truc"

       

      L’astuce, c’est que “un truc” peut être n’importe quelle expression Python. Généralement une expression qui print() quelque chose.

      Or, Pyped met automatiquement à disposition de cette expression deux variables :

    • La ligne en cours, dans la variable x.
    • Le numéro de la ligne en cours, dans la variable i.
    • L’expression Python est appelée une fois pour chaque ligne.

      Par exemple, supposons que j’ai un fichier “fortune.txt” contenant :

       

      bitcoin (btc) : 5
      euros (€) : 100
      dollars ($) : 80

       

      Si je veut tout mettre en majuscule, je fais :

       

      cat fortune.txt | pyp "print(x.upper())"
      BITCOIN (BTC) : 5
      EUROS (€) : 100
      DOLLARS ($) : 80

       

      On peut mettre plusieurs expressions d’affilé. Ainsi, si je veux récupérer la somme et le symbole uniquement :

       

      cat fortune.txt | pyp "devise, sign, _, value = x.split()" "sign = sign.strip('()')" "print('%s%s' % (value, sign))"
      5btc
      100€
      80$

       

      Ok, c’est plus long que perl, mais vachement plus facile à écrire et à relire. Et j’utilise un langage que je connais déjà. Et pas besoin de faire un mix incompréhensible de sed, awk et autre cut.

      Si j’ai vraiment besoin de lisibilité, je peux même le mettre sur plusieurs lignes :

       

      cat fortune.txt | pyp "
      devise, sign, _, value = x.split()
      sign = sign.strip('()')
      print('%s%s' % (value, sign))
      "
      5btc
      100€
      80$

       

      Vous aurez noté que j’utilise print() et que je semble ne pas me soucier de l’unicode. C’est parceque pyped fait ça au début du script :

       

      from __future__ import print_function, unicode_literals, division, absolute_imports

       

      Du coup, on est bien en Python 2.7, mais on bénéficie de la division améliorée, de la fonction pour printer, des imports absolus et surtout, de l’unicode partout. D’ailleurs pyped vous transforme x pour que ce soit un objet unicode.

      Dans tous les cas, pyped est compatible Python 3.

      Tout traiter d’un coup

      Parfois, on a besoin d’avoir accès à toutes les lignes, pas juste les lignes une à une. pyped permet cela avec l’option -i. Les variables x et i disparaissent au profit de la variable l, qui contient un itérable sur toutes les lignes.

      Par exemple, envie de trier tout ça ?

       

      cat fortune.txt | pyp -i "
      lignes = (x.split() for x in l)
      lignes = sorted((v, s.strip('()')) for d, s, _, v in lignes)
      for ligne in lignes: print('%s%s' % ligne)
      "
      100€
      5btc
      80$

       

      Parmi les trucs les plus utiles, il y a l’option -b qui permet de lancer un code avant la boucle. Pratique pour importer des trucs genre le module tarfile pour extraire une archive avant d’utiliser son contenu.

      Néanmoins la plupart du temps on a rien besoin d’importer car pyped importe déjà automatiquement les modules les plus utiles : maths, datetime, re, json, hashlib, uuid, etc.

 

Aucun commentaire

 

Laissez un commentaire