2.6.0-test10-mm1 --- diff/Documentation/filesystems/Locking 2003-08-26 10:00:51.000000000 +0100 +++ source/Documentation/filesystems/Locking 2003-11-26 10:09:05.000000000 +0000 @@ -420,7 +420,7 @@ prototypes: void (*open)(struct vm_area_struct*); void (*close)(struct vm_area_struct*); - struct page *(*nopage)(struct vm_area_struct*, unsigned long, int); + struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); locking rules: BKL mmap_sem --- diff/Documentation/filesystems/proc.txt 2003-09-17 12:28:01.000000000 +0100 +++ source/Documentation/filesystems/proc.txt 2003-11-26 10:09:05.000000000 +0000 @@ -900,6 +900,15 @@ Every mounted file system needs a super block, so if you plan to mount lots of file systems, you may want to increase these numbers. +aio-nr and aio-max-nr +--------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats ----------------------------------------------------------- --- diff/Documentation/kernel-parameters.txt 2003-10-09 09:47:33.000000000 +0100 +++ source/Documentation/kernel-parameters.txt 2003-11-26 10:09:05.000000000 +0000 @@ -24,6 +24,7 @@ HW Appropriate hardware is enabled. IA-32 IA-32 aka i386 architecture is enabled. IA-64 IA-64 architecture is enabled. + IOSCHED More than one I/O scheduler is enabled. IP_PNP IP DCHP, BOOTP, or RARP is enabled. ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. @@ -91,6 +92,11 @@ ht -- run only enough ACPI to enable Hyper Threading See also Documentation/pm.txt. + acpi_pic_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode + Format: { level | edge } + level Force PIC-mode SCI to Level Trigger (default) + edge Force PIC-mode SCI to Edge Trigge + ad1816= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2> See also Documentation/sound/oss/AD1816. @@ -214,7 +220,7 @@ Forces specified timesource (if avaliable) to be used when calculating gettimeofday(). If specicified timesource is not avalible, it defaults to PIT. - Format: { pit | tsc | cyclone | ... } + Format: { pit | tsc | cyclone | pmtmr } hpet= [IA-32,HPET] option to disable HPET and use PIT. Format: disable @@ -303,6 +309,10 @@ See comment before function elanfreq_setup() in arch/i386/kernel/cpu/cpufreq/elanfreq.c. + elevator= [IOSCHED] + Format: {"as"|"cfq"|"deadline"|"noop"} + See Documentation/as-iosched.txt for details + es1370= [HW,OSS] Format: <lineout>[,<micbias>] See also header of sound/oss/es1370.c. --- diff/Documentation/kobject.txt 2003-09-17 12:28:01.000000000 +0100 +++ source/Documentation/kobject.txt 2003-11-26 10:09:05.000000000 +0000 @@ -2,7 +2,7 @@ Patrick Mochel <mochel@osdl.org> -Updated: 3 June 2003 +Updated: 12 November 2003 Copyright (c) 2003 Patrick Mochel @@ -128,7 +128,33 @@ (like the networking layer). -1.4 sysfs +1.4 Order dependencies + +Fields in a kobject must be initialized before they are used, as +indicated in this table: + + k_name Before kobject_add() (1) + name Before kobject_add() (1) + refcount Before kobject_init() (2) + entry Set by kobject_init() + parent Before kobject_add() (3) + kset Before kobject_init() (4) + ktype Before kobject_add() (4) + dentry Set by kobject_add() + +(1) If k_name isn't already set when kobject_add() is called, +it will be set to name. + +(2) Although kobject_init() sets refcount to 1, a warning will be logged +if refcount is not equal to 0 beforehand. + +(3) If parent isn't already set when kobject_add() is called, +it will be set to kset's embedded kobject. + +(4) kset and ktype are optional. If they are used, they should be set +at the times indicated. + +1.5 sysfs Each kobject receives a directory in sysfs. This directory is created under the kobject's parent directory. @@ -210,7 +236,25 @@ name. The kobject, if found, is returned. -2.3 sysfs +2.3 Order dependencies + +Fields in a kset must be initialized before they are used, as indicated +in this table: + + subsys Before kset_add() + ktype Before kset_add() (1) + list Set by kset_init() + kobj Before kset_init() (2) + hotplug_ops Before kset_add() (1) + +(1) ktype and hotplug_ops are optional. If they are used, they should +be set at the times indicated. + +(2) kobj is passed to kobject_init() during kset_init() and to +kobject_add() during kset_add(); it must initialized accordingly. + + +2.4 sysfs ksets are represented in sysfs when their embedded kobjects are registered. They follow the same rules of parenting, with one @@ -352,7 +396,21 @@ - Sets obj->subsys.kset.kobj.kset to the subsystem's embedded kset. -4.4 sysfs +4.4 Order dependencies + +Fields in a subsystem must be initialized before they are used, +as indicated in this table: + + kset Before subsystem_register() (1) + rwsem Set by subsystem_register() + +(1) kset is minimally initialized by the decl_subsys macro. It is +passed to kset_init() and kset_add() by subsystem_register(). If its +subsys member isn't already set, subsystem_register() sets it to the +containing subsystem. + + +4.5 sysfs subsystems are represented in sysfs via their embedded kobjects. They follow the same rules as previously mentioned with no exceptions. They --- diff/Documentation/scsi/BusLogic.txt 2003-08-20 14:16:23.000000000 +0100 +++ source/Documentation/scsi/BusLogic.txt 2003-11-26 10:09:05.000000000 +0000 @@ -577,7 +577,7 @@ INSMOD Loadable Kernel Module Installation Facility: insmod BusLogic.o \ - 'BusLogic_Options="QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30"' + 'BusLogic="QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30"' NOTE: Module Utilities 2.1.71 or later is required for correct parsing of driver options containing commas. --- diff/Documentation/sysctl/fs.txt 2003-01-02 10:43:02.000000000 +0000 +++ source/Documentation/sysctl/fs.txt 2003-11-26 10:09:05.000000000 +0000 @@ -138,3 +138,13 @@ can have. You only need to increase super-max if you need to mount more filesystems than the current value in super-max allows you to. + +============================================================== + +aio-nr & aio-max-nr: + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + +============================================================== --- diff/MAINTAINERS 2003-11-25 15:24:57.000000000 +0000 +++ source/MAINTAINERS 2003-11-26 10:09:08.000000000 +0000 @@ -1154,6 +1154,12 @@ W: http://developer.osdl.org/rddunlap/kj-patches/ S: Maintained +KGDB FOR I386 PLATFORM +P: George Anzinger +M: george@mvista.com +L: linux-net@vger.kernel.org +S: Supported + KERNEL NFSD P: Neil Brown M: neilb@cse.unsw.edu.au @@ -1234,8 +1240,8 @@ S: Maintained LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers -P: Gerard Roudier -M: groudier@free.fr +P: Matthew Wilcox +M: matthew@wil.cx L: linux-scsi@vger.kernel.org S: Maintained --- diff/Makefile 2003-11-25 15:24:57.000000000 +0000 +++ source/Makefile 2003-11-26 10:09:08.000000000 +0000 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 0 -EXTRAVERSION = -test10 +EXTRAVERSION = -test10-mm1 # *DOCUMENTATION* # To see a list of typical targets execute "make help" @@ -275,7 +275,7 @@ CPPFLAGS := -D__KERNEL__ -Iinclude \ $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) -CFLAGS := -Wall -Wstrict-prototypes -Wno-trigraphs -O2 \ +CFLAGS := -Wall -Wstrict-prototypes -Wno-trigraphs \ -fno-strict-aliasing -fno-common AFLAGS := -D__ASSEMBLY__ @@ -431,6 +431,12 @@ # --------------------------------------------------------------------------- +ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE +CFLAGS += -Os +else +CFLAGS += -O2 +endif + ifndef CONFIG_FRAME_POINTER CFLAGS += -fomit-frame-pointer endif --- diff/arch/alpha/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/alpha/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -529,19 +529,21 @@ #ifdef CONFIG_SMP int j; #endif - int i; + int i = *(int *) v; struct irqaction * action; unsigned long flags; #ifdef CONFIG_SMP - seq_puts(p, " "); - for (i = 0; i < NR_CPUS; i++) - if (cpu_online(i)) - seq_printf(p, "CPU%d ", i); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i)) + seq_printf(p, "CPU%d ", i); + seq_putc(p, '\n'); + } #endif - for (i = 0; i < ACTUAL_NR_IRQS; i++) { + if (i < ACTUAL_NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) @@ -568,15 +570,16 @@ seq_putc(p, '\n'); unlock: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } + } else if (i == ACTUAL_NR_IRQS) { #ifdef CONFIG_SMP - seq_puts(p, "IPI: "); - for (i = 0; i < NR_CPUS; i++) - if (cpu_online(i)) - seq_printf(p, "%10lu ", cpu_data[i].ipi_count); - seq_putc(p, '\n'); + seq_puts(p, "IPI: "); + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i)) + seq_printf(p, "%10lu ", cpu_data[i].ipi_count); + seq_putc(p, '\n'); #endif - seq_printf(p, "ERR: %10lu\n", irq_err_count); + seq_printf(p, "ERR: %10lu\n", irq_err_count); + } return 0; } --- diff/arch/alpha/kernel/traps.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/alpha/kernel/traps.c 2003-11-26 10:09:04.000000000 +0000 @@ -636,6 +636,7 @@ lock_kernel(); printk("Bad unaligned kernel access at %016lx: %p %lx %ld\n", pc, va, opcode, reg); + dump_stack(); do_exit(SIGSEGV); got_exception: --- diff/arch/arm/Makefile 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/arm/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -14,8 +14,6 @@ GZFLAGS :=-9 #CFLAGS +=-pipe -CFLAGS :=$(CFLAGS:-O2=-Os) - ifeq ($(CONFIG_FRAME_POINTER),y) CFLAGS +=-fno-omit-frame-pointer -mapcs -mno-sched-prolog endif --- diff/arch/arm/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/arm/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -169,11 +169,11 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; struct irqaction * action; unsigned long flags; - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_controller_lock, flags); action = irq_desc[i].action; if (!action) @@ -187,12 +187,12 @@ seq_putc(p, '\n'); unlock: spin_unlock_irqrestore(&irq_controller_lock, flags); - } - + } else if (i == NR_IRQS) { #ifdef CONFIG_ARCH_ACORN - show_fiq_list(p, v); + show_fiq_list(p, v); #endif - seq_printf(p, "Err: %10lu\n", irq_err_count); + seq_printf(p, "Err: %10lu\n", irq_err_count); + } return 0; } --- diff/arch/arm26/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/arm26/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -135,10 +135,10 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; struct irqaction * action; - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { action = irq_desc[i].action; if (!action) continue; @@ -148,10 +148,10 @@ seq_printf(p, ", %s", action->name); } seq_putc(p, '\n'); + } else if (i == NR_IRQS) { + show_fiq_list(p, v); + seq_printf(p, "Err: %10lu\n", irq_err_count); } - - show_fiq_list(p, v); - seq_printf(p, "Err: %10lu\n", irq_err_count); return 0; } --- diff/arch/cris/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/cris/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -89,11 +89,11 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; struct irqaction * action; unsigned long flags; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { local_irq_save(flags); action = irq_action[i]; if (!action) --- diff/arch/h8300/Kconfig 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/h8300/Kconfig 2003-11-26 10:09:04.000000000 +0000 @@ -5,6 +5,10 @@ mainmenu "uClinux/h8300 (w/o MMU) Kernel Configuration" +config H8300 + bool + default y + config MMU bool default n --- diff/arch/h8300/Makefile 2003-08-26 10:00:51.000000000 +0100 +++ source/arch/h8300/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -34,7 +34,7 @@ ldflags-$(CONFIG_CPU_H8S) := -mh8300self CFLAGS += $(cflags-y) -CFLAGS += -mint32 -fno-builtin -Os +CFLAGS += -mint32 -fno-builtin CFLAGS += -g CFLAGS += -D__linux__ CFLAGS += -DUTS_SYSNAME=\"uClinux\" --- diff/arch/h8300/platform/h8300h/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/h8300/platform/h8300h/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -228,9 +228,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (irq_list[i]) { seq_printf(p, "%3d: %10u ",i,irq_list[i]->count); seq_printf(p, "%s\n", irq_list[i]->devname); --- diff/arch/h8300/platform/h8s/ints.c 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/h8300/platform/h8s/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -280,9 +280,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (irq_list[i]) { seq_printf(p, "%3d: %10u ",i,irq_list[i]->count); seq_printf(p, "%s\n", irq_list[i]->devname); --- diff/arch/i386/Kconfig 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/Kconfig 2003-11-26 10:09:04.000000000 +0000 @@ -397,6 +397,54 @@ depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 default y +config X86_4G + bool "4 GB kernel-space and 4 GB user-space virtual memory support" + help + This option is only useful for systems that have more than 1 GB + of RAM. + + The default kernel VM layout leaves 1 GB of virtual memory for + kernel-space mappings, and 3 GB of VM for user-space applications. + This option ups both the kernel-space VM and the user-space VM to + 4 GB. + + The cost of this option is additional TLB flushes done at + system-entry points that transition from user-mode into kernel-mode. + I.e. system calls and page faults, and IRQs that interrupt user-mode + code. There's also additional overhead to kernel operations that copy + memory to/from user-space. The overhead from this is hard to tell and + depends on the workload - it can be anything from no visible overhead + to 20-30% overhead. A good rule of thumb is to count with a runtime + overhead of 20%. + + The upside is the much increased kernel-space VM, which more than + quadruples the maximum amount of RAM supported. Kernels compiled with + this option boot on 64GB of RAM and still have more than 3.1 GB of + 'lowmem' left. Another bonus is that highmem IO bouncing decreases, + if used with drivers that still use bounce-buffers. + + There's also a 33% increase in user-space VM size - database + applications might see a boost from this. + + But the cost of the TLB flushes and the runtime overhead has to be + weighed against the bonuses offered by the larger VM spaces. The + dividing line depends on the actual workload - there might be 4 GB + systems that benefit from this option. Systems with less than 4 GB + of RAM will rarely see a benefit from this option - but it's not + out of question, the exact circumstances have to be considered. + +config X86_SWITCH_PAGETABLES + def_bool X86_4G + +config X86_4G_VM_LAYOUT + def_bool X86_4G + +config X86_UACCESS_INDIRECT + def_bool X86_4G + +config X86_HIGH_ENTRY + def_bool X86_4G + config HPET_TIMER bool "HPET Timer Support" help @@ -784,6 +832,25 @@ See <file:Documentation/mtrr.txt> for more information. +config EFI + bool "Boot from EFI support (EXPERIMENTAL)" + depends on ACPI + default n + ---help--- + + This enables the the kernel to boot on EFI platforms using + system configuration information passed to it from the firmware. + This also enables the kernel to use any EFI runtime services that are + available (such as the EFI variable services). + + This option is only useful on systems that have EFI firmware + and will result in a kernel image that is ~8k larger. In addition, + you must use the latest ELILO loader available at + ftp.hpl.hp.com/pub/linux-ia64/ in order to take advantage of kernel + initialization using EFI information (neither GRUB nor LILO know + anything about EFI). However, even with this option, the resultant + kernel should continue to boot on existing non-EFI platforms. + config HAVE_DEC_LOCK bool depends on (SMP || PREEMPT) && X86_CMPXCHG @@ -793,7 +860,7 @@ # Summit needs it only when NUMA is on config BOOT_IOREMAP bool - depends on ((X86_SUMMIT || X86_GENERICARCH) && NUMA) + depends on (((X86_SUMMIT || X86_GENERICARCH) && NUMA) || (X86 && EFI)) default y endmenu @@ -1030,6 +1097,25 @@ depends on PCI && ((PCI_GODIRECT || PCI_GOANY) || X86_VISWS) default y +config PCI_USE_VECTOR + bool "Vector-based interrupt indexing" + depends on X86_LOCAL_APIC + default n + help + This replaces the current existing IRQ-based index interrupt scheme + with the vector-base index scheme. The advantages of vector base + over IRQ base are listed below: + 1) Support MSI implementation. + 2) Support future IOxAPIC hotplug + + Note that this enables MSI, Message Signaled Interrupt, on all + MSI capable device functions detected if users also install the + MSI patch. Message Signal Interrupt enables an MSI-capable + hardware device to send an inbound Memory Write on its PCI bus + instead of asserting IRQ signal on device IRQ pin. + + If you don't know what to do here, say N. + source "drivers/pci/Kconfig" config ISA @@ -1187,6 +1273,15 @@ This results in a large slowdown, but helps to find certain types of memory corruptions. +config SPINLINE + bool "Spinlock inlining" + depends on DEBUG_KERNEL + help + This will change spinlocks from out of line to inline, making them + account cost to the callers in readprofile, rather than the lock + itself (as ".text.lock.filename"). This can be helpful for finding + the callers of locks. + config DEBUG_HIGHMEM bool "Highmem debugging" depends on DEBUG_KERNEL && HIGHMEM @@ -1203,20 +1298,208 @@ Say Y here only if you plan to use gdb to debug the kernel. If you don't debug the kernel, you can say N. +config LOCKMETER + bool "Kernel lock metering" + depends on SMP + help + Say Y to enable kernel lock metering, which adds overhead to SMP locks, + but allows you to see various statistics using the lockstat command. + config DEBUG_SPINLOCK_SLEEP bool "Sleep-inside-spinlock checking" help If you say Y here, various routines which may sleep will become very noisy if they are called with a spinlock held. +config KGDB + bool "Include kgdb kernel debugger" + depends on DEBUG_KERNEL + help + If you say Y here, the system will be compiled with the debug + option (-g) and a debugging stub will be included in the + kernel. This stub communicates with gdb on another (host) + computer via a serial port. The host computer should have + access to the kernel binary file (vmlinux) and a serial port + that is connected to the target machine. Gdb can be made to + configure the serial port or you can use stty and setserial to + do this. See the 'target' command in gdb. This option also + configures in the ability to request a breakpoint early in the + boot process. To request the breakpoint just include 'kgdb' + as a boot option when booting the target machine. The system + will then break as soon as it looks at the boot options. This + option also installs a breakpoint in panic and sends any + kernel faults to the debugger. For more information see the + Documentation/i386/kgdb.txt file. + +choice + depends on KGDB + prompt "Debug serial port BAUD" + default KGDB_115200BAUD + help + Gdb and the kernel stub need to agree on the baud rate to be + used. Some systems (x86 family at this writing) allow this to + be configured. + +config KGDB_9600BAUD + bool "9600" + +config KGDB_19200BAUD + bool "19200" + +config KGDB_38400BAUD + bool "38400" + +config KGDB_57600BAUD + bool "57600" + +config KGDB_115200BAUD + bool "115200" +endchoice + +config KGDB_PORT + hex "hex I/O port address of the debug serial port" + depends on KGDB + default 3f8 + help + Some systems (x86 family at this writing) allow the port + address to be configured. The number entered is assumed to be + hex, don't put 0x in front of it. The standard address are: + COM1 3f8 , irq 4 and COM2 2f8 irq 3. Setserial /dev/ttySx + will tell you what you have. It is good to test the serial + connection with a live system before trying to debug. + +config KGDB_IRQ + int "IRQ of the debug serial port" + depends on KGDB + default 4 + help + This is the irq for the debug port. If everything is working + correctly and the kernel has interrupts on a control C to the + port should cause a break into the kernel debug stub. + +config DEBUG_INFO + bool + depends on KGDB + default y + +config KGDB_MORE + bool "Add any additional compile options" + depends on KGDB + default n + help + Saying yes here turns on the ability to enter additional + compile options. + + +config KGDB_OPTIONS + depends on KGDB_MORE + string "Additional compile arguments" + default "-O1" + help + This option allows you enter additional compile options for + the whole kernel compile. Each platform will have a default + that seems right for it. For example on PPC "-ggdb -O1", and + for i386 "-O1". Note that by configuring KGDB "-g" is already + turned on. In addition, on i386 platforms + "-fomit-frame-pointer" is deleted from the standard compile + options. + +config NO_KGDB_CPUS + int "Number of CPUs" + depends on KGDB && SMP + default NR_CPUS + help + + This option sets the number of cpus for kgdb ONLY. It is used + to prune some internal structures so they look "nice" when + displayed with gdb. This is to overcome possibly larger + numbers that may have been entered above. Enter the real + number to get nice clean kgdb_info displays. + +config KGDB_TS + bool "Enable kgdb time stamp macros?" + depends on KGDB + default n + help + Kgdb event macros allow you to instrument your code with calls + to the kgdb event recording function. The event log may be + examined with gdb at a break point. Turning on this + capability also allows you to choose how many events to + keep. Kgdb always keeps the lastest events. + +choice + depends on KGDB_TS + prompt "Max number of time stamps to save?" + default KGDB_TS_128 + +config KGDB_TS_64 + bool "64" + +config KGDB_TS_128 + bool "128" + +config KGDB_TS_256 + bool "256" + +config KGDB_TS_512 + bool "512" + +config KGDB_TS_1024 + bool "1024" + +endchoice + +config STACK_OVERFLOW_TEST + bool "Turn on kernel stack overflow testing?" + depends on KGDB + default n + help + This option enables code in the front line interrupt handlers + to check for kernel stack overflow on interrupts and system + calls. This is part of the kgdb code on x86 systems. + +config KGDB_CONSOLE + bool "Enable serial console thru kgdb port" + depends on KGDB + default n + help + This option enables the command line "console=kgdb" option. + When the system is booted with this option in the command line + all kernel printk output is sent to gdb (as well as to other + consoles). For this to work gdb must be connected. For this + reason, this command line option will generate a breakpoint if + gdb has not yet connected. After the gdb continue command is + given all pent up console output will be printed by gdb on the + host machine. Neither this option, nor KGDB require the + serial driver to be configured. + +config KGDB_SYSRQ + bool "Turn on SysRq 'G' command to do a break?" + depends on KGDB + default y + help + This option includes an option in the SysRq code that allows + you to enter SysRq G which generates a breakpoint to the KGDB + stub. This will work if the keyboard is alive and can + interrupt the system. Because of constraints on when the + serial port interrupt can be enabled, this code may allow you + to interrupt the system before the serial port control C is + available. Just say yes here. + config FRAME_POINTER bool "Compile the kernel with frame pointers" + default KGDB help If you say Y here the resulting kernel image will be slightly larger and slower, but it will give very useful debugging information. If you don't debug the kernel, you can say N, but we may not be able to solve problems without frame pointers. +config MAGIC_SYSRQ + bool + depends on KGDB_SYSRQ + default y + config X86_EXTRA_IRQS bool depends on X86_LOCAL_APIC || X86_VOYAGER --- diff/arch/i386/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -84,6 +84,9 @@ # default subarch .h files mflags-y += -Iinclude/asm-i386/mach-default +mflags-$(CONFIG_KGDB) += -gdwarf-2 +mflags-$(CONFIG_KGDB_MORE) += $(shell echo $(CONFIG_KGDB_OPTIONS) | sed -e 's/"//g') + head-y := arch/i386/kernel/head.o arch/i386/kernel/init_task.o libs-y += arch/i386/lib/ --- diff/arch/i386/boot/setup.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/boot/setup.S 2003-11-26 10:09:04.000000000 +0000 @@ -162,7 +162,7 @@ # can be located anywhere in # low memory 0x10000 or higher. -ramdisk_max: .long MAXMEM-1 # (Header version 0x0203 or later) +ramdisk_max: .long __MAXMEM-1 # (Header version 0x0203 or later) # The highest safe address for # the contents of an initrd --- diff/arch/i386/kernel/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -7,13 +7,14 @@ obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \ ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \ pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \ - doublefault.o + doublefault.o entry_trampoline.o obj-y += cpu/ obj-y += timers/ obj-$(CONFIG_ACPI_BOOT) += acpi/ obj-$(CONFIG_X86_BIOS_REBOOT) += reboot.o obj-$(CONFIG_MCA) += mca.o +obj-$(CONFIG_KGDB) += kgdb_stub.o obj-$(CONFIG_X86_MSR) += msr.o obj-$(CONFIG_X86_CPUID) += cpuid.o obj-$(CONFIG_MICROCODE) += microcode.o @@ -30,6 +31,7 @@ obj-y += sysenter.o vsyscall.o obj-$(CONFIG_ACPI_SRAT) += srat.o obj-$(CONFIG_HPET_TIMER) += time_hpet.o +obj-$(CONFIG_EFI) += efi.o efi_stub.o EXTRA_AFLAGS := -traditional --- diff/arch/i386/kernel/acpi/boot.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/acpi/boot.c 2003-11-26 10:09:04.000000000 +0000 @@ -26,6 +26,7 @@ #include <linux/init.h> #include <linux/config.h> #include <linux/acpi.h> +#include <linux/efi.h> #include <asm/pgalloc.h> #include <asm/io_apic.h> #include <asm/apic.h> @@ -249,29 +250,66 @@ #ifdef CONFIG_ACPI_BUS /* - * Set specified PIC IRQ to level triggered mode. + * "acpi_pic_sci=level" (current default) + * programs the PIC-mode SCI to Level Trigger. + * (NO-OP if the BIOS set Level Trigger already) + * + * If a PIC-mode SCI is not recogznied or gives spurious IRQ7's + * it may require Edge Trigger -- use "acpi_pic_sci=edge" + * (NO-OP if the BIOS set Edge Trigger already) * * Port 0x4d0-4d1 are ECLR1 and ECLR2, the Edge/Level Control Registers * for the 8259 PIC. bit[n] = 1 means irq[n] is Level, otherwise Edge. * ECLR1 is IRQ's 0-7 (IRQ 0, 1, 2 must be 0) * ECLR2 is IRQ's 8-15 (IRQ 8, 13 must be 0) - * - * As the BIOS should have done this for us, - * print a warning if the IRQ wasn't already set to level. */ -void acpi_pic_set_level_irq(unsigned int irq) +static int __initdata acpi_pic_sci_trigger; /* 0: level, 1: edge */ + +void __init +acpi_pic_sci_set_trigger(unsigned int irq) { unsigned char mask = 1 << (irq & 7); unsigned int port = 0x4d0 + (irq >> 3); unsigned char val = inb(port); + + printk(PREFIX "IRQ%d SCI:", irq); if (!(val & mask)) { - printk(KERN_WARNING PREFIX "IRQ %d was Edge Triggered, " - "setting to Level Triggerd\n", irq); - outb(val | mask, port); + printk(" Edge"); + + if (!acpi_pic_sci_trigger) { + printk(" set to Level"); + outb(val | mask, port); + } + } else { + printk(" Level"); + + if (acpi_pic_sci_trigger) { + printk(" set to Edge"); + outb(val | mask, port); + } + } + printk(" Trigger.\n"); +} + +int __init +acpi_pic_sci_setup(char *str) +{ + while (str && *str) { + if (strncmp(str, "level", 5) == 0) + acpi_pic_sci_trigger = 0; /* force level trigger */ + if (strncmp(str, "edge", 4) == 0) + acpi_pic_sci_trigger = 1; /* force edge trigger */ + str = strchr(str, ','); + if (str) + str += strspn(str, ", \t"); } + return 1; } + +__setup("acpi_pic_sci=", acpi_pic_sci_setup); + #endif /* CONFIG_ACPI_BUS */ @@ -326,11 +364,48 @@ } #endif +/* detect the location of the ACPI PM Timer */ +#ifdef CONFIG_X86_PM_TIMER +extern u32 pmtmr_ioport; + +static int __init acpi_parse_fadt(unsigned long phys, unsigned long size) +{ + struct fadt_descriptor_rev2 *fadt =0; + + fadt = (struct fadt_descriptor_rev2*) __acpi_map_table(phys,size); + if(!fadt) { + printk(KERN_WARNING PREFIX "Unable to map FADT\n"); + return 0; + } + + if (fadt->revision >= FADT2_REVISION_ID) { + /* FADT rev. 2 */ + if (fadt->xpm_tmr_blk.address_space_id != ACPI_ADR_SPACE_SYSTEM_IO) + return 0; + + pmtmr_ioport = fadt->xpm_tmr_blk.address; + } else { + /* FADT rev. 1 */ + pmtmr_ioport = fadt->V1_pm_tmr_blk; + } + if (pmtmr_ioport) + printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x\n", pmtmr_ioport); + return 0; +} +#endif + + unsigned long __init acpi_find_rsdp (void) { unsigned long rsdp_phys = 0; + if (efi_enabled) { + if (efi.acpi20) + return __pa(efi.acpi20); + else if (efi.acpi) + return __pa(efi.acpi); + } /* * Scan memory looking for the RSDP signature. First search EBDA (low * memory) paragraphs and then search upper memory (E0000-FFFFF). @@ -519,5 +594,9 @@ acpi_table_parse(ACPI_HPET, acpi_parse_hpet); #endif +#ifdef CONFIG_X86_PM_TIMER + acpi_table_parse(ACPI_FADT, acpi_parse_fadt); +#endif + return 0; } --- diff/arch/i386/kernel/asm-offsets.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/asm-offsets.c 2003-11-26 10:09:04.000000000 +0000 @@ -4,9 +4,11 @@ * to extract and format the required data. */ +#include <linux/sched.h> #include <linux/signal.h> #include <asm/ucontext.h> #include "sigframe.h" +#include <asm/fixmap.h> #define DEFINE(sym, val) \ asm volatile("\n->" #sym " %0 " #val : : "i" (val)) @@ -28,4 +30,17 @@ DEFINE(RT_SIGFRAME_sigcontext, offsetof (struct rt_sigframe, uc.uc_mcontext)); + DEFINE(TI_task, offsetof (struct thread_info, task)); + DEFINE(TI_exec_domain, offsetof (struct thread_info, exec_domain)); + DEFINE(TI_flags, offsetof (struct thread_info, flags)); + DEFINE(TI_preempt_count, offsetof (struct thread_info, preempt_count)); + DEFINE(TI_addr_limit, offsetof (struct thread_info, addr_limit)); + DEFINE(TI_real_stack, offsetof (struct thread_info, real_stack)); + DEFINE(TI_virtual_stack, offsetof (struct thread_info, virtual_stack)); + DEFINE(TI_user_pgd, offsetof (struct thread_info, user_pgd)); + + DEFINE(FIX_ENTRY_TRAMPOLINE_0_addr, __fix_to_virt(FIX_ENTRY_TRAMPOLINE_0)); + DEFINE(FIX_VSYSCALL_addr, __fix_to_virt(FIX_VSYSCALL)); + DEFINE(PAGE_SIZE_asm, PAGE_SIZE); + DEFINE(task_thread_db7, offsetof (struct task_struct, thread.debugreg[7])); } --- diff/arch/i386/kernel/cpu/common.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/cpu/common.c 2003-11-26 10:09:04.000000000 +0000 @@ -510,16 +510,20 @@ BUG(); enter_lazy_tlb(&init_mm, current); - load_esp0(t, thread->esp0); + load_esp0(t, thread); set_tss_desc(cpu,t); cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; load_TR_desc(); - load_LDT(&init_mm.context); + if (cpu) + load_LDT(&init_mm.context); /* Set up doublefault TSS pointer in the GDT */ __set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss); cpu_gdt_table[cpu][GDT_ENTRY_DOUBLEFAULT_TSS].b &= 0xfffffdff; + if (cpu) + trap_init_virtual_GDT(); + /* Clear %fs and %gs. */ asm volatile ("xorl %eax, %eax; movl %eax, %fs; movl %eax, %gs"); --- diff/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2003-09-17 12:28:01.000000000 +0100 +++ source/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2003-11-26 10:09:04.000000000 +0000 @@ -73,6 +73,16 @@ { .frequency = CPUFREQ_TABLE_END } }; +/* Ultra Low Voltage Intel Pentium M processor 1000MHz */ +static struct cpufreq_frequency_table op_1000[] = + { + OP(600, 844), + OP(800, 972), + OP(900, 988), + OP(1000, 1004), + { .frequency = CPUFREQ_TABLE_END } + }; + /* Low Voltage Intel Pentium M processor 1.10GHz */ static struct cpufreq_frequency_table op_1100[] = { @@ -165,6 +175,7 @@ static const struct cpu_model models[] = { _CPU( 900, " 900"), + CPU(1000), CPU(1100), CPU(1200), CPU(1300), --- diff/arch/i386/kernel/cpu/intel.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/cpu/intel.c 2003-11-26 10:09:04.000000000 +0000 @@ -8,11 +8,10 @@ #include <asm/processor.h> #include <asm/msr.h> #include <asm/uaccess.h> +#include <asm/desc.h> #include "cpu.h" -extern int trap_init_f00f_bug(void); - #ifdef CONFIG_X86_INTEL_USERCOPY /* * Alignment at which movsl is preferred for bulk memory copies. @@ -157,7 +156,7 @@ c->f00f_bug = 1; if ( !f00f_workaround_enabled ) { - trap_init_f00f_bug(); + trap_init_virtual_IDT(); printk(KERN_NOTICE "Intel Pentium with F0 0F bug - workaround enabled.\n"); f00f_workaround_enabled = 1; } --- diff/arch/i386/kernel/dmi_scan.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/dmi_scan.c 2003-11-26 10:09:04.000000000 +0000 @@ -16,6 +16,7 @@ int is_sony_vaio_laptop; int is_unsafe_smbus; +int es7000_plat = 0; struct dmi_header { @@ -1011,6 +1012,7 @@ printk(KERN_NOTICE "ACPI disabled because your bios is from %s and too old\n", s); printk(KERN_NOTICE "You can enable it with acpi=force\n"); acpi_disabled = 1; + acpi_ht = 0; } } } --- diff/arch/i386/kernel/doublefault.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/doublefault.c 2003-11-26 10:09:04.000000000 +0000 @@ -7,12 +7,13 @@ #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/desc.h> +#include <asm/fixmap.h> #define DOUBLEFAULT_STACKSIZE (1024) static unsigned long doublefault_stack[DOUBLEFAULT_STACKSIZE]; #define STACK_START (unsigned long)(doublefault_stack+DOUBLEFAULT_STACKSIZE) -#define ptr_ok(x) ((x) > 0xc0000000 && (x) < 0xc1000000) +#define ptr_ok(x) (((x) > __PAGE_OFFSET && (x) < (__PAGE_OFFSET + 0x01000000)) || ((x) >= FIXADDR_START)) static void doublefault_fn(void) { @@ -38,8 +39,8 @@ printk("eax = %08lx, ebx = %08lx, ecx = %08lx, edx = %08lx\n", t->eax, t->ebx, t->ecx, t->edx); - printk("esi = %08lx, edi = %08lx\n", - t->esi, t->edi); + printk("esi = %08lx, edi = %08lx, ebp = %08lx\n", + t->esi, t->edi, t->ebp); } } --- diff/arch/i386/kernel/entry.S 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/entry.S 2003-11-26 10:09:04.000000000 +0000 @@ -43,11 +43,25 @@ #include <linux/config.h> #include <linux/linkage.h> #include <asm/thread_info.h> +#include <asm/asm_offsets.h> #include <asm/errno.h> #include <asm/segment.h> +#include <asm/page.h> #include <asm/smp.h> #include <asm/page.h> #include "irq_vectors.h" + /* We do not recover from a stack overflow, but at least + * we know it happened and should be able to track it down. + */ +#ifdef CONFIG_STACK_OVERFLOW_TEST +#define STACK_OVERFLOW_TEST \ + testl $7680,%esp; \ + jnz 10f; \ + call stack_overflow; \ +10: +#else +#define STACK_OVERFLOW_TEST +#endif #define nr_syscalls ((syscall_table_size)/4) @@ -87,7 +101,102 @@ #define resume_kernel restore_all #endif -#define SAVE_ALL \ +#ifdef CONFIG_X86_HIGH_ENTRY + +#ifdef CONFIG_X86_SWITCH_PAGETABLES + +#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) +/* + * If task is preempted in __SWITCH_KERNELSPACE, and moved to another cpu, + * __switch_to repoints %esp to the appropriate virtual stack; but %ebp is + * left stale, so we must check whether to repeat the real stack calculation. + */ +#define repeat_if_esp_changed \ + xorl %esp, %ebp; \ + testl $0xffffe000, %ebp; \ + jnz 0b +#else +#define repeat_if_esp_changed +#endif + +/* clobbers ebx, edx and ebp */ + +#define __SWITCH_KERNELSPACE \ + cmpl $0xff000000, %esp; \ + jb 1f; \ + \ + /* \ + * switch pagetables and load the real stack, \ + * keep the stack offset: \ + */ \ + \ + movl $swapper_pg_dir-__PAGE_OFFSET, %edx; \ + \ + /* GET_THREAD_INFO(%ebp) intermixed */ \ +0: \ + movl %esp, %ebp; \ + movl %esp, %ebx; \ + andl $0xffffe000, %ebp; \ + andl $0x00001fff, %ebx; \ + orl TI_real_stack(%ebp), %ebx; \ + repeat_if_esp_changed; \ + \ + movl %edx, %cr3; \ + movl %ebx, %esp; \ +1: + +#endif + + +#define __SWITCH_USERSPACE \ + /* interrupted any of the user return paths? */ \ + \ + movl EIP(%esp), %eax; \ + \ + cmpl $int80_ret_start_marker, %eax; \ + jb 33f; /* nope - continue with sysexit check */\ + cmpl $int80_ret_end_marker, %eax; \ + jb 22f; /* yes - switch to virtual stack */ \ +33: \ + cmpl $sysexit_ret_start_marker, %eax; \ + jb 44f; /* nope - continue with user check */ \ + cmpl $sysexit_ret_end_marker, %eax; \ + jb 22f; /* yes - switch to virtual stack */ \ + /* return to userspace? */ \ +44: \ + movl EFLAGS(%esp),%ecx; \ + movb CS(%esp),%cl; \ + testl $(VM_MASK | 3),%ecx; \ + jz 2f; \ +22: \ + /* \ + * switch to the virtual stack, then switch to \ + * the userspace pagetables. \ + */ \ + \ + GET_THREAD_INFO(%ebp); \ + movl TI_virtual_stack(%ebp), %edx; \ + movl TI_user_pgd(%ebp), %ecx; \ + \ + movl %esp, %ebx; \ + andl $0x1fff, %ebx; \ + orl %ebx, %edx; \ +int80_ret_start_marker: \ + movl %edx, %esp; \ + movl %ecx, %cr3; \ + \ + __RESTORE_ALL; \ +int80_ret_end_marker: \ +2: + +#else /* !CONFIG_X86_HIGH_ENTRY */ + +#define __SWITCH_KERNELSPACE +#define __SWITCH_USERSPACE + +#endif + +#define __SAVE_ALL \ cld; \ pushl %es; \ pushl %ds; \ @@ -102,7 +211,7 @@ movl %edx, %ds; \ movl %edx, %es; -#define RESTORE_INT_REGS \ +#define __RESTORE_INT_REGS \ popl %ebx; \ popl %ecx; \ popl %edx; \ @@ -111,29 +220,28 @@ popl %ebp; \ popl %eax -#define RESTORE_REGS \ - RESTORE_INT_REGS; \ -1: popl %ds; \ -2: popl %es; \ +#define __RESTORE_REGS \ + __RESTORE_INT_REGS; \ +111: popl %ds; \ +222: popl %es; \ .section .fixup,"ax"; \ -3: movl $0,(%esp); \ - jmp 1b; \ -4: movl $0,(%esp); \ - jmp 2b; \ +444: movl $0,(%esp); \ + jmp 111b; \ +555: movl $0,(%esp); \ + jmp 222b; \ .previous; \ .section __ex_table,"a";\ .align 4; \ - .long 1b,3b; \ - .long 2b,4b; \ + .long 111b,444b;\ + .long 222b,555b;\ .previous - -#define RESTORE_ALL \ - RESTORE_REGS \ +#define __RESTORE_ALL \ + __RESTORE_REGS \ addl $4, %esp; \ -1: iret; \ +333: iret; \ .section .fixup,"ax"; \ -2: sti; \ +666: sti; \ movl $(__USER_DS), %edx; \ movl %edx, %ds; \ movl %edx, %es; \ @@ -142,10 +250,19 @@ .previous; \ .section __ex_table,"a";\ .align 4; \ - .long 1b,2b; \ + .long 333b,666b;\ .previous +#define SAVE_ALL \ + __SAVE_ALL; \ + __SWITCH_KERNELSPACE; \ + STACK_OVERFLOW_TEST; + +#define RESTORE_ALL \ + __SWITCH_USERSPACE; \ + __RESTORE_ALL; +.section .entry.text,"ax" ENTRY(lcall7) pushfl # We get a different stack layout with call @@ -163,7 +280,7 @@ movl %edx,EIP(%ebp) # Now we move them to their "normal" places movl %ecx,CS(%ebp) # andl $-8192, %ebp # GET_THREAD_INFO - movl TI_EXEC_DOMAIN(%ebp), %edx # Get the execution domain + movl TI_exec_domain(%ebp), %edx # Get the execution domain call *4(%edx) # Call the lcall7 handler for the domain addl $4, %esp popl %eax @@ -208,7 +325,7 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done on # int/exception return? jne work_pending @@ -216,18 +333,18 @@ #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) - cmpl $0,TI_PRE_COUNT(%ebp) # non-zero preempt_count ? + cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? jnz restore_all need_resched: - movl TI_FLAGS(%ebp), %ecx # need_resched set ? + movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl jz restore_all testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ? jz restore_all - movl $PREEMPT_ACTIVE,TI_PRE_COUNT(%ebp) + movl $PREEMPT_ACTIVE,TI_preempt_count(%ebp) sti call schedule - movl $0,TI_PRE_COUNT(%ebp) + movl $0,TI_preempt_count(%ebp) cli jmp need_resched #endif @@ -246,37 +363,50 @@ pushl $(__USER_CS) pushl $SYSENTER_RETURN -/* - * Load the potential sixth argument from user stack. - * Careful about security. - */ - cmpl $__PAGE_OFFSET-3,%ebp - jae syscall_fault -1: movl (%ebp),%ebp -.section __ex_table,"a" - .align 4 - .long 1b,syscall_fault -.previous - pushl %eax SAVE_ALL GET_THREAD_INFO(%ebp) cmpl $(nr_syscalls), %eax jae syscall_badsys - testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebp) + testb $_TIF_SYSCALL_TRACE,TI_flags(%ebp) jnz syscall_trace_entry call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) cli - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx jne syscall_exit_work + +#ifdef CONFIG_X86_SWITCH_PAGETABLES + + GET_THREAD_INFO(%ebp) + movl TI_virtual_stack(%ebp), %edx + movl TI_user_pgd(%ebp), %ecx + movl %esp, %ebx + andl $0x1fff, %ebx + orl %ebx, %edx +sysexit_ret_start_marker: + movl %edx, %esp + movl %ecx, %cr3 +#endif + /* + * only ebx is not restored by the userspace sysenter vsyscall + * code, it assumes it to be callee-saved. + */ + movl EBX(%esp), %ebx + /* if something modifies registers it must also disable sysexit */ + movl EIP(%esp), %edx movl OLDESP(%esp), %ecx + sti sysexit +#ifdef CONFIG_X86_SWITCH_PAGETABLES +sysexit_ret_end_marker: + nop +#endif # system call handler stub @@ -287,7 +417,7 @@ cmpl $(nr_syscalls), %eax jae syscall_badsys # system call tracing in operation - testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebp) + testb $_TIF_SYSCALL_TRACE,TI_flags(%ebp) jnz syscall_trace_entry syscall_call: call *sys_call_table(,%eax,4) @@ -296,10 +426,23 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work restore_all: +#ifdef CONFIG_TRAP_BAD_SYSCALL_EXITS + movl EFLAGS(%esp), %eax # mix EFLAGS and CS + movb CS(%esp), %al + testl $(VM_MASK | 3), %eax + jz resume_kernelX # returning to kernel or vm86-space + + cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? + jz resume_kernelX + + int $3 + +resume_kernelX: +#endif RESTORE_ALL # perform work that needs to be done immediately before resumption @@ -312,7 +455,7 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? jz restore_all @@ -327,6 +470,22 @@ # vm86-space xorl %edx, %edx call do_notify_resume + +#if CONFIG_X86_HIGH_ENTRY + /* + * Reload db7 if necessary: + */ + movl TI_flags(%ebp), %ecx + testb $_TIF_DB7, %cl + jnz work_db7 + + jmp restore_all + +work_db7: + movl TI_task(%ebp), %edx; + movl task_thread_db7(%edx), %edx; + movl %edx, %db7; +#endif jmp restore_all ALIGN @@ -382,7 +541,7 @@ */ .data ENTRY(interrupt) -.text +.previous vector=0 ENTRY(irq_entries_start) @@ -392,7 +551,7 @@ jmp common_interrupt .data .long 1b -.text +.previous vector=vector+1 .endr @@ -433,12 +592,17 @@ movl ES(%esp), %edi # get the function address movl %eax, ORIG_EAX(%esp) movl %ecx, ES(%esp) - movl %esp, %edx pushl %esi # push the error code - pushl %edx # push the pt_regs pointer movl $(__USER_DS), %edx movl %edx, %ds movl %edx, %es + +/* clobbers edx, ebx and ebp */ + __SWITCH_KERNELSPACE + + leal 4(%esp), %edx # prepare pt_regs + pushl %edx # push pt_regs + call *%edi addl $8, %esp jmp ret_from_exception @@ -529,7 +693,7 @@ pushl %edx call do_nmi addl $8, %esp - RESTORE_ALL + jmp restore_all nmi_stack_fixup: FIX_STACK(12,nmi_stack_correct, 1) @@ -606,6 +770,8 @@ pushl $do_spurious_interrupt_bug jmp error_code +.previous + .data ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ --- diff/arch/i386/kernel/head.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/head.S 2003-11-26 10:09:04.000000000 +0000 @@ -16,6 +16,7 @@ #include <asm/pgtable.h> #include <asm/desc.h> #include <asm/cache.h> +#include <asm/asm_offsets.h> #define OLD_CL_MAGIC_ADDR 0x90020 #define OLD_CL_MAGIC 0xA33F @@ -330,7 +331,7 @@ /* This is the default interrupt "handler" :-) */ int_msg: - .asciz "Unknown interrupt\n" + .asciz "Unknown interrupt or fault at EIP %p %p %p\n" ALIGN ignore_int: cld @@ -342,9 +343,17 @@ movl $(__KERNEL_DS),%eax movl %eax,%ds movl %eax,%es + pushl 16(%esp) + pushl 24(%esp) + pushl 32(%esp) + pushl 40(%esp) pushl $int_msg call printk popl %eax + popl %eax + popl %eax + popl %eax + popl %eax popl %ds popl %es popl %edx @@ -377,23 +386,27 @@ .fill NR_CPUS-1,8,0 # space for the other GDT descriptors /* - * This is initialized to create an identity-mapping at 0-8M (for bootup - * purposes) and another mapping of the 0-8M area at virtual address + * This is initialized to create an identity-mapping at 0-16M (for bootup + * purposes) and another mapping of the 0-16M area at virtual address * PAGE_OFFSET. */ .org 0x1000 ENTRY(swapper_pg_dir) .long 0x00102007 .long 0x00103007 - .fill BOOT_USER_PGD_PTRS-2,4,0 - /* default: 766 entries */ + .long 0x00104007 + .long 0x00105007 + .fill BOOT_USER_PGD_PTRS-4,4,0 + /* default: 764 entries */ .long 0x00102007 .long 0x00103007 - /* default: 254 entries */ - .fill BOOT_KERNEL_PGD_PTRS-2,4,0 + .long 0x00104007 + .long 0x00105007 + /* default: 252 entries */ + .fill BOOT_KERNEL_PGD_PTRS-4,4,0 /* - * The page tables are initialized to only 8MB here - the final page + * The page tables are initialized to only 16MB here - the final page * tables are set up later depending on memory size. */ .org 0x2000 @@ -402,15 +415,21 @@ .org 0x3000 ENTRY(pg1) +.org 0x4000 +ENTRY(pg2) + +.org 0x5000 +ENTRY(pg3) + /* * empty_zero_page must immediately follow the page tables ! (The * initialization loop counts until empty_zero_page) */ -.org 0x4000 +.org 0x6000 ENTRY(empty_zero_page) -.org 0x5000 +.org 0x7000 /* * Real beginning of normal "text" segment @@ -419,12 +438,12 @@ ENTRY(_stext) /* - * This starts the data section. Note that the above is all - * in the text section because it has alignment requirements - * that we cannot fulfill any other way. + * This starts the data section. */ .data +.align PAGE_SIZE_asm + /* * The Global Descriptor Table contains 28 quadwords, per-CPU. */ @@ -439,7 +458,9 @@ .quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */ .quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */ #endif - .align L1_CACHE_BYTES + +.align PAGE_SIZE_asm + ENTRY(cpu_gdt_table) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* 0x0b reserved */ --- diff/arch/i386/kernel/i386_ksyms.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/i386_ksyms.c 2003-11-26 10:09:04.000000000 +0000 @@ -98,7 +98,6 @@ EXPORT_SYMBOL_NOVERS(__down_failed_trylock); EXPORT_SYMBOL_NOVERS(__up_wakeup); /* Networking helper routines. */ -EXPORT_SYMBOL(csum_partial_copy_generic); /* Delay loops */ EXPORT_SYMBOL(__ndelay); EXPORT_SYMBOL(__udelay); @@ -112,13 +111,17 @@ EXPORT_SYMBOL(strpbrk); EXPORT_SYMBOL(strstr); +#if !defined(CONFIG_X86_UACCESS_INDIRECT) EXPORT_SYMBOL(strncpy_from_user); -EXPORT_SYMBOL(__strncpy_from_user); +EXPORT_SYMBOL(__direct_strncpy_from_user); EXPORT_SYMBOL(clear_user); EXPORT_SYMBOL(__clear_user); EXPORT_SYMBOL(__copy_from_user_ll); EXPORT_SYMBOL(__copy_to_user_ll); EXPORT_SYMBOL(strnlen_user); +#else /* CONFIG_X86_UACCESS_INDIRECT */ +EXPORT_SYMBOL(direct_csum_partial_copy_generic); +#endif EXPORT_SYMBOL(dma_alloc_coherent); EXPORT_SYMBOL(dma_free_coherent); --- diff/arch/i386/kernel/i387.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/i387.c 2003-11-26 10:09:04.000000000 +0000 @@ -218,6 +218,7 @@ static int convert_fxsr_to_user( struct _fpstate __user *buf, struct i387_fxsave_struct *fxsave ) { + struct _fpreg tmp[8]; /* 80 bytes scratch area */ unsigned long env[7]; struct _fpreg __user *to; struct _fpxreg *from; @@ -234,23 +235,25 @@ if ( __copy_to_user( buf, env, 7 * sizeof(unsigned long) ) ) return 1; - to = &buf->_st[0]; + to = tmp; from = (struct _fpxreg *) &fxsave->st_space[0]; for ( i = 0 ; i < 8 ; i++, to++, from++ ) { unsigned long *t = (unsigned long *)to; unsigned long *f = (unsigned long *)from; - if (__put_user(*f, t) || - __put_user(*(f + 1), t + 1) || - __put_user(from->exponent, &to->exponent)) - return 1; + *t = *f; + *(t + 1) = *(f+1); + to->exponent = from->exponent; } + if (copy_to_user(buf->_st, tmp, sizeof(struct _fpreg [8]))) + return 1; return 0; } static int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave, struct _fpstate __user *buf ) { + struct _fpreg tmp[8]; /* 80 bytes scratch area */ unsigned long env[7]; struct _fpxreg *to; struct _fpreg __user *from; @@ -258,6 +261,8 @@ if ( __copy_from_user( env, buf, 7 * sizeof(long) ) ) return 1; + if (copy_from_user(tmp, buf->_st, sizeof(struct _fpreg [8]))) + return 1; fxsave->cwd = (unsigned short)(env[0] & 0xffff); fxsave->swd = (unsigned short)(env[1] & 0xffff); @@ -269,15 +274,14 @@ fxsave->fos = env[6]; to = (struct _fpxreg *) &fxsave->st_space[0]; - from = &buf->_st[0]; + from = tmp; for ( i = 0 ; i < 8 ; i++, to++, from++ ) { unsigned long *t = (unsigned long *)to; unsigned long *f = (unsigned long *)from; - if (__get_user(*t, f) || - __get_user(*(t + 1), f + 1) || - __get_user(to->exponent, &from->exponent)) - return 1; + *t = *f; + *(t + 1) = *(f + 1); + to->exponent = from->exponent; } return 0; } --- diff/arch/i386/kernel/i8259.c 2003-06-30 10:07:18.000000000 +0100 +++ source/arch/i386/kernel/i8259.c 2003-11-26 10:09:04.000000000 +0000 @@ -419,8 +419,10 @@ * us. (some of these will be overridden and become * 'special' SMP interrupts) */ - for (i = 0; i < NR_IRQS; i++) { + for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { int vector = FIRST_EXTERNAL_VECTOR + i; + if (i >= NR_IRQS) + break; if (vector != SYSCALL_VECTOR) set_intr_gate(vector, interrupt[i]); } --- diff/arch/i386/kernel/init_task.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/init_task.c 2003-11-26 10:09:04.000000000 +0000 @@ -26,7 +26,7 @@ */ union thread_union init_thread_union __attribute__((__section__(".data.init_task"))) = - { INIT_THREAD_INFO(init_task) }; + { INIT_THREAD_INFO(init_task, init_thread_union) }; /* * Initial task structure. @@ -44,5 +44,5 @@ * section. Since TSS's are completely CPU-local, we want them * on exact cacheline boundaries, to eliminate cacheline ping-pong. */ -struct tss_struct init_tss[NR_CPUS] __cacheline_aligned = { [0 ... NR_CPUS-1] = INIT_TSS }; +struct tss_struct init_tss[NR_CPUS] __attribute__((__section__(".data.tss"))) = { [0 ... NR_CPUS-1] = INIT_TSS }; --- diff/arch/i386/kernel/io_apic.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/io_apic.c 2003-11-26 10:09:04.000000000 +0000 @@ -76,6 +76,14 @@ int apic, pin, next; } irq_2_pin[PIN_MAP_SIZE]; +#ifdef CONFIG_PCI_USE_VECTOR +int vector_irq[NR_IRQS] = { [0 ... NR_IRQS -1] = -1}; +#define vector_to_irq(vector) \ + (platform_legacy_irq(vector) ? vector : vector_irq[vector]) +#else +#define vector_to_irq(vector) (vector) +#endif + /* * The common case is 1:1 IRQ<->pin mappings. Sometimes there are * shared ISA-space IRQs, so we have to support them. We are super @@ -249,7 +257,7 @@ clear_IO_APIC_pin(apic, pin); } -static void set_ioapic_affinity(unsigned int irq, cpumask_t cpumask) +static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) { unsigned long flags; int pin; @@ -288,7 +296,7 @@ extern cpumask_t irq_affinity[NR_IRQS]; -static cpumask_t __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS]; +cpumask_t __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS]; #define IRQBALANCE_CHECK_ARCH -999 static int irqbalance_disabled = IRQBALANCE_CHECK_ARCH; @@ -670,13 +678,11 @@ __setup("noirqbalance", irqbalance_disable); -static void set_ioapic_affinity(unsigned int irq, cpumask_t mask); - static inline void move_irq(int irq) { /* note - we hold the desc->lock */ if (unlikely(!cpus_empty(pending_irq_balance_cpumask[irq]))) { - set_ioapic_affinity(irq, pending_irq_balance_cpumask[irq]); + set_ioapic_affinity_irq(irq, pending_irq_balance_cpumask[irq]); cpus_clear(pending_irq_balance_cpumask[irq]); } } @@ -853,7 +859,7 @@ if (irq_entry == -1) continue; irq = pin_2_irq(irq_entry, ioapic, pin); - set_ioapic_affinity(irq, mask); + set_ioapic_affinity_irq(irq, mask); } } @@ -1141,7 +1147,8 @@ /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */ u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 }; -static int __init assign_irq_vector(int irq) +#ifndef CONFIG_PCI_USE_VECTOR +int __init assign_irq_vector(int irq) { static int current_vector = FIRST_DEVICE_VECTOR, offset = 0; BUG_ON(irq >= NR_IRQ_VECTORS); @@ -1158,11 +1165,36 @@ } IO_APIC_VECTOR(irq) = current_vector; + return current_vector; } +#endif + +static struct hw_interrupt_type ioapic_level_type; +static struct hw_interrupt_type ioapic_edge_type; -static struct hw_interrupt_type ioapic_level_irq_type; -static struct hw_interrupt_type ioapic_edge_irq_type; +#define IOAPIC_AUTO -1 +#define IOAPIC_EDGE 0 +#define IOAPIC_LEVEL 1 + +static inline void ioapic_register_intr(int irq, int vector, unsigned long trigger) +{ + if (use_pci_vector() && !platform_legacy_irq(irq)) { + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + irq_desc[vector].handler = &ioapic_level_type; + else + irq_desc[vector].handler = &ioapic_edge_type; + set_intr_gate(vector, interrupt[vector]); + } else { + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + irq_desc[irq].handler = &ioapic_level_type; + else + irq_desc[irq].handler = &ioapic_edge_type; + set_intr_gate(vector, interrupt[irq]); + } +} void __init setup_IO_APIC_irqs(void) { @@ -1220,13 +1252,7 @@ if (IO_APIC_IRQ(irq)) { vector = assign_irq_vector(irq); entry.vector = vector; - - if (IO_APIC_irq_trigger(irq)) - irq_desc[irq].handler = &ioapic_level_irq_type; - else - irq_desc[irq].handler = &ioapic_edge_irq_type; - - set_intr_gate(vector, interrupt[irq]); + ioapic_register_intr(irq, vector, IOAPIC_AUTO); if (!apic && (irq < 16)) disable_8259A_irq(irq); @@ -1273,7 +1299,7 @@ * The timer IRQ doesn't have to know that behind the * scene we have a 8259A-master in AEOI mode ... */ - irq_desc[0].handler = &ioapic_edge_irq_type; + irq_desc[0].handler = &ioapic_edge_type; /* * Add it to the IO-APIC irq-routing table: @@ -1624,10 +1650,6 @@ unsigned char old_id; unsigned long flags; - if (acpi_ioapic) - /* This gets done during IOAPIC enumeration for ACPI. */ - return; - /* * This is broken; anything with a real cpu count has to * circumvent this idiocy regardless. @@ -1763,9 +1785,6 @@ * that was delayed but this is now handled in the device * independent code. */ -#define enable_edge_ioapic_irq unmask_IO_APIC_irq - -static void disable_edge_ioapic_irq (unsigned int irq) { /* nothing */ } /* * Starting up a edge-triggered IO-APIC interrupt is @@ -1776,7 +1795,6 @@ * This is not complete - we should be able to fake * an edge even if it isn't on the 8259A... */ - static unsigned int startup_edge_ioapic_irq(unsigned int irq) { int was_pending = 0; @@ -1794,8 +1812,6 @@ return was_pending; } -#define shutdown_edge_ioapic_irq disable_edge_ioapic_irq - /* * Once we have recorded IRQ_PENDING already, we can mask the * interrupt for real. This prevents IRQ storms from unhandled @@ -1810,9 +1826,6 @@ ack_APIC_irq(); } -static void end_edge_ioapic_irq (unsigned int i) { /* nothing */ } - - /* * Level triggered interrupts can just be masked, * and shutting down and starting up the interrupt @@ -1834,10 +1847,6 @@ return 0; /* don't check for pending */ } -#define shutdown_level_ioapic_irq mask_IO_APIC_irq -#define enable_level_ioapic_irq unmask_IO_APIC_irq -#define disable_level_ioapic_irq mask_IO_APIC_irq - static void end_level_ioapic_irq (unsigned int irq) { unsigned long v; @@ -1864,6 +1873,7 @@ * The idea is from Manfred Spraul. --macro */ i = IO_APIC_VECTOR(irq); + v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1)); ack_APIC_irq(); @@ -1898,7 +1908,57 @@ } } -static void mask_and_ack_level_ioapic_irq (unsigned int irq) { /* nothing */ } +#ifdef CONFIG_PCI_USE_VECTOR +static unsigned int startup_edge_ioapic_vector(unsigned int vector) +{ + int irq = vector_to_irq(vector); + + return startup_edge_ioapic_irq(irq); +} + +static void ack_edge_ioapic_vector(unsigned int vector) +{ + int irq = vector_to_irq(vector); + + ack_edge_ioapic_irq(irq); +} + +static unsigned int startup_level_ioapic_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + return startup_level_ioapic_irq (irq); +} + +static void end_level_ioapic_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + end_level_ioapic_irq(irq); +} + +static void mask_IO_APIC_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + mask_IO_APIC_irq(irq); +} + +static void unmask_IO_APIC_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + unmask_IO_APIC_irq(irq); +} + +static void set_ioapic_affinity_vector (unsigned int vector, + unsigned long cpu_mask) +{ + int irq = vector_to_irq(vector); + + set_ioapic_affinity_irq(irq, cpu_mask); +} +#endif /* * Level and edge triggered IO-APIC interrupts need different handling, @@ -1908,26 +1968,25 @@ * edge-triggered handler, without risking IRQ storms and other ugly * races. */ - -static struct hw_interrupt_type ioapic_edge_irq_type = { +static struct hw_interrupt_type ioapic_edge_type = { .typename = "IO-APIC-edge", - .startup = startup_edge_ioapic_irq, - .shutdown = shutdown_edge_ioapic_irq, - .enable = enable_edge_ioapic_irq, - .disable = disable_edge_ioapic_irq, - .ack = ack_edge_ioapic_irq, - .end = end_edge_ioapic_irq, + .startup = startup_edge_ioapic, + .shutdown = shutdown_edge_ioapic, + .enable = enable_edge_ioapic, + .disable = disable_edge_ioapic, + .ack = ack_edge_ioapic, + .end = end_edge_ioapic, .set_affinity = set_ioapic_affinity, }; -static struct hw_interrupt_type ioapic_level_irq_type = { +static struct hw_interrupt_type ioapic_level_type = { .typename = "IO-APIC-level", - .startup = startup_level_ioapic_irq, - .shutdown = shutdown_level_ioapic_irq, - .enable = enable_level_ioapic_irq, - .disable = disable_level_ioapic_irq, - .ack = mask_and_ack_level_ioapic_irq, - .end = end_level_ioapic_irq, + .startup = startup_level_ioapic, + .shutdown = shutdown_level_ioapic, + .enable = enable_level_ioapic, + .disable = disable_level_ioapic, + .ack = mask_and_ack_level_ioapic, + .end = end_level_ioapic, .set_affinity = set_ioapic_affinity, }; @@ -1947,7 +2006,13 @@ * 0x80, because int 0x80 is hm, kind of importantish. ;) */ for (irq = 0; irq < NR_IRQS ; irq++) { - if (IO_APIC_IRQ(irq) && !IO_APIC_VECTOR(irq)) { + int tmp = irq; + if (use_pci_vector()) { + if (!platform_legacy_irq(tmp)) + if ((tmp = vector_to_irq(tmp)) == -1) + continue; + } + if (IO_APIC_IRQ(tmp) && !IO_APIC_VECTOR(tmp)) { /* * Hmm.. We don't have an entry for this, * so default to an old-fashioned 8259 @@ -2217,12 +2282,14 @@ /* * Set up IO-APIC IRQ routing. */ - setup_ioapic_ids_from_mpc(); + if (!acpi_ioapic) + setup_ioapic_ids_from_mpc(); sync_Arb_IDs(); setup_IO_APIC_irqs(); init_IO_APIC_traps(); check_timer(); - print_IO_APIC(); + if (!acpi_ioapic) + print_IO_APIC(); } /* @@ -2379,10 +2446,12 @@ "IRQ %d Mode:%i Active:%i)\n", ioapic, mp_ioapics[ioapic].mpc_apicid, pin, entry.vector, irq, edge_level, active_high_low); + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); if (edge_level) { - irq_desc[irq].handler = &ioapic_level_irq_type; + irq_desc[irq].handler = &ioapic_level_type; } else { - irq_desc[irq].handler = &ioapic_edge_irq_type; + irq_desc[irq].handler = &ioapic_edge_type; } set_intr_gate(entry.vector, interrupt[irq]); --- diff/arch/i386/kernel/irq.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -45,6 +45,7 @@ #include <asm/delay.h> #include <asm/desc.h> #include <asm/irq.h> +#include <asm/kgdb.h> /* * Linux has a controller-independent x86 interrupt architecture. @@ -138,17 +139,19 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_printf(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - seq_putc(p, '\n'); + if (i == 0) { + seq_printf(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) @@ -170,28 +173,32 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } - seq_printf(p, "NMI: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", nmi_count(j)); - seq_putc(p, '\n'); + } else if (i == NR_IRQS) { + seq_printf(p, "NMI: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", nmi_count(j)); + seq_putc(p, '\n'); #ifdef CONFIG_X86_LOCAL_APIC - seq_printf(p, "LOC: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", irq_stat[j].apic_timer_irqs); - seq_putc(p, '\n'); + seq_printf(p, "LOC: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", irq_stat[j].apic_timer_irqs); + seq_putc(p, '\n'); #endif - seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); + seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); #ifdef CONFIG_X86_IO_APIC #ifdef APIC_MISMATCH_DEBUG - seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); + seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); #endif #endif + } return 0; } + + + #ifdef CONFIG_SMP inline void synchronize_irq(unsigned int irq) { @@ -502,6 +509,17 @@ irq_exit(); +#ifdef CONFIG_KGDB + /* + * We need to do this after clearing out of all the interrupt + * machinery because kgdb will reenter the NIC driver and the IRQ + * system. synchronize_irq() (at least) will deadlock. + */ + if (kgdb_eth_need_breakpoint[smp_processor_id()]) { + kgdb_eth_need_breakpoint[smp_processor_id()] = 0; + BREAKPOINT; + } +#endif return 1; } @@ -949,19 +967,13 @@ static int irq_affinity_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { - int k, len; - cpumask_t tmp = irq_affinity[(long)data]; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - len = 0; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -1000,10 +1012,16 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - unsigned long *mask = (unsigned long *) data; + int len; + cpumask_t *mask = (cpumask_t *)data; + if (count < HEX_DIGITS+1) return -EINVAL; - return sprintf (page, "%08lx\n", *mask); + + len = format_cpumask(page, *mask); + page += len; + len += sprintf (page, "\n"); + return len; } static int prof_cpu_mask_write_proc (struct file *file, const char __user *buffer, --- diff/arch/i386/kernel/ldt.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/ldt.c 2003-11-26 10:09:04.000000000 +0000 @@ -2,7 +2,7 @@ * linux/kernel/ldt.c * * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds - * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> + * Copyright (C) 1999, 2003 Ingo Molnar <mingo@redhat.com> */ #include <linux/errno.h> @@ -18,6 +18,8 @@ #include <asm/system.h> #include <asm/ldt.h> #include <asm/desc.h> +#include <linux/highmem.h> +#include <asm/atomic_kmap.h> #ifdef CONFIG_SMP /* avoids "defined but not used" warnig */ static void flush_ldt(void *null) @@ -29,34 +31,31 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int reload) { - void *oldldt; - void *newldt; - int oldsize; + int oldsize, newsize, i; if (mincount <= pc->size) return 0; + /* + * LDT got larger - reallocate if necessary. + */ oldsize = pc->size; mincount = (mincount+511)&(~511); - if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE) - newldt = vmalloc(mincount*LDT_ENTRY_SIZE); - else - newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL); - - if (!newldt) - return -ENOMEM; - - if (oldsize) - memcpy(newldt, pc->ldt, oldsize*LDT_ENTRY_SIZE); - oldldt = pc->ldt; - memset(newldt+oldsize*LDT_ENTRY_SIZE, 0, (mincount-oldsize)*LDT_ENTRY_SIZE); - pc->ldt = newldt; - wmb(); + newsize = mincount*LDT_ENTRY_SIZE; + for (i = 0; i < newsize; i += PAGE_SIZE) { + int nr = i/PAGE_SIZE; + BUG_ON(i >= 64*1024); + if (!pc->ldt_pages[nr]) { + pc->ldt_pages[nr] = alloc_page(GFP_HIGHUSER); + if (!pc->ldt_pages[nr]) + return -ENOMEM; + clear_highpage(pc->ldt_pages[nr]); + } + } pc->size = mincount; - wmb(); - if (reload) { #ifdef CONFIG_SMP cpumask_t mask; + preempt_disable(); load_LDT(pc); mask = cpumask_of_cpu(smp_processor_id()); @@ -67,21 +66,20 @@ load_LDT(pc); #endif } - if (oldsize) { - if (oldsize*LDT_ENTRY_SIZE > PAGE_SIZE) - vfree(oldldt); - else - kfree(oldldt); - } return 0; } static inline int copy_ldt(mm_context_t *new, mm_context_t *old) { - int err = alloc_ldt(new, old->size, 0); - if (err < 0) + int i, err, size = old->size, nr_pages = (size*LDT_ENTRY_SIZE + PAGE_SIZE-1)/PAGE_SIZE; + + err = alloc_ldt(new, size, 0); + if (err < 0) { + new->size = 0; return err; - memcpy(new->ldt, old->ldt, old->size*LDT_ENTRY_SIZE); + } + for (i = 0; i < nr_pages; i++) + copy_user_highpage(new->ldt_pages[i], old->ldt_pages[i], 0); return 0; } @@ -96,6 +94,7 @@ init_MUTEX(&mm->context.sem); mm->context.size = 0; + memset(mm->context.ldt_pages, 0, sizeof(struct page *) * MAX_LDT_PAGES); old_mm = current->mm; if (old_mm && old_mm->context.size > 0) { down(&old_mm->context.sem); @@ -107,23 +106,21 @@ /* * No need to lock the MM as we are the last user + * Do not touch the ldt register, we are already + * in the next thread. */ void destroy_context(struct mm_struct *mm) { - if (mm->context.size) { - if (mm == current->active_mm) - clear_LDT(); - if (mm->context.size*LDT_ENTRY_SIZE > PAGE_SIZE) - vfree(mm->context.ldt); - else - kfree(mm->context.ldt); - mm->context.size = 0; - } + int i, nr_pages = (mm->context.size*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; + + for (i = 0; i < nr_pages; i++) + __free_page(mm->context.ldt_pages[i]); + mm->context.size = 0; } static int read_ldt(void __user * ptr, unsigned long bytecount) { - int err; + int err, i; unsigned long size; struct mm_struct * mm = current->mm; @@ -138,8 +135,25 @@ size = bytecount; err = 0; - if (copy_to_user(ptr, mm->context.ldt, size)) - err = -EFAULT; + /* + * This is necessary just in case we got here straight from a + * context-switch where the ptes were set but no tlb flush + * was done yet. We rather avoid doing a TLB flush in the + * context-switch path and do it here instead. + */ + __flush_tlb_global(); + + for (i = 0; i < size; i += PAGE_SIZE) { + int nr = i / PAGE_SIZE, bytes; + char *kaddr = kmap(mm->context.ldt_pages[nr]); + + bytes = size - i; + if (bytes > PAGE_SIZE) + bytes = PAGE_SIZE; + if (copy_to_user(ptr + i, kaddr, size - i)) + err = -EFAULT; + kunmap(mm->context.ldt_pages[nr]); + } up(&mm->context.sem); if (err < 0) return err; @@ -158,7 +172,7 @@ err = 0; address = &default_ldt[0]; - size = 5*sizeof(struct desc_struct); + size = 5*LDT_ENTRY_SIZE; if (size > bytecount) size = bytecount; @@ -200,7 +214,15 @@ goto out_unlock; } - lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt); + /* + * No rescheduling allowed from this point to the install. + * + * We do a TLB flush for the same reason as in the read_ldt() path. + */ + preempt_disable(); + __flush_tlb_global(); + lp = (__u32 *) ((ldt_info.entry_number << 3) + + (char *) __kmap_atomic_vaddr(KM_LDT_PAGE0)); /* Allow LDTs to be cleared by the user. */ if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { @@ -221,6 +243,7 @@ *lp = entry_1; *(lp+1) = entry_2; error = 0; + preempt_enable(); out_unlock: up(&mm->context.sem); @@ -248,3 +271,26 @@ } return ret; } + +/* + * load one particular LDT into the current CPU + */ +void load_LDT_nolock(mm_context_t *pc, int cpu) +{ + struct page **pages = pc->ldt_pages; + int count = pc->size; + int nr_pages, i; + + if (likely(!count)) { + pages = &default_ldt_page; + count = 5; + } + nr_pages = (count*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; + + for (i = 0; i < nr_pages; i++) { + __kunmap_atomic_type(KM_LDT_PAGE0 - i); + __kmap_atomic(pages[i], KM_LDT_PAGE0 - i); + } + set_ldt_desc(cpu, (void *)__kmap_atomic_vaddr(KM_LDT_PAGE0), count); + load_LDT_desc(); +} --- diff/arch/i386/kernel/mpparse.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/mpparse.c 2003-11-26 10:09:04.000000000 +0000 @@ -668,7 +668,7 @@ * Read the physical hardware table. Anything here will * override the defaults. */ - if (!smp_read_mpc((void *)mpf->mpf_physptr)) { + if (!smp_read_mpc((void *)phys_to_virt(mpf->mpf_physptr))) { smp_found_config = 0; printk(KERN_ERR "BIOS bug, MP table errors detected!...\n"); printk(KERN_ERR "... disabling SMP support. (tell your hw vendor)\n"); @@ -1129,8 +1129,11 @@ continue; ioapic_pin = irq - mp_ioapic_routing[ioapic].irq_start; - if (!ioapic && (irq < 16)) - irq += 16; + if (es7000_plat) { + if (!ioapic && (irq < 16)) + irq += 16; + } + /* * Avoid pin reprogramming. PRTs typically include entries * with redundant pin->irq mappings (but unique PCI devices); @@ -1147,21 +1150,29 @@ if ((1<<bit) & mp_ioapic_routing[ioapic].pin_programmed[idx]) { printk(KERN_DEBUG "Pin %d-%d already programmed\n", mp_ioapic_routing[ioapic].apic_id, ioapic_pin); - entry->irq = irq; + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); + entry->irq = irq; continue; } mp_ioapic_routing[ioapic].pin_programmed[idx] |= (1<<bit); - if (!io_apic_set_pci_routing(ioapic, ioapic_pin, irq, edge_level, active_high_low)) - entry->irq = irq; - + if (!io_apic_set_pci_routing(ioapic, ioapic_pin, irq, edge_level, active_high_low)) { + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); + entry->irq = irq; + } printk(KERN_DEBUG "%02x:%02x:%02x[%c] -> %d-%d -> IRQ %d\n", entry->id.segment, entry->id.bus, entry->id.device, ('A' + entry->pin), mp_ioapic_routing[ioapic].apic_id, ioapic_pin, entry->irq); } + + print_IO_APIC(); + + return; } #endif /*CONFIG_ACPI_PCI*/ --- diff/arch/i386/kernel/nmi.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/nmi.c 2003-11-26 10:09:04.000000000 +0000 @@ -31,7 +31,16 @@ #include <asm/mpspec.h> #include <asm/nmi.h> +#ifdef CONFIG_KGDB +#include <asm/kgdb.h> +#ifdef CONFIG_SMP +unsigned int nmi_watchdog = NMI_IO_APIC; +#else +unsigned int nmi_watchdog = NMI_LOCAL_APIC; +#endif +#else unsigned int nmi_watchdog = NMI_NONE; +#endif static unsigned int nmi_hz = HZ; unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ extern void show_registers(struct pt_regs *regs); @@ -408,6 +417,9 @@ for (i = 0; i < NR_CPUS; i++) alert_counter[i] = 0; } +#ifdef CONFIG_KGDB +int tune_watchdog = 5*HZ; +#endif void nmi_watchdog_tick (struct pt_regs * regs) { @@ -421,12 +433,24 @@ sum = irq_stat[cpu].apic_timer_irqs; +#ifdef CONFIG_KGDB + if (! in_kgdb(regs) && last_irq_sums[cpu] == sum ) { + +#else if (last_irq_sums[cpu] == sum) { +#endif /* * Ayiee, looks like this CPU is stuck ... * wait a few IRQs (5 seconds) before doing the oops ... */ alert_counter[cpu]++; +#ifdef CONFIG_KGDB + if (alert_counter[cpu] == tune_watchdog) { + kgdb_handle_exception(2, SIGPWR, 0, regs); + last_irq_sums[cpu] = sum; + alert_counter[cpu] = 0; + } +#endif if (alert_counter[cpu] == 5*nmi_hz) { spin_lock(&nmi_print_lock); /* --- diff/arch/i386/kernel/process.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/process.c 2003-11-26 10:09:04.000000000 +0000 @@ -47,6 +47,7 @@ #include <asm/i387.h> #include <asm/irq.h> #include <asm/desc.h> +#include <asm/atomic_kmap.h> #ifdef CONFIG_MATH_EMULATION #include <asm/math_emu.h> #endif @@ -302,6 +303,9 @@ struct task_struct *tsk = current; memset(tsk->thread.debugreg, 0, sizeof(unsigned long)*8); +#ifdef CONFIG_X86_HIGH_ENTRY + clear_thread_flag(TIF_DB7); +#endif memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); /* * Forget coprocessor state.. @@ -315,9 +319,8 @@ if (dead_task->mm) { // temporary debugging check if (dead_task->mm->context.size) { - printk("WARNING: dead process %8s still has LDT? <%p/%d>\n", + printk("WARNING: dead process %8s still has LDT? <%d>\n", dead_task->comm, - dead_task->mm->context.ldt, dead_task->mm->context.size); BUG(); } @@ -352,7 +355,17 @@ p->thread.esp = (unsigned long) childregs; p->thread.esp0 = (unsigned long) (childregs+1); + /* + * get the two stack pages, for the virtual stack. + * + * IMPORTANT: this code relies on the fact that the task + * structure is an 8K aligned piece of physical memory. + */ + p->thread.stack_page0 = virt_to_page((unsigned long)p->thread_info); + p->thread.stack_page1 = virt_to_page((unsigned long)p->thread_info + PAGE_SIZE); + p->thread.eip = (unsigned long) ret_from_fork; + p->thread_info->real_stack = p->thread_info; savesegment(fs,p->thread.fs); savesegment(gs,p->thread.gs); @@ -504,10 +517,41 @@ __unlazy_fpu(prev_p); +#ifdef CONFIG_X86_HIGH_ENTRY + /* + * Set the ptes of the virtual stack. (NOTE: a one-page TLB flush is + * needed because otherwise NMIs could interrupt the + * user-return code with a virtual stack and stale TLBs.) + */ + __kunmap_atomic_type(KM_VSTACK0); + __kunmap_atomic_type(KM_VSTACK1); + __kmap_atomic(next->stack_page0, KM_VSTACK0); + __kmap_atomic(next->stack_page1, KM_VSTACK1); + + /* + * NOTE: here we rely on the task being the stack as well + */ + next_p->thread_info->virtual_stack = + (void *)__kmap_atomic_vaddr(KM_VSTACK0); + +#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) + /* + * If next was preempted on entry from userspace to kernel, + * and now it's on a different cpu, we need to adjust %esp. + * This assumes that entry.S does not copy %esp while on the + * virtual stack (with interrupts enabled): which is so, + * except within __SWITCH_KERNELSPACE itself. + */ + if (unlikely(next->esp >= TASK_SIZE)) { + next->esp &= THREAD_SIZE - 1; + next->esp |= (unsigned long) next_p->thread_info->virtual_stack; + } +#endif +#endif /* * Reload esp0, LDT and the page table pointer: */ - load_esp0(tss, next->esp0); + load_virtual_esp0(tss, next_p); /* * Load the per-thread Thread-Local Storage descriptor. --- diff/arch/i386/kernel/reboot.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/reboot.c 2003-11-26 10:09:04.000000000 +0000 @@ -8,6 +8,7 @@ #include <linux/init.h> #include <linux/interrupt.h> #include <linux/mc146818rtc.h> +#include <linux/efi.h> #include <asm/uaccess.h> #include <asm/apic.h> #include "mach_reboot.h" @@ -154,12 +155,11 @@ CMOS_WRITE(0x00, 0x8f); spin_unlock_irqrestore(&rtc_lock, flags); - /* Remap the kernel at virtual address zero, as well as offset zero - from the kernel segment. This assumes the kernel segment starts at - virtual address PAGE_OFFSET. */ - - memcpy (swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, - sizeof (swapper_pg_dir [0]) * KERNEL_PGD_PTRS); + /* + * Remap the first 16 MB of RAM (which includes the kernel image) + * at virtual address zero: + */ + setup_identity_mappings(swapper_pg_dir, 0, 16*1024*1024); /* * Use `swapper_pg_dir' as our page directory. @@ -263,7 +263,12 @@ disable_IO_APIC(); #endif - if(!reboot_thru_bios) { + if (!reboot_thru_bios) { + if (efi_enabled) { + efi.reset_system(EFI_RESET_COLD, EFI_SUCCESS, 0, 0); + __asm__ __volatile__("lidt %0": :"m" (no_idt)); + __asm__ __volatile__("int3"); + } /* rebooting needs to touch the page at absolute addr 0 */ *((unsigned short *)__va(0x472)) = reboot_mode; for (;;) { @@ -273,6 +278,8 @@ __asm__ __volatile__("int3"); } } + if (efi_enabled) + efi.reset_system(EFI_RESET_WARM, EFI_SUCCESS, 0, 0); machine_real_restart(jump_to_bios, sizeof(jump_to_bios)); } @@ -287,6 +294,8 @@ void machine_power_off(void) { + if (efi_enabled) + efi.reset_system(EFI_RESET_SHUTDOWN, EFI_SUCCESS, 0, 0); if (pm_power_off) pm_power_off(); } --- diff/arch/i386/kernel/setup.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/setup.c 2003-11-26 10:09:04.000000000 +0000 @@ -36,6 +36,8 @@ #include <linux/root_dev.h> #include <linux/highmem.h> #include <linux/module.h> +#include <linux/efi.h> +#include <linux/init.h> #include <video/edid.h> #include <asm/e820.h> #include <asm/mpspec.h> @@ -56,6 +58,10 @@ * Machine setup.. */ +#ifdef CONFIG_EFI +int efi_enabled = 0; +#endif + /* cpu data as detected by the assembly code in head.S */ struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; /* common cpu data for all cpus */ @@ -144,6 +150,20 @@ unsigned long long current_addr = 0; int i; + if (efi_enabled) { + for (i = 0; i < memmap.nr_map; i++) { + current_addr = memmap.map[i].phys_addr + + (memmap.map[i].num_pages << 12); + if (memmap.map[i].type == EFI_CONVENTIONAL_MEMORY) { + if (current_addr >= size) { + memmap.map[i].num_pages -= + (((current_addr-size) + PAGE_SIZE-1) >> PAGE_SHIFT); + memmap.nr_map = i + 1; + return; + } + } + } + } for (i = 0; i < e820.nr_map; i++) { if (e820.map[i].type == E820_RAM) { current_addr = e820.map[i].addr + e820.map[i].size; @@ -159,17 +179,21 @@ static void __init add_memory_region(unsigned long long start, unsigned long long size, int type) { - int x = e820.nr_map; + int x; - if (x == E820MAX) { - printk(KERN_ERR "Ooops! Too many entries in the memory map!\n"); - return; - } + if (!efi_enabled) { + x = e820.nr_map; - e820.map[x].addr = start; - e820.map[x].size = size; - e820.map[x].type = type; - e820.nr_map++; + if (x == E820MAX) { + printk(KERN_ERR "Ooops! Too many entries in the memory map!\n"); + return; + } + + e820.map[x].addr = start; + e820.map[x].size = size; + e820.map[x].type = type; + e820.nr_map++; + } } /* add_memory_region */ #define E820_DEBUG 1 @@ -446,7 +470,6 @@ static void __init setup_memory_region(void) { char *who = machine_specific_memory_setup(); - printk(KERN_INFO "BIOS-provided physical RAM map:\n"); print_memory_map(who); } /* setup_memory_region */ @@ -584,6 +607,23 @@ } /* + * Callback for efi_memory_walk. + */ +static int __init +efi_find_max_pfn(unsigned long start, unsigned long end, void *arg) +{ + unsigned long *max_pfn = arg, pfn; + + if (start < end) { + pfn = PFN_UP(end -1); + if (pfn > *max_pfn) + *max_pfn = pfn; + } + return 0; +} + + +/* * Find the highest page frame number we have available */ void __init find_max_pfn(void) @@ -591,6 +631,11 @@ int i; max_pfn = 0; + if (efi_enabled) { + efi_memmap_walk(efi_find_max_pfn, &max_pfn); + return; + } + for (i = 0; i < e820.nr_map; i++) { unsigned long start, end; /* RAM? */ @@ -665,6 +710,25 @@ } #ifndef CONFIG_DISCONTIGMEM + +/* + * Free all available memory for boot time allocation. Used + * as a callback function by efi_memory_walk() + */ + +static int __init +free_available_memory(unsigned long start, unsigned long end, void *arg) +{ + /* check max_low_pfn */ + if (start >= ((max_low_pfn + 1) << PAGE_SHIFT)) + return 0; + if (end >= ((max_low_pfn + 1) << PAGE_SHIFT)) + end = (max_low_pfn + 1) << PAGE_SHIFT; + if (start < end) + free_bootmem(start, end - start); + + return 0; +} /* * Register fully available low RAM pages with the bootmem allocator. */ @@ -672,6 +736,10 @@ { int i; + if (efi_enabled) { + efi_memmap_walk(free_available_memory, NULL); + return; + } for (i = 0; i < e820.nr_map; i++) { unsigned long curr_pfn, last_pfn, size; /* @@ -799,9 +867,9 @@ * Request address space for all standard RAM and ROM resources * and also for regions reported as reserved by the e820. */ -static void __init register_memory(unsigned long max_low_pfn) +static void __init +legacy_init_iomem_resources(struct resource *code_resource, struct resource *data_resource) { - unsigned long low_mem_size; int i; probe_roms(); @@ -826,11 +894,26 @@ * so we try it repeatedly and let the resource manager * test it. */ - request_resource(res, &code_resource); - request_resource(res, &data_resource); + request_resource(res, code_resource); + request_resource(res, data_resource); } } +} +/* + * Request address space for all standard resources + */ +static void __init register_memory(unsigned long max_low_pfn) +{ + unsigned long low_mem_size; + int i; + + if (efi_enabled) + efi_initialize_iomem_resources(&code_resource, &data_resource); + else + legacy_init_iomem_resources(&code_resource, &data_resource); + + /* EFI systems may still have VGA */ request_graphics_resource(); /* request I/O space for devices used on all i[345]86 PCs */ @@ -950,6 +1033,13 @@ __setup("noreplacement", noreplacement_setup); +/* + * Determine if we were loaded by an EFI loader. If so, then we have also been + * passed the efi memmap, systab, etc., so we should use these data structures + * for initialization. Note, the efi init code path is determined by the + * global efi_enabled. This allows the same kernel image to be used on existing + * systems (with a traditional BIOS) as well as on EFI systems. + */ void __init setup_arch(char **cmdline_p) { unsigned long max_low_pfn; @@ -958,6 +1048,18 @@ pre_setup_arch_hook(); early_cpu_init(); + /* + * FIXME: This isn't an official loader_type right + * now but does currently work with elilo. + * If we were configured as an EFI kernel, check to make + * sure that we were loaded correctly from elilo and that + * the system table is valid. If not, then initialize normally. + */ +#ifdef CONFIG_EFI + if ((LOADER_TYPE == 0x50) && EFI_SYSTAB) + efi_enabled = 1; +#endif + ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV); drive_info = DRIVE_INFO; screen_info = SCREEN_INFO; @@ -979,7 +1081,11 @@ rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0); #endif ARCH_SETUP - setup_memory_region(); + if (efi_enabled) + efi_init(); + else + setup_memory_region(); + copy_edd(); if (!MOUNT_ROOT_RDONLY) @@ -1013,6 +1119,8 @@ #ifdef CONFIG_X86_GENERICARCH generic_apic_probe(*cmdline_p); #endif + if (efi_enabled) + efi_map_memmap(); /* * Parse the ACPI tables for possible boot-time SMP configuration. @@ -1028,7 +1136,8 @@ #ifdef CONFIG_VT #if defined(CONFIG_VGA_CONSOLE) - conswitchp = &vga_con; + if (!efi_enabled || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY)) + conswitchp = &vga_con; #elif defined(CONFIG_DUMMY_CONSOLE) conswitchp = &dummy_con; #endif --- diff/arch/i386/kernel/signal.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/signal.c 2003-11-26 10:09:04.000000000 +0000 @@ -128,28 +128,29 @@ */ static int -restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, int *peax) +restore_sigcontext(struct pt_regs *regs, + struct sigcontext __user *__sc, int *peax) { - unsigned int err = 0; + struct sigcontext scratch; /* 88 bytes of scratch area */ /* Always make any pending restarted system calls return -EINTR */ current_thread_info()->restart_block.fn = do_no_restart_syscall; -#define COPY(x) err |= __get_user(regs->x, &sc->x) + if (copy_from_user(&scratch, __sc, sizeof(scratch))) + return -EFAULT; + +#define COPY(x) regs->x = scratch.x #define COPY_SEG(seg) \ - { unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ + { unsigned short tmp = scratch.seg; \ regs->x##seg = tmp; } #define COPY_SEG_STRICT(seg) \ - { unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ + { unsigned short tmp = scratch.seg; \ regs->x##seg = tmp|3; } #define GET_SEG(seg) \ - { unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ + { unsigned short tmp = scratch.seg; \ loadsegment(seg,tmp); } GET_SEG(gs); @@ -168,27 +169,23 @@ COPY_SEG_STRICT(ss); { - unsigned int tmpflags; - err |= __get_user(tmpflags, &sc->eflags); + unsigned int tmpflags = scratch.eflags; regs->eflags = (regs->eflags & ~0x40DD5) | (tmpflags & 0x40DD5); regs->orig_eax = -1; /* disable syscall checks */ } { - struct _fpstate __user * buf; - err |= __get_user(buf, &sc->fpstate); + struct _fpstate * buf = scratch.fpstate; if (buf) { if (verify_area(VERIFY_READ, buf, sizeof(*buf))) - goto badframe; - err |= restore_i387(buf); + return -EFAULT; + if (restore_i387(buf)) + return -EFAULT; } } - err |= __get_user(*peax, &sc->eax); - return err; - -badframe: - return 1; + *peax = scratch.eax; + return 0; } asmlinkage int sys_sigreturn(unsigned long __unused) @@ -266,46 +263,47 @@ */ static int -setup_sigcontext(struct sigcontext __user *sc, struct _fpstate __user *fpstate, +setup_sigcontext(struct sigcontext __user *__sc, struct _fpstate __user *fpstate, struct pt_regs *regs, unsigned long mask) { - int tmp, err = 0; + struct sigcontext sc; /* 88 bytes of scratch area */ + int tmp; tmp = 0; __asm__("movl %%gs,%0" : "=r"(tmp): "0"(tmp)); - err |= __put_user(tmp, (unsigned int *)&sc->gs); + *(unsigned int *)&sc.gs = tmp; __asm__("movl %%fs,%0" : "=r"(tmp): "0"(tmp)); - err |= __put_user(tmp, (unsigned int *)&sc->fs); - - err |= __put_user(regs->xes, (unsigned int *)&sc->es); - err |= __put_user(regs->xds, (unsigned int *)&sc->ds); - err |= __put_user(regs->edi, &sc->edi); - err |= __put_user(regs->esi, &sc->esi); - err |= __put_user(regs->ebp, &sc->ebp); - err |= __put_user(regs->esp, &sc->esp); - err |= __put_user(regs->ebx, &sc->ebx); - err |= __put_user(regs->edx, &sc->edx); - err |= __put_user(regs->ecx, &sc->ecx); - err |= __put_user(regs->eax, &sc->eax); - err |= __put_user(current->thread.trap_no, &sc->trapno); - err |= __put_user(current->thread.error_code, &sc->err); - err |= __put_user(regs->eip, &sc->eip); - err |= __put_user(regs->xcs, (unsigned int *)&sc->cs); - err |= __put_user(regs->eflags, &sc->eflags); - err |= __put_user(regs->esp, &sc->esp_at_signal); - err |= __put_user(regs->xss, (unsigned int *)&sc->ss); + *(unsigned int *)&sc.fs = tmp; + *(unsigned int *)&sc.es = regs->xes; + *(unsigned int *)&sc.ds = regs->xds; + sc.edi = regs->edi; + sc.esi = regs->esi; + sc.ebp = regs->ebp; + sc.esp = regs->esp; + sc.ebx = regs->ebx; + sc.edx = regs->edx; + sc.ecx = regs->ecx; + sc.eax = regs->eax; + sc.trapno = current->thread.trap_no; + sc.err = current->thread.error_code; + sc.eip = regs->eip; + *(unsigned int *)&sc.cs = regs->xcs; + sc.eflags = regs->eflags; + sc.esp_at_signal = regs->esp; + *(unsigned int *)&sc.ss = regs->xss; tmp = save_i387(fpstate); if (tmp < 0) - err = 1; - else - err |= __put_user(tmp ? fpstate : NULL, &sc->fpstate); + return 1; + sc.fpstate = tmp ? fpstate : NULL; /* non-iBCS2 extensions.. */ - err |= __put_user(mask, &sc->oldmask); - err |= __put_user(current->thread.cr2, &sc->cr2); + sc.oldmask = mask; + sc.cr2 = current->thread.cr2; - return err; + if (copy_to_user(__sc, &sc, sizeof(sc))) + return 1; + return 0; } /* @@ -443,7 +441,7 @@ /* Create the ucontext. */ err |= __put_user(0, &frame->uc.uc_flags); err |= __put_user(0, &frame->uc.uc_link); - err |= __put_user(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); + err |= __put_user(current->sas_ss_sp, (unsigned long *)&frame->uc.uc_stack.ss_sp); err |= __put_user(sas_ss_flags(regs->esp), &frame->uc.uc_stack.ss_flags); err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size); --- diff/arch/i386/kernel/smp.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/smp.c 2003-11-26 10:09:04.000000000 +0000 @@ -327,10 +327,12 @@ if (flush_mm == cpu_tlbstate[cpu].active_mm) { if (cpu_tlbstate[cpu].state == TLBSTATE_OK) { +#ifndef CONFIG_X86_SWITCH_PAGETABLES if (flush_va == FLUSH_ALL) local_flush_tlb(); else __flush_tlb_one(flush_va); +#endif } else leave_mm(cpu); } @@ -396,21 +398,6 @@ spin_unlock(&tlbstate_lock); } -void flush_tlb_current_task(void) -{ - struct mm_struct *mm = current->mm; - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - local_flush_tlb(); - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, FLUSH_ALL); - preempt_enable(); -} - void flush_tlb_mm (struct mm_struct * mm) { cpumask_t cpu_mask; @@ -442,7 +429,10 @@ if (current->active_mm == mm) { if(current->mm) - __flush_tlb_one(va); +#ifndef CONFIG_X86_SWITCH_PAGETABLES + __flush_tlb_one(va) +#endif + ; else leave_mm(smp_processor_id()); } @@ -466,7 +456,17 @@ { on_each_cpu(do_flush_tlb_all, 0, 1, 1); } - +#ifdef CONFIG_KGDB +/* + * By using the NMI code instead of a vector we just sneak thru the + * word generator coming out with just what we want. AND it does + * not matter if clustered_apic_mode is set or not. + */ +void smp_send_nmi_allbutself(void) +{ + send_IPI_allbutself(APIC_DM_NMI); +} +#endif /* * this function sends a 'reschedule' IPI to another CPU. * it goes straight through and wastes no time serializing --- diff/arch/i386/kernel/sysenter.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/sysenter.c 2003-11-26 10:09:04.000000000 +0000 @@ -18,13 +18,18 @@ #include <asm/msr.h> #include <asm/pgtable.h> #include <asm/unistd.h> +#include <linux/highmem.h> extern asmlinkage void sysenter_entry(void); void enable_sep_cpu(void *info) { int cpu = get_cpu(); +#ifdef CONFIG_X86_HIGH_ENTRY + struct tss_struct *tss = (struct tss_struct *) __fix_to_virt(FIX_TSS_0) + cpu; +#else struct tss_struct *tss = init_tss + cpu; +#endif tss->ss1 = __KERNEL_CS; tss->esp1 = sizeof(struct tss_struct) + (unsigned long) tss; --- diff/arch/i386/kernel/time.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/time.c 2003-11-26 10:09:04.000000000 +0000 @@ -44,6 +44,7 @@ #include <linux/module.h> #include <linux/sysdev.h> #include <linux/bcd.h> +#include <linux/efi.h> #include <asm/io.h> #include <asm/smp.h> @@ -94,7 +95,7 @@ { unsigned long seq; unsigned long usec, sec; - unsigned long max_ntp_tick = tick_usec - tickadj; + unsigned long max_ntp_tick; do { unsigned long lost; @@ -110,13 +111,14 @@ * Better to lose some accuracy than have time go backwards.. */ if (unlikely(time_adjust < 0)) { + max_ntp_tick = (USEC_PER_SEC / HZ) - tickadj; usec = min(usec, max_ntp_tick); if (lost) usec += lost * max_ntp_tick; } else if (unlikely(lost)) - usec += lost * tick_usec; + usec += lost * (USEC_PER_SEC / HZ); sec = xtime.tv_sec; usec += (xtime.tv_nsec / 1000); @@ -174,7 +176,10 @@ /* gets recalled with irq locally disabled */ spin_lock(&rtc_lock); - retval = mach_set_rtc_mmss(nowtime); + if (efi_enabled) + retval = efi_set_rtc_mmss(nowtime); + else + retval = mach_set_rtc_mmss(nowtime); spin_unlock(&rtc_lock); return retval; @@ -232,7 +237,13 @@ >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 && (xtime.tv_nsec / 1000) <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) { - if (set_rtc_mmss(xtime.tv_sec) == 0) + /* horrible...FIXME */ + if (efi_enabled) { + if (efi_set_rtc_mmss(xtime.tv_sec) == 0) + last_rtc_update = xtime.tv_sec; + else + last_rtc_update = xtime.tv_sec - 600; + } else if (set_rtc_mmss(xtime.tv_sec) == 0) last_rtc_update = xtime.tv_sec; else last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */ @@ -286,7 +297,10 @@ spin_lock(&rtc_lock); - retval = mach_get_cmos_time(); + if (efi_enabled) + retval = efi_get_time(); + else + retval = mach_get_cmos_time(); spin_unlock(&rtc_lock); @@ -297,6 +311,7 @@ set_kset_name("pit"), }; + /* XXX this driverfs stuff should probably go elsewhere later -john */ static struct sys_device device_i8253 = { .id = 0, @@ -328,6 +343,8 @@ } cur_timer = select_timer(); + printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name); + time_init_hook(); } #endif @@ -344,12 +361,13 @@ return; } #endif - xtime.tv_sec = get_cmos_time(); wall_to_monotonic.tv_sec = -xtime.tv_sec; xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); wall_to_monotonic.tv_nsec = -xtime.tv_nsec; cur_timer = select_timer(); + printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name); + time_init_hook(); } --- diff/arch/i386/kernel/timers/Makefile 2003-09-30 15:46:11.000000000 +0100 +++ source/arch/i386/kernel/timers/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -6,3 +6,4 @@ obj-$(CONFIG_X86_CYCLONE_TIMER) += timer_cyclone.o obj-$(CONFIG_HPET_TIMER) += timer_hpet.o +obj-$(CONFIG_X86_PM_TIMER) += timer_pm.o --- diff/arch/i386/kernel/timers/common.c 2003-09-30 15:46:11.000000000 +0100 +++ source/arch/i386/kernel/timers/common.c 2003-11-26 10:09:04.000000000 +0000 @@ -137,3 +137,23 @@ } #endif +/* calculate cpu_khz */ +void __init init_cpu_khz(void) +{ + if (cpu_has_tsc) { + unsigned long tsc_quotient = calibrate_tsc(); + if (tsc_quotient) { + /* report CPU clock rate in Hz. + * The formula is (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) = + * clock/second. Our precision is about 100 ppm. + */ + { unsigned long eax=0, edx=1000; + __asm__("divl %2" + :"=a" (cpu_khz), "=d" (edx) + :"r" (tsc_quotient), + "0" (eax), "1" (edx)); + printk("Detected %lu.%03lu MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000); + } + } + } +} --- diff/arch/i386/kernel/timers/timer.c 2003-09-17 12:28:01.000000000 +0100 +++ source/arch/i386/kernel/timers/timer.c 2003-11-26 10:09:04.000000000 +0000 @@ -19,6 +19,9 @@ #ifdef CONFIG_HPET_TIMER &timer_hpet, #endif +#ifdef CONFIG_X86_PM_TIMER + &timer_pmtmr, +#endif &timer_tsc, &timer_pit, NULL, --- diff/arch/i386/kernel/timers/timer_cyclone.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/timers/timer_cyclone.c 2003-11-26 10:09:04.000000000 +0000 @@ -212,26 +212,7 @@ } } - /* init cpu_khz. - * XXX - This should really be done elsewhere, - * and in a more generic fashion. -johnstul@us.ibm.com - */ - if (cpu_has_tsc) { - unsigned long tsc_quotient = calibrate_tsc(); - if (tsc_quotient) { - /* report CPU clock rate in Hz. - * The formula is (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) = - * clock/second. Our precision is about 100 ppm. - */ - { unsigned long eax=0, edx=1000; - __asm__("divl %2" - :"=a" (cpu_khz), "=d" (edx) - :"r" (tsc_quotient), - "0" (eax), "1" (edx)); - printk("Detected %lu.%03lu MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000); - } - } - } + init_cpu_khz(); /* Everything looks good! */ return 0; @@ -253,6 +234,7 @@ /* cyclone timer_opts struct */ struct timer_opts timer_cyclone = { + .name = "cyclone", .init = init_cyclone, .mark_offset = mark_offset_cyclone, .get_offset = get_offset_cyclone, --- diff/arch/i386/kernel/timers/timer_hpet.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/timers/timer_hpet.c 2003-11-26 10:09:04.000000000 +0000 @@ -178,6 +178,7 @@ /* tsc timer_opts struct */ struct timer_opts timer_hpet = { + .name = "hpet", .init = init_hpet, .mark_offset = mark_offset_hpet, .get_offset = get_offset_hpet, --- diff/arch/i386/kernel/timers/timer_none.c 2003-05-21 11:49:59.000000000 +0100 +++ source/arch/i386/kernel/timers/timer_none.c 2003-11-26 10:09:04.000000000 +0000 @@ -36,6 +36,7 @@ /* tsc timer_opts struct */ struct timer_opts timer_none = { + .name = "none", .init = init_none, .mark_offset = mark_offset_none, .get_offset = get_offset_none, --- diff/arch/i386/kernel/timers/timer_pit.c 2003-05-21 11:50:14.000000000 +0100 +++ source/arch/i386/kernel/timers/timer_pit.c 2003-11-26 10:09:04.000000000 +0000 @@ -149,6 +149,7 @@ /* tsc timer_opts struct */ struct timer_opts timer_pit = { + .name = "pit", .init = init_pit, .mark_offset = mark_offset_pit, .get_offset = get_offset_pit, --- diff/arch/i386/kernel/timers/timer_tsc.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/timers/timer_tsc.c 2003-11-26 10:09:04.000000000 +0000 @@ -472,6 +472,7 @@ /* tsc timer_opts struct */ struct timer_opts timer_tsc = { + .name = "tsc", .init = init_tsc, .mark_offset = mark_offset_tsc, .get_offset = get_offset_tsc, --- diff/arch/i386/kernel/traps.c 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/i386/kernel/traps.c 2003-11-26 10:09:04.000000000 +0000 @@ -54,12 +54,8 @@ #include "mach_traps.h" -asmlinkage int system_call(void); -asmlinkage void lcall7(void); -asmlinkage void lcall27(void); - -struct desc_struct default_ldt[] = { { 0, 0 }, { 0, 0 }, { 0, 0 }, - { 0, 0 }, { 0, 0 } }; +struct desc_struct default_ldt[] __attribute__((__section__(".data.default_ldt"))) = { { 0, 0 }, { 0, 0 }, { 0, 0 }, { 0, 0 }, { 0, 0 } }; +struct page *default_ldt_page; /* Do we ignore FPU interrupts ? */ char ignore_fpu_irq = 0; @@ -91,6 +87,43 @@ asmlinkage void spurious_interrupt_bug(void); asmlinkage void machine_check(void); +#ifdef CONFIG_KGDB +extern void sysenter_entry(void); +#include <asm/kgdb.h> +#include <linux/init.h> +extern void int3(void); +extern void debug(void); +void set_intr_gate(unsigned int n, void *addr); +static void set_intr_usr_gate(unsigned int n, void *addr); +/* + * Should be able to call this breakpoint() very early in + * bring up. Just hard code the call where needed. + * The breakpoint() code is here because set_?_gate() functions + * are local (static) to trap.c. They need be done only once, + * but it does not hurt to do them over. + */ +void breakpoint(void) +{ + init_entry_mappings(); + set_intr_usr_gate(3,&int3); /* disable ints on trap */ + set_intr_gate(1,&debug); + set_intr_gate(14,&page_fault); + + BREAKPOINT; +} +#define CHK_REMOTE_DEBUG(trapnr,signr,error_code,regs,after) \ + { \ + if (!user_mode(regs) ) \ + { \ + kgdb_handle_exception(trapnr, signr, error_code, regs); \ + after; \ + } else if ((trapnr == 3) && (regs->eflags &0x200)) local_irq_enable(); \ + } +#else +#define CHK_REMOTE_DEBUG(trapnr,signr,error_code,regs,after) +#endif + + static int kstack_depth_to_print = 24; void show_trace(struct task_struct *task, unsigned long * stack) @@ -175,8 +208,9 @@ ss = regs->xss & 0xffff; } print_modules(); - printk("CPU: %d\nEIP: %04x:[<%08lx>] %s\nEFLAGS: %08lx\n", - smp_processor_id(), 0xffff & regs->xcs, regs->eip, print_tainted(), regs->eflags); + printk("CPU: %d\nEIP: %04x:[<%08lx>] %s VLI\nEFLAGS: %08lx\n", + smp_processor_id(), 0xffff & regs->xcs, + regs->eip, print_tainted(), regs->eflags); print_symbol("EIP is at %s\n", regs->eip); printk("eax: %08lx ebx: %08lx ecx: %08lx edx: %08lx\n", @@ -192,23 +226,27 @@ * time of the fault.. */ if (in_kernel) { + u8 *eip; printk("\nStack: "); show_stack(NULL, (unsigned long*)esp); printk("Code: "); - if(regs->eip < PAGE_OFFSET) - goto bad; - for(i=0;i<20;i++) - { - unsigned char c; - if(__get_user(c, &((unsigned char*)regs->eip)[i])) { -bad: + eip = (u8 *)regs->eip - 43; + for (i = 0; i < 64; i++, eip++) { + unsigned char c = 0xff; + + if ((user_mode(regs) && get_user(c, eip)) || + (!user_mode(regs) && __direct_get_user(c, eip))) { + printk(" Bad EIP value."); break; } - printk("%02x ", c); + if (eip == (u8 *)regs->eip) + printk("<%02x> ", c); + else + printk("%02x ", c); } } printk("\n"); @@ -255,12 +293,36 @@ void die(const char * str, struct pt_regs * regs, long err) { static int die_counter; + int nl = 0; console_verbose(); spin_lock_irq(&die_lock); bust_spinlocks(1); handle_BUG(regs); printk("%s: %04lx [#%d]\n", str, err & 0xffff, ++die_counter); +#ifdef CONFIG_PREEMPT + printk("PREEMPT "); + nl = 1; +#endif +#ifdef CONFIG_SMP + printk("SMP "); + nl = 1; +#endif +#ifdef CONFIG_DEBUG_PAGEALLOC + printk("DEBUG_PAGEALLOC"); + nl = 1; +#endif + if (nl) + printk("\n"); +#ifdef CONFIG_KGDB + /* This is about the only place we want to go to kgdb even if in + * user mode. But we must go in via a trap so within kgdb we will + * always be in kernel mode. + */ + if (user_mode(regs)) + BREAKPOINT; +#endif + CHK_REMOTE_DEBUG(0,SIGTRAP,err,regs,) show_registers(regs); bust_spinlocks(0); spin_unlock_irq(&die_lock); @@ -330,6 +392,7 @@ #define DO_ERROR(trapnr, signr, str, name) \ asmlinkage void do_##name(struct pt_regs * regs, long error_code) \ { \ + CHK_REMOTE_DEBUG(trapnr,signr,error_code,regs,)\ do_trap(trapnr, signr, str, 0, regs, error_code, NULL); \ } @@ -347,7 +410,9 @@ #define DO_VM86_ERROR(trapnr, signr, str, name) \ asmlinkage void do_##name(struct pt_regs * regs, long error_code) \ { \ + CHK_REMOTE_DEBUG(trapnr, signr, error_code,regs, return)\ do_trap(trapnr, signr, str, 1, regs, error_code, NULL); \ + return; \ } #define DO_VM86_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \ @@ -394,8 +459,10 @@ return; gp_in_kernel: - if (!fixup_exception(regs)) + if (!fixup_exception(regs)){ + CHK_REMOTE_DEBUG(13,SIGSEGV,error_code,regs,) die("general protection fault", regs, error_code); + } } static void mem_parity_error(unsigned char reason, struct pt_regs * regs) @@ -534,10 +601,18 @@ if (regs->eflags & X86_EFLAGS_IF) local_irq_enable(); - /* Mask out spurious debug traps due to lazy DR7 setting */ + /* + * Mask out spurious debug traps due to lazy DR7 setting or + * due to 4G/4G kernel mode: + */ if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) { if (!tsk->thread.debugreg[7]) goto clear_dr7; + if (!user_mode(regs)) { + // restore upon return-to-userspace: + set_thread_flag(TIF_DB7); + goto clear_dr7; + } } if (regs->eflags & VM_MASK) @@ -557,8 +632,18 @@ * allowing programs to debug themselves without the ptrace() * interface. */ +#ifdef CONFIG_KGDB + /* + * I think this is the only "real" case of a TF in the kernel + * that really belongs to user space. Others are + * "Ours all ours!" + */ + if (((regs->xcs & 3) == 0) && ((void *)regs->eip == sysenter_entry)) + goto clear_TF_reenable; +#else if ((regs->xcs & 3) == 0) goto clear_TF_reenable; +#endif if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE) goto clear_TF; } @@ -570,6 +655,17 @@ info.si_errno = 0; info.si_code = TRAP_BRKPT; +#ifdef CONFIG_KGDB + /* + * If this is a kernel mode trap, we need to reset db7 to allow us + * to continue sanely ALSO skip the signal delivery + */ + if ((regs->xcs & 3) == 0) + goto clear_dr7; + + /* if not kernel, allow ints but only if they were on */ + if ( regs->eflags & 0x200) local_irq_enable(); +#endif /* If this is a kernel mode trap, save the user PC on entry to * the kernel, that's what the debugger can make sense of. */ @@ -584,6 +680,7 @@ __asm__("movl %0,%%db7" : /* no output */ : "r" (0)); + CHK_REMOTE_DEBUG(1,SIGTRAP,error_code,regs,) return; debug_vm86: @@ -779,19 +876,53 @@ #endif /* CONFIG_MATH_EMULATION */ -#ifdef CONFIG_X86_F00F_BUG -void __init trap_init_f00f_bug(void) +void __init trap_init_virtual_IDT(void) { - __set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO); - /* - * Update the IDT descriptor and reload the IDT so that - * it uses the read-only mapped virtual address. + * "idt" is magic - it overlaps the idt_descr + * variable so that updating idt will automatically + * update the idt descriptor.. */ - idt_descr.address = fix_to_virt(FIX_F00F_IDT); + __set_fixmap(FIX_IDT, __pa(&idt_table), PAGE_KERNEL_RO); + idt_descr.address = __fix_to_virt(FIX_IDT); + __asm__ __volatile__("lidt %0" : : "m" (idt_descr)); } + +void __init trap_init_virtual_GDT(void) +{ + int cpu = smp_processor_id(); + struct Xgt_desc_struct *gdt_desc = cpu_gdt_descr + cpu; + struct Xgt_desc_struct tmp_desc = {0, 0}; + struct tss_struct * t; + + __asm__ __volatile__("sgdt %0": "=m" (tmp_desc): :"memory"); + +#ifdef CONFIG_X86_HIGH_ENTRY + if (!cpu) { + __set_fixmap(FIX_GDT_0, __pa(cpu_gdt_table), PAGE_KERNEL); + __set_fixmap(FIX_GDT_1, __pa(cpu_gdt_table) + PAGE_SIZE, PAGE_KERNEL); + __set_fixmap(FIX_TSS_0, __pa(init_tss), PAGE_KERNEL); + __set_fixmap(FIX_TSS_1, __pa(init_tss) + 1*PAGE_SIZE, PAGE_KERNEL); + __set_fixmap(FIX_TSS_2, __pa(init_tss) + 2*PAGE_SIZE, PAGE_KERNEL); + __set_fixmap(FIX_TSS_3, __pa(init_tss) + 3*PAGE_SIZE, PAGE_KERNEL); + } + + gdt_desc->address = __fix_to_virt(FIX_GDT_0) + sizeof(cpu_gdt_table[0]) * cpu; +#else + gdt_desc->address = (unsigned long)cpu_gdt_table[cpu]; +#endif + __asm__ __volatile__("lgdt %0": "=m" (*gdt_desc)); + +#ifdef CONFIG_X86_HIGH_ENTRY + t = (struct tss_struct *) __fix_to_virt(FIX_TSS_0) + cpu; +#else + t = init_tss + cpu; #endif + set_tss_desc(cpu, t); + cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; + load_TR_desc(); +} #define _set_gate(gate_addr,type,dpl,addr,seg) \ do { \ @@ -818,20 +949,26 @@ _set_gate(idt_table+n,14,0,addr,__KERNEL_CS); } -static void __init set_trap_gate(unsigned int n, void *addr) +void __init set_trap_gate(unsigned int n, void *addr) { _set_gate(idt_table+n,15,0,addr,__KERNEL_CS); } -static void __init set_system_gate(unsigned int n, void *addr) +void __init set_system_gate(unsigned int n, void *addr) { _set_gate(idt_table+n,15,3,addr,__KERNEL_CS); } -static void __init set_call_gate(void *a, void *addr) +void __init set_call_gate(void *a, void *addr) { _set_gate(a,12,3,addr,__KERNEL_CS); } +#ifdef CONFIG_KGDB +void set_intr_usr_gate(unsigned int n, void *addr) +{ + _set_gate(idt_table+n,14,3,addr,__KERNEL_CS); +} +#endif static void __init set_task_gate(unsigned int n, unsigned int gdt_entry) { @@ -850,11 +987,16 @@ #ifdef CONFIG_X86_LOCAL_APIC init_apic_mappings(); #endif + init_entry_mappings(); set_trap_gate(0,÷_error); set_intr_gate(1,&debug); set_intr_gate(2,&nmi); +#ifndef CONFIG_KGDB set_system_gate(3,&int3); /* int3-5 can be called from all */ +#else + set_intr_usr_gate(3,&int3); /* int3-5 can be called from all */ +#endif set_system_gate(4,&overflow); set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); --- diff/arch/i386/kernel/vm86.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/vm86.c 2003-11-26 10:09:04.000000000 +0000 @@ -124,7 +124,8 @@ tss = init_tss + get_cpu(); current->thread.esp0 = current->thread.saved_esp0; - load_esp0(tss, current->thread.esp0); + current->thread.sysenter_cs = __KERNEL_CS; + load_virtual_esp0(tss, current); current->thread.saved_esp0 = 0; put_cpu(); @@ -301,8 +302,10 @@ asm volatile("movl %%gs,%0":"=m" (tsk->thread.saved_gs)); tss = init_tss + get_cpu(); - tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0; - disable_sysenter(tss); + tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0; + if (cpu_has_sep) + tsk->thread.sysenter_cs = 0; + load_virtual_esp0(tss, tsk); put_cpu(); tsk->thread.screen_bitmap = info->screen_bitmap; --- diff/arch/i386/kernel/vmlinux.lds.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/vmlinux.lds.S 2003-11-26 10:09:04.000000000 +0000 @@ -3,6 +3,9 @@ */ #include <asm-generic/vmlinux.lds.h> +#include <linux/config.h> +#include <asm/page.h> +#include <asm/asm_offsets.h> OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) @@ -10,7 +13,7 @@ jiffies = jiffies_64; SECTIONS { - . = 0xC0000000 + 0x100000; + . = __PAGE_OFFSET + 0x100000; /* read-only */ _text = .; /* Text and read-only data */ .text : { @@ -19,6 +22,19 @@ *(.gnu.warning) } = 0x9090 +#ifdef CONFIG_X86_4G + . = ALIGN(PAGE_SIZE_asm); + __entry_tramp_start = .; + . = FIX_ENTRY_TRAMPOLINE_0_addr; + __start___entry_text = .; + .entry.text : AT (__entry_tramp_start) { *(.entry.text) } + __entry_tramp_end = __entry_tramp_start + SIZEOF(.entry.text); + . = __entry_tramp_end; + . = ALIGN(PAGE_SIZE_asm); +#else + .entry.text : { *(.entry.text) } +#endif + _etext = .; /* End of text section */ . = ALIGN(16); /* Exception table */ @@ -34,15 +50,12 @@ CONSTRUCTORS } - . = ALIGN(4096); + . = ALIGN(PAGE_SIZE_asm); __nosave_begin = .; .data_nosave : { *(.data.nosave) } - . = ALIGN(4096); + . = ALIGN(PAGE_SIZE_asm); __nosave_end = .; - . = ALIGN(4096); - .data.page_aligned : { *(.data.idt) } - . = ALIGN(32); .data.cacheline_aligned : { *(.data.cacheline_aligned) } @@ -52,7 +65,7 @@ .data.init_task : { *(.data.init_task) } /* will be freed after init */ - . = ALIGN(4096); /* Init code and data */ + . = ALIGN(PAGE_SIZE_asm); /* Init code and data */ __init_begin = .; .init.text : { _sinittext = .; @@ -91,7 +104,7 @@ from .altinstructions and .eh_frame */ .exit.text : { *(.exit.text) } .exit.data : { *(.exit.data) } - . = ALIGN(4096); + . = ALIGN(PAGE_SIZE_asm); __initramfs_start = .; .init.ramfs : { *(.init.ramfs) } __initramfs_end = .; @@ -99,10 +112,22 @@ __per_cpu_start = .; .data.percpu : { *(.data.percpu) } __per_cpu_end = .; - . = ALIGN(4096); + . = ALIGN(PAGE_SIZE_asm); __init_end = .; /* freed after init ends here */ - + + . = ALIGN(PAGE_SIZE_asm); + .data.page_aligned_tss : { *(.data.tss) } + + . = ALIGN(PAGE_SIZE_asm); + .data.page_aligned_default_ldt : { *(.data.default_ldt) } + + . = ALIGN(PAGE_SIZE_asm); + .data.page_aligned_idt : { *(.data.idt) } + + . = ALIGN(PAGE_SIZE_asm); + .data.page_aligned_gdt : { *(.data.gdt) } + __bss_start = .; /* BSS */ .bss : { *(.bss) } __bss_stop = .; @@ -122,4 +147,6 @@ .stab.index 0 : { *(.stab.index) } .stab.indexstr 0 : { *(.stab.indexstr) } .comment 0 : { *(.comment) } + + } --- diff/arch/i386/kernel/vsyscall-sysenter.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/vsyscall-sysenter.S 2003-11-26 10:09:04.000000000 +0000 @@ -7,6 +7,11 @@ .type __kernel_vsyscall,@function __kernel_vsyscall: .LSTART_vsyscall: + cmpl $192, %eax + jne 1f + int $0x80 + ret +1: push %ecx .Lpush_ecx: push %edx --- diff/arch/i386/kernel/vsyscall.lds 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/vsyscall.lds 2003-11-26 10:09:04.000000000 +0000 @@ -5,7 +5,7 @@ */ /* This must match <asm/fixmap.h>. */ -VSYSCALL_BASE = 0xffffe000; +VSYSCALL_BASE = 0xffffd000; SECTIONS { --- diff/arch/i386/lib/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/lib/Makefile 2003-11-26 10:09:04.000000000 +0000 @@ -9,4 +9,5 @@ lib-$(CONFIG_X86_USE_3DNOW) += mmx.o lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o +lib-$(CONFIG_KGDB) += kgdb_serial.o lib-$(CONFIG_DEBUG_IOVIRT) += iodebug.o --- diff/arch/i386/lib/checksum.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/lib/checksum.S 2003-11-26 10:09:04.000000000 +0000 @@ -280,14 +280,14 @@ .previous .align 4 -.globl csum_partial_copy_generic +.globl direct_csum_partial_copy_generic #ifndef CONFIG_X86_USE_PPRO_CHECKSUM #define ARGBASE 16 #define FP 12 -csum_partial_copy_generic: +direct_csum_partial_copy_generic: subl $4,%esp pushl %edi pushl %esi @@ -422,7 +422,7 @@ #define ARGBASE 12 -csum_partial_copy_generic: +direct_csum_partial_copy_generic: pushl %ebx pushl %edi pushl %esi --- diff/arch/i386/lib/dec_and_lock.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/lib/dec_and_lock.c 2003-11-26 10:09:04.000000000 +0000 @@ -10,6 +10,7 @@ #include <linux/spinlock.h> #include <asm/atomic.h> +#ifndef ATOMIC_DEC_AND_LOCK int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) { int counter; @@ -38,3 +39,5 @@ spin_unlock(lock); return 0; } +#endif + --- diff/arch/i386/lib/getuser.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/lib/getuser.S 2003-11-26 10:09:04.000000000 +0000 @@ -9,6 +9,7 @@ * return value. */ #include <asm/thread_info.h> +#include <asm/asm_offsets.h> /* @@ -28,7 +29,7 @@ .globl __get_user_1 __get_user_1: GET_THREAD_INFO(%edx) - cmpl TI_ADDR_LIMIT(%edx),%eax + cmpl TI_addr_limit(%edx),%eax jae bad_get_user 1: movzbl (%eax),%edx xorl %eax,%eax @@ -40,7 +41,7 @@ addl $1,%eax jc bad_get_user GET_THREAD_INFO(%edx) - cmpl TI_ADDR_LIMIT(%edx),%eax + cmpl TI_addr_limit(%edx),%eax jae bad_get_user 2: movzwl -1(%eax),%edx xorl %eax,%eax @@ -52,7 +53,7 @@ addl $3,%eax jc bad_get_user GET_THREAD_INFO(%edx) - cmpl TI_ADDR_LIMIT(%edx),%eax + cmpl TI_addr_limit(%edx),%eax jae bad_get_user 3: movl -3(%eax),%edx xorl %eax,%eax --- diff/arch/i386/lib/usercopy.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/lib/usercopy.c 2003-11-26 10:09:04.000000000 +0000 @@ -76,7 +76,7 @@ * and returns @count. */ long -__strncpy_from_user(char *dst, const char __user *src, long count) +__direct_strncpy_from_user(char *dst, const char __user *src, long count) { long res; __do_strncpy_from_user(dst, src, count, res); @@ -102,7 +102,7 @@ * and returns @count. */ long -strncpy_from_user(char *dst, const char __user *src, long count) +direct_strncpy_from_user(char *dst, const char __user *src, long count) { long res = -EFAULT; if (access_ok(VERIFY_READ, src, 1)) @@ -147,7 +147,7 @@ * On success, this will be zero. */ unsigned long -clear_user(void __user *to, unsigned long n) +direct_clear_user(void __user *to, unsigned long n) { might_sleep(); if (access_ok(VERIFY_WRITE, to, n)) @@ -167,7 +167,7 @@ * On success, this will be zero. */ unsigned long -__clear_user(void __user *to, unsigned long n) +__direct_clear_user(void __user *to, unsigned long n) { __do_clear_user(to, n); return n; @@ -184,7 +184,7 @@ * On exception, returns 0. * If the string is too long, returns a value greater than @n. */ -long strnlen_user(const char __user *s, long n) +long direct_strnlen_user(const char __user *s, long n) { unsigned long mask = -__addr_ok(s); unsigned long res, tmp; @@ -573,3 +573,4 @@ n = __copy_user_zeroing_intel(to, (const void *) from, n); return n; } + --- diff/arch/i386/mach-es7000/es7000.c 2003-06-30 10:07:28.000000000 +0100 +++ source/arch/i386/mach-es7000/es7000.c 2003-11-26 10:09:04.000000000 +0000 @@ -51,8 +51,6 @@ int mip_port; unsigned long mip_addr, host_addr; -static int es7000_plat; - /* * Parse the OEM Table */ --- diff/arch/i386/math-emu/fpu_system.h 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/math-emu/fpu_system.h 2003-11-26 10:09:04.000000000 +0000 @@ -15,6 +15,7 @@ #include <linux/sched.h> #include <linux/kernel.h> #include <linux/mm.h> +#include <asm/atomic_kmap.h> /* This sets the pointer FPU_info to point to the argument part of the stack frame of math_emulate() */ @@ -22,7 +23,7 @@ /* s is always from a cpu register, and the cpu does bounds checking * during register load --> no further bounds checks needed */ -#define LDT_DESCRIPTOR(s) (((struct desc_struct *)current->mm->context.ldt)[(s) >> 3]) +#define LDT_DESCRIPTOR(s) (((struct desc_struct *)__kmap_atomic_vaddr(KM_LDT_PAGE0))[(s) >> 3]) #define SEG_D_SIZE(x) ((x).b & (3 << 21)) #define SEG_G_BIT(x) ((x).b & (1 << 23)) #define SEG_GRANULARITY(x) (((x).b & (1 << 23)) ? 4096 : 1) --- diff/arch/i386/mm/extable.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/mm/extable.c 2003-11-26 10:09:04.000000000 +0000 @@ -6,6 +6,52 @@ #include <linux/module.h> #include <linux/spinlock.h> #include <asm/uaccess.h> +#include <asm/pgtable.h> + +extern struct exception_table_entry __start___ex_table[]; +extern struct exception_table_entry __stop___ex_table[]; + +/* + * The exception table needs to be sorted because we use the macros + * which put things into the exception table in a variety of sections + * as well as the init section and the main kernel text section. + */ +static inline void +sort_ex_table(struct exception_table_entry *start, + struct exception_table_entry *finish) +{ + struct exception_table_entry el, *p, *q; + + /* insertion sort */ + for (p = start + 1; p < finish; ++p) { + /* start .. p-1 is sorted */ + if (p[0].insn < p[-1].insn) { + /* move element p down to its right place */ + el = *p; + q = p; + do { + /* el comes before q[-1], move q[-1] up one */ + q[0] = q[-1]; + --q; + } while (q > start && el.insn < q[-1].insn); + *q = el; + } + } +} + +void fixup_sort_exception_table(void) +{ + struct exception_table_entry *p; + + /* + * Fix up the trampoline exception addresses: + */ + for (p = __start___ex_table; p < __stop___ex_table; p++) { + p->insn = (unsigned long)(void *)p->insn; + p->fixup = (unsigned long)(void *)p->fixup; + } + sort_ex_table(__start___ex_table, __stop___ex_table); +} /* Simple binary search */ const struct exception_table_entry * @@ -15,13 +61,15 @@ { while (first <= last) { const struct exception_table_entry *mid; - long diff; mid = (last - first) / 2 + first; - diff = mid->insn - value; - if (diff == 0) + /* + * careful, the distance between entries can be + * larger than 2GB: + */ + if (mid->insn == value) return mid; - else if (diff < 0) + else if (mid->insn < value) first = mid+1; else last = mid-1; --- diff/arch/i386/mm/fault.c 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/i386/mm/fault.c 2003-11-26 10:09:04.000000000 +0000 @@ -27,6 +27,7 @@ #include <asm/pgalloc.h> #include <asm/hardirq.h> #include <asm/desc.h> +#include <asm/tlbflush.h> extern void die(const char *,struct pt_regs *,long); @@ -104,8 +105,17 @@ if (seg & (1<<2)) { /* Must lock the LDT while reading it. */ down(¤t->mm->context.sem); +#if 1 + /* horrible hack for 4/4 disabled kernels. + I'm not quite sure what the TLB flush is good for, + it's mindlessly copied from the read_ldt code */ + __flush_tlb_global(); + desc = kmap(current->mm->context.ldt_pages[(seg&~7)/PAGE_SIZE]); + desc = (void *)desc + ((seg & ~7) % PAGE_SIZE); +#else desc = current->mm->context.ldt; desc = (void *)desc + (seg & ~7); +#endif } else { /* Must disable preemption while reading the GDT. */ desc = (u32 *)&cpu_gdt_table[get_cpu()]; @@ -118,6 +128,9 @@ (desc[1] & 0xff000000); if (seg & (1<<2)) { +#if 1 + kunmap((void *)((unsigned long)desc & PAGE_MASK)); +#endif up(¤t->mm->context.sem); } else put_cpu(); @@ -243,6 +256,19 @@ * (error_code & 4) == 0, and that the fault was not a * protection error (error_code & 1) == 0. */ +#ifdef CONFIG_X86_4G + /* + * On 4/4 all kernels faults are either bugs, vmalloc or prefetch + */ + if (unlikely((regs->xcs & 3) == 0)) { + if (error_code & 3) + goto bad_area_nosemaphore; + + /* If it's vm86 fall through */ + if (!(regs->eflags & VM_MASK)) + goto vmalloc_fault; + } +#else if (unlikely(address >= TASK_SIZE)) { if (!(error_code & 5)) goto vmalloc_fault; @@ -252,6 +278,7 @@ */ goto bad_area_nosemaphore; } +#endif mm = tsk->mm; @@ -402,6 +429,12 @@ * Oops. The kernel tried to access some bad page. We'll have to * terminate things with extreme prejudice. */ +#ifdef CONFIG_KGDB + if (!user_mode(regs)){ + kgdb_handle_exception(14,SIGBUS, error_code, regs); + return; + } +#endif bust_spinlocks(1); --- diff/arch/i386/mm/hugetlbpage.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/mm/hugetlbpage.c 2003-11-26 10:09:04.000000000 +0000 @@ -534,7 +534,7 @@ * this far. */ static struct page *hugetlb_nopage(struct vm_area_struct *vma, - unsigned long address, int unused) + unsigned long address, int *unused) { BUG(); return NULL; --- diff/arch/i386/mm/init.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/mm/init.c 2003-11-26 10:09:04.000000000 +0000 @@ -26,6 +26,7 @@ #include <linux/bootmem.h> #include <linux/slab.h> #include <linux/proc_fs.h> +#include <linux/efi.h> #include <asm/processor.h> #include <asm/system.h> @@ -39,125 +40,13 @@ #include <asm/tlb.h> #include <asm/tlbflush.h> #include <asm/sections.h> +#include <asm/desc.h> DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; static int do_test_wp_bit(void); -/* - * Creates a middle page table and puts a pointer to it in the - * given global directory entry. This only returns the gd entry - * in non-PAE compilation mode, since the middle layer is folded. - */ -static pmd_t * __init one_md_table_init(pgd_t *pgd) -{ - pmd_t *pmd_table; - -#ifdef CONFIG_X86_PAE - pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT)); - if (pmd_table != pmd_offset(pgd, 0)) - BUG(); -#else - pmd_table = pmd_offset(pgd, 0); -#endif - - return pmd_table; -} - -/* - * Create a page table and place a pointer to it in a middle page - * directory entry. - */ -static pte_t * __init one_page_table_init(pmd_t *pmd) -{ - if (pmd_none(*pmd)) { - pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); - set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE)); - if (page_table != pte_offset_kernel(pmd, 0)) - BUG(); - - return page_table; - } - - return pte_offset_kernel(pmd, 0); -} - -/* - * This function initializes a certain range of kernel virtual memory - * with new bootmem page tables, everywhere page tables are missing in - * the given range. - */ - -/* - * NOTE: The pagetables are allocated contiguous on the physical space - * so we can cache the place of the first one and move around without - * checking the pgd every time. - */ -static void __init page_table_range_init (unsigned long start, unsigned long end, pgd_t *pgd_base) -{ - pgd_t *pgd; - pmd_t *pmd; - int pgd_idx, pmd_idx; - unsigned long vaddr; - - vaddr = start; - pgd_idx = pgd_index(vaddr); - pmd_idx = pmd_index(vaddr); - pgd = pgd_base + pgd_idx; - - for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) { - if (pgd_none(*pgd)) - one_md_table_init(pgd); - - pmd = pmd_offset(pgd, vaddr); - for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end); pmd++, pmd_idx++) { - if (pmd_none(*pmd)) - one_page_table_init(pmd); - - vaddr += PMD_SIZE; - } - pmd_idx = 0; - } -} - -/* - * This maps the physical memory to kernel virtual address space, a total - * of max_low_pfn pages, by creating page tables starting from address - * PAGE_OFFSET. - */ -static void __init kernel_physical_mapping_init(pgd_t *pgd_base) -{ - unsigned long pfn; - pgd_t *pgd; - pmd_t *pmd; - pte_t *pte; - int pgd_idx, pmd_idx, pte_ofs; - - pgd_idx = pgd_index(PAGE_OFFSET); - pgd = pgd_base + pgd_idx; - pfn = 0; - - for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) { - pmd = one_md_table_init(pgd); - if (pfn >= max_low_pfn) - continue; - for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) { - /* Map with big pages if possible, otherwise create normal page tables. */ - if (cpu_has_pse) { - set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE)); - pfn += PTRS_PER_PTE; - } else { - pte = one_page_table_init(pmd); - - for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) - set_pte(pte, pfn_pte(pfn, PAGE_KERNEL)); - } - } - } -} - static inline int page_kills_ppro(unsigned long pagenr) { if (pagenr >= 0x70000 && pagenr <= 0x7003F) @@ -165,12 +54,30 @@ return 0; } +extern int is_available_memory(efi_memory_desc_t *); + static inline int page_is_ram(unsigned long pagenr) { int i; + unsigned long addr, end; + + if (efi_enabled) { + efi_memory_desc_t *md; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + if (!is_available_memory(md)) + continue; + addr = (md->phys_addr+PAGE_SIZE-1) >> PAGE_SHIFT; + end = (md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT)) >> PAGE_SHIFT; + + if ((pagenr >= addr) && (pagenr < end)) + return 1; + } + return 0; + } for (i = 0; i < e820.nr_map; i++) { - unsigned long addr, end; if (e820.map[i].type != E820_RAM) /* not usable memory */ continue; @@ -187,11 +94,8 @@ return 0; } -#ifdef CONFIG_HIGHMEM pte_t *kmap_pte; -pgprot_t kmap_prot; -EXPORT_SYMBOL(kmap_prot); EXPORT_SYMBOL(kmap_pte); #define kmap_get_fixmap_pte(vaddr) \ @@ -199,29 +103,7 @@ void __init kmap_init(void) { - unsigned long kmap_vstart; - - /* cache the first kmap pte */ - kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN); - kmap_pte = kmap_get_fixmap_pte(kmap_vstart); - - kmap_prot = PAGE_KERNEL; -} - -void __init permanent_kmaps_init(pgd_t *pgd_base) -{ - pgd_t *pgd; - pmd_t *pmd; - pte_t *pte; - unsigned long vaddr; - - vaddr = PKMAP_BASE; - page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base); - - pgd = swapper_pg_dir + pgd_index(vaddr); - pmd = pmd_offset(pgd, vaddr); - pte = pte_offset_kernel(pmd, vaddr); - pkmap_page_table = pte; + kmap_pte = kmap_get_fixmap_pte(__fix_to_virt(FIX_KMAP_BEGIN)); } void __init one_highpage_init(struct page *page, int pfn, int bad_ppro) @@ -236,6 +118,8 @@ SetPageReserved(page); } +#ifdef CONFIG_HIGHMEM + #ifndef CONFIG_DISCONTIGMEM void __init set_highmem_pages_init(int bad_ppro) { @@ -247,12 +131,9 @@ #else extern void set_highmem_pages_init(int); #endif /* !CONFIG_DISCONTIGMEM */ - #else -#define kmap_init() do { } while (0) -#define permanent_kmaps_init(pgd_base) do { } while (0) -#define set_highmem_pages_init(bad_ppro) do { } while (0) -#endif /* CONFIG_HIGHMEM */ +# define set_highmem_pages_init(bad_ppro) do { } while (0) +#endif unsigned long __PAGE_KERNEL = _PAGE_KERNEL; @@ -262,30 +143,125 @@ extern void __init remap_numa_kva(void); #endif -static void __init pagetable_init (void) +static __init void prepare_pagetables(pgd_t *pgd_base, unsigned long address) +{ + pgd_t *pgd; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_base + pgd_index(address); + pmd = pmd_offset(pgd, address); + if (!pmd_present(*pmd)) { + pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); + set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))); + } +} + +static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base) +{ + unsigned long vaddr; + + for (vaddr = start; vaddr != end; vaddr += PAGE_SIZE) + prepare_pagetables(pgd_base, vaddr); +} + +void setup_identity_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end) { unsigned long vaddr; - pgd_t *pgd_base = swapper_pg_dir; + pgd_t *pgd; + int i, j, k; + pmd_t *pmd; + pte_t *pte, *pte_base; + + pgd = pgd_base; + + for (i = 0; i < PTRS_PER_PGD; pgd++, i++) { + vaddr = i*PGDIR_SIZE; + if (end && (vaddr >= end)) + break; + pmd = pmd_offset(pgd, 0); + for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { + vaddr = i*PGDIR_SIZE + j*PMD_SIZE; + if (end && (vaddr >= end)) + break; + if (vaddr < start) + continue; + if (cpu_has_pse) { + unsigned long __pe; + + set_in_cr4(X86_CR4_PSE); + boot_cpu_data.wp_works_ok = 1; + __pe = _KERNPG_TABLE + _PAGE_PSE + vaddr - start; + /* Make it "global" too if supported */ + if (cpu_has_pge) { + set_in_cr4(X86_CR4_PGE); +#if !defined(CONFIG_X86_SWITCH_PAGETABLES) + __pe += _PAGE_GLOBAL; + __PAGE_KERNEL |= _PAGE_GLOBAL; +#endif + } + set_pmd(pmd, __pmd(__pe)); + continue; + } + if (!pmd_present(*pmd)) + pte_base = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); + else + pte_base = (pte_t *) page_address(pmd_page(*pmd)); + pte = pte_base; + for (k = 0; k < PTRS_PER_PTE; pte++, k++) { + vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE; + if (end && (vaddr >= end)) + break; + if (vaddr < start) + continue; + *pte = mk_pte_phys(vaddr-start, PAGE_KERNEL); + } + set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base))); + } + } +} +static void __init pagetable_init (void) +{ + unsigned long vaddr, end; + pgd_t *pgd_base; #ifdef CONFIG_X86_PAE int i; - /* Init entries of the first-level page table to the zero page */ - for (i = 0; i < PTRS_PER_PGD; i++) - set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT)); #endif - /* Enable PSE if available */ - if (cpu_has_pse) { - set_in_cr4(X86_CR4_PSE); - } + /* + * This can be zero as well - no problem, in that case we exit + * the loops anyway due to the PTRS_PER_* conditions. + */ + end = (unsigned long)__va(max_low_pfn*PAGE_SIZE); - /* Enable PGE if available */ - if (cpu_has_pge) { - set_in_cr4(X86_CR4_PGE); - __PAGE_KERNEL |= _PAGE_GLOBAL; + pgd_base = swapper_pg_dir; +#ifdef CONFIG_X86_PAE + /* + * It causes too many problems if there's no proper pmd set up + * for all 4 entries of the PGD - so we allocate all of them. + * PAE systems will not miss this extra 4-8K anyway ... + */ + for (i = 0; i < PTRS_PER_PGD; i++) { + pmd_t *pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); + set_pgd(pgd_base + i, __pgd(__pa(pmd) + 0x1)); } +#endif + /* + * Set up lowmem-sized identity mappings at PAGE_OFFSET: + */ + setup_identity_mappings(pgd_base, PAGE_OFFSET, end); - kernel_physical_mapping_init(pgd_base); + /* + * Add flat-mode identity-mappings - SMP needs it when + * starting up on an AP from real-mode. (In the non-PAE + * case we already have these mappings through head.S.) + * All user-space mappings are explicitly cleared after + * SMP startup. + */ +#if CONFIG_SMP && CONFIG_X86_PAE + setup_identity_mappings(pgd_base, 0, 16*1024*1024); +#endif remap_numa_kva(); /* @@ -293,38 +269,64 @@ * created - mappings will be set by set_fixmap(): */ vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; - page_table_range_init(vaddr, 0, pgd_base); + fixrange_init(vaddr, 0, pgd_base); - permanent_kmaps_init(pgd_base); +#if CONFIG_HIGHMEM + { + pgd_t *pgd; + pmd_t *pmd; + pte_t *pte; -#ifdef CONFIG_X86_PAE - /* - * Add low memory identity-mappings - SMP needs it when - * starting up on an AP from real-mode. In the non-PAE - * case we already have these mappings through head.S. - * All user-space mappings are explicitly cleared after - * SMP startup. - */ - pgd_base[0] = pgd_base[USER_PTRS_PER_PGD]; + /* + * Permanent kmaps: + */ + vaddr = PKMAP_BASE; + fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base); + + pgd = swapper_pg_dir + pgd_index(vaddr); + pmd = pmd_offset(pgd, vaddr); + pte = pte_offset_kernel(pmd, vaddr); + pkmap_page_table = pte; + } #endif } -void zap_low_mappings (void) +/* + * Clear kernel pagetables in a PMD_SIZE-aligned range. + */ +static void clear_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end) { - int i; + unsigned long vaddr; + pgd_t *pgd; + pmd_t *pmd; + int i, j; + + pgd = pgd_base; + + for (i = 0; i < PTRS_PER_PGD; pgd++, i++) { + vaddr = i*PGDIR_SIZE; + if (end && (vaddr >= end)) + break; + pmd = pmd_offset(pgd, 0); + for (j = 0; j < PTRS_PER_PMD; pmd++, j++) { + vaddr = i*PGDIR_SIZE + j*PMD_SIZE; + if (end && (vaddr >= end)) + break; + if (vaddr < start) + continue; + pmd_clear(pmd); + } + } + flush_tlb_all(); +} + +void __init zap_low_mappings(void) +{ + printk("zapping low mappings.\n"); /* * Zap initial low-memory mappings. - * - * Note that "pgd_clear()" doesn't do it for - * us, because pgd_clear() is a no-op on i386. */ - for (i = 0; i < USER_PTRS_PER_PGD; i++) -#ifdef CONFIG_X86_PAE - set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(empty_zero_page))); -#else - set_pgd(swapper_pg_dir+i, __pgd(0)); -#endif - flush_tlb_all(); + clear_mappings(swapper_pg_dir, 0, 16*1024*1024); } #ifndef CONFIG_DISCONTIGMEM @@ -388,12 +390,6 @@ void __init test_wp_bit(void) { - if (cpu_has_pse) { - /* Ok, all PSE-capable CPUs are definitely handling the WP bit right. */ - boot_cpu_data.wp_works_ok = 1; - return; - } - printk("Checking if this processor honours the WP bit even in supervisor mode... "); /* Any page-aligned address will do, the test is non-destructive */ @@ -428,6 +424,7 @@ #endif /* !CONFIG_DISCONTIGMEM */ static struct kcore_list kcore_mem, kcore_vmalloc; +extern void fixup_sort_exception_table(void); void __init mem_init(void) { @@ -436,6 +433,8 @@ int tmp; int bad_ppro; + fixup_sort_exception_table(); + #ifndef CONFIG_DISCONTIGMEM if (!mem_map) BUG(); @@ -511,13 +510,18 @@ #ifndef CONFIG_SMP zap_low_mappings(); #endif + entry_trampoline_setup(); + default_ldt_page = virt_to_page(default_ldt); + load_LDT(&init_mm.context); } -kmem_cache_t *pgd_cache; -kmem_cache_t *pmd_cache; +kmem_cache_t *pgd_cache, *pmd_cache, *kpmd_cache; void __init pgtable_cache_init(void) { + void (*ctor)(void *, kmem_cache_t *, unsigned long); + void (*dtor)(void *, kmem_cache_t *, unsigned long); + if (PTRS_PER_PMD > 1) { pmd_cache = kmem_cache_create("pmd", PTRS_PER_PMD*sizeof(pmd_t), @@ -527,13 +531,36 @@ NULL); if (!pmd_cache) panic("pgtable_cache_init(): cannot create pmd cache"); + + if (TASK_SIZE > PAGE_OFFSET) { + kpmd_cache = kmem_cache_create("kpmd", + PTRS_PER_PMD*sizeof(pmd_t), + 0, + SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, + kpmd_ctor, + NULL); + if (!kpmd_cache) + panic("pgtable_cache_init(): " + "cannot create kpmd cache"); + } } + + if (PTRS_PER_PMD == 1 || TASK_SIZE <= PAGE_OFFSET) + ctor = pgd_ctor; + else + ctor = NULL; + + if (PTRS_PER_PMD == 1 && TASK_SIZE <= PAGE_OFFSET) + dtor = pgd_dtor; + else + dtor = NULL; + pgd_cache = kmem_cache_create("pgd", PTRS_PER_PGD*sizeof(pgd_t), 0, SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, - pgd_ctor, - PTRS_PER_PMD == 1 ? pgd_dtor : NULL); + ctor, + dtor); if (!pgd_cache) panic("pgtable_cache_init(): Cannot create pgd cache"); } --- diff/arch/i386/mm/pgtable.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/mm/pgtable.c 2003-11-26 10:09:04.000000000 +0000 @@ -21,6 +21,7 @@ #include <asm/e820.h> #include <asm/tlb.h> #include <asm/tlbflush.h> +#include <asm/atomic_kmap.h> void show_mem(void) { @@ -157,11 +158,20 @@ memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t)); } +void kpmd_ctor(void *__pmd, kmem_cache_t *cache, unsigned long flags) +{ + pmd_t *kpmd, *pmd; + kpmd = pmd_offset(&swapper_pg_dir[PTRS_PER_PGD-1], + (PTRS_PER_PMD - NR_SHARED_PMDS)*PMD_SIZE); + pmd = (pmd_t *)__pmd + (PTRS_PER_PMD - NR_SHARED_PMDS); + + memset(__pmd, 0, (PTRS_PER_PMD - NR_SHARED_PMDS)*sizeof(pmd_t)); + memcpy(pmd, kpmd, NR_SHARED_PMDS*sizeof(pmd_t)); +} + /* - * List of all pgd's needed for non-PAE so it can invalidate entries - * in both cached and uncached pgd's; not needed for PAE since the - * kernel pmd is shared. If PAE were not to share the pmd a similar - * tactic would be needed. This is essentially codepath-based locking + * List of all pgd's needed so it can invalidate entries in both cached + * and uncached pgd's. This is essentially codepath-based locking * against pageattr.c; it is the unique case in which a valid change * of kernel pagetables can't be lazily synchronized by vmalloc faults. * vmalloc faults work because attached pagetables are never freed. @@ -170,30 +180,60 @@ * could be used. The locking scheme was chosen on the basis of * manfred's recommendations and having no core impact whatsoever. * -- wli + * + * The entire issue goes away when XKVA is configured. */ spinlock_t pgd_lock = SPIN_LOCK_UNLOCKED; LIST_HEAD(pgd_list); -void pgd_ctor(void *pgd, kmem_cache_t *cache, unsigned long unused) +/* + * This is not that hard to figure out. + * (a) PTRS_PER_PMD == 1 means non-PAE. + * (b) PTRS_PER_PMD > 1 means PAE. + * (c) TASK_SIZE > PAGE_OFFSET means XKVA. + * (d) TASK_SIZE <= PAGE_OFFSET means non-XKVA. + * + * Do *NOT* back out the preconstruction like the patch I'm cleaning + * up after this very instant did, or at all, for that matter. + * This is never called when PTRS_PER_PMD > 1 && TASK_SIZE > PAGE_OFFSET. + * -- wli + */ +void pgd_ctor(void *__pgd, kmem_cache_t *cache, unsigned long unused) { + pgd_t *pgd = (pgd_t *)__pgd; unsigned long flags; - if (PTRS_PER_PMD == 1) - spin_lock_irqsave(&pgd_lock, flags); + if (PTRS_PER_PMD == 1) { + if (TASK_SIZE <= PAGE_OFFSET) + spin_lock_irqsave(&pgd_lock, flags); + else + memcpy(&pgd[PTRS_PER_PGD - NR_SHARED_PMDS], + &swapper_pg_dir[PTRS_PER_PGD - NR_SHARED_PMDS], + NR_SHARED_PMDS * sizeof(pgd_t)); + } - memcpy((pgd_t *)pgd + USER_PTRS_PER_PGD, - swapper_pg_dir + USER_PTRS_PER_PGD, - (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); + if (TASK_SIZE <= PAGE_OFFSET) + memcpy(pgd + USER_PTRS_PER_PGD, + swapper_pg_dir + USER_PTRS_PER_PGD, + (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); if (PTRS_PER_PMD > 1) return; - list_add(&virt_to_page(pgd)->lru, &pgd_list); - spin_unlock_irqrestore(&pgd_lock, flags); - memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); + if (TASK_SIZE > PAGE_OFFSET) + memset(pgd, 0, (PTRS_PER_PGD - NR_SHARED_PMDS)*sizeof(pgd_t)); + else { + list_add(&virt_to_page(pgd)->lru, &pgd_list); + spin_unlock_irqrestore(&pgd_lock, flags); + memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); + } } -/* never called when PTRS_PER_PMD > 1 */ +/* + * Never called when PTRS_PER_PMD > 1 || TASK_SIZE > PAGE_OFFSET + * for with PAE we would list_del() multiple times, and for non-PAE + * with XKVA all the AGP pgd shootdown code is unnecessary. + */ void pgd_dtor(void *pgd, kmem_cache_t *cache, unsigned long unused) { unsigned long flags; /* can be called from interrupt context */ @@ -203,6 +243,12 @@ spin_unlock_irqrestore(&pgd_lock, flags); } +/* + * See the comments above pgd_ctor() wrt. preconstruction. + * Do *NOT* memcpy() here. If you do, you back out important + * anti- cache pollution code. + * + */ pgd_t *pgd_alloc(struct mm_struct *mm) { int i; @@ -211,15 +257,33 @@ if (PTRS_PER_PMD == 1 || !pgd) return pgd; + /* + * In the 4G userspace case alias the top 16 MB virtual + * memory range into the user mappings as well (these + * include the trampoline and CPU data structures). + */ for (i = 0; i < USER_PTRS_PER_PGD; ++i) { - pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL); + kmem_cache_t *cache; + pmd_t *pmd; + + if (TASK_SIZE > PAGE_OFFSET && i == USER_PTRS_PER_PGD - 1) + cache = kpmd_cache; + else + cache = pmd_cache; + + pmd = kmem_cache_alloc(cache, GFP_KERNEL); if (!pmd) goto out_oom; set_pgd(&pgd[i], __pgd(1 + __pa((u64)((u32)pmd)))); } - return pgd; + return pgd; out_oom: + /* + * we don't have to handle the kpmd_cache here, since it's the + * last allocation, and has either nothing to free or when it + * succeeds the whole operation succeeds. + */ for (i--; i >= 0; i--) kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1)); kmem_cache_free(pgd_cache, pgd); @@ -230,10 +294,29 @@ { int i; - /* in the PAE case user pgd entries are overwritten before usage */ - if (PTRS_PER_PMD > 1) - for (i = 0; i < USER_PTRS_PER_PGD; ++i) - kmem_cache_free(pmd_cache, (void *)__va(pgd_val(pgd[i])-1)); /* in the non-PAE case, clear_page_tables() clears user pgd entries */ + if (PTRS_PER_PMD == 1) + goto out_free; + + /* in the PAE case user pgd entries are overwritten before usage */ + for (i = 0; i < USER_PTRS_PER_PGD; ++i) { + kmem_cache_t *cache; + pmd_t *pmd = __va(pgd_val(pgd[i]) - 1); + + /* + * only userspace pmd's are cleared for us + * by mm/memory.c; it's a slab cache invariant + * that we must separate the kernel pmd slab + * all times, else we'll have bad pmd's. + */ + if (TASK_SIZE > PAGE_OFFSET && i == USER_PTRS_PER_PGD - 1) + cache = kpmd_cache; + else + cache = pmd_cache; + + kmem_cache_free(cache, pmd); + } +out_free: kmem_cache_free(pgd_cache, pgd); } + --- diff/arch/i386/pci/acpi.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/pci/acpi.c 2003-11-26 10:09:04.000000000 +0000 @@ -15,10 +15,11 @@ static int __init pci_acpi_init(void) { + extern int acpi_disabled; if (pcibios_scanned) return 0; - if (!(pci_probe & PCI_NO_ACPI_ROUTING)) { + if (!(pci_probe & PCI_NO_ACPI_ROUTING) && !acpi_disabled) { if (!acpi_pci_irq_init()) { printk(KERN_INFO "PCI: Using ACPI for IRQ routing\n"); printk(KERN_INFO "PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'\n"); --- diff/arch/i386/pci/fixup.c 2003-08-20 14:16:24.000000000 +0100 +++ source/arch/i386/pci/fixup.c 2003-11-26 10:09:04.000000000 +0000 @@ -6,27 +6,52 @@ #include <linux/init.h> #include "pci.h" +static void __devinit i450nx_scan_bus(struct pci_bus *parent, u8 busnr) +{ + struct list_head *tmp; + + pci_scan_bus(busnr, &pci_root_ops, NULL); + + list_for_each(tmp, &parent->children) { + u8 childnr; + struct pci_dev *dev = pci_dev_b(tmp); + + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) + continue; + pci_read_config_byte(dev, PCI_PRIMARY_BUS, &childnr); + if (childnr != busnr) + continue; + + printk(KERN_WARNING "PCI: Removing fake PCI bridge %s\n", + pci_name(dev)); + pci_remove_bus_device(dev); + break; + } +} static void __devinit pci_fixup_i450nx(struct pci_dev *d) { /* * i450NX -- Find and scan all secondary buses on all PXB's. + * Some manufacturers added fake PCI-PCI bridges that also point + * to the peer busses. Look for them and delete them. */ int pxb, reg; u8 busno, suba, subb; - printk(KERN_WARNING "PCI: Searching for i450NX host bridges on %s\n", pci_name(d)); + printk(KERN_NOTICE "PCI: Searching for i450NX host bridges on %s\n", pci_name(d)); reg = 0xd0; - for(pxb=0; pxb<2; pxb++) { + for (pxb = 0; pxb < 2; pxb++) { pci_read_config_byte(d, reg++, &busno); pci_read_config_byte(d, reg++, &suba); pci_read_config_byte(d, reg++, &subb); DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb); if (busno) - pci_scan_bus(busno, &pci_root_ops, NULL); /* Bus A */ + i450nx_scan_bus(d->bus, busno); /* Bus A */ if (suba < subb) - pci_scan_bus(suba+1, &pci_root_ops, NULL); /* Bus B */ + i450nx_scan_bus(d->bus, suba+1); /* Bus B */ } + pcibios_last_bus = -1; } --- diff/arch/i386/pci/irq.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/pci/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -455,7 +455,10 @@ #if 0 /* Let's see what chip this is supposed to be ... */ /* We must not touch 440GX even if we have tables. 440GX has different IRQ routing weirdness */ - if (pci_find_device(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82440GX, NULL)) + if ( pci_find_device(PCI_VENDOR_ID_INTEL, + PCI_DEVICE_ID_INTEL_82443GX_0, NULL) || + pci_find_device(PCI_VENDOR_ID_INTEL, + PCI_DEVICE_ID_INTEL_82443GX_2, NULL)) return 0; #endif @@ -695,9 +698,10 @@ return NULL; } -static irqreturn_t pcibios_test_irq_handler(int irq, void *dev_id, struct pt_regs *regs) +static irqreturn_t +pcibios_test_irq_handler(int irq, void *dev_id, struct pt_regs *regs) { - return IRQ_NONE; + return IRQ_HANDLED; } static int pcibios_lookup_irq(struct pci_dev *dev, int assign) @@ -813,8 +817,10 @@ if ( dev2->irq && dev2->irq != irq && \ (!(pci_probe & PCI_USE_PIRQ_MASK) || \ ((1 << dev2->irq) & mask)) ) { +#ifndef CONFIG_PCI_USE_VECTOR printk(KERN_INFO "IRQ routing conflict for %s, have irq %d, want irq %d\n", pci_name(dev2), dev2->irq, irq); +#endif continue; } dev2->irq = irq; @@ -878,6 +884,10 @@ bridge->bus->number, PCI_SLOT(bridge->devfn), pin, irq); } if (irq >= 0) { + if (use_pci_vector() && + !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); + printk(KERN_INFO "PCI->APIC IRQ transform: (B%d,I%d,P%d) -> %d\n", dev->bus->number, PCI_SLOT(dev->devfn), pin, irq); dev->irq = irq; --- diff/arch/ia64/Kconfig 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/ia64/Kconfig 2003-11-26 10:09:04.000000000 +0000 @@ -164,11 +164,6 @@ The ACPI Sourceforge project may also be of interest: <http://sf.net/projects/acpi/> -config ACPI_EFI - bool - depends on !IA64_HP_SIM - default y - config ACPI_INTERPRETER bool depends on !IA64_HP_SIM @@ -404,6 +399,11 @@ To use this option, you have to ensure that the "/proc file system support" (CONFIG_PROC_FS) is enabled, too. +config EFI + bool + depends on !IA64_HP_SIM + default y + config EFI_VARS tristate "/proc/efi/vars support" help --- diff/arch/ia64/defconfig 2003-07-08 09:55:17.000000000 +0100 +++ source/arch/ia64/defconfig 2003-11-26 10:09:04.000000000 +0000 @@ -48,7 +48,6 @@ CONFIG_IA64_PAGE_SIZE_16KB=y # CONFIG_IA64_PAGE_SIZE_64KB is not set CONFIG_ACPI=y -CONFIG_ACPI_EFI=y CONFIG_ACPI_INTERPRETER=y CONFIG_ACPI_KERNEL_CONFIG=y CONFIG_IA64_L1_CACHE_SHIFT=7 @@ -76,6 +75,7 @@ CONFIG_COMPAT=y CONFIG_PERFMON=y CONFIG_IA64_PALINFO=y +CONFIG_EFI=y CONFIG_EFI_VARS=y CONFIG_NR_CPUS=16 CONFIG_BINFMT_ELF=y --- diff/arch/ia64/ia32/binfmt_elf32.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/ia64/ia32/binfmt_elf32.c 2003-11-26 10:09:04.000000000 +0000 @@ -60,10 +60,12 @@ extern unsigned long *ia32_gdt; struct page * -ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int no_share) +ia32_install_shared_page (struct vm_area_struct *vma, unsigned long address, int *type) { struct page *pg = ia32_shared_page[smp_processor_id()]; get_page(pg); + if (type) + *type = VM_FAULT_MINOR; return pg; } --- diff/arch/ia64/kernel/irq.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/ia64/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -160,18 +160,20 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int j, i = *(int *) v; struct irqaction * action; irq_desc_t *idesc; unsigned long flags; - seq_puts(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { idesc = irq_descp(i); spin_lock_irqsave(&idesc->lock, flags); action = idesc->action; @@ -194,25 +196,26 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&idesc->lock, flags); - } - seq_puts(p, "NMI: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", nmi_count(j)); - seq_putc(p, '\n'); + } else if (i == NR_IRQS) { + seq_puts(p, "NMI: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", nmi_count(j)); + seq_putc(p, '\n'); #ifdef CONFIG_X86_LOCAL_APIC - seq_puts(p, "LOC: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", irq_stat[j].apic_timer_irqs); - seq_putc(p, '\n'); + seq_puts(p, "LOC: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", irq_stat[j].apic_timer_irqs); + seq_putc(p, '\n'); #endif - seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); + seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); #ifdef CONFIG_X86_IO_APIC #ifdef APIC_MISMATCH_DEBUG - seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); + seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); #endif #endif + } return 0; } @@ -974,19 +977,13 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - int k, len; - cpumask_t tmp = irq_affinity[(long)data]; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - len = 0; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -1034,17 +1031,13 @@ int count, int *eof, void *data) { cpumask_t *mask = (cpumask_t *)data; - int k, len = 0; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(*mask)); - len += j; - page += j; - cpus_shift_right(*mask, *mask, 16); - } + len = format_cpumask(page, *mask); + page += len; len += sprintf(page, "\n"); return len; } --- diff/arch/ia64/kernel/perfmon.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/ia64/kernel/perfmon.c 2003-11-26 10:09:04.000000000 +0000 @@ -2157,6 +2157,7 @@ d_add(file->f_dentry, inode); file->f_vfsmnt = mntget(pfmfs_mnt); + file->f_mapping = inode->i_mapping; file->f_op = &pfm_file_ops; file->f_mode = FMODE_READ; --- diff/arch/ia64/kernel/setup.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/ia64/kernel/setup.c 2003-11-26 10:09:04.000000000 +0000 @@ -54,6 +54,10 @@ # error "struct cpuinfo_ia64 too big!" #endif +#ifdef CONFIG_EFI +int efi_enabled = 1; +#endif + #ifdef CONFIG_SMP unsigned long __per_cpu_offset[NR_CPUS]; #endif --- diff/arch/ia64/mm/hugetlbpage.c 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/ia64/mm/hugetlbpage.c 2003-11-26 10:09:04.000000000 +0000 @@ -518,7 +518,7 @@ return 1; } -static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused) +static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int *unused) { BUG(); return NULL; --- diff/arch/m68k/kernel/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/m68k/kernel/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -253,19 +253,18 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; /* autovector interrupts */ if (mach_default_handler) { - for (i = 0; i < SYS_IRQS; i++) { + if (i < SYS_IRQS) { seq_printf(p, "auto %2d: %10u ", i, i ? kstat_cpu(0).irqs[i] : num_spurious); seq_puts(p, " "); seq_printf(p, "%s\n", irq_list[i].devname); } - } - - mach_get_irq_list(p, v); + } else if (i == SYS_IRQS) + mach_get_irq_list(p, v); return 0; } --- diff/arch/m68knommu/platform/5307/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/m68knommu/platform/5307/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -254,9 +254,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (irq_list[i].flags & IRQ_FLG_STD) continue; @@ -269,7 +269,7 @@ seq_printf(p, "%s\n", irq_list[i].devname); } - if (mach_get_irq_list) + if (i == NR_IRQS && mach_get_irq_list) mach_get_irq_list(p, v); return(0); } --- diff/arch/m68knommu/platform/68328/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/m68knommu/platform/68328/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -198,9 +198,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (int_irq_list[i].flags & IRQ_FLG_STD) continue; @@ -211,7 +211,8 @@ seq_printf(p, " "); seq_printf(p, "%s\n", int_irq_list[i].devname); } - seq_printf(p, " : %10u spurious\n", num_spurious); + if (i == NR_IRQS) + seq_printf(p, " : %10u spurious\n", num_spurious); return 0; } --- diff/arch/m68knommu/platform/68360/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/m68knommu/platform/68360/ints.c 2003-11-26 10:09:04.000000000 +0000 @@ -278,9 +278,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (int_irq_list[i].flags & IRQ_FLG_STD) continue; @@ -291,7 +291,8 @@ seq_printf(p, " "); seq_printf(p, "%s\n", int_irq_list[i].devname); } - seq_printf(p, " : %10u spurious\n", num_spurious); + if (i == NR_IRQS) + seq_printf(p, " : %10u spurious\n", num_spurious); return 0; } --- diff/arch/mips/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/mips/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -85,17 +85,19 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_printf(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - seq_putc(p, '\n'); + if (i == 0) { + seq_printf(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) @@ -117,10 +119,10 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[i].lock, flags); + } else if (i == NR_IRQS) { + seq_putc(p, '\n'); + seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); } - seq_putc(p, '\n'); - seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); - return 0; } @@ -872,17 +874,13 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - int len, k; - cpumask_t tmp = irq_affinity[(long)data]; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -918,19 +916,14 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - int len, k; - cpumask_t *mask = (cpumask_t *)data, tmp; + int len; + cpumask_t *mask = (cpumask_t *)data; if (count < HEX_DIGITS+1) return -EINVAL; - tmp = *mask; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, *mask); + page += len; len += sprintf(page, "\n"); return len; } --- diff/arch/parisc/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/parisc/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -215,29 +215,30 @@ int show_interrupts(struct seq_file *p, void *v) { #ifdef CONFIG_PROC_FS - unsigned int regnr = 0; + unsigned int regnr = *(int *) v, i; - seq_puts(p, " "); + if (regnr == 0) { + seq_puts(p, " "); #ifdef CONFIG_SMP - for (regnr = 0; regnr < NR_CPUS; regnr++) + for (i = 0; i < NR_CPUS; i++) #endif - seq_printf(p, " CPU%02d ", regnr); + seq_printf(p, " CPU%02d ", i); #ifdef PARISC_IRQ_CR16_COUNTS - seq_printf(p, "[min/avg/max] (CPU cycle counts)"); + seq_printf(p, "[min/avg/max] (CPU cycle counts)"); #endif - seq_putc(p, '\n'); + seq_putc(p, '\n'); + } /* We don't need *irqsave lock variants since this is ** only allowed to change while in the base context. */ spin_lock(&irq_lock); - for (regnr = 0; regnr < NR_IRQ_REGS; regnr++) { - unsigned int i; + if (regnr < NR_IRQ_REGS) { struct irq_region *region = irq_region[regnr]; if (!region || !region->action) - continue; + goto skip; for (i = 0; i <= MAX_CPU_IRQ; i++) { struct irqaction *action = ®ion->action[i]; @@ -286,9 +287,9 @@ seq_putc(p, '\n'); } } + skip: spin_unlock(&irq_lock); - seq_putc(p, '\n'); #endif /* CONFIG_PROC_FS */ return 0; } --- diff/arch/parisc/kernel/sys_parisc.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/parisc/kernel/sys_parisc.c 2003-11-26 10:09:04.000000000 +0000 @@ -93,17 +93,13 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { - struct inode *inode; - if (len > TASK_SIZE) return -ENOMEM; if (!addr) addr = TASK_UNMAPPED_BASE; - inode = filp ? filp->f_dentry->d_inode : NULL; - - if (inode && (flags & MAP_SHARED)) { - addr = get_shared_area(inode->i_mapping, addr, len, pgoff); + if (filp && (flags & MAP_SHARED)) { + addr = get_shared_area(filp->f_mapping, addr, len, pgoff); } else { addr = get_unshared_area(addr, len); } --- diff/arch/ppc/boot/ld.script 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/ppc/boot/ld.script 2003-11-26 10:09:04.000000000 +0000 @@ -82,6 +82,7 @@ *(__ksymtab) *(__ksymtab_strings) *(__bug_table) + *(__kcrctab) } } --- diff/arch/ppc/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/ppc/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -346,17 +346,19 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_puts(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ", j); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ", j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if ( !action || !action->handler ) @@ -381,22 +383,23 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } + } else if (i == NR_IRQS) { #ifdef CONFIG_TAU_INT - if (tau_initialized){ - seq_puts(p, "TAU: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", tau_interrupts(j)); - seq_puts(p, " PowerPC Thermal Assist (cpu temp)\n"); - } + if (tau_initialized){ + seq_puts(p, "TAU: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", tau_interrupts(j)); + seq_puts(p, " PowerPC Thermal Assist (cpu temp)\n"); + } #endif #ifdef CONFIG_SMP - /* should this be per processor send/receive? */ - seq_printf(p, "IPI (recv/sent): %10u/%u\n", - atomic_read(&ipi_recv), atomic_read(&ipi_sent)); + /* should this be per processor send/receive? */ + seq_printf(p, "IPI (recv/sent): %10u/%u\n", + atomic_read(&ipi_recv), atomic_read(&ipi_sent)); #endif - seq_printf(p, "BAD: %10u\n", ppc_spurious_interrupts); + seq_printf(p, "BAD: %10u\n", ppc_spurious_interrupts); + } return 0; } @@ -574,19 +577,13 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - cpumask_t tmp = irq_affinity[(long)data]; - int k, len = 0; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } - + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -665,17 +662,13 @@ int count, int *eof, void *data) { cpumask_t mask = *(cpumask_t *)data; - int k, len = 0; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(mask)); - len += j; - page += j; - cpus_shift_right(mask, mask, 16); - } + len = format_cpumask(page, mask); + page += len; len += sprintf(page, "\n"); return len; } --- diff/arch/ppc64/kernel/irq.c 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/ppc64/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -300,7 +300,7 @@ spin_lock_irqsave(&desc->lock, flags); switch (desc->depth) { case 1: { - unsigned int status = desc->status & ~(IRQ_DISABLED | IRQ_INPROGRESS); + unsigned int status = desc->status & ~IRQ_DISABLED; desc->status = status; if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) { desc->status = status | IRQ_REPLAY; @@ -323,18 +323,20 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_printf(p, " "); - for (j=0; j<NR_CPUS; j++) { - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); + if (i == 0) { + seq_printf(p, " "); + for (j=0; j<NR_CPUS; j++) { + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + } + seq_putc(p, '\n'); } - seq_putc(p, '\n'); - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action || !action->handler) @@ -359,8 +361,8 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } - seq_printf(p, "BAD: %10u\n", ppc_spurious_interrupts); + } else if (i == NR_IRQS) + seq_printf(p, "BAD: %10u\n", ppc_spurious_interrupts); return 0; } @@ -657,18 +659,13 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - int k, len; - cpumask_t tmp = irq_affinity[(long)data]; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = 0; k < sizeof(cpumask_t) / sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -744,10 +741,16 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - unsigned long *mask = (unsigned long *) data; + int len; + cpumask_t *mask = (cpumask_t *) data; + if (count < HEX_DIGITS+1) return -EINVAL; - return sprintf (page, "%08lx\n", *mask); + + len = format_cpumask(page, *mask); + page += len; + len += sprintf (page, "\n"); + return len; } static int prof_cpu_mask_write_proc (struct file *file, const char __user *buffer, --- diff/arch/ppc64/kernel/misc.S 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/ppc64/kernel/misc.S 2003-11-26 10:09:04.000000000 +0000 @@ -843,15 +843,15 @@ .llong .sys_ni_syscall .llong .sys_ni_syscall .llong .sys_ni_syscall - .llong .sys_ni_syscall /* 245 */ - .llong .sys_ni_syscall - .llong .sys_ni_syscall - .llong .sys_ni_syscall - .llong .sys_ni_syscall + .llong .compat_clock_settime /* 245 */ + .llong .compat_clock_gettime + .llong .compat_clock_getres + .llong .compat_clock_nanosleep + .llong .sys_ni_syscall /* 249 swapcontext */ .llong .sys32_tgkill /* 250 */ .llong .sys32_utimes - .llong .sys_statfs64 - .llong .sys_fstatfs64 + .llong .compat_statfs64 + .llong .compat_fstatfs64 .balign 8 _GLOBAL(sys_call_table) --- diff/arch/ppc64/kernel/time.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/ppc64/kernel/time.c 2003-11-26 10:09:04.000000000 +0000 @@ -91,6 +91,9 @@ unsigned long processor_freq; spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED; +unsigned long tb_to_ns_scale; +unsigned long tb_to_ns_shift; + struct gettimeofday_struct do_gtod; extern unsigned long wall_jiffies; @@ -313,11 +316,13 @@ /* * Scheduler clock - returns current time in nanosec units. * - * This is wrong, but my CPUs run at 1GHz, so nyer nyer. + * Note: mulhdu(a, b) (multiply high double unsigned) returns + * the high 64 bits of a * b, i.e. (a * b) >> 64, where a and b + * are 64-bit unsigned numbers. */ unsigned long long sched_clock(void) { - return get_tb(); + return mulhdu(get_tb(), tb_to_ns_scale) << tb_to_ns_shift; } /* @@ -473,9 +478,30 @@ /* This function is only called on the boot processor */ unsigned long flags; struct rtc_time tm; + struct div_result res; + unsigned long scale, shift; ppc_md.calibrate_decr(); + /* + * Compute scale factor for sched_clock. + * The calibrate_decr() function has set tb_ticks_per_sec, + * which is the timebase frequency. + * We compute 1e9 * 2^64 / tb_ticks_per_sec and interpret + * the 128-bit result as a 64.64 fixed-point number. + * We then shift that number right until it is less than 1.0, + * giving us the scale factor and shift count to use in + * sched_clock(). + */ + div128_by_32(1000000000, 0, tb_ticks_per_sec, &res); + scale = res.result_low; + for (shift = 0; res.result_high != 0; ++shift) { + scale = (scale >> 1) | (res.result_high << 63); + res.result_high >>= 1; + } + tb_to_ns_scale = scale; + tb_to_ns_shift = shift; + #ifdef CONFIG_PPC_ISERIES if (!piranha_simulator) #endif --- diff/arch/ppc64/mm/hugetlbpage.c 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/ppc64/mm/hugetlbpage.c 2003-11-26 10:09:04.000000000 +0000 @@ -921,7 +921,7 @@ * this far. */ static struct page *hugetlb_nopage(struct vm_area_struct *vma, - unsigned long address, int unused) + unsigned long address, int *unused) { BUG(); return NULL; --- diff/arch/ppc64/mm/numa.c 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/ppc64/mm/numa.c 2003-11-26 10:09:04.000000000 +0000 @@ -108,7 +108,7 @@ for (memory = find_type_devices("memory"); memory; memory = memory->next) { - int *tmp1, *tmp2; + unsigned int *tmp1, *tmp2; unsigned long i; unsigned long start = 0; unsigned long size = 0; --- diff/arch/s390/kernel/compat_wrapper.S 2003-06-09 14:18:18.000000000 +0100 +++ source/arch/s390/kernel/compat_wrapper.S 2003-11-26 10:09:04.000000000 +0000 @@ -5,6 +5,7 @@ * S390 version * Copyright (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation * Author(s): Gerhard Tonn (ton@de.ibm.com), +* Thomas Spatzier (tspat@de.ibm.com) */ .globl sys32_exit_wrapper @@ -1230,3 +1231,37 @@ lgfr %r4,%r4 # int lgfr %r5,%r5 # int jg sys_epoll_wait # branch to system call + + .globl sys32_io_setup_wrapper +sys32_io_setup_wrapper: + llgfr %r2,%r2 # unsigned int + llgtr %r3,%r3 # u32 * + jg compat_sys_io_setup + + .globl sys32_io_destroy_wrapper +sys32_io_destroy_wrapper: + llgfr %r2,%r2 # (aio_context_t) u32 + jg sys_io_destroy + + .globl sys32_io_getevents_wrapper +sys32_io_getevents_wrapper: + llgfr %r2,%r2 # (aio_context_t) u32 + lgfr %r3,%r3 # long + lgfr %r4,%r4 # long + llgtr %r5,%r5 # struct io_event * + llgtr %r6,%r6 # struct compat_timespec * + jg compat_sys_io_getevents + + .globl sys32_io_submit_wrapper +sys32_io_submit_wrapper: + llgfr %r2,%r2 # (aio_context_t) u32 + lgfr %r3,%r3 # long + llgtr %r4,%r4 # struct iocb ** + jg compat_sys_io_submit + + .globl sys32_io_cancel_wrapper +sys32_io_cancel_wrapper: + llgfr %r2,%r2 # (aio_context_t) u32 + llgtr %r3,%r3 # struct iocb * + llgtr %r4,%r4 # struct io_event * + jg sys_io_cancel --- diff/arch/s390/kernel/setup.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/s390/kernel/setup.c 2003-11-26 10:09:04.000000000 +0000 @@ -617,17 +617,17 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; - seq_puts(p, " "); - - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { seq_printf(p, "%s: ", intrclass_names[i]); #ifndef CONFIG_SMP seq_printf(p, "%10u ", kstat_irqs(i)); --- diff/arch/s390/kernel/syscalls.S 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/s390/kernel/syscalls.S 2003-11-26 10:09:04.000000000 +0000 @@ -251,11 +251,11 @@ SYSCALL(sys_sched_getaffinity,sys_sched_getaffinity,sys32_sched_getaffinity_wrapper) /* 240 */ SYSCALL(sys_tgkill,sys_tgkill,sys_tgkill) NI_SYSCALL /* reserved for TUX */ -SYSCALL(sys_io_setup,sys_io_setup,sys_ni_syscall) -SYSCALL(sys_io_destroy,sys_io_destroy,sys_ni_syscall) -SYSCALL(sys_io_getevents,sys_io_getevents,sys_ni_syscall) /* 245 */ -SYSCALL(sys_io_submit,sys_io_submit,sys_ni_syscall) -SYSCALL(sys_io_cancel,sys_io_cancel,sys_ni_syscall) +SYSCALL(sys_io_setup,sys_io_setup,sys32_io_setup_wrapper) +SYSCALL(sys_io_destroy,sys_io_destroy,sys32_io_destroy_wrapper) +SYSCALL(sys_io_getevents,sys_io_getevents,sys32_io_getevents_wrapper) /* 245 */ +SYSCALL(sys_io_submit,sys_io_submit,sys32_io_submit_wrapper) +SYSCALL(sys_io_cancel,sys_io_cancel,sys32_io_cancel_wrapper) SYSCALL(sys_exit_group,sys_exit_group,sys32_exit_group_wrapper) SYSCALL(sys_epoll_create,sys_epoll_create,sys_epoll_create_wrapper) SYSCALL(sys_epoll_ctl,sys_epoll_ctl,sys_epoll_ctl_wrapper) /* 250 */ --- diff/arch/sh/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/sh/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -93,17 +93,19 @@ #if defined(CONFIG_PROC_FS) int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_puts(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < ACTUAL_NR_IRQS ; i++) { + if (i < ACTUAL_NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) --- diff/arch/sparc/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/sparc/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -102,7 +102,7 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; struct irqaction * action; unsigned long flags; #ifdef CONFIG_SMP @@ -114,7 +114,7 @@ return show_sun4d_interrupts(p, v); } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { local_irq_save(flags); action = *(i + irq_action); if (!action) --- diff/arch/sparc/kernel/sun4d_irq.c 2003-05-21 11:50:14.000000000 +0100 +++ source/arch/sparc/kernel/sun4d_irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -75,13 +75,13 @@ int show_sun4d_interrupts(struct seq_file *p, void *v) { - int i, j = 0, k = 0, sbusl; + int i = *(int *) v, j = 0, k = 0, sbusl; struct irqaction * action; #ifdef CONFIG_SMP int x; #endif - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { sbusl = pil_to_sbus[i]; if (!sbusl) { action = *(i + irq_action); --- diff/arch/sparc/kernel/time.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/sparc/kernel/time.c 2003-11-26 10:09:05.000000000 +0000 @@ -638,3 +638,12 @@ return -1; } } + +/* + * Returns nanoseconds + */ + +unsigned long long sched_clock(void) +{ + return (unsigned long long)jiffies * (1000000000 / HZ); +} --- diff/arch/sparc64/Kconfig 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/sparc64/Kconfig 2003-11-26 10:09:04.000000000 +0000 @@ -808,12 +808,19 @@ best used in conjunction with the NMI watchdog so that spinlock deadlocks are also debuggable. +config LOCKMETER + bool "Kernel lock metering" + depends on SMP && !PREEMPT + help + Say Y to enable kernel lock metering, which adds overhead to SMP locks, + but allows you to see various statistics using the lockstat command. + # We have a custom atomic_dec_and_lock() implementation but it's not # compatible with spinlock debugging so we need to fall back on # the generic version in that case. config HAVE_DEC_LOCK bool - depends on SMP && !DEBUG_SPINLOCK + depends on SMP && !DEBUG_SPINLOCK && !LOCKMETER default y config DEBUG_SPINLOCK_SLEEP --- diff/arch/sparc64/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/sparc64/kernel/irq.c 2003-11-26 10:09:04.000000000 +0000 @@ -128,14 +128,14 @@ int show_interrupts(struct seq_file *p, void *v) { unsigned long flags; - int i; + int i = *(int *) v; struct irqaction *action; #ifdef CONFIG_SMP int j; #endif spin_lock_irqsave(&irq_action_lock, flags); - for (i = 0; i < (NR_IRQS + 1); i++) { + if (i <= NR_IRQS) { if (!(action = *(i + irq_action))) continue; seq_printf(p, "%3d: ", i); --- diff/arch/sparc64/kernel/systbls.S 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/sparc64/kernel/systbls.S 2003-11-26 10:09:04.000000000 +0000 @@ -72,8 +72,8 @@ /*250*/ .word sys32_mremap, sys32_sysctl, sys_getsid, sys_fdatasync, sys32_nfsservctl .word sys_ni_syscall, compat_clock_settime, compat_clock_gettime, compat_clock_getres, compat_clock_nanosleep /*260*/ .word compat_sys_sched_getaffinity, compat_sys_sched_setaffinity, compat_timer_settime, compat_timer_gettime, sys_timer_getoverrun - .word sys_timer_delete, sys32_timer_create, sys_ni_syscall, sys_ni_syscall, sys_ni_syscall -/*270*/ .word sys_ni_syscall, sys_ni_syscall, sys_ni_syscall, sys_ni_syscall + .word sys_timer_delete, sys32_timer_create, sys_ni_syscall, compat_sys_io_setup, sys_io_destroy +/*270*/ .word compat_sys_io_submit, sys_io_cancel, compat_sys_io_getevents, sys_ni_syscall /* Now the 64-bit native Linux syscall table. */ --- diff/arch/sparc64/lib/rwlock.S 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/sparc64/lib/rwlock.S 2003-11-26 10:09:04.000000000 +0000 @@ -85,5 +85,20 @@ __write_trylock_fail: retl mov 0, %o0 + + .globl __read_trylock +__read_trylock: /* %o0 = lock_ptr */ + ldsw [%o0], %g5 + brlz,pn %g5, 100f + add %g5, 1, %g7 + cas [%o0], %g5, %g7 + cmp %g5, %g7 + bne,pn %icc, __read_trylock + membar #StoreLoad | #StoreStore + retl + mov 1, %o0 +100: retl + mov 0, %o0 + rwlock_impl_end: --- diff/arch/sparc64/mm/hugetlbpage.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/sparc64/mm/hugetlbpage.c 2003-11-26 10:09:04.000000000 +0000 @@ -504,7 +504,7 @@ * this far. */ static struct page *hugetlb_nopage(struct vm_area_struct *vma, - unsigned long address, int unused) + unsigned long address, int *unused) { BUG(); return NULL; --- diff/arch/um/drivers/ubd_kern.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/um/drivers/ubd_kern.c 2003-11-26 10:09:05.000000000 +0000 @@ -49,9 +49,9 @@ static void (*do_ubd)(void); -static int ubd_open(struct inode * inode, struct file * filp); -static int ubd_release(struct inode * inode, struct file * file); -static int ubd_ioctl(struct inode * inode, struct file * file, +static int ubd_open(struct block_device *bdev, struct file * filp); +static int ubd_release(struct gendisk *disk); +static int ubd_ioctl(struct block_device *bdev, struct file * file, unsigned int cmd, unsigned long arg); #define MAX_DEV (8) @@ -710,9 +710,9 @@ device_initcall(ubd_driver_init); -static int ubd_open(struct inode *inode, struct file *filp) +static int ubd_open(struct block_device *bdev, struct file *filp) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct ubd *dev = disk->private_data; int err = -EISDIR; @@ -739,9 +739,8 @@ return(err); } -static int ubd_release(struct inode * inode, struct file * file) +static int ubd_release(struct gendisk *disk) { - struct gendisk *disk = inode->i_bdev->bd_disk; struct ubd *dev = disk->private_data; if(--dev->count == 0) @@ -865,11 +864,11 @@ } } -static int ubd_ioctl(struct inode * inode, struct file * file, +static int ubd_ioctl(struct block_device *bdev, struct file * file, unsigned int cmd, unsigned long arg) { struct hd_geometry *loc = (struct hd_geometry *) arg; - struct ubd *dev = inode->i_bdev->bd_disk->private_data; + struct ubd *dev = bdev->bd_disk->private_data; int err; struct hd_driveid ubd_id = { .cyls = 0, @@ -890,7 +889,7 @@ case HDIO_SET_UNMASKINTR: if(!capable(CAP_SYS_ADMIN)) return(-EACCES); - if((arg > 1) || (inode->i_bdev->bd_contains != inode->i_bdev)) + if((arg > 1) || (bdev->bd_contains != bdev)) return(-EINVAL); return(0); @@ -910,7 +909,7 @@ case HDIO_SET_MULTCOUNT: if(!capable(CAP_SYS_ADMIN)) return(-EACCES); - if(inode->i_bdev->bd_contains != inode->i_bdev) + if(bdev->bd_contains != bdev) return(-EINVAL); return(0); --- diff/arch/um/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/um/kernel/irq.c 2003-11-26 10:09:05.000000000 +0000 @@ -577,9 +577,15 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { + int len; + if (count < HEX_DIGITS+1) return -EINVAL; - return sprintf (page, "%08lx\n", irq_affinity[(long)data]); + + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; + len += sprinf(page, "\n"); + return len; } static unsigned int parse_hex_value (const char *buffer, @@ -652,18 +658,14 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - cpumask_t tmp, *mask = (cpumask_t *) data; - int k, len = 0; + cpumask_t *mask = (cpumask_t *)data; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - tmp = *mask; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + + len = format_cpumask(page, *mask); + page += len; len += sprintf(page, "\n"); return len; } --- diff/arch/v850/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/v850/kernel/irq.c 2003-11-26 10:09:05.000000000 +0000 @@ -81,16 +81,18 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(int *) v; struct irqaction * action; unsigned long flags; - seq_puts(p, " "); - for (i=0; i < 1 /*smp_num_cpus*/; i++) - seq_printf(p, "CPU%d ", i); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (i=0; i < 1 /*smp_num_cpus*/; i++) + seq_printf(p, "CPU%d ", i); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { int j, count, num; const char *type_name = irq_desc[i].handler->typename; spin_lock_irqsave(&irq_desc[j].lock, flags); @@ -121,8 +123,8 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[j].lock, flags); - } - seq_printf(p, "ERR: %10lu\n", irq_err_count); + } else if (i == NR_IRQS) + seq_printf(p, "ERR: %10lu\n", irq_err_count); return 0; } --- diff/arch/x86_64/Kconfig 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/x86_64/Kconfig 2003-11-26 10:09:05.000000000 +0000 @@ -371,6 +371,12 @@ turn this on, unless you're 100% sure that you don't have any 32-bit programs left. +config IA32_AOUT + bool "IA32 a.out support" + depends on IA32_EMULATION + help + Support old a.out binaries in the 32bit emulation. + config COMPAT bool depends on IA32_EMULATION @@ -505,6 +511,7 @@ Normally you should say N. config IOMMU_DEBUG + depends on GART_IOMMU && DEBUG_KERNEL bool "Force IOMMU to on" help Force the IOMMU to on even when you have less than 4GB of memory and add --- diff/arch/x86_64/boot/compressed/head.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/x86_64/boot/compressed/head.S 2003-11-26 10:09:05.000000000 +0000 @@ -26,6 +26,7 @@ .code32 .text +#define IN_BOOTLOADER #include <linux/linkage.h> #include <asm/segment.h> --- diff/arch/x86_64/boot/compressed/misc.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/x86_64/boot/compressed/misc.c 2003-11-26 10:09:05.000000000 +0000 @@ -9,6 +9,7 @@ * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996 */ +#define IN_BOOTLOADER #include "miscsetup.h" #include <asm/io.h> --- diff/arch/x86_64/defconfig 2003-09-30 15:46:12.000000000 +0100 +++ source/arch/x86_64/defconfig 2003-11-26 10:09:05.000000000 +0000 @@ -59,7 +59,6 @@ CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y -# CONFIG_HUGETLB_PAGE is not set CONFIG_SMP=y # CONFIG_PREEMPT is not set CONFIG_K8_NUMA=y @@ -79,9 +78,9 @@ # # ACPI (Advanced Configuration and Power Interface) Support # -# CONFIG_ACPI_HT is not set CONFIG_ACPI=y CONFIG_ACPI_BOOT=y +CONFIG_ACPI_INTERPRETER=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y CONFIG_ACPI_AC=y @@ -94,11 +93,29 @@ CONFIG_ACPI_TOSHIBA=y CONFIG_ACPI_DEBUG=y CONFIG_ACPI_BUS=y -CONFIG_ACPI_INTERPRETER=y CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_PCI=y CONFIG_ACPI_SYSTEM=y +CONFIG_ACPI_RELAXED_AML=y + +# +# CPU Frequency scaling +# +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_PROC_INTF=y +CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y +# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set +CONFIG_CPU_FREQ_GOV_PERFORMANCE=y +CONFIG_CPU_FREQ_GOV_POWERSAVE=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y +# CONFIG_CPU_FREQ_24_API is not set +CONFIG_CPU_FREQ_TABLE=y + +# +# CPUFreq processor drivers +# +CONFIG_X86_POWERNOW_K8=y # # Bus options (PCI etc.) @@ -246,6 +263,7 @@ # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_ADVANSYS is not set # CONFIG_SCSI_MEGARAID is not set +# CONFIG_SCSI_SATA is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_CPQFCTS is not set # CONFIG_SCSI_DMX3191D is not set @@ -325,7 +343,9 @@ # CONFIG_IP_SCTP is not set # CONFIG_ATM is not set # CONFIG_VLAN_8021Q is not set -# CONFIG_LLC is not set +# CONFIG_LLC2 is not set +# CONFIG_IPX is not set +# CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_NET_DIVERT is not set @@ -358,7 +378,7 @@ # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y -# CONFIG_MII is not set +CONFIG_MII=y # CONFIG_HAPPYMEAL is not set # CONFIG_SUNGEM is not set # CONFIG_NET_VENDOR_3COM is not set @@ -388,7 +408,6 @@ # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set # CONFIG_SUNDANCE is not set -# CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # @@ -421,10 +440,10 @@ # CONFIG_NET_RADIO is not set # -# Token Ring devices (depends on LLC=y) +# Token Ring devices # +# CONFIG_TR is not set # CONFIG_NET_FC is not set -# CONFIG_RCPCI is not set # CONFIG_SHAPER is not set # @@ -443,6 +462,11 @@ # CONFIG_IRDA is not set # +# Bluetooth support +# +# CONFIG_BT is not set + +# # ISDN subsystem # # CONFIG_ISDN_BOOL is not set @@ -485,6 +509,7 @@ # CONFIG_KEYBOARD_NEWTON is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y +# CONFIG_MOUSE_PS2_SYNAPTICS is not set # CONFIG_MOUSE_SERIAL is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TOUCHSCREEN is not set @@ -504,6 +529,7 @@ CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y # CONFIG_SERIAL_8250_ACPI is not set +CONFIG_SERIAL_8250_NR_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set # @@ -520,7 +546,11 @@ # CONFIG_I2C is not set # -# I2C Hardware Sensors Mainboard support +# I2C Algorithms +# + +# +# I2C Hardware Bus support # # @@ -549,7 +579,6 @@ # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set -# CONFIG_SONYPI is not set # # Ftape, the floppy tape device driver @@ -559,6 +588,7 @@ # CONFIG_DRM is not set # CONFIG_MWAVE is not set CONFIG_RAW_DRIVER=y +CONFIG_MAX_RAW_DEVS=256 CONFIG_HANGCHECK_TIMER=y # @@ -619,10 +649,13 @@ # Pseudo filesystems # CONFIG_PROC_FS=y +CONFIG_PROC_KCORE=y # CONFIG_DEVFS_FS is not set CONFIG_DEVPTS_FS=y # CONFIG_DEVPTS_FS_XATTR is not set CONFIG_TMPFS=y +CONFIG_HUGETLBFS=y +CONFIG_HUGETLB_PAGE=y CONFIG_RAMFS=y # @@ -647,6 +680,7 @@ CONFIG_NFS_FS=y CONFIG_NFS_V3=y # CONFIG_NFS_V4 is not set +# CONFIG_NFS_DIRECTIO is not set CONFIG_NFSD=y CONFIG_NFSD_V3=y # CONFIG_NFSD_V4 is not set @@ -707,13 +741,15 @@ # CONFIG_SOUND_MAESTRO is not set # CONFIG_SOUND_MAESTRO3 is not set CONFIG_SOUND_ICH=y -# CONFIG_SOUND_RME96XX is not set # CONFIG_SOUND_SONICVIBES is not set # CONFIG_SOUND_TRIDENT is not set +# CONFIG_SOUND_MSNDCLAS is not set +# CONFIG_SOUND_MSNDPIN is not set # CONFIG_SOUND_VIA82CXXX is not set # CONFIG_SOUND_OSS is not set # CONFIG_SOUND_ALI5455 is not set # CONFIG_SOUND_FORTE is not set +# CONFIG_SOUND_RME96XX is not set # CONFIG_SOUND_AD1980 is not set # @@ -723,11 +759,6 @@ # CONFIG_USB_GADGET is not set # -# Bluetooth support -# -# CONFIG_BT is not set - -# # Profiling support # CONFIG_PROFILING=y @@ -743,8 +774,7 @@ # CONFIG_INIT_DEBUG is not set # CONFIG_DEBUG_INFO is not set # CONFIG_FRAME_POINTER is not set -CONFIG_IOMMU_DEBUG=y -CONFIG_IOMMU_LEAK=y +# CONFIG_IOMMU_DEBUG is not set CONFIG_MCE_DEBUG=y # @@ -760,4 +790,4 @@ # # Library routines # -# CONFIG_CRC32 is not set +CONFIG_CRC32=y --- diff/arch/x86_64/ia32/Makefile 2003-09-30 15:46:12.000000000 +0100 +++ source/arch/x86_64/ia32/Makefile 2003-11-26 10:09:05.000000000 +0000 @@ -6,6 +6,8 @@ ia32_signal.o tls32.o \ ia32_binfmt.o fpu32.o ptrace32.o ipc32.o syscall32.o +obj-$(CONFIG_IA32_AOUT) += ia32_aout.o + $(obj)/syscall32.o: $(src)/syscall32.c $(obj)/vsyscall.so # Teach kbuild about targets --- diff/arch/x86_64/ia32/ia32_signal.c 2003-09-17 12:28:03.000000000 +0100 +++ source/arch/x86_64/ia32/ia32_signal.c 2003-11-26 10:09:05.000000000 +0000 @@ -173,6 +173,9 @@ { unsigned int err = 0; + /* Always make any pending restarted system calls return -EINTR */ + current_thread_info()->restart_block.fn = do_no_restart_syscall; + #if DEBUG_SIG printk("SIG restore_sigcontext: sc=%p err(%x) eip(%x) cs(%x) flg(%x)\n", sc, sc->err, sc->eip, sc->cs, sc->eflags); --- diff/arch/x86_64/ia32/ia32entry.S 2003-09-17 12:28:03.000000000 +0100 +++ source/arch/x86_64/ia32/ia32entry.S 2003-11-26 10:09:05.000000000 +0000 @@ -221,7 +221,7 @@ .quad sys_chmod /* 15 */ .quad sys_lchown16 .quad ni_syscall /* old break syscall holder */ - .quad ni_syscall /* (old)stat */ + .quad sys_stat .quad sys32_lseek .quad sys_getpid /* 20 */ .quad sys_mount /* mount */ @@ -231,7 +231,7 @@ .quad sys_stime /* stime */ /* 25 */ .quad sys32_ptrace /* ptrace */ .quad sys_alarm /* XXX sign extension??? */ - .quad ni_syscall /* (old)fstat */ + .quad sys_fstat /* (old)fstat */ .quad sys_pause .quad compat_sys_utime /* 30 */ .quad ni_syscall /* old stty syscall holder */ @@ -287,7 +287,7 @@ .quad sys_setgroups16 .quad sys32_old_select .quad sys_symlink - .quad ni_syscall /* (old)lstat */ + .quad sys_lstat .quad sys_readlink /* 85 */ .quad sys_uselib .quad sys_swapon @@ -330,10 +330,10 @@ .quad sys32_adjtimex .quad sys32_mprotect /* 125 */ .quad compat_sys_sigprocmask - .quad sys32_module_warning /* create_module */ + .quad quiet_ni_syscall /* create_module */ .quad sys_init_module .quad sys_delete_module - .quad sys32_module_warning /* 130 get_kernel_syms */ + .quad quiet_ni_syscall /* 130 get_kernel_syms */ .quad ni_syscall /* quotactl */ .quad sys_getpgid .quad sys_fchdir @@ -396,8 +396,8 @@ .quad stub32_vfork /* 190 */ .quad compat_sys_getrlimit .quad sys32_mmap2 - .quad sys_truncate - .quad sys_ftruncate + .quad sys32_truncate64 + .quad sys32_ftruncate64 .quad sys32_stat64 /* 195 */ .quad sys32_lstat64 .quad sys32_fstat64 --- diff/arch/x86_64/ia32/sys_ia32.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/x86_64/ia32/sys_ia32.c 2003-11-26 10:09:05.000000000 +0000 @@ -110,6 +110,22 @@ return 0; } +extern long sys_truncate(char *, loff_t); +extern long sys_ftruncate(int, loff_t); + +asmlinkage long +sys32_truncate64(char * filename, unsigned long offset_low, unsigned long offset_high) +{ + return sys_truncate(filename, ((loff_t) offset_high << 32) | offset_low); +} + + +asmlinkage long +sys32_ftruncate64(unsigned int fd, unsigned long offset_low, unsigned long offset_high) +{ + return sys_ftruncate(fd, ((loff_t) offset_high << 32) | offset_low); +} + /* Another set for IA32/LFS -- x86_64 struct stat is different due to support for 64bit inode numbers. */ @@ -1817,13 +1833,6 @@ } #endif -long sys32_module_warning(void) -{ - printk(KERN_INFO "%s: 32bit 2.4.x modutils not supported on 64bit kernel\n", - current->comm); - return -ENOSYS ; -} - extern long sys_io_setup(unsigned nr_reqs, aio_context_t *ctx); long sys32_io_setup(unsigned nr_reqs, u32 *ctx32p) @@ -1989,12 +1998,16 @@ long sys32_vm86_warning(void) { + struct task_struct *me = current; + static char lastcomm[8]; + if (strcmp(lastcomm, me->comm)) { printk(KERN_INFO "%s: vm86 mode not supported on 64 bit kernel\n", - current->comm); - return -ENOSYS ; + me->comm); + strcpy(lastcomm, me->comm); + } + return -ENOSYS; } - struct exec_domain ia32_exec_domain = { .name = "linux/x86", .pers_low = PER_LINUX32, --- diff/arch/x86_64/ia32/syscall32.c 2003-06-09 14:18:18.000000000 +0100 +++ source/arch/x86_64/ia32/syscall32.c 2003-11-26 10:09:05.000000000 +0000 @@ -30,10 +30,12 @@ int map_syscall32(struct mm_struct *mm, unsigned long address) { pte_t *pte; + pmd_t *pmd; int err = 0; + down_read(&mm->mmap_sem); spin_lock(&mm->page_table_lock); - pmd_t *pmd = pmd_alloc(mm, pgd_offset(mm, address), address); + pmd = pmd_alloc(mm, pgd_offset(mm, address), address); if (pmd && (pte = pte_alloc_map(mm, pmd, address)) != NULL) { if (pte_none(*pte)) { set_pte(pte, --- diff/arch/x86_64/kernel/Makefile 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/x86_64/kernel/Makefile 2003-11-26 10:09:05.000000000 +0000 @@ -18,13 +18,16 @@ obj-$(CONFIG_X86_IO_APIC) += io_apic.o mpparse.o obj-$(CONFIG_PM) += suspend.o obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend_asm.o +obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_GART_IOMMU) += pci-gart.o aperture.o obj-$(CONFIG_DUMMY_IOMMU) += pci-nommu.o pci-dma.o obj-$(CONFIG_MODULES) += module.o +obj-y += topology.o + bootflag-y += ../../i386/kernel/bootflag.o -cpuid-$(CONFIG_X86_CPUID) += ../../i386/kernel/cpuid.o +cpuid-$(subst m,y,$(CONFIG_X86_CPUID)) += ../../i386/kernel/cpuid.o +topology-y += ../../i386/mach-default/topology.o -obj-$(CONFIG_CPU_FREQ) += cpufreq/ --- diff/arch/x86_64/kernel/acpi/boot.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/x86_64/kernel/acpi/boot.c 2003-11-26 10:09:05.000000000 +0000 @@ -253,6 +253,70 @@ #ifdef CONFIG_ACPI_BUS /* + * "acpi_pic_sci=level" (current default) + * programs the PIC-mode SCI to Level Trigger. + * (NO-OP if the BIOS set Level Trigger already) + * + * If a PIC-mode SCI is not recogznied or gives spurious IRQ7's + * it may require Edge Trigger -- use "acpi_pic_sci=edge" + * (NO-OP if the BIOS set Edge Trigger already) + * + * Port 0x4d0-4d1 are ECLR1 and ECLR2, the Edge/Level Control Registers + * for the 8259 PIC. bit[n] = 1 means irq[n] is Level, otherwise Edge. + * ECLR1 is IRQ's 0-7 (IRQ 0, 1, 2 must be 0) + * ECLR2 is IRQ's 8-15 (IRQ 8, 13 must be 0) + */ + +static int __initdata acpi_pic_sci_trigger; /* 0: level, 1: edge */ + +void __init +acpi_pic_sci_set_trigger(unsigned int irq) +{ + unsigned char mask = 1 << (irq & 7); + unsigned int port = 0x4d0 + (irq >> 3); + unsigned char val = inb(port); + + + printk(PREFIX "IRQ%d SCI:", irq); + if (!(val & mask)) { + printk(" Edge"); + + if (!acpi_pic_sci_trigger) { + printk(" set to Level"); + outb(val | mask, port); + } + } else { + printk(" Level"); + + if (acpi_pic_sci_trigger) { + printk(" set to Edge"); + outb(val | mask, port); + } + } + printk(" Trigger.\n"); +} + +int __init +acpi_pic_sci_setup(char *str) +{ + while (str && *str) { + if (strncmp(str, "level", 5) == 0) + acpi_pic_sci_trigger = 0; /* force level trigger */ + if (strncmp(str, "edge", 4) == 0) + acpi_pic_sci_trigger = 1; /* force edge trigger */ + str = strchr(str, ','); + if (str) + str += strspn(str, ", \t"); + } + return 1; +} + +__setup("acpi_pic_sci=", acpi_pic_sci_setup); + +#endif /* CONFIG_ACPI_BUS */ + +#ifdef CONFIG_ACPI_BUS +/* * Set specified PIC IRQ to level triggered mode. * * Port 0x4d0-4d1 are ECLR1 and ECLR2, the Edge/Level Control Registers --- diff/arch/x86_64/kernel/acpi/sleep.c 2003-05-21 11:50:00.000000000 +0100 +++ source/arch/x86_64/kernel/acpi/sleep.c 2003-11-26 10:09:05.000000000 +0000 @@ -56,6 +56,7 @@ /* address in low memory of the wakeup routine. */ unsigned long acpi_wakeup_address = 0; +unsigned long acpi_video_flags; extern char wakeup_start, wakeup_end; extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long)); @@ -116,6 +117,22 @@ printk(KERN_DEBUG "ACPI: have wakeup address 0x%8.8lx\n", acpi_wakeup_address); } +static int __init acpi_sleep_setup(char *str) +{ + while ((str != NULL) && (*str != '\0')) { + if (strncmp(str, "s3_bios", 7) == 0) + acpi_video_flags = 1; + if (strncmp(str, "s3_mode", 7) == 0) + acpi_video_flags |= 2; + str = strchr(str, ','); + if (str != NULL) + str += strspn(str, ", \t"); + } + return 1; +} + +__setup("acpi_sleep=", acpi_sleep_setup); + #endif /*CONFIG_ACPI_SLEEP*/ void acpi_pci_link_exit(void) {} --- diff/arch/x86_64/kernel/acpi/wakeup.S 2003-07-08 09:55:18.000000000 +0100 +++ source/arch/x86_64/kernel/acpi/wakeup.S 2003-11-26 10:09:05.000000000 +0000 @@ -41,7 +41,19 @@ cmpl $0x12345678, %eax jne bogus_real_magic + testl $1, video_flags - wakeup_code + jz 1f lcall $0xc000,$3 + movw %cs, %ax + movw %ax, %ds # Bios might have played with that + movw %ax, %ss +1: + + testl $2, video_flags - wakeup_code + jz 1f + mov video_mode - wakeup_code, %ax + call mode_seta +1: movw $0xb800, %ax movw %ax,%fs @@ -250,6 +262,7 @@ .quad 0 real_magic: .quad 0 video_mode: .quad 0 +video_flags: .quad 0 bogus_real_magic: movb $0xba,%al ; outb %al,$0x80 @@ -382,8 +395,10 @@ movl %eax, saved_efer movl %edx, saved_efer2 -# movq saved_videomode, %rdx # FIXME: videomode - movq %rdx, video_mode - wakeup_start (,%rdi) + movl saved_video_mode, %edx + movl %edx, video_mode - wakeup_start (,%rdi) + movl acpi_video_flags, %edx + movl %edx, video_flags - wakeup_start (,%rdi) movq $0x12345678, real_magic - wakeup_start (,%rdi) movq $0x123456789abcdef0, %rdx movq %rdx, saved_magic @@ -415,8 +430,6 @@ .LFB5: subq $8, %rsp .LCFI2: - testl %edi, %edi - jne .L99 xorl %eax, %eax call save_processor_state --- diff/arch/x86_64/kernel/apic.c 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/x86_64/kernel/apic.c 2003-11-26 10:09:05.000000000 +0000 @@ -42,6 +42,8 @@ static DEFINE_PER_CPU(int, prof_old_multiplier) = 1; static DEFINE_PER_CPU(int, prof_counter) = 1; +static void apic_pm_activate(void); + void enable_NMI_through_LVT0 (void * dummy) { unsigned int v, ver; @@ -435,6 +437,7 @@ if (nmi_watchdog == NMI_LOCAL_APIC) setup_apic_nmi_watchdog(); + apic_pm_activate(); } #ifdef CONFIG_PM @@ -556,7 +559,7 @@ #else /* CONFIG_PM */ -static inline void apic_pm_activate(void) { } +static void apic_pm_activate(void) { } #endif /* CONFIG_PM */ @@ -579,7 +582,6 @@ if (nmi_watchdog != NMI_NONE) nmi_watchdog = NMI_LOCAL_APIC; - apic_pm_activate(); return 0; } --- diff/arch/x86_64/kernel/bluesmoke.c 2003-08-20 14:16:26.000000000 +0100 +++ source/arch/x86_64/kernel/bluesmoke.c 2003-11-26 10:09:05.000000000 +0000 @@ -200,11 +200,14 @@ static void check_k8_nb(int header) { struct pci_dev *nb; + u32 statuslow, statushigh; + unsigned short errcode; + int i; + nb = find_k8_nb(); if (nb == NULL) return; - u32 statuslow, statushigh; pci_read_config_dword(nb, 0x48, &statuslow); pci_read_config_dword(nb, 0x4c, &statushigh); if (!(statushigh & (1<<31))) @@ -215,50 +218,42 @@ printk(KERN_ERR "Northbridge status %08x%08x\n", statushigh,statuslow); - unsigned short errcode = statuslow & 0xffff; - switch (errcode >> 8) { - case 0: + printk(KERN_ERR " Error %s\n", extendederr[(statuslow >> 16) & 0xf]); + + errcode = statuslow & 0xffff; + switch ((statuslow >> 16) & 0xF) { + case 5: printk(KERN_ERR " GART TLB error %s %s\n", transaction[(errcode >> 2) & 3], cachelevel[errcode & 3]); break; - case 1: - if (errcode & (1<<11)) { - printk(KERN_ERR " bus error %s %s %s %s %s\n", - partproc[(errcode >> 10) & 0x3], - timeout[(errcode >> 9) & 1], + case 8: + printk(KERN_ERR " ECC error syndrome %x\n", + (((statuslow >> 24) & 0xff) << 8) | ((statushigh >> 15) & 0x7f)); + /*FALL THROUGH*/ + default: + printk(KERN_ERR " bus error %s, %s\n %s\n %s, %s\n", + partproc[(errcode >> 9) & 0x3], + timeout[(errcode >> 8) & 1], memtrans[(errcode >> 4) & 0xf], memoryio[(errcode >> 2) & 0x3], cachelevel[(errcode & 0x3)]); - } else if (errcode & (1<<8)) { - printk(KERN_ERR " memory error %s %s %s\n", - memtrans[(errcode >> 4) & 0xf], - transaction[(errcode >> 2) & 0x3], - cachelevel[(errcode & 0x3)]); - } else { - printk(KERN_ERR " unknown error code %x\n", errcode); - } - break; - } - if (statushigh & ((1<<14)|(1<<13))) - printk(KERN_ERR " ECC syndrome bits %x\n", - (((statuslow >> 24) & 0xff) << 8) | ((statushigh >> 15) & 0x7f)); - errcode = (statuslow >> 16) & 0xf; - printk(KERN_ERR " extended error %s\n", extendederr[(statuslow >> 16) & 0xf]); - /* should only print when it was a HyperTransport related error. */ printk(KERN_ERR " link number %x\n", (statushigh >> 4) & 3); + break; + } - int i; - for (i = 0; i < 32; i++) + for (i = 0; i < 32; i++) { + if (i == 26 || i == 28) + continue; if (highbits[i] && (statushigh & (1<<i))) printk(KERN_ERR " %s\n", highbits[i]); - + } if (statushigh & (1<<26)) { u32 addrhigh, addrlow; pci_read_config_dword(nb, 0x54, &addrhigh); pci_read_config_dword(nb, 0x50, &addrlow); - printk(KERN_ERR " error address %08x%08x\n", addrhigh,addrlow); + printk(KERN_ERR " NB error address %08x%08x\n", addrhigh,addrlow); } statushigh &= ~(1<<31); pci_write_config_dword(nb, 0x4c, statushigh); @@ -307,9 +302,6 @@ wrmsrl(MSR_IA32_MC0_STATUS+4*4, 0); wrmsrl(MSR_IA32_MCG_STATUS, 0); - if (regs && (status & (1<<1))) - printk(KERN_EMERG "MCE at RIP %lx RSP %lx\n", regs->rip, regs->rsp); - others: generic_machine_check(regs, error_code); @@ -373,6 +365,9 @@ wrmsrl(MSR_IA32_MC0_STATUS+4*i,0); } + if (cap & (1<<8)) + wrmsrl(MSR_IA32_MCG_CTL, 0xffffffffffffffffULL); + set_in_cr4(X86_CR4_MCE); if (mcheck_interval && (smp_processor_id() == 0)) { --- diff/arch/x86_64/kernel/e820.c 2003-06-09 14:18:18.000000000 +0100 +++ source/arch/x86_64/kernel/e820.c 2003-11-26 10:09:05.000000000 +0000 @@ -2,10 +2,6 @@ * Handle the memory map. * The functions here do the job until bootmem takes over. * $Id: 00001.patch,v 1.1 2004/02/16 16:51:56 agk Exp $ - - * AK: some of these functions are not used in 2.5 yet but they will be when - * NUMA is completely merged. - */ #include <linux/config.h> #include <linux/kernel.h> --- diff/arch/x86_64/kernel/head.S 2003-08-20 14:16:26.000000000 +0100 +++ source/arch/x86_64/kernel/head.S 2003-11-26 10:09:05.000000000 +0000 @@ -38,6 +38,9 @@ movl %ebx,%ebp /* Save trampoline flag */ + movl $__KERNEL_DS,%eax + movl %eax,%ds + /* If the CPU doesn't support CPUID this will double fault. * Unfortunately it is hard to check for CPUID without a stack. */ @@ -114,25 +117,11 @@ movl $(pGDT32 - __START_KERNEL_map), %eax lgdt (%eax) +second: movl $(ljumpvector - __START_KERNEL_map), %eax /* Finally jump in 64bit mode */ ljmp *(%eax) -second: - /* abuse syscall to get into 64bit mode. this way we don't need - a working low identity mapping just for the short 32bit roundtrip. - XXX kludge. this should not be needed. */ - movl $MSR_STAR,%ecx - xorl %eax,%eax - movl $(__USER32_CS<<16)|__KERNEL_CS,%edx - wrmsr - - movl $MSR_CSTAR,%ecx - movl $0xffffffff,%edx - movl $0x80100100,%eax # reach_long64 absolute - wrmsr - syscall - .code64 .org 0x100 reach_long64: --- diff/arch/x86_64/kernel/io_apic.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/x86_64/kernel/io_apic.c 2003-11-26 10:09:05.000000000 +0000 @@ -147,6 +147,13 @@ struct IO_APIC_route_entry entry; unsigned long flags; + /* Check delivery_mode to be sure we're not clearing an SMI pin */ + spin_lock_irqsave(&ioapic_lock, flags); + *(((int*)&entry) + 0) = io_apic_read(apic, 0x10 + 2 * pin); + *(((int*)&entry) + 1) = io_apic_read(apic, 0x11 + 2 * pin); + spin_unlock_irqrestore(&ioapic_lock, flags); + if (entry.delivery_mode == dest_SMI) + return; /* * Disable it in the IO-APIC irq-routing table: */ @@ -1087,8 +1094,6 @@ unsigned char old_id; unsigned long flags; - if (acpi_ioapic) return; /* ACPI does that already */ - /* * Set the IOAPIC ID to the value stored in the MPC table. */ @@ -1212,8 +1217,6 @@ */ #define enable_edge_ioapic_irq unmask_IO_APIC_irq -static void disable_edge_ioapic_irq (unsigned int irq) { /* nothing */ } - /* * Starting up a edge-triggered IO-APIC interrupt is * nasty - we need to make sure that we get the edge. @@ -1256,9 +1259,6 @@ ack_APIC_irq(); } -static void end_edge_ioapic_irq (unsigned int i) { /* nothing */ } - - /* * Level triggered interrupts can just be masked, * and shutting down and starting up the interrupt @@ -1343,8 +1343,6 @@ } } -static void mask_and_ack_level_ioapic_irq (unsigned int irq) { /* nothing */ } - static void set_ioapic_affinity (unsigned int irq, cpumask_t mask) { unsigned long flags; @@ -1673,12 +1671,14 @@ /* * Set up the IO-APIC IRQ routing table. */ - setup_ioapic_ids_from_mpc(); + if (!acpi_ioapic) + setup_ioapic_ids_from_mpc(); sync_Arb_IDs(); setup_IO_APIC_irqs(); init_IO_APIC_traps(); check_timer(); - print_IO_APIC(); + if (!acpi_ioapic) + print_IO_APIC(); } /* Ensure the ACPI SCI interrupt level is active low, edge-triggered */ --- diff/arch/x86_64/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/x86_64/kernel/irq.c 2003-11-26 10:09:05.000000000 +0000 @@ -138,17 +138,19 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(int *) v, j; struct irqaction * action; unsigned long flags; - seq_printf(p, " "); - for (j=0; j<NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "CPU%d ",j); - seq_putc(p, '\n'); + if (i == 0) { + seq_printf(p, " "); + for (j=0; j<NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "CPU%d ",j); + seq_putc(p, '\n'); + } - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) @@ -170,25 +172,26 @@ seq_putc(p, '\n'); skip: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } - seq_printf(p, "NMI: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", cpu_pda[j].__nmi_count); - seq_putc(p, '\n'); + } else if (i == NR_IRQS) { + seq_printf(p, "NMI: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", cpu_pda[j].__nmi_count); + seq_putc(p, '\n'); #ifdef CONFIG_X86_LOCAL_APIC - seq_printf(p, "LOC: "); - for (j = 0; j < NR_CPUS; j++) - if (cpu_online(j)) - seq_printf(p, "%10u ", cpu_pda[j].apic_timer_irqs); - seq_putc(p, '\n'); + seq_printf(p, "LOC: "); + for (j = 0; j < NR_CPUS; j++) + if (cpu_online(j)) + seq_printf(p, "%10u ", cpu_pda[j].apic_timer_irqs); + seq_putc(p, '\n'); #endif - seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); + seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count)); #ifdef CONFIG_X86_IO_APIC #ifdef APIC_MISMATCH_DEBUG - seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); + seq_printf(p, "MIS: %10u\n", atomic_read(&irq_mis_count)); #endif #endif + } return 0; } @@ -850,18 +853,13 @@ static int irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - int k, len; - cpumask_t tmp = irq_affinity[(long)data]; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - for (k = len = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, irq_affinity[(long)data]); + page += len; len += sprintf(page, "\n"); return len; } @@ -897,19 +895,14 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - cpumask_t tmp, *mask = (cpumask_t *) data; - int k, len; + cpumask_t *mask = (cpumask_t *)data; + int len; if (count < HEX_DIGITS+1) return -EINVAL; - tmp = *mask; - for (k = len = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } + len = format_cpumask(page, *mask); + page += len; len += sprintf(page, "\n"); return len; } --- diff/arch/x86_64/kernel/mpparse.c 2003-09-30 15:46:12.000000000 +0100 +++ source/arch/x86_64/kernel/mpparse.c 2003-11-26 10:09:05.000000000 +0000 @@ -950,6 +950,8 @@ entry->irq); } + print_IO_APIC(); + return; } --- diff/arch/x86_64/kernel/pci-gart.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/x86_64/kernel/pci-gart.c 2003-11-26 10:09:05.000000000 +0000 @@ -31,6 +31,10 @@ #include <asm/kdebug.h> #include <asm/proto.h> +/* Workarounds for specific drivers */ +#define FUSION_WORKAROUND 1 +#define THREEWARE_WORKAROUND 1 + dma_addr_t bad_dma_address; unsigned long iommu_bus_base; /* GART remapping area (physical) */ @@ -44,12 +48,13 @@ #ifdef CONFIG_IOMMU_DEBUG int panic_on_overflow = 1; int force_iommu = 1; -int sac_force_size = 0; #else -int panic_on_overflow = 1; /* for testing */ +int panic_on_overflow = 0; int force_iommu = 0; -int sac_force_size = 256*1024*1024; #endif +int iommu_merge = 1; +int iommu_sac_force = 0; +int iommu_fullflush = 0; /* Allocation bitmap for the remapping area */ static spinlock_t iommu_bitmap_lock = SPIN_LOCK_UNLOCKED; @@ -137,10 +142,15 @@ /* recheck flush count inside lock */ if (need_flush) { for (i = 0; northbridges[i]; i++) { + u32 w; if (bus >= 0 && !(cpu_isset_const(i, bus_cpumask))) continue; pci_write_config_dword(northbridges[i], 0x9c, northbridge_flush_word[i] | 1); + /* Make sure the hardware actually executed the flush. */ + do { + pci_read_config_dword(northbridges[i], 0x9c, &w); + } while (w & 1); flushed++; } if (!flushed) @@ -152,6 +162,8 @@ static inline void flush_gart(struct pci_dev *dev) { + if (iommu_fullflush) + need_flush = 1; if (need_flush) __flush_gart(dev); } @@ -174,11 +186,16 @@ } else { dma_mask = hwdev->consistent_dma_mask; } + if (dma_mask == 0) dma_mask = 0xffffffff; if (dma_mask < 0xffffffff || no_iommu) gfp |= GFP_DMA; + /* Kludge to make it bug-to-bug compatible with i386. i386 + uses the normal dma_mask for alloc_consistent. */ + dma_mask &= hwdev->dma_mask; + memory = (void *)__get_free_pages(gfp, get_order(size)); if (memory == NULL) { return NULL; @@ -381,6 +398,16 @@ return nents; } +static void dump_sg(struct scatterlist *sg, int stopat) +{ + int k; + for (k = 0; k < stopat; k++) + printk(KERN_EMERG "sg[%d] page:%p dma:%lx offset:%u length:%u\n", + k, + sg[k].page, (unsigned long)sg[k].dma_address, sg[k].offset, + sg[k].length); +} + /* Map multiple scatterlist entries continuous into the first. */ static int __pci_map_cont(struct scatterlist *sg, int start, int stopat, struct scatterlist *sout, unsigned long pages) @@ -394,7 +421,9 @@ for (i = start; i < stopat; i++) { struct scatterlist *s = &sg[i]; - unsigned long start_addr = s->dma_address; + unsigned long pages, addr; + unsigned long phys_addr = s->dma_address; + BUG_ON(i > start && s->offset); if (i == start) { *sout = *s; @@ -403,15 +432,23 @@ } else { sout->length += s->length; } - unsigned long addr = start_addr; - while (addr < start_addr + s->length) { + + addr = phys_addr; + pages = to_pages(s->offset, s->length); + while (pages--) { iommu_gatt_base[iommu_page] = GPTE_ENCODE(addr); SET_LEAK(iommu_page); addr += PAGE_SIZE; iommu_page++; } } - BUG_ON(iommu_page - iommu_start != pages); + if (iommu_page - iommu_start != pages) { + printk(KERN_EMERG + "iommu_page:%lx iommu_start:%lx pages:%lu start:%d stopat:%d\n", + iommu_page, iommu_start, pages, start, stopat); + dump_sg(sg, stopat); + panic("IOMMU confused"); + } return 0; } @@ -458,8 +495,8 @@ if (i > start) { struct scatterlist *ps = &sg[i-1]; /* Can only merge when the last chunk ends on a page - boundary. */ - if (!force_iommu || !need || (i-1 > start && ps->offset) || + boundary and the new one doesn't have an offset. */ + if (!iommu_merge || !need || s->offset || (ps->offset + ps->length) % PAGE_SIZE) { if (pci_map_cont(sg, start, i, sg+out, pages, need) < 0) @@ -539,19 +576,28 @@ if (mask < 0x00ffffff) return 0; +#ifdef FUSION_WORKAROUND + if (dev->vendor == PCI_VENDOR_ID_LSI_LOGIC && mask > 0xffffffff) { + iommu_merge = 1; + return 0; + } +#endif +#ifdef THREEWARE_WORKAROUND + if (dev->vendor == PCI_VENDOR_ID_3WARE && mask <= 0xffffffff) + iommu_fullflush = 1; +#endif + /* Tell the device to use SAC when IOMMU force is on. This allows the driver to use cheaper accesses in some cases. Problem with this is that if we overflow the IOMMU area and return DAC as fallback address the device may not handle it correctly. - As a compromise we only do this if the IOMMU area is >= 256MB - which should make overflow unlikely enough. As a special case some controllers have a 39bit address mode that is as efficient as 32bit (aic79xx). Don't force SAC for these. Assume all masks <= 40 bits are of this type. Normally this doesn't make any difference, but gives more gentle handling of IOMMU overflow. */ - if (force_iommu && (mask > 0xffffffffffULL) && (iommu_size >= sac_force_size)){ + if (iommu_sac_force && (mask >= 0xffffffffffULL)) { printk(KERN_INFO "%s: Force SAC with mask %Lx\n", dev->slot_name,mask); return 0; } @@ -680,7 +726,7 @@ unsigned long iommu_start; struct pci_dev *dev; -#ifndef CONFIG_AGP_AMD_8151 +#ifndef CONFIG_AGP_AMD64 no_agp = 1; #else /* Makefile puts PCI initialization via subsys_initcall first. */ @@ -776,7 +822,8 @@ /* Must execute after PCI subsystem */ fs_initcall(pci_iommu_init); -/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]] +/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge] + [,forcesac][,fullflush][,nomerge] size set size of iommu (in bytes) noagp don't initialize the AGP driver and use full aperture. off don't use the IOMMU @@ -784,6 +831,10 @@ memaper[=order] allocate an own aperture over RAM with size 32MB^order. noforce don't force IOMMU usage. Default. force Force IOMMU. + merge Do SG merging. Implies force (experimental) + nomerge Don't do SG merging. + forcesac For SAC mode for masks <40bits (experimental) + fullflush Flush IOMMU on each allocation (for testing) */ __init int iommu_setup(char *opt) { @@ -809,6 +860,16 @@ panic_on_overflow = 1; if (!memcmp(p, "nopanic", 7)) panic_on_overflow = 0; + if (!memcmp(p, "merge", 5)) { + iommu_merge = 1; + force_iommu = 1; + } + if (!memcmp(p, "nomerge", 7)) + iommu_merge = 0; + if (!memcmp(p, "forcesac", 8)) + iommu_sac_force = 1; + if (!memcmp(p, "fullflush", 9)) + iommu_fullflush = 1; #ifdef CONFIG_IOMMU_LEAK if (!memcmp(p,"leak", 4)) { leak_trace = 1; --- diff/arch/x86_64/kernel/pci-nommu.c 2003-08-26 10:00:52.000000000 +0100 +++ source/arch/x86_64/kernel/pci-nommu.c 2003-11-26 10:09:05.000000000 +0000 @@ -4,6 +4,8 @@ #include <linux/string.h> #include <asm/proto.h> +int iommu_merge = 0; + /* * Dummy IO MMU functions */ --- diff/arch/x86_64/kernel/setup.c 2003-10-27 09:20:37.000000000 +0000 +++ source/arch/x86_64/kernel/setup.c 2003-11-26 10:09:05.000000000 +0000 @@ -332,6 +332,7 @@ void __init setup_arch(char **cmdline_p) { unsigned long low_mem_size; + unsigned long kernel_end; ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV); drive_info = DRIVE_INFO; @@ -380,7 +381,6 @@ (table_end - table_start) << PAGE_SHIFT); /* reserve kernel */ - unsigned long kernel_end; kernel_end = round_up(__pa_symbol(&_end),PAGE_SIZE); reserve_bootmem_generic(HIGH_MEMORY, kernel_end - HIGH_MEMORY); --- diff/arch/x86_64/kernel/signal.c 2003-06-30 10:07:29.000000000 +0100 +++ source/arch/x86_64/kernel/signal.c 2003-11-26 10:09:05.000000000 +0000 @@ -93,6 +93,8 @@ { unsigned int err = 0; + /* Always make any pending restarted system calls return -EINTR */ + current_thread_info()->restart_block.fn = do_no_restart_syscall; #define COPY(x) err |= __get_user(regs->x, &sc->x) --- diff/arch/x86_64/kernel/smpboot.c 2003-09-17 12:28:03.000000000 +0100 +++ source/arch/x86_64/kernel/smpboot.c 2003-11-26 10:09:05.000000000 +0000 @@ -54,7 +54,7 @@ #include <asm/proto.h> /* Bitmask of currently online CPUs */ -cpumask_t cpu_online_map; +cpumask_t cpu_online_map = { 1 }; static cpumask_t cpu_callin_map; cpumask_t cpu_callout_map; --- diff/arch/x86_64/mm/fault.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/x86_64/mm/fault.c 2003-11-26 10:09:05.000000000 +0000 @@ -73,6 +73,9 @@ if (regs->cs & (1<<2)) return 0; + if ((regs->cs & 3) != 0 && regs->rip >= TASK_SIZE) + return 0; + while (scan_more && instr < max_instr) { unsigned char opcode; unsigned char instr_hi; --- diff/drivers/acorn/block/fd1772.c 2003-09-17 12:28:03.000000000 +0100 +++ source/drivers/acorn/block/fd1772.c 2003-11-26 10:09:05.000000000 +0000 @@ -365,13 +365,12 @@ static void floppy_off(unsigned int nr); static void setup_req_params(int drive); static void redo_fd_request(void); -static int fd_ioctl(struct inode *inode, struct file *filp, unsigned int +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param); static void fd_probe(int drive); static int fd_test_drive_present(int drive); static void config_types(void); -static int floppy_open(struct inode *inode, struct file *filp); -static int floppy_release(struct inode *inode, struct file *filp); +static int floppy_open(struct block_device *bdev, struct file *filp); static void do_fd_request(request_queue_t *); /************************* End of Prototypes **************************/ @@ -1309,11 +1308,9 @@ return 0; } -static int fd_ioctl(struct inode *inode, struct file *filp, +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { - struct block_device *bdev = inode->i_bdev; - switch (cmd) { case FDFMTEND: case FDFLUSH: @@ -1453,10 +1450,11 @@ * drive with different device numbers. */ -static int floppy_open(struct inode *inode, struct file *filp) +static int floppy_open(struct block_device *bdev, struct file *filp) { - int drive = iminor(inode) & 3; - int type = iminor(inode) >> 2; + struct archy_floppy_struct *p = bdev->bd_disk->private_data; + int drive = p - unit; + int type = MINOR(bdev->bd_dev) >> 2; int old_dev = fd_device[drive]; if (fd_ref[drive] && old_dev != type) @@ -1476,10 +1474,13 @@ return 0; if (filp->f_mode & 3) { - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (filp->f_mode & 2) { - if (unit[drive].wpstat) { - floppy_release(inode, filp); + if (p->wpstat) { + if (fd_ref[drive] < 0) + fd_ref[drive] = 0; + else + fd_ref[drive]--; return -EROFS; } } @@ -1487,10 +1488,10 @@ return 0; } - -static int floppy_release(struct inode *inode, struct file *filp) +static int floppy_release(struct gendisk *disk) { - int drive = iminor(inode) & 3; + struct archy_floppy_struct *p = disk->private_data; + int drive = p - unit; if (fd_ref[drive] < 0) fd_ref[drive] = 0; --- diff/drivers/acorn/block/mfmhd.c 2003-08-20 14:16:26.000000000 +0100 +++ source/drivers/acorn/block/mfmhd.c 2003-11-26 10:09:05.000000000 +0000 @@ -1153,9 +1153,9 @@ * The 'front' end of the mfm driver follows... */ -static int mfm_ioctl(struct inode *inode, struct file *file, u_int cmd, u_long arg) +static int mfm_ioctl(struct block_device *bdev, struct file *file, u_int cmd, u_long arg) { - struct mfm_info *p = inode->i_bdev->bd_disk->private_data; + struct mfm_info *p = bdev->bd_disk->private_data; struct hd_geometry *geo = (struct hd_geometry *) arg; if (cmd != HDIO_GETGEO) return -EINVAL; @@ -1167,7 +1167,7 @@ return -EFAULT; if (put_user (p->cylinders, &geo->cylinders)) return -EFAULT; - if (put_user (get_start_sect(inode->i_bdev), &geo->start)) + if (put_user (get_start_sect(bdev), &geo->start)) return -EFAULT; return 0; } --- diff/drivers/acpi/Kconfig 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/acpi/Kconfig 2003-11-26 10:09:05.000000000 +0000 @@ -251,12 +251,6 @@ This driver will enable your system to shut down using ACPI, and dump your ACPI DSDT table using /proc/acpi/dsdt. -config ACPI_EFI - bool - depends on ACPI_INTERPRETER - depends on IA64 - default y - config ACPI_RELAXED_AML bool "Relaxed AML" depends on ACPI_INTERPRETER @@ -269,5 +263,23 @@ particular, many Toshiba laptops require this for correct operation of the AC module. +config X86_PM_TIMER + bool "Power Management Timer Support" + depends on X86 && ACPI + depends on ACPI_BOOT && EXPERIMENTAL + default n + help + The Power Management Timer is available on all ACPI-capable, + in most cases even if ACPI is unusable or blacklisted. + + This timing source is not affected by powermanagement features + like aggressive processor idling, throttling, frequency and/or + voltage scaling, unlike the commonly used Time Stamp Counter + (TSC) timing source. + + So, if you see messages like 'Losing too many ticks!' in the + kernel logs, and/or you are using a this on a notebook which + does not yet have an HPET, you should say "Y" here. + endmenu --- diff/drivers/acpi/bus.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/acpi/bus.c 2003-11-26 10:09:05.000000000 +0000 @@ -39,7 +39,7 @@ #define _COMPONENT ACPI_BUS_COMPONENT ACPI_MODULE_NAME ("acpi_bus") -extern void acpi_pic_set_level_irq(unsigned int irq); +extern void __init acpi_pic_sci_set_trigger(unsigned int irq); FADT_DESCRIPTOR acpi_fadt; struct acpi_device *acpi_root; @@ -615,7 +615,7 @@ if (acpi_ioapic) mp_config_ioapic_for_sci(acpi_fadt.sci_int); else - acpi_pic_set_level_irq(acpi_fadt.sci_int); + acpi_pic_sci_set_trigger(acpi_fadt.sci_int); #endif status = acpi_enable_subsystem(ACPI_FULL_INITIALIZATION); --- diff/drivers/acpi/osl.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/acpi/osl.c 2003-11-26 10:09:05.000000000 +0000 @@ -41,10 +41,7 @@ #include <acpi/acpi_bus.h> #include <asm/uaccess.h> -#ifdef CONFIG_ACPI_EFI #include <linux/efi.h> -u64 efi_mem_attributes (u64 phys_addr); -#endif #define _COMPONENT ACPI_OS_SERVICES @@ -140,22 +137,24 @@ acpi_status acpi_os_get_root_pointer(u32 flags, struct acpi_pointer *addr) { -#ifdef CONFIG_ACPI_EFI - addr->pointer_type = ACPI_PHYSICAL_POINTER; - if (efi.acpi20) - addr->pointer.physical = (acpi_physical_address) virt_to_phys(efi.acpi20); - else if (efi.acpi) - addr->pointer.physical = (acpi_physical_address) virt_to_phys(efi.acpi); - else { - printk(KERN_ERR PREFIX "System description tables not found\n"); - return AE_NOT_FOUND; - } -#else - if (ACPI_FAILURE(acpi_find_root_pointer(flags, addr))) { - printk(KERN_ERR PREFIX "System description tables not found\n"); - return AE_NOT_FOUND; + if (efi_enabled) { + addr->pointer_type = ACPI_PHYSICAL_POINTER; + if (efi.acpi20) + addr->pointer.physical = + (acpi_physical_address) virt_to_phys(efi.acpi20); + else if (efi.acpi) + addr->pointer.physical = + (acpi_physical_address) virt_to_phys(efi.acpi); + else { + printk(KERN_ERR PREFIX "System description tables not found\n"); + return AE_NOT_FOUND; + } + } else { + if (ACPI_FAILURE(acpi_find_root_pointer(flags, addr))) { + printk(KERN_ERR PREFIX "System description tables not found\n"); + return AE_NOT_FOUND; + } } -#endif /*CONFIG_ACPI_EFI*/ return AE_OK; } @@ -163,22 +162,22 @@ acpi_status acpi_os_map_memory(acpi_physical_address phys, acpi_size size, void **virt) { -#ifdef CONFIG_ACPI_EFI - if (EFI_MEMORY_WB & efi_mem_attributes(phys)) { - *virt = phys_to_virt(phys); + if (efi_enabled) { + if (EFI_MEMORY_WB & efi_mem_attributes(phys)) { + *virt = phys_to_virt(phys); + } else { + *virt = ioremap(phys, size); + } } else { - *virt = ioremap(phys, size); - } -#else - if (phys > ULONG_MAX) { - printk(KERN_ERR PREFIX "Cannot map memory that high\n"); - return AE_BAD_PARAMETER; + if (phys > ULONG_MAX) { + printk(KERN_ERR PREFIX "Cannot map memory that high\n"); + return AE_BAD_PARAMETER; + } + /* + * ioremap checks to ensure this is in reserved space + */ + *virt = ioremap((unsigned long) phys, size); } - /* - * ioremap checks to ensure this is in reserved space - */ - *virt = ioremap((unsigned long) phys, size); -#endif if (!*virt) return AE_NO_MEMORY; @@ -369,19 +368,17 @@ { u32 dummy; void *virt_addr; - -#ifdef CONFIG_ACPI_EFI int iomem = 0; - if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) { + if (efi_enabled) { + if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) { + virt_addr = phys_to_virt(phys_addr); + } else { + iomem = 1; + virt_addr = ioremap(phys_addr, width); + } + } else virt_addr = phys_to_virt(phys_addr); - } else { - iomem = 1; - virt_addr = ioremap(phys_addr, width); - } -#else - virt_addr = phys_to_virt(phys_addr); -#endif if (!value) value = &dummy; @@ -399,10 +396,10 @@ BUG(); } -#ifdef CONFIG_ACPI_EFI - if (iomem) - iounmap(virt_addr); -#endif + if (efi_enabled) { + if (iomem) + iounmap(virt_addr); + } return AE_OK; } @@ -414,19 +411,17 @@ u32 width) { void *virt_addr; - -#ifdef CONFIG_ACPI_EFI int iomem = 0; - if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) { + if (efi_enabled) { + if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) { + virt_addr = phys_to_virt(phys_addr); + } else { + iomem = 1; + virt_addr = ioremap(phys_addr, width); + } + } else virt_addr = phys_to_virt(phys_addr); - } else { - iomem = 1; - virt_addr = ioremap(phys_addr, width); - } -#else - virt_addr = phys_to_virt(phys_addr); -#endif switch (width) { case 8: @@ -442,10 +437,8 @@ BUG(); } -#ifdef CONFIG_ACPI_EFI if (iomem) iounmap(virt_addr); -#endif return AE_OK; } --- diff/drivers/acpi/pci_irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/acpi/pci_irq.c 2003-11-26 10:09:05.000000000 +0000 @@ -237,7 +237,7 @@ PCI Interrupt Routing Support -------------------------------------------------------------------------- */ -static int +int acpi_pci_irq_lookup (struct pci_bus *bus, int device, int pin) { struct acpi_prt_entry *entry = NULL; --- diff/drivers/base/firmware_class.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/base/firmware_class.c 2003-11-26 10:09:05.000000000 +0000 @@ -415,18 +415,22 @@ void (*cont)(const struct firmware *fw, void *context); }; -static void +static int request_firmware_work_func(void *arg) { struct firmware_work *fw_work = arg; const struct firmware *fw; - if (!arg) - return; + if (!arg) { + WARN_ON(1); + return 0; + } + daemonize("firmware/%s", fw_work->name); request_firmware(&fw, fw_work->name, fw_work->device); fw_work->cont(fw, fw_work->context); release_firmware(fw); module_put(fw_work->module); kfree(fw_work); + return 0; } /** @@ -451,6 +455,8 @@ { struct firmware_work *fw_work = kmalloc(sizeof (struct firmware_work), GFP_ATOMIC); + int ret; + if (!fw_work) return -ENOMEM; if (!try_module_get(module)) { @@ -465,9 +471,14 @@ .context = context, .cont = cont, }; - INIT_WORK(&fw_work->work, request_firmware_work_func, fw_work); - schedule_work(&fw_work->work); + ret = kernel_thread(request_firmware_work_func, fw_work, + CLONE_FS | CLONE_FILES); + + if (ret < 0) { + fw_work->cont(NULL, fw_work->context); + return ret; + } return 0; } --- diff/drivers/base/node.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/base/node.c 2003-11-26 10:09:05.000000000 +0000 @@ -19,15 +19,11 @@ { struct node *node_dev = to_node(dev); cpumask_t tmp = node_dev->cpumap; - int k, len = 0; + int len = 0; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(buf, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - buf += j; - cpus_shift_right(tmp, tmp, 16); - } - len += sprintf(buf, "\n"); + len = format_cpumask(buf, node_dev->cpumap); + buf += len; + len += sprintf(buf, "\n"); return len; } static SYSDEV_ATTR(cpumap,S_IRUGO,node_read_cpumap,NULL); --- diff/drivers/block/DAC960.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/block/DAC960.c 2003-11-26 10:09:05.000000000 +0000 @@ -67,9 +67,9 @@ } } -static int DAC960_open(struct inode *inode, struct file *file) +static int DAC960_open(struct block_device *bdev, struct file *file) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; DAC960_Controller_T *p = disk->queue->queuedata; int drive_nr = (long)disk->private_data; @@ -84,17 +84,17 @@ return -ENXIO; } - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (!get_capacity(p->disks[drive_nr])) return -ENXIO; return 0; } -static int DAC960_ioctl(struct inode *inode, struct file *file, +static int DAC960_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; DAC960_Controller_T *p = disk->queue->queuedata; int drive_nr = (long)disk->private_data; struct hd_geometry g, *loc = (struct hd_geometry *)arg; @@ -128,7 +128,7 @@ g.cylinders = i->ConfigurableDeviceSize / (g.heads * g.sectors); } - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); return copy_to_user(loc, &g, sizeof g) ? -EFAULT : 0; } --- diff/drivers/block/Kconfig.iosched 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/block/Kconfig.iosched 2003-11-26 10:09:05.000000000 +0000 @@ -27,3 +27,10 @@ a disk at any one time, its behaviour is almost identical to the anticipatory I/O scheduler and so is a good choice. +config IOSCHED_CFQ + bool "CFQ I/O scheduler" if EMBEDDED + default y + ---help--- + The CFQ I/O scheduler tries to distribute bandwidth equally + among all processes in the system. It should provide a fair + working environment, suitable for desktop systems. --- diff/drivers/block/Makefile 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/block/Makefile 2003-11-26 10:09:05.000000000 +0000 @@ -18,6 +18,7 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o obj-$(CONFIG_IOSCHED_AS) += as-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o +obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o obj-$(CONFIG_MAC_FLOPPY) += swim3.o obj-$(CONFIG_BLK_DEV_FD) += floppy.o obj-$(CONFIG_BLK_DEV_FD98) += floppy98.o --- diff/drivers/block/acsi.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/acsi.c 2003-11-26 10:09:05.000000000 +0000 @@ -359,10 +359,9 @@ static void do_end_requests( void ); static void do_acsi_request( request_queue_t * ); static void redo_acsi_request( void ); -static int acsi_ioctl( struct inode *inode, struct file *file, unsigned int +static int acsi_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg ); -static int acsi_open( struct inode * inode, struct file * filp ); -static int acsi_release( struct inode * inode, struct file * file ); +static int acsi_open(struct block_device *bdev, struct file *filp); static void acsi_prevent_removal(struct acsi_info_struct *aip, int flag ); static int acsi_change_blk_size( int target, int lun); static int acsi_mode_sense( int target, int lun, SENSE_DATA *sd ); @@ -1081,10 +1080,10 @@ ***********************************************************************/ -static int acsi_ioctl( struct inode *inode, struct file *file, - unsigned int cmd, unsigned long arg ) +static int acsi_ioctl(struct block_device *bdev, struct file *file, + unsigned int cmd, unsigned long arg ) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct acsi_info_struct *aip = disk->private_data; switch (cmd) { case HDIO_GETGEO: @@ -1096,7 +1095,7 @@ put_user( 64, &geo->heads ); put_user( 32, &geo->sectors ); put_user( aip->size >> 11, &geo->cylinders ); - put_user(get_start_sect(inode->i_bdev), &geo->start); + put_user(get_start_sect(bdev), &geo->start); return 0; } case SCSI_IOCTL_GET_IDLUN: @@ -1126,16 +1125,16 @@ * */ -static int acsi_open( struct inode * inode, struct file * filp ) +static int acsi_open(struct block_device *bdev, struct file *filp) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct acsi_info_struct *aip = disk->private_data; if (aip->access_count == 0 && aip->removable) { #if 0 aip->changed = 1; /* safety first */ #endif - check_disk_change( inode->i_bdev ); + check_disk_change(bdev); if (aip->changed) /* revalidate was not successful (no medium) */ return -ENXIO; acsi_prevent_removal(aip, 1); @@ -1143,10 +1142,11 @@ aip->access_count++; if (filp && filp->f_mode) { - check_disk_change( inode->i_bdev ); + check_disk_change(bdev); if (filp->f_mode & 2) { if (aip->read_only) { - acsi_release( inode, filp ); + if (--aip->access_count == 0 && aip->removable) + acsi_prevent_removal(aip, 0); return -EROFS; } } @@ -1160,9 +1160,8 @@ * be forgotten about... */ -static int acsi_release( struct inode * inode, struct file * file ) +static int acsi_release(struct gendisk *disk) { - struct gendisk *disk = inode->i_bdev->bd_disk; struct acsi_info_struct *aip = disk->private_data; if (--aip->access_count == 0 && aip->removable) acsi_prevent_removal(aip, 0); @@ -1328,8 +1327,6 @@ ********************************************************************/ -extern struct block_device_operations acsi_fops; - static struct gendisk *acsi_gendisk[MAX_DEV]; #define MAX_SCSI_DEVICE_CODE 10 --- diff/drivers/block/amiflop.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/block/amiflop.c 2003-11-26 10:09:05.000000000 +0000 @@ -1434,10 +1434,11 @@ redo_fd_request(); } -static int fd_ioctl(struct inode *inode, struct file *filp, +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { - int drive = iminor(inode) & 3; + struct amiga_floppy_struct *floppy = bdev->bd_disk->private_data; + int drive = floppy - unit; static struct floppy_struct getprm; switch(cmd){ @@ -1459,7 +1460,7 @@ rel_fdc(); return -EBUSY; } - fsync_bdev(inode->i_bdev); + fsync_bdev(bdev); if (fd_motor_on(drive) == 0) { rel_fdc(); return -ENODEV; @@ -1488,7 +1489,7 @@ break; case FDFMTEND: floppy_off(drive); - invalidate_bdev(inode->i_bdev, 0); + invalidate_bdev(bdev, 0); break; case FDGETPRM: memset((void *)&getprm, 0, sizeof (getprm)); @@ -1559,10 +1560,11 @@ * /dev/PS0 etc), and disallows simultaneous access to the same * drive with different device numbers. */ -static int floppy_open(struct inode *inode, struct file *filp) +static int floppy_open(struct block_device *bdev, struct file *filp) { - int drive = iminor(inode) & 3; - int system = (iminor(inode) & 4) >> 2; + struct amiga_floppy_struct *p = bdev->bd_disk->private_data; + int drive = p - unit; + int system = (MINOR(bdev->bd_dev) & 4) >> 2; int old_dev; unsigned long flags; @@ -1572,7 +1574,7 @@ return -EBUSY; if (filp && filp->f_mode & 3) { - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (filp->f_mode & 2 ) { int wrprot; @@ -1607,9 +1609,10 @@ return 0; } -static int floppy_release(struct inode * inode, struct file * filp) +static int floppy_release(struct gendisk *disk) { - int drive = iminor(inode) & 3; + struct amiga_floppy_struct *p = disk->private_data; + int drive = p - unit; if (unit[drive].dirty == 1) { del_timer (flush_track_timer + drive); --- diff/drivers/block/as-iosched.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/block/as-iosched.c 2003-11-26 10:09:05.000000000 +0000 @@ -70,6 +70,7 @@ /* Bits in as_io_context.state */ enum as_io_states { AS_TASK_RUNNING=0, /* Process has not exitted */ + AS_TASK_IOSTARTED, /* Process has started some IO */ AS_TASK_IORUNNING, /* Process has completed some IO */ }; @@ -99,7 +100,14 @@ sector_t last_sector[2]; /* last REQ_SYNC & REQ_ASYNC sectors */ struct list_head *dispatch; /* driver dispatch queue */ struct list_head *hash; /* request hash */ - unsigned long new_success; /* anticipation success on new proc */ + + unsigned long exit_prob; /* probability a task will exit while + being waited on */ + unsigned long new_ttime_total; /* mean thinktime on new proc */ + unsigned long new_ttime_mean; + u64 new_seek_total; /* mean seek on new proc */ + sector_t new_seek_mean; + unsigned long current_batch_expires; unsigned long last_check_fifo[2]; int changed_batch; /* 1: waiting for old batch to end */ @@ -137,6 +145,10 @@ scheduler */ AS_RQ_DISPATCHED, /* On the dispatch list. It belongs to the driver now */ + AS_RQ_PRESCHED, /* Debug poisoning for requests being used */ + AS_RQ_REMOVED, + AS_RQ_MERGED, + AS_RQ_POSTSCHED, /* when they shouldn't be */ }; struct as_rq { @@ -183,6 +195,7 @@ /* Called when the task exits */ static void exit_as_io_context(struct as_io_context *aic) { + WARN_ON(!test_bit(AS_TASK_RUNNING, &aic->state)); clear_bit(AS_TASK_RUNNING, &aic->state); } @@ -585,18 +598,11 @@ int status = ad->antic_status; if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) { - struct as_io_context *aic; - if (status == ANTIC_WAIT_NEXT) del_timer(&ad->antic_timer); ad->antic_status = ANTIC_FINISHED; /* see as_work_handler */ kblockd_schedule_work(&ad->antic_work); - - aic = ad->io_context->aic; - if (aic->seek_samples == 0) - /* new process */ - ad->new_success = (ad->new_success * 3) / 4 + 256; } } @@ -612,14 +618,15 @@ spin_lock_irqsave(q->queue_lock, flags); if (ad->antic_status == ANTIC_WAIT_REQ || ad->antic_status == ANTIC_WAIT_NEXT) { - struct as_io_context *aic; + struct as_io_context *aic = ad->io_context->aic; + ad->antic_status = ANTIC_FINISHED; kblockd_schedule_work(&ad->antic_work); - aic = ad->io_context->aic; - if (aic->seek_samples == 0) - /* new process */ - ad->new_success = (ad->new_success * 3) / 4; + if (aic->ttime_samples == 0) { + /* process anticipated on has exitted or timed out*/ + ad->exit_prob = (7*ad->exit_prob + 256)/8; + } } spin_unlock_irqrestore(q->queue_lock, flags); } @@ -633,7 +640,7 @@ unsigned long delay; /* milliseconds */ sector_t last = ad->last_sector[ad->batch_data_dir]; sector_t next = arq->request->sector; - sector_t delta; /* acceptable close offset (in sectors) */ + sector_t delta; /* acceptable close offset (in sectors) */ if (ad->antic_status == ANTIC_OFF || !ad->ioc_finished) delay = 0; @@ -650,6 +657,7 @@ return (last - (delta>>1) <= next) && (next <= last + delta); } +static void as_update_thinktime(struct as_data *ad, struct as_io_context *aic, unsigned long ttime); /* * as_can_break_anticipation returns true if we have been anticipating this * request. @@ -667,9 +675,27 @@ { struct io_context *ioc; struct as_io_context *aic; + sector_t s; + + ioc = ad->io_context; + BUG_ON(!ioc); + + if (arq && ioc == arq->io_context) { + /* request from same process */ + return 1; + } if (arq && arq->is_sync == REQ_SYNC && as_close_req(ad, arq)) { /* close request */ + struct as_io_context *aic = ioc->aic; + if (aic) { + unsigned long thinktime; + spin_lock(&aic->lock); + thinktime = jiffies - aic->last_end_request; + aic->last_end_request = jiffies; + as_update_thinktime(ad, aic, thinktime); + spin_unlock(&aic->lock); + } return 1; } @@ -681,20 +707,14 @@ return 1; } - ioc = ad->io_context; - BUG_ON(!ioc); - - if (arq && ioc == arq->io_context) { - /* request from same process */ - return 1; - } - aic = ioc->aic; if (!aic) return 0; if (!test_bit(AS_TASK_RUNNING, &aic->state)) { /* process anticipated on has exitted */ + if (aic->ttime_samples == 0) + ad->exit_prob = (7*ad->exit_prob + 256)/8; return 1; } @@ -708,28 +728,36 @@ return 1; } - if (ad->new_success < 256 && - (aic->seek_samples == 0 || aic->ttime_samples == 0)) { - /* - * Process has just started IO and we have a bad history of - * success anticipating on new processes! - */ - return 1; - } - - if (aic->ttime_mean > ad->antic_expire) { + if (aic->ttime_samples == 0) { + if (ad->new_ttime_mean > ad->antic_expire) + return 1; + if (ad->exit_prob > 128) + return 1; + } else if (aic->ttime_mean > ad->antic_expire) { /* the process thinks too much between requests */ return 1; } - if (arq && aic->seek_samples) { - sector_t s; - if (ad->last_sector[REQ_SYNC] < arq->request->sector) - s = arq->request->sector - ad->last_sector[REQ_SYNC]; - else - s = ad->last_sector[REQ_SYNC] - arq->request->sector; + if (!arq) + return 0; + + if (ad->last_sector[REQ_SYNC] < arq->request->sector) + s = arq->request->sector - ad->last_sector[REQ_SYNC]; + else + s = ad->last_sector[REQ_SYNC] - arq->request->sector; + + if (aic->seek_samples == 0) { + /* + * Process has just started IO. Use past statistics to + * guage success possibility + */ + if (ad->new_seek_mean/2 > s) { + /* this request is better than what we're expecting */ + return 1; + } - if (aic->seek_mean > (s>>1)) { + } else { + if (aic->seek_mean/2 > s) { /* this request is better than what we're expecting */ return 1; } @@ -774,12 +802,51 @@ return 1; } +static void as_update_thinktime(struct as_data *ad, struct as_io_context *aic, unsigned long ttime) +{ + /* fixed point: 1.0 == 1<<8 */ + if (aic->ttime_samples == 0) { + ad->new_ttime_total = (7*ad->new_ttime_total + 256*ttime) / 8; + ad->new_ttime_mean = ad->new_ttime_total / 256; + + ad->exit_prob = (7*ad->exit_prob)/8; + } + aic->ttime_samples = (7*aic->ttime_samples + 256) / 8; + aic->ttime_total = (7*aic->ttime_total + 256*ttime) / 8; + aic->ttime_mean = (aic->ttime_total + 128) / aic->ttime_samples; +} + +static void as_update_seekdist(struct as_data *ad, struct as_io_context *aic, sector_t sdist) +{ + u64 total; + + if (aic->seek_samples == 0) { + ad->new_seek_total = (7*ad->new_seek_total + 256*(u64)sdist)/8; + ad->new_seek_mean = ad->new_seek_total / 256; + } + + /* + * Don't allow the seek distance to get too large from the + * odd fragment, pagein, etc + */ + if (aic->seek_samples <= 60) /* second&third seek */ + sdist = min(sdist, (aic->seek_mean * 4) + 2*1024*1024); + else + sdist = min(sdist, (aic->seek_mean * 4) + 2*1024*64); + + aic->seek_samples = (7*aic->seek_samples + 256) / 8; + aic->seek_total = (7*aic->seek_total + (u64)256*sdist) / 8; + total = aic->seek_total + (aic->seek_samples/2); + do_div(total, aic->seek_samples); + aic->seek_mean = (sector_t)total; +} + /* * as_update_iohist keeps a decaying histogram of IO thinktimes, and * updates @aic->ttime_mean based on that. It is called when a new * request is queued. */ -static void as_update_iohist(struct as_io_context *aic, struct request *rq) +static void as_update_iohist(struct as_data *ad, struct as_io_context *aic, struct request *rq) { struct as_rq *arq = RQ_DATA(rq); int data_dir = arq->is_sync; @@ -790,60 +857,29 @@ return; if (data_dir == REQ_SYNC) { + unsigned long in_flight = atomic_read(&aic->nr_queued) + + atomic_read(&aic->nr_dispatched); spin_lock(&aic->lock); - - if (test_bit(AS_TASK_IORUNNING, &aic->state) - && !atomic_read(&aic->nr_queued) - && !atomic_read(&aic->nr_dispatched)) { + if (test_bit(AS_TASK_IORUNNING, &aic->state) || + test_bit(AS_TASK_IOSTARTED, &aic->state)) { /* Calculate read -> read thinktime */ - thinktime = jiffies - aic->last_end_request; - thinktime = min(thinktime, MAX_THINKTIME-1); - /* fixed point: 1.0 == 1<<8 */ - aic->ttime_samples += 256; - aic->ttime_total += 256*thinktime; - if (aic->ttime_samples) - /* fixed point factor is cancelled here */ - aic->ttime_mean = (aic->ttime_total + 128) - / aic->ttime_samples; - aic->ttime_samples = (aic->ttime_samples>>1) - + (aic->ttime_samples>>2); - aic->ttime_total = (aic->ttime_total>>1) - + (aic->ttime_total>>2); - } - - /* Calculate read -> read seek distance */ - if (!aic->seek_samples) - seek_dist = 0; - else if (aic->last_request_pos < rq->sector) - seek_dist = rq->sector - aic->last_request_pos; - else - seek_dist = aic->last_request_pos - rq->sector; - + if (test_bit(AS_TASK_IORUNNING, &aic->state) + && in_flight == 0) { + thinktime = jiffies - aic->last_end_request; + thinktime = min(thinktime, MAX_THINKTIME-1); + } else + thinktime = 0; + as_update_thinktime(ad, aic, thinktime); + + /* Calculate read -> read seek distance */ + if (aic->last_request_pos < rq->sector) + seek_dist = rq->sector - aic->last_request_pos; + else + seek_dist = aic->last_request_pos - rq->sector; + as_update_seekdist(ad, aic, seek_dist); + } aic->last_request_pos = rq->sector + rq->nr_sectors; - - /* - * Don't allow the seek distance to get too large from the - * odd fragment, pagein, etc - */ - if (aic->seek_samples < 400) /* second&third seek */ - seek_dist = min(seek_dist, (aic->seek_mean * 4) - + 2*1024*1024); - else - seek_dist = min(seek_dist, (aic->seek_mean * 4) - + 2*1024*64); - - aic->seek_samples += 256; - aic->seek_total += (u64)256*seek_dist; - if (aic->seek_samples) { - u64 total = aic->seek_total + (aic->seek_samples>>1); - do_div(total, aic->seek_samples); - aic->seek_mean = (sector_t)total; - } - aic->seek_samples = (aic->seek_samples>>1) - + (aic->seek_samples>>2); - aic->seek_total = (aic->seek_total>>1) - + (aic->seek_total>>2); - + set_bit(AS_TASK_IOSTARTED, &aic->state); spin_unlock(&aic->lock); } } @@ -908,15 +944,25 @@ { struct as_data *ad = q->elevator.elevator_data; struct as_rq *arq = RQ_DATA(rq); - struct as_io_context *aic; WARN_ON(!list_empty(&rq->queuelist)); - if (unlikely(arq->state != AS_RQ_DISPATCHED)) - return; + if (arq->state == AS_RQ_PRESCHED) { + WARN_ON(arq->io_context); + goto out; + } + + if (arq->state == AS_RQ_MERGED) + goto out_ioc; + + if (arq->state != AS_RQ_REMOVED) { + printk("arq->state %d\n", arq->state); + WARN_ON(1); + goto out; + } if (!blk_fs_request(rq)) - return; + goto out; if (ad->changed_batch && ad->nr_dispatched == 1) { kblockd_schedule_work(&ad->antic_work); @@ -940,10 +986,7 @@ ad->new_batch = 0; } - if (!arq->io_context) - return; - - if (ad->io_context == arq->io_context) { + if (ad->io_context == arq->io_context && ad->io_context) { ad->antic_start = jiffies; ad->ioc_finished = 1; if (ad->antic_status == ANTIC_WAIT_REQ) { @@ -955,18 +998,23 @@ } } - aic = arq->io_context->aic; - if (!aic) - return; +out_ioc: + if (!arq->io_context) + goto out; - spin_lock(&aic->lock); if (arq->is_sync == REQ_SYNC) { - set_bit(AS_TASK_IORUNNING, &aic->state); - aic->last_end_request = jiffies; + struct as_io_context *aic = arq->io_context->aic; + if (aic) { + spin_lock(&aic->lock); + set_bit(AS_TASK_IORUNNING, &aic->state); + aic->last_end_request = jiffies; + spin_unlock(&aic->lock); + } } - spin_unlock(&aic->lock); put_io_context(arq->io_context); +out: + arq->state = AS_RQ_POSTSCHED; } /* @@ -1035,14 +1083,14 @@ struct as_rq *arq = RQ_DATA(rq); if (unlikely(arq->state == AS_RQ_NEW)) - return; - - if (!arq) { - WARN_ON(1); - return; - } + goto out; if (ON_RB(&arq->rb_node)) { + if (arq->state != AS_RQ_QUEUED) { + printk("arq->state %d\n", arq->state); + WARN_ON(1); + goto out; + } /* * We'll lose the aliased request(s) here. I don't think this * will ever happen, but if it does, hopefully someone will @@ -1050,8 +1098,16 @@ */ WARN_ON(!list_empty(&rq->queuelist)); as_remove_queued_request(q, rq); - } else + } else { + if (arq->state != AS_RQ_DISPATCHED) { + printk("arq->state %d\n", arq->state); + WARN_ON(1); + goto out; + } as_remove_dispatched_request(q, rq); + } +out: + arq->state = AS_RQ_REMOVED; } /* @@ -1229,9 +1285,9 @@ */ goto dispatch_writes; - if (ad->batch_data_dir == REQ_ASYNC) { + if (ad->batch_data_dir == REQ_ASYNC) { WARN_ON(ad->new_batch); - ad->changed_batch = 1; + ad->changed_batch = 1; } ad->batch_data_dir = REQ_SYNC; arq = list_entry_fifo(ad->fifo_list[ad->batch_data_dir].next); @@ -1247,8 +1303,8 @@ dispatch_writes: BUG_ON(RB_EMPTY(&ad->sort_list[REQ_ASYNC])); - if (ad->batch_data_dir == REQ_SYNC) { - ad->changed_batch = 1; + if (ad->batch_data_dir == REQ_SYNC) { + ad->changed_batch = 1; /* * new_batch might be 1 when the queue runs out of @@ -1291,8 +1347,6 @@ ad->new_batch = 1; ad->changed_batch = 0; - - arq->request->flags |= REQ_SOFTBARRIER; } /* @@ -1369,8 +1423,8 @@ arq->io_context = as_get_io_context(); if (arq->io_context) { + as_update_iohist(ad, arq->io_context->aic, arq->request); atomic_inc(&arq->io_context->aic->nr_queued); - as_update_iohist(arq->io_context->aic, arq->request); } alias = as_add_arq_rb(ad, arq); @@ -1391,6 +1445,7 @@ } else { as_add_aliased_request(ad, arq, alias); + /* * have we been anticipating this request? * or does it come from the same process as the one we are @@ -1416,6 +1471,11 @@ struct as_rq *arq = RQ_DATA(rq); if (arq) { + if (arq->state != AS_RQ_REMOVED) { + printk("arq->state %d\n", arq->state); + WARN_ON(1); + } + arq->state = AS_RQ_DISPATCHED; if (arq->io_context && arq->io_context->aic) atomic_inc(&arq->io_context->aic->nr_dispatched); @@ -1427,8 +1487,6 @@ /* Stop anticipating - let this request get through */ as_antic_stop(ad); - - return; } static void @@ -1437,10 +1495,20 @@ struct as_data *ad = q->elevator.elevator_data; struct as_rq *arq = RQ_DATA(rq); + if (arq) { + if (arq->state != AS_RQ_PRESCHED) { + printk("arq->state: %d\n", arq->state); + WARN_ON(1); + } + arq->state = AS_RQ_NEW; + } + /* barriers must flush the reorder queue */ if (unlikely(rq->flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER) - && where == ELEVATOR_INSERT_SORT)) + && where == ELEVATOR_INSERT_SORT)) { + WARN_ON(1); where = ELEVATOR_INSERT_BACK; + } switch (where) { case ELEVATOR_INSERT_BACK: @@ -1675,7 +1743,8 @@ * kill knowledge of next, this one is a goner */ as_remove_queued_request(q, next); - put_io_context(anext->io_context); + + anext->state = AS_RQ_MERGED; } /* @@ -1708,6 +1777,11 @@ return; } + if (arq->state != AS_RQ_POSTSCHED && arq->state != AS_RQ_PRESCHED) { + printk("arq->state %d\n", arq->state); + WARN_ON(1); + } + mempool_free(arq, ad->arq_pool); rq->elevator_private = NULL; } @@ -1721,7 +1795,7 @@ memset(arq, 0, sizeof(*arq)); RB_CLEAR(&arq->rb_node); arq->request = rq; - arq->state = AS_RQ_NEW; + arq->state = AS_RQ_PRESCHED; arq->io_context = NULL; INIT_LIST_HEAD(&arq->hash); arq->on_hash = 0; @@ -1823,8 +1897,6 @@ if (ad->write_batch_count < 2) ad->write_batch_count = 2; - ad->new_success = 512; - return 0; } @@ -1860,6 +1932,17 @@ return count; } +static ssize_t as_est_show(struct as_data *ad, char *page) +{ + int pos = 0; + + pos += sprintf(page+pos, "%lu %% exit probability\n", 100*ad->exit_prob/256); + pos += sprintf(page+pos, "%lu ms new thinktime\n", ad->new_ttime_mean); + pos += sprintf(page+pos, "%llu sectors new seek distance\n", (unsigned long long)ad->new_seek_mean); + + return pos; +} + #define SHOW_FUNCTION(__FUNC, __VAR) \ static ssize_t __FUNC(struct as_data *ad, char *page) \ { \ @@ -1891,6 +1974,10 @@ &ad->batch_expire[REQ_ASYNC], 0, INT_MAX); #undef STORE_FUNCTION +static struct as_fs_entry as_est_entry = { + .attr = {.name = "est_time", .mode = S_IRUGO }, + .show = as_est_show, +}; static struct as_fs_entry as_readexpire_entry = { .attr = {.name = "read_expire", .mode = S_IRUGO | S_IWUSR }, .show = as_readexpire_show, @@ -1918,6 +2005,7 @@ }; static struct attribute *default_attrs[] = { + &as_est_entry.attr, &as_readexpire_entry.attr, &as_writeexpire_entry.attr, &as_anticexpire_entry.attr, --- diff/drivers/block/ataflop.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/block/ataflop.c 2003-11-26 10:09:05.000000000 +0000 @@ -364,13 +364,12 @@ static __inline__ void copy_buffer( void *from, void *to); static void setup_req_params( int drive ); static void redo_fd_request( void); -static int fd_ioctl( struct inode *inode, struct file *filp, unsigned int +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param); static void fd_probe( int drive ); static int fd_test_drive_present( int drive ); static void config_types( void ); -static int floppy_open( struct inode *inode, struct file *filp ); -static int floppy_release( struct inode * inode, struct file * filp ); +static int floppy_open(struct block_device *bdev, struct file *filp ); /************************* End of Prototypes **************************/ @@ -1496,10 +1495,10 @@ atari_enable_irq( IRQ_MFP_FDC ); } -static int fd_ioctl(struct inode *inode, struct file *filp, +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct atari_floppy_struct *floppy = disk->private_data; int drive = floppy - unit; int type = floppy->type; @@ -1673,7 +1672,7 @@ /* invalidate the buffer track to force a reread */ BufferDrive = -1; set_bit(drive, &fake_change); - check_disk_change(inode->i_bdev); + check_disk_change(bdev); return 0; default: return -EINVAL; @@ -1816,10 +1815,10 @@ * drive with different device numbers. */ -static int floppy_open( struct inode *inode, struct file *filp ) +static int floppy_open(struct block_device *bdev, struct file *filp ) { - struct atari_floppy_struct *p = inode->i_bdev->bd_disk->private_data; - int type = iminor(inode) >> 2; + struct atari_floppy_struct *p = bdev->bd_disk->private_data; + int type = MINOR(bdev->bd_dev) >> 2; DPRINT(("fd_open: type=%d\n",type)); if (p->ref && p->type != type) @@ -1839,14 +1838,13 @@ return 0; if (filp->f_mode & 3) { - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (filp->f_mode & 2) { if (p->wpstat) { if (p->ref < 0) p->ref = 0; else p->ref--; - floppy_release(inode, filp); return -EROFS; } } @@ -1855,9 +1853,9 @@ } -static int floppy_release( struct inode * inode, struct file * filp ) +static int floppy_release(struct gendisk *disk) { - struct atari_floppy_struct *p = inode->i_bdev->bd_disk->private_data; + struct atari_floppy_struct *p = disk->private_data; if (p->ref < 0) p->ref = 0; else if (!p->ref--) { --- diff/drivers/block/cciss.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/cciss.c 2003-11-26 10:09:05.000000000 +0000 @@ -112,9 +112,9 @@ static ctlr_info_t *hba[MAX_CTLR]; static void do_cciss_request(request_queue_t *q); -static int cciss_open(struct inode *inode, struct file *filep); -static int cciss_release(struct inode *inode, struct file *filep); -static int cciss_ioctl(struct inode *inode, struct file *filep, +static int cciss_open(struct block_device *bdev, struct file *filep); +static int cciss_release(struct gendisk *disk); +static int cciss_ioctl(struct block_device *bdev, struct file *filep, unsigned int cmd, unsigned long arg); static int revalidate_allvol(ctlr_info_t *host); @@ -362,13 +362,13 @@ /* * Open. Make sure the device is really there. */ -static int cciss_open(struct inode *inode, struct file *filep) +static int cciss_open(struct block_device *bdev, struct file *filep) { - ctlr_info_t *host = get_host(inode->i_bdev->bd_disk); - drive_info_struct *drv = get_drv(inode->i_bdev->bd_disk); + ctlr_info_t *host = get_host(bdev->bd_disk); + drive_info_struct *drv = get_drv(bdev->bd_disk); #ifdef CCISS_DEBUG - printk(KERN_DEBUG "cciss_open %s\n", inode->i_bdev->bd_disk->disk_name); + printk(KERN_DEBUG "cciss_open %s\n", bdev->bd_disk->disk_name); #endif /* CCISS_DEBUG */ /* @@ -378,7 +378,7 @@ * for "raw controller". */ if (drv->nr_blocks == 0) { - if (iminor(inode) != 0) + if (bdev != bdev->bd_contains || drv != host->drv) return -ENXIO; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -390,13 +390,13 @@ /* * Close. Sync first. */ -static int cciss_release(struct inode *inode, struct file *filep) +static int cciss_release(struct gendisk *disk) { - ctlr_info_t *host = get_host(inode->i_bdev->bd_disk); - drive_info_struct *drv = get_drv(inode->i_bdev->bd_disk); + ctlr_info_t *host = get_host(disk); + drive_info_struct *drv = get_drv(disk); #ifdef CCISS_DEBUG - printk(KERN_DEBUG "cciss_release %s\n", inode->i_bdev->bd_disk->disk_name); + printk(KERN_DEBUG "cciss_release %s\n", disk->disk_name); #endif /* CCISS_DEBUG */ drv->usage_count--; @@ -407,10 +407,9 @@ /* * ioctl */ -static int cciss_ioctl(struct inode *inode, struct file *filep, +static int cciss_ioctl(struct block_device *bdev, struct file *filep, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; struct gendisk *disk = bdev->bd_disk; ctlr_info_t *host = get_host(disk); drive_info_struct *drv = get_drv(disk); @@ -433,7 +432,7 @@ driver_geo.sectors = 0x3f; driver_geo.cylinders = (int)drv->nr_blocks / (0xff*0x3f); } - driver_geo.start= get_start_sect(inode->i_bdev); + driver_geo.start= get_start_sect(bdev); if (copy_to_user((void *) arg, &driver_geo, sizeof( struct hd_geometry))) return -EFAULT; --- diff/drivers/block/cpqarray.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/cpqarray.c 2003-11-26 10:09:05.000000000 +0000 @@ -128,9 +128,9 @@ unsigned int blkcnt, unsigned int log_unit ); -static int ida_open(struct inode *inode, struct file *filep); -static int ida_release(struct inode *inode, struct file *filep); -static int ida_ioctl(struct inode *inode, struct file *filep, unsigned int cmd, unsigned long arg); +static int ida_open(struct block_device *bdev, struct file *filep); +static int ida_release(struct gendisk *disk); +static int ida_ioctl(struct block_device *bdev, struct file *filep, unsigned int cmd, unsigned long arg); static int ida_ctlr_ioctl(ctlr_info_t *h, int dsk, ida_ioctl_t *io); static void do_ida_request(request_queue_t *q); @@ -715,12 +715,12 @@ /* * Open. Make sure the device is really there. */ -static int ida_open(struct inode *inode, struct file *filep) +static int ida_open(struct block_device *bdev, struct file *filep) { - drv_info_t *drv = get_drv(inode->i_bdev->bd_disk); - ctlr_info_t *host = get_host(inode->i_bdev->bd_disk); + drv_info_t *drv = get_drv(bdev->bd_disk); + ctlr_info_t *host = get_host(bdev->bd_disk); - DBGINFO(printk("ida_open %s\n", inode->i_bdev->bd_disk->disk_name)); + DBGINFO(printk("ida_open %s\n", bdev->bd_disk->disk_name)); /* * Root is allowed to open raw volume zero even if it's not configured * so array config can still work. I don't think I really like this, @@ -740,9 +740,9 @@ /* * Close. Sync first. */ -static int ida_release(struct inode *inode, struct file *filep) +static int ida_release(struct gendisk *disk) { - ctlr_info_t *host = get_host(inode->i_bdev->bd_disk); + ctlr_info_t *host = get_host(disk); host->usage_count--; return 0; } @@ -1022,10 +1022,10 @@ * ida_ioctl does some miscellaneous stuff like reporting drive geometry, * setting readahead and submitting commands from userspace to the controller. */ -static int ida_ioctl(struct inode *inode, struct file *filep, unsigned int cmd, unsigned long arg) +static int ida_ioctl(struct block_device *bdev, struct file *filep, unsigned int cmd, unsigned long arg) { - drv_info_t *drv = get_drv(inode->i_bdev->bd_disk); - ctlr_info_t *host = get_host(inode->i_bdev->bd_disk); + drv_info_t *drv = get_drv(bdev->bd_disk); + ctlr_info_t *host = get_host(bdev->bd_disk); int error; int diskinfo[4]; struct hd_geometry *geo = (struct hd_geometry *)arg; @@ -1046,7 +1046,7 @@ put_user(diskinfo[0], &geo->heads); put_user(diskinfo[1], &geo->sectors); put_user(diskinfo[2], &geo->cylinders); - put_user(get_start_sect(inode->i_bdev), &geo->start); + put_user(get_start_sect(bdev), &geo->start); return 0; case IDAGETDRVINFO: if (copy_to_user(&io->c.drv, drv, sizeof(drv_info_t))) @@ -1076,7 +1076,7 @@ put_user(host->ctlr_sig, (int*)arg); return 0; case IDAREVALIDATEVOLS: - if (iminor(inode) != 0) + if (bdev != bdev->bd_contains || drv != host->drv) return -ENXIO; return revalidate_allvol(host); case IDADRIVERVERSION: --- diff/drivers/block/cryptoloop.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/block/cryptoloop.c 2003-11-26 10:09:05.000000000 +0000 @@ -87,43 +87,49 @@ static int -cryptoloop_transfer_ecb(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t IV) +cryptoloop_transfer_ecb(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t IV) { struct crypto_tfm *tfm = (struct crypto_tfm *) lo->key_data; struct scatterlist sg_out = { 0, }; struct scatterlist sg_in = { 0, }; encdec_ecb_t encdecfunc; - char const *in; - char *out; + struct page *in_page, *out_page; + unsigned in_offs, out_offs; if (cmd == READ) { - in = raw_buf; - out = loop_buf; + in_page = raw_page; + in_offs = raw_off; + out_page = loop_page; + out_offs = loop_off; encdecfunc = tfm->crt_u.cipher.cit_decrypt; } else { - in = loop_buf; - out = raw_buf; + in_page = loop_page; + in_offs = loop_off; + out_page = raw_page; + out_offs = raw_off; encdecfunc = tfm->crt_u.cipher.cit_encrypt; } while (size > 0) { const int sz = min(size, LOOP_IV_SECTOR_SIZE); - sg_in.page = virt_to_page(in); - sg_in.offset = (unsigned long)in & ~PAGE_MASK; + sg_in.page = in_page; + sg_in.offset = in_offs; sg_in.length = sz; - sg_out.page = virt_to_page(out); - sg_out.offset = (unsigned long)out & ~PAGE_MASK; + sg_out.page = out_page; + sg_out.offset = out_offs; sg_out.length = sz; encdecfunc(tfm, &sg_out, &sg_in, sz); size -= sz; - in += sz; - out += sz; + in_offs += sz; + out_offs += sz; } return 0; @@ -135,24 +141,30 @@ unsigned int nsg, u8 *iv); static int -cryptoloop_transfer_cbc(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t IV) +cryptoloop_transfer_cbc(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t IV) { struct crypto_tfm *tfm = (struct crypto_tfm *) lo->key_data; struct scatterlist sg_out = { 0, }; struct scatterlist sg_in = { 0, }; encdec_cbc_t encdecfunc; - char const *in; - char *out; + struct page *in_page, *out_page; + unsigned in_offs, out_offs; if (cmd == READ) { - in = raw_buf; - out = loop_buf; + in_page = raw_page; + in_offs = raw_off; + out_page = loop_page; + out_offs = loop_off; encdecfunc = tfm->crt_u.cipher.cit_decrypt_iv; } else { - in = loop_buf; - out = raw_buf; + in_page = loop_page; + in_offs = loop_off; + out_page = raw_page; + out_offs = raw_off; encdecfunc = tfm->crt_u.cipher.cit_encrypt_iv; } @@ -161,39 +173,43 @@ u32 iv[4] = { 0, }; iv[0] = cpu_to_le32(IV & 0xffffffff); - sg_in.page = virt_to_page(in); - sg_in.offset = offset_in_page(in); + sg_in.page = in_page; + sg_in.offset = in_offs; sg_in.length = sz; - sg_out.page = virt_to_page(out); - sg_out.offset = offset_in_page(out); + sg_out.page = out_page; + sg_out.offset = out_offs; sg_out.length = sz; encdecfunc(tfm, &sg_out, &sg_in, sz, (u8 *)iv); IV++; size -= sz; - in += sz; - out += sz; + in_offs += sz; + out_offs += sz; } return 0; } static int -cryptoloop_transfer(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t IV) +cryptoloop_transfer(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t IV) { struct crypto_tfm *tfm = (struct crypto_tfm *) lo->key_data; if(tfm->crt_cipher.cit_mode == CRYPTO_TFM_MODE_ECB) { lo->transfer = cryptoloop_transfer_ecb; - return cryptoloop_transfer_ecb(lo, cmd, raw_buf, loop_buf, size, IV); + return cryptoloop_transfer_ecb(lo, cmd, raw_page, raw_off, + loop_page, loop_off, size, IV); } if(tfm->crt_cipher.cit_mode == CRYPTO_TFM_MODE_CBC) { lo->transfer = cryptoloop_transfer_cbc; - return cryptoloop_transfer_cbc(lo, cmd, raw_buf, loop_buf, size, IV); + return cryptoloop_transfer_cbc(lo, cmd, raw_page, raw_off, + loop_page, loop_off, size, IV); } /* This is not supposed to happen */ --- diff/drivers/block/floppy.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/block/floppy.c 2003-11-26 10:09:05.000000000 +0000 @@ -3456,14 +3456,14 @@ return 0; } -static int fd_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { #define FD_IOCTL_ALLOWED ((filp) && (filp)->private_data) #define OUT(c,x) case c: outparam = (const char *) (x); break #define IN(c,x,tag) case c: *(x) = inparam. tag ; return 0 - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)bdev->bd_disk->private_data; int i, type = ITYPE(UDRS->fd_device); int ret; int size; @@ -3539,11 +3539,11 @@ current_type[drive] = NULL; floppy_sizes[drive] = MAX_DISK_SIZE << 1; UDRS->keep_data = 0; - return invalidate_drive(inode->i_bdev); + return invalidate_drive(bdev); case FDSETPRM: case FDDEFPRM: return set_geometry(cmd, & inparam.g, - drive, type, inode->i_bdev); + drive, type, bdev); case FDGETPRM: ECALL(get_floppy_geometry(drive, type, (struct floppy_struct**) @@ -3574,7 +3574,7 @@ case FDFMTEND: case FDFLUSH: LOCK_FDC(drive,1); - return invalidate_drive(inode->i_bdev); + return invalidate_drive(bdev); case FDSETEMSGTRESH: UDP->max_errors.reporting = @@ -3685,9 +3685,9 @@ printk("\n"); } -static int floppy_release(struct inode * inode, struct file * filp) +static int floppy_release(struct gendisk *disk) { - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)disk->private_data; down(&open_lock); if (UDRS->fd_ref < 0) @@ -3708,9 +3708,9 @@ * /dev/PS0 etc), and disallows simultaneous access to the same * drive with different device numbers. */ -static int floppy_open(struct inode * inode, struct file * filp) +static int floppy_open(struct block_device *bdev, struct file * filp) { - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)bdev->bd_disk->private_data; int old_dev; int try; int res = -EBUSY; @@ -3719,7 +3719,7 @@ filp->private_data = (void*) 0; down(&open_lock); old_dev = UDRS->fd_device; - if (opened_bdev[drive] && opened_bdev[drive] != inode->i_bdev) + if (opened_bdev[drive] && opened_bdev[drive] != bdev) goto out2; if (!UDRS->fd_ref && (UDP->flags & FD_BROKEN_DCL)){ @@ -3739,7 +3739,7 @@ else UDRS->fd_ref++; - opened_bdev[drive] = inode->i_bdev; + opened_bdev[drive] = bdev; res = -ENXIO; @@ -3774,9 +3774,9 @@ } } - UDRS->fd_device = iminor(inode); - set_capacity(disks[drive], floppy_sizes[iminor(inode)]); - if (old_dev != -1 && old_dev != iminor(inode)) { + UDRS->fd_device = MINOR(bdev->bd_dev); + set_capacity(disks[drive], floppy_sizes[MINOR(bdev->bd_dev)]); + if (old_dev != -1 && old_dev != MINOR(bdev->bd_dev)) { if (buffer_drive == drive) buffer_track = -1; } @@ -3784,8 +3784,7 @@ /* Allow ioctls if we have write-permissions even if read-only open. * Needed so that programs such as fdrawcmd still can work on write * protected disks */ - if ((filp->f_mode & 2) || - (inode->i_sb && (permission(inode,2, NULL) == 0))) + if ((filp->f_mode & 2) || permission(filp->f_dentry->d_inode,2,NULL) == 0) filp->private_data = (void*) 8; if (UFDCS->rawcmd == 1) @@ -3794,7 +3793,7 @@ if (!(filp->f_flags & O_NDELAY)) { if (filp->f_mode & 3) { UDRS->last_checked = 0; - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (UTESTF(FD_DISK_CHANGED)) goto out; } --- diff/drivers/block/floppy98.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/block/floppy98.c 2003-11-26 10:09:05.000000000 +0000 @@ -3484,14 +3484,14 @@ return 0; } -static int fd_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, +static int fd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { #define FD_IOCTL_ALLOWED ((filp) && (filp)->private_data) #define OUT(c,x) case c: outparam = (const char *) (x); break #define IN(c,x,tag) case c: *(x) = inparam. tag ; return 0 - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)bdev->bd_disk->private_data; int i, type = ITYPE(UDRS->fd_device); int ret; int size; @@ -3566,11 +3566,11 @@ current_type[drive] = NULL; floppy_sizes[drive] = MAX_DISK_SIZE << 1; UDRS->keep_data = 0; - return invalidate_drive(inode->i_bdev); + return invalidate_drive(bdev); case FDSETPRM: case FDDEFPRM: return set_geometry(cmd, & inparam.g, - drive, type, inode->i_bdev); + drive, type, bdev); case FDGETPRM: ECALL(get_floppy_geometry(drive, type, (struct floppy_struct**) @@ -3625,7 +3625,7 @@ case FDFMTEND: case FDFLUSH: LOCK_FDC(drive,1); - return invalidate_drive(inode->i_bdev); + return invalidate_drive(bdev); case FDSETEMSGTRESH: UDP->max_errors.reporting = @@ -3735,9 +3735,9 @@ printk("\n"); } -static int floppy_release(struct inode * inode, struct file * filp) +static int floppy_release(struct gendisk *disk) { - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)disk->private_data; down(&open_lock); if (UDRS->fd_ref < 0) @@ -3758,11 +3758,10 @@ * /dev/PS0 etc), and disallows simultaneous access to the same * drive with different device numbers. */ -#define RETERR(x) do{floppy_release(inode,filp); return -(x);}while(0) -static int floppy_open(struct inode * inode, struct file * filp) +static int floppy_open(struct block_device *bdev, struct file *filp) { - int drive = (long)inode->i_bdev->bd_disk->private_data; + int drive = (long)bdev->bd_disk->private_data; int old_dev; int try; int res = -EBUSY; @@ -3789,7 +3788,7 @@ down(&open_lock); old_dev = UDRS->fd_device; - if (opened_bdev[drive] && opened_bdev[drive] != inode->i_bdev) + if (opened_bdev[drive] && opened_bdev[drive] != bdev) goto out2; if (!UDRS->fd_ref && (UDP->flags & FD_BROKEN_DCL)){ @@ -3809,7 +3808,7 @@ else UDRS->fd_ref++; - opened_bdev[drive] = inode->i_bdev; + opened_bdev[drive] = bdev; res = -ENXIO; @@ -3844,9 +3843,9 @@ } } - UDRS->fd_device = iminor(inode); - set_capacity(disks[drive], floppy_sizes[iminor(inode)]); - if (old_dev != -1 && old_dev != iminor(inode)) { + UDRS->fd_device = MINOR(bdev->bd_dev); + set_capacity(disks[drive], floppy_sizes[MINOR(bdev->bd_dev)]); + if (old_dev != -1 && old_dev != MINOR(bdev->bd_dev)) { if (buffer_drive == drive) buffer_track = -1; } @@ -3859,8 +3858,7 @@ /* Allow ioctls if we have write-permissions even if read-only open. * Needed so that programs such as fdrawcmd still can work on write * protected disks */ - if ((filp->f_mode & 2) || - (inode->i_sb && (permission(inode,2) == 0))) + if ((filp->f_mode & 2) || permission(filp->f_dentry->d_inode,2) == 0) filp->private_data = (void*) 8; if (UFDCS->rawcmd == 1) @@ -3873,7 +3871,7 @@ if (!(filp->f_flags & O_NDELAY)) { if (filp->f_mode & 3) { UDRS->last_checked = 0; - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (UTESTF(FD_DISK_CHANGED)) goto out; } --- diff/drivers/block/genhd.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/genhd.c 2003-11-26 10:09:05.000000000 +0000 @@ -296,7 +296,7 @@ static struct kobject *base_probe(dev_t dev, int *part, void *data) { - request_module("block-major-%d", MAJOR(dev)); + request_module("block-major-%d-%d", MAJOR(dev), MINOR(dev)); return NULL; } --- diff/drivers/block/ioctl.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/block/ioctl.c 2003-11-26 10:09:05.000000000 +0000 @@ -132,10 +132,9 @@ return put_user(val, (u64 *)arg); } -int blkdev_ioctl(struct inode *inode, struct file *file, unsigned cmd, +int blkdev_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; struct gendisk *disk = bdev->bd_disk; struct backing_dev_info *bdi; int holder; @@ -194,7 +193,7 @@ if (!capable(CAP_SYS_ADMIN)) return -EACCES; if (disk->fops->ioctl) { - ret = disk->fops->ioctl(inode, file, cmd, arg); + ret = disk->fops->ioctl(bdev, file, cmd, arg); if (ret != -EINVAL) return ret; } @@ -203,7 +202,7 @@ return 0; case BLKROSET: if (disk->fops->ioctl) { - ret = disk->fops->ioctl(inode, file, cmd, arg); + ret = disk->fops->ioctl(bdev, file, cmd, arg); if (ret != -EINVAL) return ret; } @@ -215,7 +214,7 @@ return 0; default: if (disk->fops->ioctl) - return disk->fops->ioctl(inode, file, cmd, arg); + return disk->fops->ioctl(bdev, file, cmd, arg); } return -ENOTTY; } --- diff/drivers/block/ll_rw_blk.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/block/ll_rw_blk.c 2003-11-26 10:09:05.000000000 +0000 @@ -517,10 +517,10 @@ { int bits, i; - if (depth > q->nr_requests * 2) { - depth = q->nr_requests * 2; - printk(KERN_ERR "%s: adjusted depth to %d\n", - __FUNCTION__, depth); + if (depth > q->nr_requests / 2) { + q->nr_requests = depth * 2; + printk(KERN_INFO "%s: large TCQ depth: adjusted nr_requests " + "to %lu\n", __FUNCTION__, q->nr_requests); } tags->tag_index = kmalloc(depth * sizeof(struct request *), GFP_ATOMIC); @@ -1335,6 +1335,8 @@ &iosched_as; #elif defined(CONFIG_IOSCHED_DEADLINE) &iosched_deadline; +#elif defined(CONFIG_IOSCHED_CFQ) + &iosched_cfq; #elif defined(CONFIG_IOSCHED_NOOP) &elevator_noop; #else @@ -1353,6 +1355,10 @@ if (!strcmp(str, "as")) chosen_elevator = &iosched_as; #endif +#ifdef CONFIG_IOSCHED_CFQ + if (!strcmp(str, "cfq")) + chosen_elevator = &iosched_cfq; +#endif #ifdef CONFIG_IOSCHED_NOOP if (!strcmp(str, "noop")) chosen_elevator = &elevator_noop; @@ -1875,29 +1881,52 @@ spin_unlock_irqrestore(q->queue_lock, flags); } } - EXPORT_SYMBOL(blk_put_request); /** - * blk_congestion_wait - wait for a queue to become uncongested + * blk_congestion_wait_wq - wait for a queue to become uncongested, * @rw: READ or WRITE * @timeout: timeout in jiffies + * @wait : wait queue entry to use for waiting or async notification + * (NULL defaults to synchronous behaviour) * * Waits for up to @timeout jiffies for a queue (any queue) to exit congestion. * If no queues are congested then just wait for the next request to be * returned. + * + * If the wait queue parameter specifies an async i/o callback, + * then instead of blocking, just register the callback on the wait + * queue for async notification when the queue gets uncongested. */ -void blk_congestion_wait(int rw, long timeout) +int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait) { - DEFINE_WAIT(wait); wait_queue_head_t *wqh = &congestion_wqh[rw]; + DEFINE_WAIT(local_wait); + + if (!wait) + wait = &local_wait; blk_run_queues(); - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); + if (!is_sync_wait(wait)) { + /* + * if we've queued an async wait queue + * callback do not block; just tell the + * caller to return and retry later when + * the callback is notified + */ + return -EIOCBRETRY; + } io_schedule_timeout(timeout); - finish_wait(wqh, &wait); + finish_wait(wqh, wait); + return 0; } +EXPORT_SYMBOL(blk_congestion_wait_wq); +void blk_congestion_wait(int rw, long timeout) +{ + blk_congestion_wait_wq(rw, timeout, NULL); +} EXPORT_SYMBOL(blk_congestion_wait); /* --- diff/drivers/block/loop.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/block/loop.c 2003-11-26 10:09:05.000000000 +0000 @@ -55,6 +55,7 @@ #include <linux/errno.h> #include <linux/major.h> #include <linux/wait.h> +#include <linux/blkdev.h> #include <linux/blkpg.h> #include <linux/init.h> #include <linux/devfs_fs_kernel.h> @@ -75,24 +76,34 @@ /* * Transfer functions */ -static int transfer_none(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t real_block) +static int transfer_none(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t real_block) { - if (raw_buf != loop_buf) { - if (cmd == READ) - memcpy(loop_buf, raw_buf, size); - else - memcpy(raw_buf, loop_buf, size); - } + char *raw_buf = kmap_atomic(raw_page, KM_USER0) + raw_off; + char *loop_buf = kmap_atomic(loop_page, KM_USER1) + loop_off; + + if (cmd == READ) + memcpy(loop_buf, raw_buf, size); + else + memcpy(raw_buf, loop_buf, size); + kunmap_atomic(raw_buf, KM_USER0); + kunmap_atomic(loop_buf, KM_USER1); + cond_resched(); return 0; } -static int transfer_xor(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t real_block) -{ - char *in, *out, *key; - int i, keysize; +static int transfer_xor(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t real_block) +{ + char *raw_buf = kmap_atomic(raw_page, KM_USER0) + raw_off; + char *loop_buf = kmap_atomic(loop_page, KM_USER1) + loop_off; + char *in, *out, *key; + int i, keysize; if (cmd == READ) { in = raw_buf; @@ -106,6 +117,10 @@ keysize = lo->lo_encrypt_key_size; for (i = 0; i < size; i++) *out++ = *in++ ^ key[(i & 511) % keysize]; + + kunmap_atomic(raw_buf, KM_USER0); + kunmap_atomic(loop_buf, KM_USER1); + cond_resched(); return 0; } @@ -140,8 +155,7 @@ sector_t x; /* Compute loopsize in bytes */ - size = i_size_read(lo->lo_backing_file->f_dentry-> - d_inode->i_mapping->host); + size = i_size_read(lo->lo_backing_file->f_mapping->host); offset = lo->lo_offset; loopsize = size - offset; if (lo->lo_sizelimit > 0 && lo->lo_sizelimit < loopsize) @@ -162,32 +176,33 @@ } static inline int -lo_do_transfer(struct loop_device *lo, int cmd, char *rbuf, - char *lbuf, int size, sector_t rblock) +lo_do_transfer(struct loop_device *lo, int cmd, + struct page *rpage, unsigned roffs, + struct page *lpage, unsigned loffs, + int size, sector_t rblock) { if (!lo->transfer) return 0; - return lo->transfer(lo, cmd, rbuf, lbuf, size, rblock); + return lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock); } static int do_lo_send(struct loop_device *lo, struct bio_vec *bvec, int bsize, loff_t pos) { struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */ - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct address_space_operations *aops = mapping->a_ops; struct page *page; - char *kaddr, *data; pgoff_t index; - unsigned size, offset; + unsigned size, offset, bv_offs; int len; int ret = 0; down(&mapping->host->i_sem); index = pos >> PAGE_CACHE_SHIFT; offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1); - data = kmap(bvec->bv_page) + bvec->bv_offset; + bv_offs = bvec->bv_offset; len = bvec->bv_len; while (len > 0) { sector_t IV; @@ -204,25 +219,28 @@ goto fail; if (aops->prepare_write(file, page, offset, offset+size)) goto unlock; - kaddr = kmap(page); - transfer_result = lo_do_transfer(lo, WRITE, kaddr + offset, - data, size, IV); + transfer_result = lo_do_transfer(lo, WRITE, page, offset, + bvec->bv_page, bv_offs, + size, IV); if (transfer_result) { + char *kaddr; + /* * The transfer failed, but we still write the data to * keep prepare/commit calls balanced. */ printk(KERN_ERR "loop: transfer error block %llu\n", (unsigned long long)index); + kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + offset, 0, size); + kunmap_atomic(kaddr, KM_USER0); } flush_dcache_page(page); - kunmap(page); if (aops->commit_write(file, page, offset, offset+size)) goto unlock; if (transfer_result) goto unlock; - data += size; + bv_offs += size; len -= size; offset = 0; index++; @@ -232,7 +250,6 @@ } up(&mapping->host->i_sem); out: - kunmap(bvec->bv_page); return ret; unlock: @@ -247,12 +264,10 @@ static int lo_send(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos) { - unsigned vecnr; - int ret = 0; - - for (vecnr = 0; vecnr < bio->bi_vcnt; vecnr++) { - struct bio_vec *bvec = &bio->bi_io_vec[vecnr]; + struct bio_vec *bvec; + int i, ret = 0; + bio_for_each_segment(bvec, bio, i) { ret = do_lo_send(lo, bvec, bsize, pos); if (ret < 0) break; @@ -263,7 +278,8 @@ struct lo_read_data { struct loop_device *lo; - char *data; + struct page *page; + unsigned offset; int bsize; }; @@ -271,7 +287,6 @@ lo_read_actor(read_descriptor_t *desc, struct page *page, unsigned long offset, unsigned long size) { - char *kaddr; unsigned long count = desc->count; struct lo_read_data *p = (struct lo_read_data*)desc->buf; struct loop_device *lo = p->lo; @@ -282,18 +297,16 @@ if (size > count) size = count; - kaddr = kmap(page); - if (lo_do_transfer(lo, READ, kaddr + offset, p->data, size, IV)) { + if (lo_do_transfer(lo, READ, page, offset, p->page, p->offset, size, IV)) { size = 0; printk(KERN_ERR "loop: transfer error block %ld\n", page->index); desc->error = -EINVAL; } - kunmap(page); desc->count = count - size; desc->written += size; - p->data += size; + p->offset += size; return size; } @@ -306,24 +319,22 @@ int retval; cookie.lo = lo; - cookie.data = kmap(bvec->bv_page) + bvec->bv_offset; + cookie.page = bvec->bv_page; + cookie.offset = bvec->bv_offset; cookie.bsize = bsize; file = lo->lo_backing_file; retval = file->f_op->sendfile(file, &pos, bvec->bv_len, lo_read_actor, &cookie); - kunmap(bvec->bv_page); return (retval < 0)? retval: 0; } static int lo_receive(struct loop_device *lo, struct bio *bio, int bsize, loff_t pos) { - unsigned vecnr; - int ret = 0; - - for (vecnr = 0; vecnr < bio->bi_vcnt; vecnr++) { - struct bio_vec *bvec = &bio->bi_io_vec[vecnr]; + struct bio_vec *bvec; + int i, ret = 0; + bio_for_each_segment(bvec, bio, i) { ret = do_lo_receive(lo, bvec, bsize, pos); if (ret < 0) break; @@ -345,23 +356,6 @@ return ret; } -static int loop_end_io_transfer(struct bio *, unsigned int, int); - -static void loop_put_buffer(struct bio *bio) -{ - /* - * check bi_end_io, may just be a remapped bio - */ - if (bio && bio->bi_end_io == loop_end_io_transfer) { - int i; - - for (i = 0; i < bio->bi_vcnt; i++) - __free_page(bio->bi_io_vec[i].bv_page); - - bio_put(bio); - } -} - /* * Add bio to back of pending list */ @@ -399,129 +393,8 @@ return bio; } -/* - * if this was a WRITE lo->transfer stuff has already been done. for READs, - * queue it for the loop thread and let it do the transfer out of - * bi_end_io context (we don't want to do decrypt of a page with irqs - * disabled) - */ -static int loop_end_io_transfer(struct bio *bio, unsigned int bytes_done, int err) -{ - struct bio *rbh = bio->bi_private; - struct loop_device *lo = rbh->bi_bdev->bd_disk->private_data; - - if (bio->bi_size) - return 1; - - if (err || bio_rw(bio) == WRITE) { - bio_endio(rbh, rbh->bi_size, err); - if (atomic_dec_and_test(&lo->lo_pending)) - up(&lo->lo_bh_mutex); - loop_put_buffer(bio); - } else - loop_add_bio(lo, bio); - - return 0; -} - -static struct bio *loop_copy_bio(struct bio *rbh) -{ - struct bio *bio; - struct bio_vec *bv; - int i; - - bio = bio_alloc(__GFP_NOWARN, rbh->bi_vcnt); - if (!bio) - return NULL; - - /* - * iterate iovec list and alloc pages - */ - __bio_for_each_segment(bv, rbh, i, 0) { - struct bio_vec *bbv = &bio->bi_io_vec[i]; - - bbv->bv_page = alloc_page(__GFP_NOWARN|__GFP_HIGHMEM); - if (bbv->bv_page == NULL) - goto oom; - - bbv->bv_len = bv->bv_len; - bbv->bv_offset = bv->bv_offset; - } - - bio->bi_vcnt = rbh->bi_vcnt; - bio->bi_size = rbh->bi_size; - - return bio; - -oom: - while (--i >= 0) - __free_page(bio->bi_io_vec[i].bv_page); - - bio_put(bio); - return NULL; -} - -static struct bio *loop_get_buffer(struct loop_device *lo, struct bio *rbh) -{ - struct bio *bio; - - /* - * When called on the page reclaim -> writepage path, this code can - * trivially consume all memory. So we drop PF_MEMALLOC to avoid - * stealing all the page reserves and throttle to the writeout rate. - * pdflush will have been woken by page reclaim. Let it do its work. - */ - do { - int flags = current->flags; - - current->flags &= ~PF_MEMALLOC; - bio = loop_copy_bio(rbh); - if (flags & PF_MEMALLOC) - current->flags |= PF_MEMALLOC; - - if (bio == NULL) - blk_congestion_wait(WRITE, HZ/10); - } while (bio == NULL); - - bio->bi_end_io = loop_end_io_transfer; - bio->bi_private = rbh; - bio->bi_sector = rbh->bi_sector + (lo->lo_offset >> 9); - bio->bi_rw = rbh->bi_rw; - bio->bi_bdev = lo->lo_device; - - return bio; -} - -static int loop_transfer_bio(struct loop_device *lo, - struct bio *to_bio, struct bio *from_bio) -{ - sector_t IV; - struct bio_vec *from_bvec, *to_bvec; - char *vto, *vfrom; - int ret = 0, i; - - IV = from_bio->bi_sector + (lo->lo_offset >> 9); - - __bio_for_each_segment(from_bvec, from_bio, i, 0) { - to_bvec = &to_bio->bi_io_vec[i]; - - kmap(from_bvec->bv_page); - kmap(to_bvec->bv_page); - vfrom = page_address(from_bvec->bv_page) + from_bvec->bv_offset; - vto = page_address(to_bvec->bv_page) + to_bvec->bv_offset; - ret |= lo_do_transfer(lo, bio_data_dir(to_bio), vto, vfrom, - from_bvec->bv_len, IV); - kunmap(from_bvec->bv_page); - kunmap(to_bvec->bv_page); - IV += from_bvec->bv_len >> 9; - } - - return ret; -} - static int loop_make_request(request_queue_t *q, struct bio *old_bio) { - struct bio *new_bio = NULL; struct loop_device *lo = q->queuedata; int rw = bio_rw(old_bio); @@ -543,31 +416,11 @@ printk(KERN_ERR "loop: unknown command (%x)\n", rw); goto err; } - - /* - * file backed, queue for loop_thread to handle - */ - if (lo->lo_flags & LO_FLAGS_DO_BMAP) { - loop_add_bio(lo, old_bio); - return 0; - } - - /* - * piggy old buffer on original, and submit for I/O - */ - new_bio = loop_get_buffer(lo, old_bio); - if (rw == WRITE) { - if (loop_transfer_bio(lo, new_bio, old_bio)) - goto err; - } - - generic_make_request(new_bio); + loop_add_bio(lo, old_bio); return 0; - err: if (atomic_dec_and_test(&lo->lo_pending)) up(&lo->lo_bh_mutex); - loop_put_buffer(new_bio); out: bio_io_error(old_bio, old_bio->bi_size); return 0; @@ -580,20 +433,8 @@ { int ret; - /* - * For block backed loop, we know this is a READ - */ - if (lo->lo_flags & LO_FLAGS_DO_BMAP) { - ret = do_bio_filebacked(lo, bio); - bio_endio(bio, bio->bi_size, ret); - } else { - struct bio *rbh = bio->bi_private; - - ret = loop_transfer_bio(lo, bio, rbh); - - bio_endio(rbh, rbh->bi_size, ret); - loop_put_buffer(bio); - } + ret = do_bio_filebacked(lo, bio); + bio_endio(bio, bio->bi_size, ret); } /* @@ -660,6 +501,7 @@ struct file *file; struct inode *inode; struct block_device *lo_device = NULL; + struct address_space *mapping; unsigned lo_blocksize; int lo_flags = 0; int error; @@ -676,35 +518,27 @@ if (!file) goto out; - error = -EINVAL; - inode = file->f_dentry->d_inode; + mapping = file->f_mapping; + inode = mapping->host; if (!(file->f_mode & FMODE_WRITE)) lo_flags |= LO_FLAGS_READ_ONLY; - if (S_ISBLK(inode->i_mode)) { - lo_device = inode->i_bdev; - if (lo_device == bdev) { - error = -EBUSY; - goto out; - } - lo_blocksize = block_size(lo_device); - if (bdev_read_only(lo_device)) - lo_flags |= LO_FLAGS_READ_ONLY; - } else if (S_ISREG(inode->i_mode)) { - struct address_space_operations *aops = inode->i_mapping->a_ops; + error = -EINVAL; + + if (S_ISREG(inode->i_mode) || S_ISBLK(inode->i_mode)) { + struct address_space_operations *aops = mapping->a_ops; /* * If we can't read - sorry. If we only can't write - well, * it's going to be read-only. */ - if (!inode->i_fop->sendfile) + if (!lo_file->f_op->sendfile) goto out_putf; if (!aops->prepare_write || !aops->commit_write) lo_flags |= LO_FLAGS_READ_ONLY; lo_blocksize = inode->i_blksize; - lo_flags |= LO_FLAGS_DO_BMAP; error = 0; } else goto out_putf; @@ -728,9 +562,8 @@ fput(file); goto out_putf; } - lo->old_gfp_mask = mapping_gfp_mask(inode->i_mapping); - mapping_set_gfp_mask(inode->i_mapping, - lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); + lo->old_gfp_mask = mapping_gfp_mask(mapping); + mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); set_blocksize(bdev, lo_blocksize); @@ -742,21 +575,6 @@ */ blk_queue_make_request(lo->lo_queue, loop_make_request); lo->lo_queue->queuedata = lo; - - /* - * we remap to a block device, make sure we correctly stack limits - */ - if (S_ISBLK(inode->i_mode)) { - request_queue_t *q = bdev_get_queue(lo_device); - - blk_queue_max_sectors(lo->lo_queue, q->max_sectors); - blk_queue_max_phys_segments(lo->lo_queue,q->max_phys_segments); - blk_queue_max_hw_segments(lo->lo_queue, q->max_hw_segments); - blk_queue_max_segment_size(lo->lo_queue, q->max_segment_size); - blk_queue_segment_boundary(lo->lo_queue, q->seg_boundary_mask); - blk_queue_merge_bvec(lo->lo_queue, q->merge_bvec_fn); - } - kernel_thread(loop_thread, lo, CLONE_KERNEL); down(&lo->lo_sem); @@ -846,7 +664,7 @@ memset(lo->lo_file_name, 0, LO_NAME_SIZE); invalidate_bdev(bdev, 0); set_capacity(disks[lo->lo_number], 0); - mapping_set_gfp_mask(filp->f_dentry->d_inode->i_mapping, gfp); + mapping_set_gfp_mask(filp->f_mapping, gfp); lo->lo_state = Lo_unbound; fput(filp); /* This is safe: open() is still holding a reference. */ @@ -1056,19 +874,19 @@ return err; } -static int lo_ioctl(struct inode * inode, struct file * file, +static int lo_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct loop_device *lo = inode->i_bdev->bd_disk->private_data; + struct loop_device *lo = bdev->bd_disk->private_data; int err; down(&lo->lo_ctl_mutex); switch (cmd) { case LOOP_SET_FD: - err = loop_set_fd(lo, file, inode->i_bdev, arg); + err = loop_set_fd(lo, file, bdev, arg); break; case LOOP_CLR_FD: - err = loop_clr_fd(lo, inode->i_bdev); + err = loop_clr_fd(lo, bdev); break; case LOOP_SET_STATUS: err = loop_set_status_old(lo, (struct loop_info *) arg); @@ -1089,9 +907,9 @@ return err; } -static int lo_open(struct inode *inode, struct file *file) +static int lo_open(struct block_device *bdev, struct file *file) { - struct loop_device *lo = inode->i_bdev->bd_disk->private_data; + struct loop_device *lo = bdev->bd_disk->private_data; down(&lo->lo_ctl_mutex); lo->lo_refcnt++; @@ -1100,9 +918,9 @@ return 0; } -static int lo_release(struct inode *inode, struct file *file) +static int lo_release(struct gendisk *disk) { - struct loop_device *lo = inode->i_bdev->bd_disk->private_data; + struct loop_device *lo = disk->private_data; down(&lo->lo_ctl_mutex); --lo->lo_refcnt; @@ -1124,6 +942,7 @@ MODULE_PARM(max_loop, "i"); MODULE_PARM_DESC(max_loop, "Maximum number of loop devices (1-256)"); MODULE_LICENSE("GPL"); +MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR); int loop_register_transfer(struct loop_func_table *funcs) { --- diff/drivers/block/nbd.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/block/nbd.c 2003-11-26 10:09:05.000000000 +0000 @@ -535,10 +535,10 @@ return; } -static int nbd_ioctl(struct inode *inode, struct file *file, +static int nbd_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct nbd_device *lo = inode->i_bdev->bd_disk->private_data; + struct nbd_device *lo = bdev->bd_disk->private_data; int error; struct request sreq ; @@ -593,7 +593,7 @@ error = -EINVAL; file = fget(arg); if (file) { - inode = file->f_dentry->d_inode; + struct inode *inode = file->f_dentry->d_inode; if (inode->i_sock) { lo->file = file; lo->sock = SOCKET_I(inode); @@ -606,20 +606,20 @@ case NBD_SET_BLKSIZE: lo->blksize = arg; lo->bytesize &= ~(lo->blksize-1); - inode->i_bdev->bd_inode->i_size = lo->bytesize; - set_blocksize(inode->i_bdev, lo->blksize); + bdev->bd_inode->i_size = lo->bytesize; + set_blocksize(bdev, lo->blksize); set_capacity(lo->disk, lo->bytesize >> 9); return 0; case NBD_SET_SIZE: lo->bytesize = arg & ~(lo->blksize-1); - inode->i_bdev->bd_inode->i_size = lo->bytesize; - set_blocksize(inode->i_bdev, lo->blksize); + bdev->bd_inode->i_size = lo->bytesize; + set_blocksize(bdev, lo->blksize); set_capacity(lo->disk, lo->bytesize >> 9); return 0; case NBD_SET_SIZE_BLOCKS: lo->bytesize = ((u64) arg) * lo->blksize; - inode->i_bdev->bd_inode->i_size = lo->bytesize; - set_blocksize(inode->i_bdev, lo->blksize); + bdev->bd_inode->i_size = lo->bytesize; + set_blocksize(bdev, lo->blksize); set_capacity(lo->disk, lo->bytesize >> 9); return 0; case NBD_DO_IT: @@ -664,11 +664,11 @@ case NBD_PRINT_DEBUG: #ifdef PARANOIA printk(KERN_INFO "%s: next = %p, prev = %p. Global: in %d, out %d\n", - inode->i_bdev->bd_disk->disk_name, lo->queue_head.next, + bdev->bd_disk->disk_name, lo->queue_head.next, lo->queue_head.prev, requests_in, requests_out); #else printk(KERN_INFO "%s: next = %p, prev = %p\n", - inode->i_bdev->bd_disk->disk_name, + bdev->bd_disk->disk_name, lo->queue_head.next, lo->queue_head.prev); #endif return 0; --- diff/drivers/block/paride/pcd.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/block/paride/pcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -243,23 +243,23 @@ /* kernel glue structures */ -static int pcd_block_open(struct inode *inode, struct file *file) +static int pcd_block_open(struct block_device *bdev, struct file *file) { - struct pcd_unit *cd = inode->i_bdev->bd_disk->private_data; - return cdrom_open(&cd->info, inode, file); + struct pcd_unit *cd = bdev->bd_disk->private_data; + return cdrom_open(&cd->info, bdev, file); } -static int pcd_block_release(struct inode *inode, struct file *file) +static int pcd_block_release(struct gendisk *disk) { - struct pcd_unit *cd = inode->i_bdev->bd_disk->private_data; - return cdrom_release(&cd->info, file); + struct pcd_unit *cd = disk->private_data; + return cdrom_release(&cd->info); } -static int pcd_block_ioctl(struct inode *inode, struct file *file, +static int pcd_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - struct pcd_unit *cd = inode->i_bdev->bd_disk->private_data; - return cdrom_ioctl(&cd->info, inode, cmd, arg); + struct pcd_unit *cd = bdev->bd_disk->private_data; + return cdrom_ioctl(&cd->info, bdev, cmd, arg); } static int pcd_block_media_changed(struct gendisk *disk) --- diff/drivers/block/paride/pd.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/block/paride/pd.c 2003-11-26 10:09:05.000000000 +0000 @@ -236,11 +236,11 @@ #define IDE_EJECT 0xed void pd_setup(char *str, int *ints); -static int pd_open(struct inode *inode, struct file *file); +static int pd_open(struct block_device *bdev, struct file *file); static void do_pd_request(request_queue_t * q); -static int pd_ioctl(struct inode *inode, struct file *file, +static int pd_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg); -static int pd_release(struct inode *inode, struct file *file); +static int pd_release(struct gendisk *disk); static int pd_revalidate(struct gendisk *p); static int pd_detect(void); static void do_pd_read(void); @@ -304,8 +304,6 @@ /* kernel glue structures */ -extern struct block_device_operations pd_fops; - static struct block_device_operations pd_fops = { .owner = THIS_MODULE, .open = pd_open, @@ -337,9 +335,9 @@ } } -static int pd_open(struct inode *inode, struct file *file) +static int pd_open(struct block_device *bdev, struct file *file) { - struct pd_unit *disk = inode->i_bdev->bd_disk->private_data; + struct pd_unit *disk = bdev->bd_disk->private_data; disk->access++; @@ -350,10 +348,10 @@ return 0; } -static int pd_ioctl(struct inode *inode, struct file *file, +static int pd_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct pd_unit *disk = inode->i_bdev->bd_disk->private_data; + struct pd_unit *disk = bdev->bd_disk->private_data; struct hd_geometry *geo = (struct hd_geometry *) arg; struct hd_geometry g; @@ -372,7 +370,7 @@ g.sectors = disk->sectors; g.cylinders = disk->cylinders; } - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); if (copy_to_user(geo, &g, sizeof(struct hd_geometry))) return -EFAULT; return 0; @@ -381,9 +379,9 @@ } } -static int pd_release(struct inode *inode, struct file *file) +static int pd_release(struct gendisk *p) { - struct pd_unit *disk = inode->i_bdev->bd_disk->private_data; + struct pd_unit *disk = p->private_data; if (!--disk->access && disk->removable) pd_doorlock(disk, IDE_DOORUNLOCK); --- diff/drivers/block/paride/pf.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/block/paride/pf.c 2003-11-26 10:09:05.000000000 +0000 @@ -222,12 +222,12 @@ #define ATAPI_READ_10 0x28 #define ATAPI_WRITE_10 0x2a -static int pf_open(struct inode *inode, struct file *file); +static int pf_open(struct block_device *bdev, struct file *file); static void do_pf_request(request_queue_t * q); -static int pf_ioctl(struct inode *inode, struct file *file, +static int pf_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg); -static int pf_release(struct inode *inode, struct file *file); +static int pf_release(struct gendisk *disk); static int pf_detect(void); static void do_pf_read(void); @@ -315,9 +315,9 @@ } } -static int pf_open(struct inode *inode, struct file *file) +static int pf_open(struct block_device *bdev, struct file *file) { - struct pf_unit *pf = inode->i_bdev->bd_disk->private_data; + struct pf_unit *pf = bdev->bd_disk->private_data; pf_identify(pf); @@ -334,9 +334,9 @@ return 0; } -static int pf_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) +static int pf_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct pf_unit *pf = inode->i_bdev->bd_disk->private_data; + struct pf_unit *pf = bdev->bd_disk->private_data; struct hd_geometry *geo = (struct hd_geometry *) arg; struct hd_geometry g; sector_t capacity; @@ -365,9 +365,9 @@ return 0; } -static int pf_release(struct inode *inode, struct file *file) +static int pf_release(struct gendisk *disk) { - struct pf_unit *pf = inode->i_bdev->bd_disk->private_data; + struct pf_unit *pf = disk->private_data; if (pf->access <= 0) return -EINVAL; --- diff/drivers/block/ps2esdi.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/ps2esdi.c 2003-11-26 10:09:05.000000000 +0000 @@ -81,7 +81,7 @@ static void ps2esdi_normal_interrupt_handler(u_int); static void ps2esdi_initial_reset_int_handler(u_int); static void ps2esdi_geometry_int_handler(u_int); -static int ps2esdi_ioctl(struct inode *inode, struct file *file, +static int ps2esdi_ioctl(struct block_device *bdev, struct file *file, u_int cmd, u_long arg); static int ps2esdi_read_status_words(int num_words, int max_words, u_short * buffer); @@ -1059,10 +1059,10 @@ } -static int ps2esdi_ioctl(struct inode *inode, +static int ps2esdi_ioctl(struct block_device *bdev, struct file *file, u_int cmd, u_long arg) { - struct ps2esdi_i_struct *p = inode->i_bdev->bd_disk->private_data; + struct ps2esdi_i_struct *p = bdev->bd_disk->private_data; struct ps2esdi_geometry *geometry = (struct ps2esdi_geometry *) arg; int err; @@ -1073,7 +1073,7 @@ put_user(p->head, (char *) &geometry->heads); put_user(p->sect, (char *) &geometry->sectors); put_user(p->cyl, (short *) &geometry->cylinders); - put_user(get_start_sect(inode->i_bdev), (long *) &geometry->start); + put_user(get_start_sect(bdev), (long *) &geometry->start); return 0; } --- diff/drivers/block/rd.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/block/rd.c 2003-11-26 10:09:05.000000000 +0000 @@ -1,15 +1,15 @@ /* * ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta. - * - * (C) Chad Page, Theodore Ts'o, et. al, 1995. + * + * (C) Chad Page, Theodore Ts'o, et. al, 1995. * * This RAM disk is designed to have filesystems created on it and mounted - * just like a regular floppy disk. - * + * just like a regular floppy disk. + * * It also does something suggested by Linus: use the buffer cache as the * RAM disk data. This makes it possible to dynamically allocate the RAM disk - * buffer - with some consequences I have to deal with as I write this. - * + * buffer - with some consequences I have to deal with as I write this. + * * This code is based on the original ramdisk.c, written mostly by * Theodore Ts'o (TYT) in 1991. The code was largely rewritten by * Chad Page to use the buffer cache to store the RAM disk data in @@ -33,7 +33,7 @@ * * Added initrd: Werner Almesberger & Hans Lermen, Feb '96 * - * 4/25/96 : Made RAM disk size a parameter (default is now 4 MB) + * 4/25/96 : Made RAM disk size a parameter (default is now 4 MB) * - Chad Page * * Add support for fs images split across >1 disk, Paul Gortmaker, Mar '98 @@ -60,7 +60,7 @@ #include <asm/uaccess.h> /* The RAM disk size is now a parameter */ -#define NUM_RAMDISKS 16 /* This cannot be overridden (yet) */ +#define NUM_RAMDISKS 16 /* This cannot be overridden (yet) */ /* Various static variables go here. Most are used only in the RAM disk code. */ @@ -73,7 +73,7 @@ * Parameters for the boot-loading of the RAM disk. These are set by * init/main.c (from arguments to the kernel command line) or from the * architecture-specific setup routine (from the stored boot sector - * information). + * information). */ int rd_size = CONFIG_BLK_DEV_RAM_SIZE; /* Size of the RAM disks */ /* @@ -94,7 +94,7 @@ * 2000 Transmeta Corp. * aops copied from ramfs. */ -static int ramdisk_readpage(struct file *file, struct page * page) +static int ramdisk_readpage(struct file *file, struct page *page) { if (!PageUptodate(page)) { void *kaddr = kmap_atomic(page, KM_USER0); @@ -108,7 +108,8 @@ return 0; } -static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int ramdisk_prepare_write(struct file *file, struct page *page, + unsigned offset, unsigned to) { if (!PageUptodate(page)) { void *kaddr = kmap_atomic(page, KM_USER0); @@ -122,7 +123,8 @@ return 0; } -static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int ramdisk_commit_write(struct file *file, struct page *page, + unsigned offset, unsigned to) { return 0; } @@ -212,7 +214,7 @@ * 19-JAN-1998 Richard Gooch <rgooch@atnf.csiro.au> Added devfs support * */ -static int rd_make_request(request_queue_t * q, struct bio *bio) +static int rd_make_request(request_queue_t *q, struct bio *bio) { struct block_device *bdev = bio->bi_bdev; struct address_space * mapping = bdev->bd_inode->i_mapping; @@ -242,17 +244,19 @@ return 0; } -static int rd_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) +static int rd_ioctl(struct block_device *bdev, struct file *file, + unsigned int cmd, unsigned long arg) { int error; - struct block_device *bdev = inode->i_bdev; if (cmd != BLKFLSBUF) return -EINVAL; - /* special: we want to release the ramdisk memory, - it's not like with the other blockdevices where - this ioctl only flushes away the buffer cache. */ + /* + * special: we want to release the ramdisk memory, it's not like with + * the other blockdevices where this ioctl only flushes away the buffer + * cache + */ error = -EBUSY; down(&bdev->bd_sem); if (bdev->bd_openers <= 2) { @@ -268,16 +272,15 @@ .memory_backed = 1, /* Does not contribute to dirty memory */ }; -static int rd_open(struct inode * inode, struct file * filp) +static int rd_open(struct block_device *bdev, struct file *filp) { - unsigned unit = iminor(inode); + unsigned unit = MINOR(bdev->bd_dev); /* * Immunize device against invalidate_buffers() and prune_icache(). */ if (rd_bdev[unit] == NULL) { - struct block_device *bdev = inode->i_bdev; - inode = igrab(bdev->bd_inode); + struct inode *inode = igrab(bdev->bd_inode); rd_bdev[unit] = bdev; bdev->bd_openers++; bdev->bd_block_size = rd_blocksize; @@ -295,12 +298,14 @@ .ioctl = rd_ioctl, }; -/* Before freeing the module, invalidate all of the protected buffers! */ -static void __exit rd_cleanup (void) +/* + * Before freeing the module, invalidate all of the protected buffers! + */ +static void __exit rd_cleanup(void) { int i; - for (i = 0 ; i < NUM_RAMDISKS; i++) { + for (i = 0; i < NUM_RAMDISKS; i++) { struct block_device *bdev = rd_bdev[i]; rd_bdev[i] = NULL; if (bdev) { @@ -311,17 +316,19 @@ put_disk(rd_disks[i]); } devfs_remove("rd"); - unregister_blkdev(RAMDISK_MAJOR, "ramdisk" ); + unregister_blkdev(RAMDISK_MAJOR, "ramdisk"); } -/* This is the registration and initialization section of the RAM disk driver */ -static int __init rd_init (void) +/* + * This is the registration and initialization section of the RAM disk driver + */ +static int __init rd_init(void) { int i; int err = -ENOMEM; if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 || - (rd_blocksize & (rd_blocksize-1))) { + (rd_blocksize & (rd_blocksize-1))) { printk("RAMDISK: wrong blocksize %d, reverting to defaults\n", rd_blocksize); rd_blocksize = BLOCK_SIZE; @@ -362,8 +369,8 @@ /* rd_size is given in kB */ printk("RAMDISK driver initialized: " - "%d RAM disks of %dK size %d blocksize\n", - NUM_RAMDISKS, rd_size, rd_blocksize); + "%d RAM disks of %dK size %d blocksize\n", + NUM_RAMDISKS, rd_size, rd_blocksize); return 0; out_queue: --- diff/drivers/block/scsi_ioctl.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/block/scsi_ioctl.c 2003-11-26 10:09:05.000000000 +0000 @@ -312,7 +312,7 @@ return -EFAULT; if (in_len > PAGE_SIZE || out_len > PAGE_SIZE) return -EINVAL; - if (get_user(opcode, sic->data)) + if (get_user(opcode, (int *)sic->data)) return -EFAULT; bytes = max(in_len, out_len); --- diff/drivers/block/swim3.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/block/swim3.c 2003-11-26 10:09:05.000000000 +0000 @@ -239,10 +239,9 @@ int interruptible); static void release_drive(struct floppy_state *fs); static int fd_eject(struct floppy_state *fs); -static int floppy_ioctl(struct inode *inode, struct file *filp, +static int floppy_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param); -static int floppy_open(struct inode *inode, struct file *filp); -static int floppy_release(struct inode *inode, struct file *filp); +static int floppy_open(struct block_device *bdev, struct file *filp); static int floppy_check_change(struct gendisk *disk); static int floppy_revalidate(struct gendisk *disk); static int swim3_add_device(struct device_node *swims); @@ -811,10 +810,10 @@ static struct floppy_struct floppy_type = { 2880,18,2,80,0,0x1B,0x00,0xCF,0x6C,NULL }; /* 7 1.44MB 3.5" */ -static int floppy_ioctl(struct inode *inode, struct file *filp, +static int floppy_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = bdev->bd_disk->private_data; int err; if ((cmd & 0x80) && !capable(CAP_SYS_ADMIN)) @@ -838,9 +837,9 @@ return -ENOTTY; } -static int floppy_open(struct inode *inode, struct file *filp) +static int floppy_open(struct block_device *bdev, struct file *filp) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = bdev->bd_disk->private_data; volatile struct swim3 *sw = fs->swim3; int n, err = 0; @@ -876,7 +875,7 @@ if (err == 0 && (filp->f_flags & O_NDELAY) == 0 && (filp->f_mode & 3)) { - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (fs->ejected) err = -ENXIO; } @@ -904,9 +903,9 @@ return 0; } -static int floppy_release(struct inode *inode, struct file *filp) +static int floppy_release(struct gendisk *disk) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = disk->private_data; volatile struct swim3 *sw = fs->swim3; if (fs->ref_count > 0 && --fs->ref_count == 0) { swim3_action(fs, MOTOR_OFF); --- diff/drivers/block/swim_iop.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/block/swim_iop.c 2003-11-26 10:09:05.000000000 +0000 @@ -98,10 +98,10 @@ static void swimiop_status_update(int, struct swim_drvstatus *); static int swimiop_eject(struct floppy_state *fs); -static int floppy_ioctl(struct inode *inode, struct file *filp, +static int floppy_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param); -static int floppy_open(struct inode *inode, struct file *filp); -static int floppy_release(struct inode *inode, struct file *filp); +static int floppy_open(struct block_device *bdev, struct file *filp); +static int floppy_release(struct gendisk *disk); static int floppy_check_change(struct gendisk *disk); static int floppy_revalidate(struct gendisk *disk); static int grab_drive(struct floppy_state *fs, enum swim_state state, @@ -348,10 +348,10 @@ static struct floppy_struct floppy_type = { 2880,18,2,80,0,0x1B,0x00,0xCF,0x6C,NULL }; /* 7 1.44MB 3.5" */ -static int floppy_ioctl(struct inode *inode, struct file *filp, +static int floppy_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long param) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = bdev->bd_disk->private_data; int err; if ((cmd & 0x80) && !capable(CAP_SYS_ADMIN)) @@ -372,15 +372,15 @@ return -ENOTTY; } -static int floppy_open(struct inode *inode, struct file *filp) +static int floppy_open(struct block_device *bdev, struct file *filp) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = bdev->bd_disk->private_data; if (fs->ref_count == -1 || filp->f_flags & O_EXCL) return -EBUSY; if ((filp->f_flags & O_NDELAY) == 0 && (filp->f_mode & 3)) { - check_disk_change(inode->i_bdev); + check_disk_change(bdev); if (fs->ejected) return -ENXIO; } @@ -396,9 +396,9 @@ return 0; } -static int floppy_release(struct inode *inode, struct file *filp) +static int floppy_release(struct gendisk *disk) { - struct floppy_state *fs = inode->i_bdev->bd_disk->private_data; + struct floppy_state *fs = disk->private_data; if (fs->ref_count > 0) fs->ref_count--; return 0; --- diff/drivers/block/umem.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/block/umem.c 2003-11-26 10:09:05.000000000 +0000 @@ -153,7 +153,6 @@ }; static struct cardinfo cards[MM_MAXCARDS]; -static struct block_device_operations mm_fops; static struct timer_list battery_timer; static int num_cards = 0; @@ -818,10 +817,10 @@ -- mm_ioctl ----------------------------------------------------------------------------------- */ -static int mm_ioctl(struct inode *i, struct file *f, unsigned int cmd, unsigned long arg) +static int mm_ioctl(struct block_device *bdev, struct file *f, unsigned int cmd, unsigned long arg) { if (cmd == HDIO_GETGEO) { - struct cardinfo *card = i->i_bdev->bd_disk->private_data; + struct cardinfo *card = bdev->bd_disk->private_data; int size = card->mm_size * (1024 / MM_HARDSECT); struct hd_geometry geo; /* @@ -831,7 +830,7 @@ */ geo.heads = 64; geo.sectors = 32; - geo.start = get_start_sect(i->i_bdev); + geo.start = get_start_sect(bdev); geo.cylinders = size / (geo.heads * geo.sectors); if (copy_to_user((void *) arg, &geo, sizeof(geo))) --- diff/drivers/block/xd.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/block/xd.c 2003-11-26 10:09:05.000000000 +0000 @@ -322,9 +322,9 @@ } /* xd_ioctl: handle device ioctl's */ -static int xd_ioctl (struct inode *inode,struct file *file,u_int cmd,u_long arg) +static int xd_ioctl (struct block_device *bdev,struct file *file,u_int cmd,u_long arg) { - XD_INFO *p = inode->i_bdev->bd_disk->private_data; + XD_INFO *p = bdev->bd_disk->private_data; switch (cmd) { case HDIO_GETGEO: @@ -334,7 +334,7 @@ g.heads = p->heads; g.sectors = p->sectors; g.cylinders = p->cylinders; - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); return copy_to_user(geometry, &g, sizeof g) ? -EFAULT : 0; } case HDIO_SET_DMA: --- diff/drivers/block/xd.h 2003-05-21 11:50:14.000000000 +0100 +++ source/drivers/block/xd.h 2003-11-26 10:09:05.000000000 +0000 @@ -105,7 +105,7 @@ static u_char xd_initdrives (void (*init_drive)(u_char drive)); static void do_xd_request (request_queue_t * q); -static int xd_ioctl (struct inode *inode,struct file *file,unsigned int cmd,unsigned long arg); +static int xd_ioctl (struct block_device *bdev,struct file *file,unsigned int cmd,unsigned long arg); static int xd_readwrite (u_char operation,XD_INFO *disk,char *buffer,u_int block,u_int count); static void xd_recalibrate (u_char drive); --- diff/drivers/block/z2ram.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/block/z2ram.c 2003-11-26 10:09:05.000000000 +0000 @@ -67,7 +67,6 @@ static spinlock_t z2ram_lock = SPIN_LOCK_UNLOCKED; -static struct block_device_operations z2_fops; static struct gendisk *z2ram_gendisk; static void do_z2_request(request_queue_t *q) @@ -141,7 +140,7 @@ } static int -z2_open( struct inode *inode, struct file *filp ) +z2_open( struct block_device *bdev, struct file *filp ) { int device; int max_z2_map = ( Z2RAM_SIZE / Z2RAM_CHUNKSIZE ) * @@ -150,7 +149,7 @@ sizeof( z2ram_map[0] ); int rc = -ENOMEM; - device = iminor(inode); + device = MINOR(bdev->bd_dev); if ( current_device != -1 && current_device != device ) { @@ -301,8 +300,7 @@ return rc; } -static int -z2_release( struct inode *inode, struct file *filp ) +static int z2_release(struct gendisk *disk) { if ( current_device == -1 ) return 0; --- diff/drivers/cdrom/aztcd.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/aztcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -330,10 +330,10 @@ /* Kernel Interface Functions */ static int check_aztcd_media_change(struct gendisk *disk); -static int aztcd_ioctl(struct inode *ip, struct file *fp, unsigned int cmd, +static int aztcd_ioctl(struct block_device *bdev, struct file *fp, unsigned int cmd, unsigned long arg); -static int aztcd_open(struct inode *ip, struct file *fp); -static int aztcd_release(struct inode *inode, struct file *file); +static int aztcd_open(struct block_device *bdev, struct file *fp); +static int aztcd_release(struct gendisk *disk); static struct block_device_operations azt_fops = { .owner = THIS_MODULE, @@ -1153,7 +1153,7 @@ /* * Kernel IO-controls */ -static int aztcd_ioctl(struct inode *ip, struct file *fp, unsigned int cmd, +static int aztcd_ioctl(struct block_device *bdev, struct file *fp, unsigned int cmd, unsigned long arg) { int i; @@ -1171,8 +1171,6 @@ cmd, jiffies); printk("aztcd Status %x\n", getAztStatus()); #endif - if (!ip) - RETURNM("aztcd_ioctl 1", -EINVAL); if (getAztStatus() < 0) RETURNM("aztcd_ioctl 2", -EIO); if ((!aztTocUpToDate) || (aztDiskChanged)) { @@ -1624,7 +1622,7 @@ /* * Open the device special file. Check that a disk is in. */ -static int aztcd_open(struct inode *ip, struct file *fp) +static int aztcd_open(struct block_device *bdev, struct file *fp) { int st; @@ -1673,12 +1671,11 @@ /* * On close, we flush all azt blocks from the buffer cache. */ -static int aztcd_release(struct inode *inode, struct file *file) +static int aztcd_release(struct gendisk *disk) { #ifdef AZT_DEBUG printk("aztcd: executing aztcd_release\n"); - printk("inode: %p, device: %s file: %p\n", inode, - inode->i_bdev->bd_disk->disk_name, file); + printk("disk: %p, device: %s\n", disk, disk->disk_name); #endif if (!--azt_open_count) { azt_invalidate_buffers(); --- diff/drivers/cdrom/cdrom.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/cdrom/cdrom.c 2003-11-26 10:09:05.000000000 +0000 @@ -367,6 +367,7 @@ ENSURE(generic_packet, CDC_GENERIC_PACKET); cdi->mc_flags = 0; cdo->n_minors = 0; + cdi->for_data = 0; cdi->options = CDO_USE_FFLAGS; if (autoclose==1 && CDROM_CAN(CDC_CLOSE_TRAY)) @@ -416,7 +417,7 @@ * is in their own interest: device control becomes a lot easier * this way. */ -int cdrom_open(struct cdrom_device_info *cdi, struct inode *ip, struct file *fp) +int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev, struct file *fp) { int ret; @@ -437,7 +438,7 @@ cdinfo(CD_OPEN, "Use count for \"/dev/%s\" now %d\n", cdi->name, cdi->use_count); /* Do this on open. Don't wait for mount, because they might not be mounting, but opening with O_NONBLOCK */ - check_disk_change(ip->i_bdev); + check_disk_change(bdev); return ret; } @@ -530,6 +531,7 @@ cdinfo(CD_OPEN, "door locked.\n"); } cdinfo(CD_OPEN, "device opened successfully.\n"); + cdi->for_data = 1; return ret; /* Something failed. Try to unlock the drive, because some drivers @@ -605,30 +607,29 @@ /* Admittedly, the logic below could be performed in a nicer way. */ -int cdrom_release(struct cdrom_device_info *cdi, struct file *fp) +int cdrom_release(struct cdrom_device_info *cdi) { struct cdrom_device_ops *cdo = cdi->ops; - int opened_for_data; cdinfo(CD_CLOSE, "entering cdrom_release\n"); if (cdi->use_count > 0) cdi->use_count--; - if (cdi->use_count == 0) - cdinfo(CD_CLOSE, "Use count for \"/dev/%s\" now zero\n", cdi->name); - if (cdi->use_count == 0 && - cdo->capability & CDC_LOCK && !keeplocked) { + if (cdi->use_count) { + cdo->release(cdi); + return 0; + } + + cdinfo(CD_CLOSE, "Use count for \"/dev/%s\" now zero\n", cdi->name); + if (cdo->capability & CDC_LOCK && !keeplocked) { cdinfo(CD_CLOSE, "Unlocking door!\n"); cdo->lock_door(cdi, 0); } - opened_for_data = !(cdi->options & CDO_USE_FFLAGS) || - !(fp && fp->f_flags & O_NONBLOCK); cdo->release(cdi); - if (cdi->use_count == 0) { /* last process that closes dev*/ - if (opened_for_data && - cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY)) - cdo->tray_move(cdi, 1); - } + if (cdi->for_data && + cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY)) + cdo->tray_move(cdi, 1); + cdi->for_data = 0; return 0; } @@ -1433,14 +1434,14 @@ * these days. ATAPI / SCSI specific code now mainly resides in * mmc_ioct(). */ -int cdrom_ioctl(struct cdrom_device_info *cdi, struct inode *ip, +int cdrom_ioctl(struct cdrom_device_info *cdi, struct block_device *bdev, unsigned int cmd, unsigned long arg) { struct cdrom_device_ops *cdo = cdi->ops; int ret; /* Try the generic SCSI command ioctl's first.. */ - ret = scsi_cmd_ioctl(ip->i_bdev, cmd, arg); + ret = scsi_cmd_ioctl(bdev, cmd, arg); if (ret != -ENOTTY) return ret; @@ -1593,7 +1594,7 @@ cdinfo(CD_DO_IOCTL, "entering CDROM_RESET\n"); if (!CDROM_CAN(CDC_RESET)) return -ENOSYS; - invalidate_bdev(ip->i_bdev, 0); + invalidate_bdev(bdev, 0); return cdo->reset(cdi); } --- diff/drivers/cdrom/cdu31a.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/cdu31a.c 2003-11-26 10:09:05.000000000 +0000 @@ -3167,20 +3167,20 @@ .name = "cdu31a" }; -static int scd_block_open(struct inode *inode, struct file *file) +static int scd_block_open(struct block_device *bdev, struct file *file) { - return cdrom_open(&scd_info, inode, file); + return cdrom_open(&scd_info, bdev, file); } -static int scd_block_release(struct inode *inode, struct file *file) +static int scd_block_release(struct gendisk *disk) { - return cdrom_release(&scd_info, file); + return cdrom_release(&scd_info); } -static int scd_block_ioctl(struct inode *inode, struct file *file, +static int scd_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - return cdrom_ioctl(&scd_info, inode, cmd, arg); + return cdrom_ioctl(&scd_info, bdev, cmd, arg); } static int scd_block_media_changed(struct gendisk *disk) --- diff/drivers/cdrom/cm206.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/cm206.c 2003-11-26 10:09:05.000000000 +0000 @@ -1350,20 +1350,20 @@ .name = "cm206", }; -static int cm206_block_open(struct inode *inode, struct file *file) +static int cm206_block_open(struct block_device *bdev, struct file *file) { - return cdrom_open(&cm206_info, inode, file); + return cdrom_open(&cm206_info, bdev, file); } -static int cm206_block_release(struct inode *inode, struct file *file) +static int cm206_block_release(struct gendisk *disk) { - return cdrom_release(&cm206_info, file); + return cdrom_release(&cm206_info); } -static int cm206_block_ioctl(struct inode *inode, struct file *file, +static int cm206_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - return cdrom_ioctl(&cm206_info, inode, cmd, arg); + return cdrom_ioctl(&cm206_info, bdev, cmd, arg); } static int cm206_block_media_changed(struct gendisk *disk) --- diff/drivers/cdrom/gscd.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/cdrom/gscd.c 2003-11-26 10:09:05.000000000 +0000 @@ -91,10 +91,10 @@ /* Schnittstellen zum Kern/FS */ static void __do_gscd_request(unsigned long dummy); -static int gscd_ioctl(struct inode *, struct file *, unsigned int, +static int gscd_ioctl(struct block_device *, struct file *, unsigned int, unsigned long); -static int gscd_open(struct inode *, struct file *); -static int gscd_release(struct inode *, struct file *); +static int gscd_open(struct block_device *, struct file *); +static int gscd_release(struct gendisk *disk); static int check_gscd_med_chg(struct gendisk *disk); /* GoldStar Funktionen */ @@ -190,8 +190,8 @@ #endif -static int gscd_ioctl(struct inode *ip, struct file *fp, unsigned int cmd, - unsigned long arg) +static int gscd_ioctl(struct block_device *bdev, struct file *fp, + unsigned int cmd, unsigned long arg) { unsigned char to_do[10]; unsigned char dummy; @@ -338,7 +338,7 @@ * Open the device special file. Check that a disk is in. */ -static int gscd_open(struct inode *ip, struct file *fp) +static int gscd_open(struct block_device *bdev, struct file *fp) { int st; @@ -368,7 +368,7 @@ * On close, we flush all gscd blocks from the buffer cache. */ -static int gscd_release(struct inode *inode, struct file *file) +static int gscd_release(struct gendisk *disk) { #ifdef GSCD_DEBUG --- diff/drivers/cdrom/mcd.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/mcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -214,20 +214,20 @@ .name = "mcd", }; -static int mcd_block_open(struct inode *inode, struct file *file) +static int mcd_block_open(struct block_device *bdev, struct file *file) { - return cdrom_open(&mcd_info, inode, file); + return cdrom_open(&mcd_info, bdev, file); } -static int mcd_block_release(struct inode *inode, struct file *file) +static int mcd_block_release(struct gendisk *disk) { - return cdrom_release(&mcd_info, file); + return cdrom_release(&mcd_info); } -static int mcd_block_ioctl(struct inode *inode, struct file *file, +static int mcd_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - return cdrom_ioctl(&mcd_info, inode, cmd, arg); + return cdrom_ioctl(&mcd_info, bdev, cmd, arg); } static int mcd_block_media_changed(struct gendisk *disk) --- diff/drivers/cdrom/mcdx.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/mcdx.c 2003-11-26 10:09:05.000000000 +0000 @@ -221,23 +221,23 @@ int mcdx_init(void); void do_mcdx_request(request_queue_t * q); -static int mcdx_block_open(struct inode *inode, struct file *file) +static int mcdx_block_open(struct block_device *bdev, struct file *file) { - struct s_drive_stuff *p = inode->i_bdev->bd_disk->private_data; - return cdrom_open(&p->info, inode, file); + struct s_drive_stuff *p = bdev->bd_disk->private_data; + return cdrom_open(&p->info, bdev, file); } -static int mcdx_block_release(struct inode *inode, struct file *file) +static int mcdx_block_release(struct gendisk *disk) { - struct s_drive_stuff *p = inode->i_bdev->bd_disk->private_data; - return cdrom_release(&p->info, file); + struct s_drive_stuff *p = disk->private_data; + return cdrom_release(&p->info); } -static int mcdx_block_ioctl(struct inode *inode, struct file *file, +static int mcdx_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - struct s_drive_stuff *p = inode->i_bdev->bd_disk->private_data; - return cdrom_ioctl(&p->info, inode, cmd, arg); + struct s_drive_stuff *p = bdev->bd_disk->private_data; + return cdrom_ioctl(&p->info, bdev, cmd, arg); } static int mcdx_block_media_changed(struct gendisk *disk) --- diff/drivers/cdrom/optcd.c 2003-09-17 12:28:04.000000000 +0100 +++ source/drivers/cdrom/optcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -1713,16 +1713,13 @@ /* VFS calls */ -static int opt_ioctl(struct inode *ip, struct file *fp, +static int opt_ioctl(struct block_device *bdev, struct file *fp, unsigned int cmd, unsigned long arg) { int status, err, retval = 0; DEBUG((DEBUG_VFS, "starting opt_ioctl")); - if (!ip) - return -EINVAL; - if (cmd == CDROMRESET) return cdromreset(); @@ -1844,7 +1841,7 @@ static int open_count = 0; /* Open device special file; check that a disk is in. */ -static int opt_open(struct inode *ip, struct file *fp) +static int opt_open(struct block_device *bdev, struct file *fp) { DEBUG((DEBUG_VFS, "starting opt_open")); @@ -1904,13 +1901,12 @@ /* Release device special file; flush all blocks from the buffer cache */ -static int opt_release(struct inode *ip, struct file *fp) +static int opt_release(struct gendisk *disk) { int status; DEBUG((DEBUG_VFS, "executing opt_release")); - DEBUG((DEBUG_VFS, "inode: %p, device: %s, file: %p\n", - ip, ip->i_bdev->bd_disk->disk_name, fp)); + DEBUG((DEBUG_VFS, "disk: %p, device: %s\n", disk, disk->disk_name)); if (!--open_count) { toc_uptodate = 0; --- diff/drivers/cdrom/sbpcd.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/cdrom/sbpcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -5356,23 +5356,23 @@ } /*==========================================================================*/ -static int sbpcd_block_open(struct inode *inode, struct file *file) +static int sbpcd_block_open(struct block_device *bdev, struct file *file) { - struct sbpcd_drive *p = inode->i_bdev->bd_disk->private_data; - return cdrom_open(p->sbpcd_infop, inode, file); + struct sbpcd_drive *p = bdev->bd_disk->private_data; + return cdrom_open(p->sbpcd_infop, bdev, file); } -static int sbpcd_block_release(struct inode *inode, struct file *file) +static int sbpcd_block_release(struct gendisk *disk) { - struct sbpcd_drive *p = inode->i_bdev->bd_disk->private_data; - return cdrom_release(p->sbpcd_infop, file); + struct sbpcd_drive *p = disk->private_data; + return cdrom_release(p->sbpcd_infop); } -static int sbpcd_block_ioctl(struct inode *inode, struct file *file, +static int sbpcd_block_ioctl(struct block_device *bdev, struct file *file, unsigned cmd, unsigned long arg) { - struct sbpcd_drive *p = inode->i_bdev->bd_disk->private_data; - return cdrom_ioctl(p->sbpcd_infop, inode, cmd, arg); + struct sbpcd_drive *p = bdev->bd_disk->private_data; + return cdrom_ioctl(p->sbpcd_infop, bdev, cmd, arg); } static int sbpcd_block_media_changed(struct gendisk *disk) --- diff/drivers/cdrom/sjcd.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/cdrom/sjcd.c 2003-11-26 10:09:05.000000000 +0000 @@ -713,16 +713,13 @@ /* * Do some user commands. */ -static int sjcd_ioctl(struct inode *ip, struct file *fp, +static int sjcd_ioctl(struct block_device *bdev, struct file *fp, unsigned int cmd, unsigned long arg) { #if defined( SJCD_TRACE ) printk("SJCD:ioctl\n"); #endif - if (ip == NULL) - return (-EINVAL); - sjcd_get_status(); if (!sjcd_status_valid) return (-EIO); @@ -1526,7 +1523,7 @@ /* * Open the device special file. Check disk is in. */ -static int sjcd_open(struct inode *ip, struct file *fp) +static int sjcd_open(struct block_device *bdev, struct file *fp) { /* * Check the presence of device. @@ -1611,7 +1608,7 @@ /* * On close, we flush all sjcd blocks from the buffer cache. */ -static int sjcd_release(struct inode *inode, struct file *file) +static int sjcd_release(struct gendisk *disk) { int s; --- diff/drivers/cdrom/sonycd535.c 2003-10-09 09:47:33.000000000 +0100 +++ source/drivers/cdrom/sonycd535.c 2003-11-26 10:09:05.000000000 +0000 @@ -201,7 +201,7 @@ static int read_subcode(void); static void sony_get_toc(void); -static int cdu_open(struct inode *inode, struct file *filp); +static int cdu_open(struct block_device *bdev, struct file *filp); static inline unsigned int int_to_bcd(unsigned int val); static unsigned int bcd_to_int(unsigned int bcd); static int do_sony_cmd(Byte * cmd, int nCmd, Byte status[2], @@ -1061,7 +1061,7 @@ * The big ugly ioctl handler. */ static int -cdu_ioctl(struct inode *inode, +cdu_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) @@ -1360,9 +1360,7 @@ * Open the drive for operations. Spin the drive up and read the table of * contents if these have not already been done. */ -static int -cdu_open(struct inode *inode, - struct file *filp) +static int cdu_open(struct block_device *bdev, struct file *filp) { Byte status[2], cmd_buff[2]; @@ -1385,7 +1383,7 @@ sony_inuse = 0; return -EIO; } - check_disk_change(inode->i_bdev); + check_disk_change(bdev); sony_usage++; #ifdef LOCK_DOORS @@ -1402,9 +1400,7 @@ * Close the drive. Spin it down if no task is using it. The spin * down will fail if playing audio, so audio play is OK. */ -static int -cdu_release(struct inode *inode, - struct file *filp) +static int cdu_release(struct gendisk *disk) { Byte status[2], cmd_no; --- diff/drivers/char/agp/alpha-agp.c 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/char/agp/alpha-agp.c 2003-11-26 10:09:05.000000000 +0000 @@ -13,7 +13,7 @@ static struct page *alpha_core_agp_vm_nopage(struct vm_area_struct *vma, unsigned long address, - int write_access) + int *type) { alpha_agp_info *agp = agp_bridge->dev_private_data; dma_addr_t dma_addr; @@ -30,6 +30,8 @@ */ page = virt_to_page(__va(pa)); get_page(page); + if (type) + *type = VM_FAULT_MINOR; return page; } --- diff/drivers/char/drm/drmP.h 2003-09-30 15:46:12.000000000 +0100 +++ source/drivers/char/drm/drmP.h 2003-11-26 10:09:05.000000000 +0000 @@ -760,16 +760,16 @@ /* Mapping support (drm_vm.h) */ extern struct page *DRM(vm_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access); + int *type); extern struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access); + int *type); extern struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access); + int *type); extern struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access); + int *type); extern void DRM(vm_open)(struct vm_area_struct *vma); extern void DRM(vm_close)(struct vm_area_struct *vma); extern void DRM(vm_shm_close)(struct vm_area_struct *vma); --- diff/drivers/char/drm/drm_vm.h 2003-07-22 18:54:27.000000000 +0100 +++ source/drivers/char/drm/drm_vm.h 2003-11-26 10:09:05.000000000 +0000 @@ -76,7 +76,7 @@ */ struct page *DRM(vm_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access) + int *type) { #if __REALLY_HAVE_AGP drm_file_t *priv = vma->vm_file->private_data; @@ -133,6 +133,8 @@ baddr, __va(agpmem->memory->memory[offset]), offset, atomic_read(&page->count)); + if (type) + *type = VM_FAULT_MINOR; return page; } vm_nopage_error: @@ -154,7 +156,7 @@ */ struct page *DRM(vm_shm_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access) + int *type) { drm_map_t *map = (drm_map_t *)vma->vm_private_data; unsigned long offset; @@ -170,6 +172,8 @@ if (!page) return NOPAGE_OOM; get_page(page); + if (type) + *type = VM_FAULT_MINOR; DRM_DEBUG("shm_nopage 0x%lx\n", address); return page; @@ -268,7 +272,7 @@ */ struct page *DRM(vm_dma_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access) + int *type) { drm_file_t *priv = vma->vm_file->private_data; drm_device_t *dev = priv->dev; @@ -287,6 +291,8 @@ (offset & (~PAGE_MASK)))); get_page(page); + if (type) + *type = VM_FAULT_MINOR; DRM_DEBUG("dma_nopage 0x%lx (page %lu)\n", address, page_nr); return page; @@ -304,7 +310,7 @@ */ struct page *DRM(vm_sg_nopage)(struct vm_area_struct *vma, unsigned long address, - int write_access) + int *type) { drm_map_t *map = (drm_map_t *)vma->vm_private_data; drm_file_t *priv = vma->vm_file->private_data; @@ -325,6 +331,8 @@ page_offset = (offset >> PAGE_SHIFT) + (map_offset >> PAGE_SHIFT); page = entry->pagelist[page_offset]; get_page(page); + if (type) + *type = VM_FAULT_MINOR; return page; } --- diff/drivers/char/keyboard.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/char/keyboard.c 2003-11-26 10:09:05.000000000 +0000 @@ -941,16 +941,16 @@ 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, - 80, 81, 82, 83, 43, 85, 86, 87, 88,115,119,120,121,375,123, 90, - 284,285,309,298,312, 91,327,328,329,331,333,335,336,337,338,339, - 367,288,302,304,350, 92,334,512,116,377,109,111,373,347,348,349, - 360, 93, 94, 95, 98,376,100,101,321,316,354,286,289,102,351,355, + 80, 81, 82, 83, 84, 93, 86, 87, 88, 94, 95, 85,259,375,260, 90, + 284,285,309,311,312, 91,327,328,329,331,333,335,336,337,338,339, + 367,288,302,304,350, 89,334,326,116,377,109,111,126,347,348,349, + 360,261,262,263,298,376,100,101,321,316,373,286,289,102,351,355, 103,104,105,275,287,279,306,106,274,107,294,364,358,363,362,361, - 291,108,381,281,290,272,292,305,280, 99,112,257,258,359,270,114, - 118,117,125,374,379,115,112,125,121,123,264,265,266,267,268,269, - 271,273,276,277,278,282,283,295,296,297,299,300,301,293,303,307, - 308,310,313,314,315,317,318,319,320,357,322,323,324,325,326,330, - 332,340,365,342,343,344,345,346,356,113,341,368,369,370,371,372 }; + 291,108,381,281,290,272,292,305,280, 99,112,257,258,359,113,114, + 264,117,271,374,379,115,125,273,121,123, 92,265,266,267,268,269, + 120,119,118,277,278,282,283,295,296,297,299,300,301,293,303,307, + 308,310,313,314,315,317,318,319,320,357,322,323,324,325,276,330, + 332,340,365,342,343,344,345,346,356,270,341,368,369,370,371,372 }; #ifdef CONFIG_MAC_EMUMOUSEBTN extern int mac_hid_mouse_emulate_buttons(int, int, int); @@ -972,11 +972,18 @@ if (keycode > 255 || !x86_keycodes[keycode]) return -1; - if (keycode == KEY_PAUSE) { - put_queue(vc, 0xe1); - put_queue(vc, 0x1d | up_flag); - put_queue(vc, 0x45 | up_flag); - return 0; + switch (keycode) { + case KEY_PAUSE: + put_queue(vc, 0xe1); + put_queue(vc, 0x1d | up_flag); + put_queue(vc, 0x45 | up_flag); + return 0; + case KEY_LANG1: + if (!up_flag) put_queue(vc, 0xf1); + return 0; + case KEY_LANG2: + if (!up_flag) put_queue(vc, 0xf2); + return 0; } if (keycode == KEY_SYSRQ && sysrq_alt) { @@ -1052,6 +1059,9 @@ } if (sysrq_down && down && !rep) { handle_sysrq(kbd_sysrq_xlate[keycode], regs, tty); +#ifdef CONFIG_KGDB_SYSRQ + sysrq_down = 0; /* in case we miss the "up" event */ +#endif return; } #endif --- diff/drivers/char/mem.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/char/mem.c 2003-11-26 10:09:05.000000000 +0000 @@ -40,6 +40,7 @@ extern void tapechar_init(void); #endif +#ifdef pgprot_noncached /* * Architectures vary in how they handle caching for addresses * outside of main memory. @@ -65,19 +66,21 @@ && addr >= __pa(high_memory); #elif defined(CONFIG_IA64) /* - * On ia64, we ignore O_SYNC because we cannot tolerate memory attribute aliases. + * On ia64, we ignore O_SYNC because we cannot tolerate memory + * attribute aliases. */ return !(efi_mem_attributes(addr) & EFI_MEMORY_WB); #else /* - * Accessing memory above the top the kernel knows about or through a file pointer - * that was marked O_SYNC will be done non-cached. + * Accessing memory above the top the kernel knows about or through a + * file pointer that was marked O_SYNC will be done non-cached. */ if (file->f_flags & O_SYNC) return 1; return addr >= __pa(high_memory); #endif } +#endif /* pgprot_noncached */ #ifndef ARCH_HAS_VALID_PHYS_ADDR_RANGE static inline int valid_phys_addr_range(unsigned long addr, size_t *count) @@ -167,28 +170,24 @@ return do_write_mem(file, __va(p), p, buf, count, ppos); } -static int mmap_mem(struct file * file, struct vm_area_struct * vma) +static int mmap_mem(struct file *file, struct vm_area_struct *vma) { unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; - int uncached; - uncached = uncached_access(file, offset); #ifdef pgprot_noncached - if (uncached) + if (uncached_access(file, offset)) vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); #endif - /* Don't try to swap out physical pages.. */ - vma->vm_flags |= VM_RESERVED; - /* - * Don't dump addresses that are not real memory to a core file. + * Don't try to swap out physical pages.. + * And treat /dev/mem mappings as "IO" regions: they may not + * describe valid pageframes. */ - if (uncached) - vma->vm_flags |= VM_IO; + vma->vm_flags |= VM_RESERVED|VM_IO; - if (remap_page_range(vma, vma->vm_start, offset, vma->vm_end-vma->vm_start, - vma->vm_page_prot)) + if (remap_page_range(vma, vma->vm_start, offset, + vma->vm_end-vma->vm_start, vma->vm_page_prot)) return -EAGAIN; return 0; } --- diff/drivers/char/raw.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/char/raw.c 2003-11-26 10:09:05.000000000 +0000 @@ -74,6 +74,7 @@ goto out; } filp->f_flags |= O_DIRECT; + filp->f_mapping = bdev->bd_inode->i_mapping; if (++raw_devices[minor].inuse == 1) filp->f_dentry->d_inode->i_mapping = bdev->bd_inode->i_mapping; --- diff/drivers/char/sn_serial.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/char/sn_serial.c 2003-11-26 10:09:05.000000000 +0000 @@ -21,8 +21,9 @@ #include <linux/sysrq.h> #include <linux/circ_buf.h> #include <linux/serial_reg.h> +#include <asm/uaccess.h> #include <asm/sn/sn_sal.h> -#include <asm/sn/pci/pciio.h> /* this is needed for get_console_nasid */ +#include <asm/sn/pci/pciio.h> #include <asm/sn/simulator.h> #include <asm/sn/sn2/sn_private.h> @@ -771,7 +772,7 @@ off_t begin = 0; len += sprintf(page, "sn_serial: nasid:%d irq:%d tx:%d rx:%d\n", - get_console_nasid(), sn_sal_irq, + ia64_sn_get_console_nasid(), sn_sal_irq, sn_total_tx_count, sn_total_rx_count); *eof = 1; @@ -813,6 +814,9 @@ { unsigned long flags; + if (sn_sal_is_asynch) + return; + sn_debug_printf("sn_serial: about to switch to asynchronous console\n"); /* without early_printk, we may be invoked late enough to race --- diff/drivers/char/sysrq.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/char/sysrq.c 2003-11-26 10:09:05.000000000 +0000 @@ -35,6 +35,25 @@ #include <linux/spinlock.h> #include <asm/ptrace.h> +#ifdef CONFIG_KGDB_SYSRQ + +#define GDB_OP &kgdb_op +static void kgdb_sysrq(int key, struct pt_regs *pt_regs, struct tty_struct *tty) +{ + printk("kgdb sysrq\n"); + breakpoint(); +} + +static struct sysrq_key_op kgdb_op = { + .handler = kgdb_sysrq, + .help_msg = "kGdb|Fgdb", + .action_msg = "Debug breakpoint\n", +}; + +#else +#define GDB_OP NULL +#endif + extern void reset_vc(unsigned int); @@ -238,8 +257,8 @@ /* c */ NULL, /* d */ NULL, /* e */ &sysrq_term_op, -/* f */ NULL, -/* g */ NULL, +/* f */ GDB_OP, +/* g */ GDB_OP, /* h */ NULL, /* i */ &sysrq_kill_op, /* j */ NULL, --- diff/drivers/char/watchdog/i810-tco.c 2003-09-30 15:46:13.000000000 +0100 +++ source/drivers/char/watchdog/i810-tco.c 2003-11-26 10:09:05.000000000 +0000 @@ -232,9 +232,8 @@ /* someone wrote to us, we should reload the timer */ tco_timer_reload (); - return 1; } - return 0; + return len; } static int i810tco_ioctl (struct inode *inode, struct file *file, --- diff/drivers/char/watchdog/ib700wdt.c 2003-09-17 12:28:05.000000000 +0100 +++ source/drivers/char/watchdog/ib700wdt.c 2003-11-26 10:09:05.000000000 +0000 @@ -161,9 +161,8 @@ } } ibwdt_ping(); - return 1; } - return 0; + return count; } static int --- diff/drivers/char/watchdog/indydog.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/char/watchdog/indydog.c 2003-11-26 10:09:05.000000000 +0000 @@ -113,9 +113,8 @@ } } indydog_ping(); - return 1; } - return 0; + return len; } static int indydog_ioctl(struct inode *inode, struct file *file, --- diff/drivers/char/watchdog/machzwd.c 2003-09-17 12:28:05.000000000 +0100 +++ source/drivers/char/watchdog/machzwd.c 2003-11-26 10:09:05.000000000 +0000 @@ -343,10 +343,9 @@ next_heartbeat = jiffies + ZF_USER_TIMEO; dprintk("user ping at %ld\n", jiffies); - return 1; } - return 0; + return count; } static int zf_ioctl(struct inode *inode, struct file *file, unsigned int cmd, --- diff/drivers/char/watchdog/mixcomwd.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/char/watchdog/mixcomwd.c 2003-11-26 10:09:05.000000000 +0000 @@ -156,9 +156,8 @@ } } mixcomwd_ping(); - return 1; } - return 0; + return len; } static int mixcomwd_ioctl(struct inode *inode, struct file *file, --- diff/drivers/char/watchdog/pcwd.c 2003-09-17 12:28:05.000000000 +0100 +++ source/drivers/char/watchdog/pcwd.c 2003-11-26 10:09:05.000000000 +0000 @@ -419,9 +419,8 @@ } } pcwd_send_heartbeat(); - return 1; } - return 0; + return len; } static int pcwd_open(struct inode *ino, struct file *filep) --- diff/drivers/char/watchdog/sa1100_wdt.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/char/watchdog/sa1100_wdt.c 2003-11-26 10:09:05.000000000 +0000 @@ -106,7 +106,7 @@ OSMR3 = OSCR + pre_margin; } - return len ? 1 : 0; + return len; } static struct watchdog_info ident = { --- diff/drivers/char/watchdog/softdog.c 2003-08-20 14:16:27.000000000 +0100 +++ source/drivers/char/watchdog/softdog.c 2003-11-26 10:09:05.000000000 +0000 @@ -155,9 +155,8 @@ } } mod_timer(&watchdog_ticktock, jiffies+(soft_margin*HZ)); - return 1; } - return 0; + return len; } static int softdog_ioctl(struct inode *inode, struct file *file, --- diff/drivers/char/watchdog/wdt.c 2003-09-17 12:28:05.000000000 +0100 +++ source/drivers/char/watchdog/wdt.c 2003-11-26 10:09:05.000000000 +0000 @@ -265,9 +265,8 @@ } } wdt_ping(); - return 1; } - return 0; + return count; } /** --- diff/drivers/i2c/i2c-dev.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/i2c/i2c-dev.c 2003-11-26 10:09:05.000000000 +0000 @@ -223,7 +223,7 @@ /* Put an arbritrary limit on the number of messages that can * be sent at once */ - if (rdwr_arg.nmsgs > 42) + if (rdwr_arg.nmsgs > I2C_RDRW_IOCTL_MAX_MSGS) return -EINVAL; rdwr_pa = (struct i2c_msg *) --- diff/drivers/ide/Kconfig 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/ide/Kconfig 2003-11-26 10:09:05.000000000 +0000 @@ -745,6 +745,14 @@ This driver adds PIO/(U)DMA support for the ServerWorks OSB4/CSB5 chipsets. +config BLK_DEV_SGIIOC4 + tristate "Silicon Graphics IOC4 chipset support" + depends on IA64_SGI_SN2 + help + This driver adds PIO & MultiMode DMA-2 support for the SGI IOC4 + chipset, which has one channel and can support two devices. + Please say Y here if you have an Altix System from SGI. + config BLK_DEV_SIIMAGE tristate "Silicon Image chipset support" help --- diff/drivers/ide/arm/icside.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/ide/arm/icside.c 2003-11-26 10:09:05.000000000 +0000 @@ -214,7 +214,7 @@ #define NR_ENTRIES 256 #define TABLE_SIZE (NR_ENTRIES * 8) -static void ide_build_sglist(ide_drive_t *drive, struct request *rq) +static void icside_build_sglist(ide_drive_t *drive, struct request *rq) { ide_hwif_t *hwif = drive->hwif; struct icside_state *state = hwif->hwif_data; @@ -543,7 +543,7 @@ BUG_ON(hwif->sg_dma_active); BUG_ON(dma_channel_active(hwif->hw.dma)); - ide_build_sglist(drive, rq); + icside_build_sglist(drive, rq); /* * Ensure that we have the right interrupt routed. --- diff/drivers/ide/ide-cd.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/ide/ide-cd.c 2003-11-26 10:09:05.000000000 +0000 @@ -3331,39 +3331,39 @@ .complete_power_step = ide_cdrom_complete_power_step, }; -static int idecd_open(struct inode * inode, struct file * file) +static int idecd_open(struct block_device *bdev, struct file * file) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = bdev->bd_disk->private_data; struct cdrom_info *info = drive->driver_data; int rc = -ENOMEM; drive->usage++; if (!info->buffer) - info->buffer = (char *) kmalloc(SECTOR_BUFFER_SIZE, GFP_KERNEL); - if (!info->buffer || (rc = cdrom_open(&info->devinfo, inode, file))) + info->buffer = kmalloc(SECTOR_BUFFER_SIZE, + GFP_KERNEL|__GFP_REPEAT); + if (!info->buffer || (rc = cdrom_open(&info->devinfo, bdev, file))) drive->usage--; return rc; } -static int idecd_release(struct inode * inode, struct file * file) +static int idecd_release(struct gendisk *disk) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = disk->private_data; struct cdrom_info *info = drive->driver_data; - cdrom_release (&info->devinfo, file); + cdrom_release(&info->devinfo); drive->usage--; return 0; } -static int idecd_ioctl (struct inode *inode, struct file *file, +static int idecd_ioctl (struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; ide_drive_t *drive = bdev->bd_disk->private_data; int err = generic_ide_ioctl(bdev, cmd, arg); if (err == -EINVAL) { struct cdrom_info *info = drive->driver_data; - err = cdrom_ioctl(&info->devinfo, inode, cmd, arg); + err = cdrom_ioctl(&info->devinfo, bdev, cmd, arg); } return err; } --- diff/drivers/ide/ide-disk.c 2003-09-17 12:28:05.000000000 +0100 +++ source/drivers/ide/ide-disk.c 2003-11-26 10:09:05.000000000 +0000 @@ -1734,9 +1734,9 @@ .complete_power_step = idedisk_complete_power_step, }; -static int idedisk_open(struct inode *inode, struct file *filp) +static int idedisk_open(struct block_device *bdev, struct file *filp) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = bdev->bd_disk->private_data; drive->usage++; if (drive->removable && drive->usage == 1) { ide_task_t args; @@ -1744,7 +1744,7 @@ memset(&args, 0, sizeof(ide_task_t)); args.tfRegister[IDE_COMMAND_OFFSET] = WIN_DOORLOCK; args.command_type = ide_cmd_type_parser(&args); - check_disk_change(inode->i_bdev); + check_disk_change(bdev); /* * Ignore the return code from door_lock, * since the open() has already succeeded, @@ -1782,9 +1782,9 @@ return 0; } -static int idedisk_release(struct inode *inode, struct file *filp) +static int idedisk_release(struct gendisk *disk) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = disk->private_data; if (drive->removable && drive->usage == 1) { ide_task_t args; memset(&args, 0, sizeof(ide_task_t)); @@ -1798,10 +1798,9 @@ return 0; } -static int idedisk_ioctl(struct inode *inode, struct file *file, +static int idedisk_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; return generic_ide_ioctl(bdev, cmd, arg); } --- diff/drivers/ide/ide-dma.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/ide/ide-dma.c 2003-11-26 10:09:05.000000000 +0000 @@ -200,8 +200,8 @@ * kernel provide the necessary cache management so that we can * operate in a portable fashion */ - -static int ide_build_sglist (ide_drive_t *drive, struct request *rq) + +int ide_build_sglist(ide_drive_t *drive, struct request *rq) { ide_hwif_t *hwif = HWIF(drive); struct scatterlist *sg = hwif->sg_table; @@ -220,6 +220,8 @@ return pci_map_sg(hwif->pci_dev, sg, nents, hwif->sg_dma_direction); } +EXPORT_SYMBOL_GPL(ide_build_sglist); + /** * ide_raw_build_sglist - map IDE scatter gather for DMA * @drive: the drive to build the DMA table for @@ -230,8 +232,8 @@ * of the kernel provide the necessary cache management so that we can * operate in a portable fashion */ - -static int ide_raw_build_sglist (ide_drive_t *drive, struct request *rq) + +int ide_raw_build_sglist(ide_drive_t *drive, struct request *rq) { ide_hwif_t *hwif = HWIF(drive); struct scatterlist *sg = hwif->sg_table; @@ -270,6 +272,8 @@ return pci_map_sg(hwif->pci_dev, sg, nents, hwif->sg_dma_direction); } +EXPORT_SYMBOL_GPL(ide_raw_build_sglist); + /** * ide_build_dmatable - build IDE DMA table * --- diff/drivers/ide/ide-floppy.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/ide/ide-floppy.c 2003-11-26 10:09:05.000000000 +0000 @@ -1866,9 +1866,9 @@ .drives = LIST_HEAD_INIT(idefloppy_driver.drives), }; -static int idefloppy_open(struct inode *inode, struct file *filp) +static int idefloppy_open(struct block_device *bdev, struct file *filp) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = bdev->bd_disk->private_data; idefloppy_floppy_t *floppy = drive->driver_data; idefloppy_pc_t pc; @@ -1908,7 +1908,7 @@ idefloppy_create_prevent_cmd(&pc, 1); (void) idefloppy_queue_pc_tail(drive, &pc); } - check_disk_change(inode->i_bdev); + check_disk_change(bdev); } else if (test_bit(IDEFLOPPY_FORMAT_IN_PROGRESS, &floppy->flags)) { drive->usage--; return -EBUSY; @@ -1916,9 +1916,9 @@ return 0; } -static int idefloppy_release(struct inode *inode, struct file *filp) +static int idefloppy_release(struct gendisk *disk) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = disk->private_data; idefloppy_pc_t pc; debug_log(KERN_INFO "Reached idefloppy_release\n"); @@ -1938,10 +1938,9 @@ return 0; } -static int idefloppy_ioctl(struct inode *inode, struct file *file, +static int idefloppy_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; ide_drive_t *drive = bdev->bd_disk->private_data; idefloppy_floppy_t *floppy = drive->driver_data; int err = generic_ide_ioctl(bdev, cmd, arg); --- diff/drivers/ide/ide-tape.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/ide/ide-tape.c 2003-11-26 10:09:05.000000000 +0000 @@ -1,5 +1,5 @@ /* - * linux/drivers/ide/ide-tape.c Version 1.17b Oct, 2002 + * linux/drivers/ide/ide-tape.c Version 1.18 Nov, 2003 * * Copyright (C) 1995 - 1999 Gadi Oxman <gadio@netvision.net.il> * @@ -422,7 +422,7 @@ * sharing a (fast) ATA-2 disk with any (slow) new ATAPI device. */ -#define IDETAPE_VERSION "1.17b-ac1" +#define IDETAPE_VERSION "1.18" #include <linux/config.h> #include <linux/module.h> @@ -450,9 +450,6 @@ #include <asm/unaligned.h> #include <asm/bitops.h> - -#define NO_LONGER_REQUIRED (1) - /* * OnStream support */ @@ -652,9 +649,11 @@ #define IDETAPE_PC_STACK (10 + IDETAPE_MAX_PC_RETRIES) /* - * Some tape drives require a long irq timeout + * Some drives (for example, Seagate STT3401A Travan) require a very long + * timeout, because they don't return an interrupt or clear their busy bit + * until after the command completes (even retension commands). */ -#define IDETAPE_WAIT_CMD (60*HZ) +#define IDETAPE_WAIT_CMD (900*HZ) /* * The following parameter is used to select the point in the internal @@ -1032,6 +1031,10 @@ /* the door is currently locked */ int door_locked; + /* the tape hardware is write protected */ + char drv_write_prot; + /* the tape is write protected (hardware or opened as read-only) */ + char write_prot; /* * OnStream flags @@ -1164,6 +1167,8 @@ #define IDETAPE_DRQ_INTERRUPT 6 /* DRQ interrupt device */ #define IDETAPE_READ_ERROR 7 #define IDETAPE_PIPELINE_ACTIVE 8 /* pipeline active */ +/* 0 = no tape is loaded, so we don't rewind after ejecting */ +#define IDETAPE_MEDIUM_PRESENT 9 /* * Supported ATAPI tape drives packet commands @@ -1665,6 +1670,20 @@ idetape_update_buffers(pc); } + /* + * If error was the result of a zero-length read or write command, + * with sense key=5, asc=0x22, ascq=0, let it slide. Some drives + * (i.e. Seagate STT3401A Travan) don't support 0-length read/writes. + */ + if ((pc->c[0] == IDETAPE_READ_CMD || pc->c[0] == IDETAPE_WRITE_CMD) + && pc->c[4] == 0 && pc->c[3] == 0 && pc->c[2] == 0) { /* length==0 */ + if (result->sense_key == 5) { + /* don't report an error, everything's ok */ + pc->error = 0; + /* don't retry read/write */ + set_bit(PC_ABORT, &pc->flags); + } + } if (pc->c[0] == IDETAPE_READ_CMD && result->filemark) { pc->error = IDETAPE_ERROR_FILEMARK; set_bit(PC_ABORT, &pc->flags); @@ -1805,10 +1824,15 @@ } } -static void idetape_abort_pipeline (ide_drive_t *drive, idetape_stage_t *last_stage) +/* + * This will free all the pipeline stages starting from new_last_stage->next + * to the end of the list, and point tape->last_stage to new_last_stage. + */ +static void idetape_abort_pipeline(ide_drive_t *drive, + idetape_stage_t *new_last_stage) { idetape_tape_t *tape = drive->driver_data; - idetape_stage_t *stage = tape->next_stage; + idetape_stage_t *stage = new_last_stage->next; idetape_stage_t *nstage; #if IDETAPE_DEBUG_LOG @@ -1822,9 +1846,9 @@ --tape->nr_pending_stages; stage = nstage; } - tape->last_stage = last_stage; - if (last_stage) - last_stage->next = NULL; + if (new_last_stage) + new_last_stage->next = NULL; + tape->last_stage = new_last_stage; tape->next_stage = NULL; } @@ -2430,7 +2454,14 @@ if (page_code != IDETAPE_BLOCK_DESCRIPTOR) pc->c[1] = 8; /* DBD = 1 - Don't return block descriptors */ pc->c[2] = page_code; - pc->c[3] = 255; /* Don't limit the returned information */ + /* + * Changed pc->c[3] to 0 (255 will at best return unused info). + * + * For SCSI this byte is defined as subpage instead of high byte + * of length and some IDE drives seem to interpret it this way + * and return an error when 255 is used. + */ + pc->c[3] = 0; pc->c[4] = 255; /* (We will just discard data in that case) */ if (page_code == IDETAPE_BLOCK_DESCRIPTOR) pc->request_transfer = 12; @@ -2544,8 +2575,9 @@ if (status.b.dsc) { if (status.b.check) { /* Error detected */ - printk(KERN_ERR "ide-tape: %s: I/O error, ",tape->name); - + if (pc->c[0] != IDETAPE_TEST_UNIT_READY_CMD) + printk(KERN_ERR "ide-tape: %s: I/O error, ", + tape->name); /* Retry operation */ return idetape_retry_pc(drive); } @@ -3295,25 +3327,28 @@ { idetape_tape_t *tape = drive->driver_data; idetape_pc_t pc; + int load_attempted = 0; /* * Wait for the tape to become ready */ + set_bit(IDETAPE_MEDIUM_PRESENT, &tape->flags); timeout += jiffies; while (time_before(jiffies, timeout)) { idetape_create_test_unit_ready_cmd(&pc); if (!__idetape_queue_pc_tail(drive, &pc)) return 0; - if (tape->sense_key == 2 && tape->asc == 4 && tape->ascq == 2) { + if ((tape->sense_key == 2 && tape->asc == 4 && tape->ascq == 2) + || (tape->asc == 0x3A)) { /* no media */ + if (load_attempted) + return -ENOMEDIUM; idetape_create_load_unload_cmd(drive, &pc, IDETAPE_LU_LOAD_MASK); __idetape_queue_pc_tail(drive, &pc); - idetape_create_test_unit_ready_cmd(&pc); - if (!__idetape_queue_pc_tail(drive, &pc)) - return 0; - } - if (!(tape->sense_key == 2 && tape->asc == 4 && - (tape->ascq == 1 || tape->ascq == 8))) - break; + load_attempted = 1; + /* not about to be ready */ + } else if (!(tape->sense_key == 2 && tape->asc == 4 && + (tape->ascq == 1 || tape->ascq == 8))) + return -EIO; current->state = TASK_INTERRUPTIBLE; schedule_timeout(HZ / 10); } @@ -3369,25 +3404,10 @@ printk(KERN_INFO "ide-tape: Reached idetape_read_position\n"); #endif /* IDETAPE_DEBUG_LOG */ -#ifdef NO_LONGER_REQUIRED - idetape_flush_tape_buffers(drive); -#endif idetape_create_read_position_cmd(&pc); if (idetape_queue_pc_tail(drive, &pc)) return -1; position = tape->first_frame_position; -#ifdef NO_LONGER_REQUIRED - if (tape->onstream) { - if ((position != tape->last_frame_position - tape->blocks_in_buffer) && - (position != tape->last_frame_position + tape->blocks_in_buffer)) { - if (tape->blocks_in_buffer == 0) { - printk("ide-tape: %s: correcting read position %d, %d, %d\n", tape->name, position, tape->last_frame_position, tape->blocks_in_buffer); - position = tape->last_frame_position; - tape->first_frame_position = position; - } - } - } -#endif return position; } @@ -3436,6 +3456,8 @@ if (tape->chrdev_direction != idetape_direction_read) return 0; + + /* Remove merge stage. */ cnt = tape->merge_stage_size / tape->tape_block_size; if (test_and_clear_bit(IDETAPE_FILEMARK, &tape->flags)) ++cnt; /* Filemarks count as 1 sector */ @@ -3444,9 +3466,12 @@ __idetape_kfree_stage(tape->merge_stage); tape->merge_stage = NULL; } + + /* Clear pipeline flags. */ clear_bit(IDETAPE_PIPELINE_ERROR, &tape->flags); tape->chrdev_direction = idetape_direction_none; - + + /* Remove pipeline stages. */ if (tape->first_stage == NULL) return 0; @@ -4059,13 +4084,17 @@ * Issue a read 0 command to ensure that DSC handshake * is switched from completion mode to buffer available * mode. + * No point in issuing this if DSC overlap isn't supported, + * some drives (Seagate STT3401A) will return an error. */ - bytes_read = idetape_queue_rw_tail(drive, REQ_IDETAPE_READ, 0, tape->merge_stage->bh); - if (bytes_read < 0) { - __idetape_kfree_stage(tape->merge_stage); - tape->merge_stage = NULL; - tape->chrdev_direction = idetape_direction_none; - return bytes_read; + if (drive->dsc_overlap) { + bytes_read = idetape_queue_rw_tail(drive, REQ_IDETAPE_READ, 0, tape->merge_stage->bh); + if (bytes_read < 0) { + __idetape_kfree_stage(tape->merge_stage); + tape->merge_stage = NULL; + tape->chrdev_direction = idetape_direction_none; + return bytes_read; + } } } if (tape->restart_speed_control_req) @@ -4898,6 +4927,10 @@ return -ENXIO; } + /* The drive is write protected. */ + if (tape->write_prot) + return -EACCES; + #if IDETAPE_DEBUG_LOG if (tape->debug_level >= 3) printk(KERN_INFO "ide-tape: Reached idetape_chrdev_write, " @@ -4979,13 +5012,17 @@ * Issue a write 0 command to ensure that DSC handshake * is switched from completion mode to buffer available * mode. + * No point in issuing this if DSC overlap isn't supported, + * some drives (Seagate STT3401A) will return an error. */ - retval = idetape_queue_rw_tail(drive, REQ_IDETAPE_WRITE, 0, tape->merge_stage->bh); - if (retval < 0) { - __idetape_kfree_stage(tape->merge_stage); - tape->merge_stage = NULL; - tape->chrdev_direction = idetape_direction_none; - return retval; + if (drive->dsc_overlap) { + retval = idetape_queue_rw_tail(drive, REQ_IDETAPE_WRITE, 0, tape->merge_stage->bh); + if (retval < 0) { + __idetape_kfree_stage(tape->merge_stage); + tape->merge_stage = NULL; + tape->chrdev_direction = idetape_direction_none; + return retval; + } } #if ONSTREAM_DEBUG if (tape->debug_level >= 2) @@ -5141,7 +5178,7 @@ * Note: * * MTBSF and MTBSFM are not supported when the tape doesn't - * supports spacing over filemarks in the reverse direction. + * support spacing over filemarks in the reverse direction. * In this case, MTFSFM is also usually not supported (it is * supported in the rare case in which we crossed the filemark * during our read-ahead pipelined operation mode). @@ -5211,6 +5248,8 @@ } switch (mt_op) { case MTWEOF: + if (tape->write_prot) + return -EACCES; idetape_discard_read_pipeline(drive, 1); for (i = 0; i < mt_count; i++) { retval = idetape_write_filemark(drive); @@ -5231,9 +5270,21 @@ return (idetape_queue_pc_tail(drive, &pc)); case MTUNLOAD: case MTOFFL: + /* + * If door is locked, attempt to unlock before + * attempting to eject. + */ + if (tape->door_locked) { + if (idetape_create_prevent_cmd(drive, &pc, 0)) + if (!idetape_queue_pc_tail(drive, &pc)) + tape->door_locked = DOOR_UNLOCKED; + } idetape_discard_read_pipeline(drive, 0); idetape_create_load_unload_cmd(drive, &pc,!IDETAPE_LU_LOAD_MASK); - return (idetape_queue_pc_tail(drive, &pc)); + retval = idetape_queue_pc_tail(drive, &pc); + if (!retval) + clear_bit(IDETAPE_MEDIUM_PRESENT, &tape->flags); + return retval; case MTNOP: idetape_discard_read_pipeline(drive, 0); return (idetape_flush_tape_buffers(drive)); @@ -5409,6 +5460,8 @@ mtget.mt_gstat |= GMT_EOD(0xffffffff); if (position <= OS_DATA_STARTFRAME1) mtget.mt_gstat |= GMT_BOT(0xffffffff); + } else if (tape->drv_write_prot) { + mtget.mt_gstat |= GMT_WR_PROT(0xffffffff); } if (copy_to_user((char *) arg,(char *) &mtget, sizeof(struct mtget))) return -EFAULT; @@ -5530,6 +5583,8 @@ return 1; } +static void idetape_get_blocksize_from_block_descriptor(ide_drive_t *drive); + /* * Our character device open function. */ @@ -5539,7 +5594,8 @@ ide_drive_t *drive; idetape_tape_t *tape; idetape_pc_t pc; - + int retval; + #if IDETAPE_DEBUG_LOG printk(KERN_INFO "ide-tape: Reached idetape_chrdev_open\n"); #endif /* IDETAPE_DEBUG_LOG */ @@ -5552,11 +5608,7 @@ if (test_and_set_bit(IDETAPE_BUSY, &tape->flags)) return -EBUSY; - if (!tape->onstream) { - idetape_read_position(drive); - if (!test_bit(IDETAPE_ADDRESS_VALID, &tape->flags)) - (void) idetape_rewind_tape(drive); - } else { + if (tape->onstream) { if (minor & 64) { tape->tape_block_size = tape->stage_size = 32768 + 512; tape->raw = 1; @@ -5566,16 +5618,42 @@ } idetape_onstream_mode_sense_tape_parameter_page(drive, tape->debug_level); } - if (idetape_wait_ready(drive, 60 * HZ)) { + retval = idetape_wait_ready(drive, 60 * HZ); + if (retval) { clear_bit(IDETAPE_BUSY, &tape->flags); printk(KERN_ERR "ide-tape: %s: drive not ready\n", tape->name); - return -EBUSY; + return retval; } - if (tape->onstream) - idetape_read_position(drive); + + idetape_read_position(drive); + if (!test_bit(IDETAPE_ADDRESS_VALID, &tape->flags)) + (void)idetape_rewind_tape(drive); + if (tape->chrdev_direction != idetape_direction_read) clear_bit(IDETAPE_PIPELINE_ERROR, &tape->flags); + /* Read block size and write protect status from drive. */ + idetape_get_blocksize_from_block_descriptor(drive); + + /* Set write protect flag if device is opened as read-only. */ + if ((filp->f_flags & O_ACCMODE) == O_RDONLY) + tape->write_prot = 1; + else + tape->write_prot = tape->drv_write_prot; + + /* Make sure drive isn't write protected if user wants to write. */ + if (tape->write_prot) { + if ((filp->f_flags & O_ACCMODE) == O_WRONLY || + (filp->f_flags & O_ACCMODE) == O_RDWR) { + clear_bit(IDETAPE_BUSY, &tape->flags); + return -EROFS; + } + } + + /* + * Lock the tape drive door so user can't eject. + * Analyze headers for OnStream drives. + */ if (tape->chrdev_direction == idetape_direction_none) { if (idetape_create_prevent_cmd(drive, &pc, 1)) { if (!idetape_queue_pc_tail(drive, &pc)) { @@ -5638,7 +5716,7 @@ __idetape_kfree_stage(tape->cache_stage); tape->cache_stage = NULL; } - if (minor < 128) + if (minor < 128 && test_bit(IDETAPE_MEDIUM_PRESENT, &tape->flags)) (void) idetape_rewind_tape(drive); if (tape->chrdev_direction == idetape_direction_none) { if (tape->door_locked == DOOR_LOCKED) { @@ -6059,6 +6137,8 @@ header = (idetape_mode_parameter_header_t *) pc.buffer; block_descrp = (idetape_parameter_block_descriptor_t *) (pc.buffer + sizeof(idetape_mode_parameter_header_t)); tape->tape_block_size =( block_descrp->length[0]<<16) + (block_descrp->length[1]<<8) + block_descrp->length[2]; + tape->drv_write_prot = (header->dsp & 0x80) >> 7; + #if IDETAPE_DEBUG_INFO printk(KERN_INFO "ide-tape: Adjusted block size - %d\n", tape->tape_block_size); #endif /* IDETAPE_DEBUG_INFO */ @@ -6139,6 +6219,9 @@ } } #endif /* CONFIG_BLK_DEV_IDEPCI */ + /* Seagate Travan drives do not support DSC overlap. */ + if (strstr(drive->id->model, "Seagate STT3401")) + drive->dsc_overlap = 0; tape->drive = drive; tape->minor = minor; tape->name[0] = 'h'; @@ -6306,24 +6389,23 @@ .release = idetape_chrdev_release, }; -static int idetape_open(struct inode *inode, struct file *filp) +static int idetape_open(struct block_device *bdev, struct file *filp) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = bdev->bd_disk->private_data; drive->usage++; return 0; } -static int idetape_release(struct inode *inode, struct file *filp) +static int idetape_release(struct gendisk *disk) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = disk->private_data; drive->usage--; return 0; } -static int idetape_ioctl(struct inode *inode, struct file *file, +static int idetape_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; ide_drive_t *drive = bdev->bd_disk->private_data; int err = generic_ide_ioctl(bdev, cmd, arg); if (err == -EINVAL) --- diff/drivers/ide/ide.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/ide/ide.c 2003-11-26 10:09:05.000000000 +0000 @@ -458,7 +458,7 @@ EXPORT_SYMBOL(ide_probe_module); -static int ide_open (struct inode * inode, struct file * filp) +static int ide_open (struct block_device *bdev, struct file * filp) { return -ENXIO; } --- diff/drivers/ide/legacy/hd.c 2003-08-26 10:00:52.000000000 +0100 +++ source/drivers/ide/legacy/hd.c 2003-11-26 10:09:05.000000000 +0000 @@ -656,10 +656,10 @@ enable_irq(HD_IRQ); } -static int hd_ioctl(struct inode * inode, struct file * file, +static int hd_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct hd_i_struct *disk = inode->i_bdev->bd_disk->private_data; + struct hd_i_struct *disk = bdev->bd_disk->private_data; struct hd_geometry *loc = (struct hd_geometry *) arg; struct hd_geometry g; @@ -670,7 +670,7 @@ g.heads = disk->head; g.sectors = disk->sect; g.cylinders = disk->cyl; - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); return copy_to_user(loc, &g, sizeof g) ? -EFAULT : 0; } --- diff/drivers/ide/legacy/hd98.c 2003-08-20 14:16:28.000000000 +0100 +++ source/drivers/ide/legacy/hd98.c 2003-11-26 10:09:05.000000000 +0000 @@ -652,10 +652,10 @@ enable_irq(HD_IRQ); } -static int hd_ioctl(struct inode * inode, struct file * file, +static int hd_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct hd_i_struct *disk = inode->i_bdev->bd_disk->private_data; + struct hd_i_struct *disk = bdev->bd_disk->private_data; struct hd_geometry *loc = (struct hd_geometry *) arg; struct hd_geometry g; @@ -666,7 +666,7 @@ g.heads = disk->head; g.sectors = disk->sect; g.cylinders = disk->cyl; - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); return copy_to_user(loc, &g, sizeof g) ? -EFAULT : 0; } --- diff/drivers/ide/pci/Makefile 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/ide/pci/Makefile 2003-11-26 10:09:05.000000000 +0000 @@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_PIIX) += piix.o obj-$(CONFIG_BLK_DEV_RZ1000) += rz1000.o obj-$(CONFIG_BLK_DEV_SVWKS) += serverworks.o +obj-$(CONFIG_BLK_DEV_SGIIOC4) += sgiioc4.o obj-$(CONFIG_BLK_DEV_SIIMAGE) += siimage.o obj-$(CONFIG_BLK_DEV_SIS5513) += sis5513.o obj-$(CONFIG_BLK_DEV_SL82C105) += sl82c105.o --- diff/drivers/ide/pci/piix.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/ide/pci/piix.c 2003-11-26 10:09:05.000000000 +0000 @@ -768,8 +768,8 @@ /* Only on the original revision: IDE DMA can hang */ if(rev == 0x00) no_piix_dma = 1; - /* On all revisions PXB bus lock must be disabled for IDE */ - else if(cfg & (1<<14)) + /* On all revisions below 5 PXB bus lock must be disabled for IDE */ + else if(cfg & (1<<14) && rev < 5) no_piix_dma = 2; } if(no_piix_dma) --- diff/drivers/ide/setup-pci.c 2003-10-27 09:20:37.000000000 +0000 +++ source/drivers/ide/setup-pci.c 2003-11-26 10:09:05.000000000 +0000 @@ -474,6 +474,11 @@ * state */ +#ifndef CONFIG_BLK_DEV_IDEDMA_PCI +static void ide_hwif_setup_dma(struct pci_dev *dev, ide_pci_device_t *d, ide_hwif_t *hwif) +{ +} +#else static void ide_hwif_setup_dma(struct pci_dev *dev, ide_pci_device_t *d, ide_hwif_t *hwif) { u16 pcicmd; @@ -516,6 +521,7 @@ } } } +#endif /* CONFIG_BLK_DEV_IDEDMA_PCI*/ /** * ide_setup_pci_controller - set up IDE PCI --- diff/drivers/ieee1394/dma.c 2003-08-20 14:16:09.000000000 +0100 +++ source/drivers/ieee1394/dma.c 2003-11-26 10:09:05.000000000 +0000 @@ -187,7 +187,7 @@ /* nopage() handler for mmap access */ static struct page* -dma_region_pagefault(struct vm_area_struct *area, unsigned long address, int write_access) +dma_region_pagefault(struct vm_area_struct *area, unsigned long address, int *type) { unsigned long offset; unsigned long kernel_virt_addr; @@ -202,6 +202,8 @@ (address > (unsigned long) area->vm_start + (PAGE_SIZE * dma->n_pages)) ) goto out; + if (type) + *type = VM_FAULT_MINOR; offset = address - area->vm_start; kernel_virt_addr = (unsigned long) dma->kvirt + offset; ret = vmalloc_to_page((void*) kernel_virt_addr); --- diff/drivers/input/input.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/input/input.c 2003-11-26 10:09:05.000000000 +0000 @@ -447,9 +447,10 @@ list_add_tail(&dev->node, &input_dev_list); list_for_each_entry(handler, &input_handler_list, node) - if ((id = input_match_device(handler->id_table, dev))) - if ((handle = handler->connect(handler, dev, id))) - input_link_handle(handle); + if (!handler->blacklist || !input_match_device(handler->blacklist, dev)) + if ((id = input_match_device(handler->id_table, dev))) + if ((handle = handler->connect(handler, dev, id))) + input_link_handle(handle); #ifdef CONFIG_HOTPLUG input_call_hotplug("add", dev); @@ -507,9 +508,10 @@ list_add_tail(&handler->node, &input_handler_list); list_for_each_entry(dev, &input_dev_list, node) - if ((id = input_match_device(handler->id_table, dev))) - if ((handle = handler->connect(handler, dev, id))) - input_link_handle(handle); + if (!handler->blacklist || !input_match_device(handler->blacklist, dev)) + if ((id = input_match_device(handler->id_table, dev))) + if ((handle = handler->connect(handler, dev, id))) + input_link_handle(handle); #ifdef CONFIG_PROC_FS input_devices_state++; --- diff/drivers/input/joydev.c 2003-09-30 15:45:46.000000000 +0100 +++ source/drivers/input/joydev.c 2003-11-26 10:09:05.000000000 +0000 @@ -380,10 +380,6 @@ struct joydev *joydev; int i, j, t, minor; - /* Avoid tablets */ - if (test_bit(EV_KEY, dev->evbit) && test_bit(BTN_TOUCH, dev->keybit)) - return NULL; - for (minor = 0; minor < JOYDEV_MINORS && joydev_table[minor]; minor++); if (minor == JOYDEV_MINORS) { printk(KERN_ERR "joydev: no more free joydev devices\n"); @@ -464,6 +460,15 @@ joydev_free(joydev); } +static struct input_device_id joydev_blacklist[] = { + { + .flags = INPUT_DEVICE_ID_MATCH_EVBIT | INPUT_DEVICE_ID_MATCH_KEYBIT, + .evbit = { BIT(EV_KEY) }, + .keybit = { [LONG(BTN_TOUCH)] = BIT(BTN_TOUCH) }, + }, /* Avoid itouchpads, touchscreens and tablets */ + { }, /* Terminating entry */ +}; + static struct input_device_id joydev_ids[] = { { .flags = INPUT_DEVICE_ID_MATCH_EVBIT | INPUT_DEVICE_ID_MATCH_ABSBIT, @@ -493,6 +498,7 @@ .minor = JOYDEV_MINOR_BASE, .name = "joydev", .id_table = joydev_ids, + .blacklist = joydev_blacklist, }; static int __init joydev_init(void) --- diff/drivers/input/keyboard/atkbd.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/input/keyboard/atkbd.c 2003-11-26 10:09:05.000000000 +0000 @@ -48,33 +48,30 @@ */ static unsigned char atkbd_set2_keycode[512] = { - 0, 67, 65, 63, 61, 59, 60, 88, 0, 68, 66, 64, 62, 15, 41, 85, - 0, 56, 42,182, 29, 16, 2, 89, 0, 0, 44, 31, 30, 17, 3, 90, - 0, 46, 45, 32, 18, 5, 4, 91, 90, 57, 47, 33, 20, 19, 6, 0, - 91, 49, 48, 35, 34, 21, 7, 0, 0, 0, 50, 36, 22, 8, 9, 0, + + 0, 67, 65, 63, 61, 59, 60, 88, 0, 68, 66, 64, 62, 15, 41,117, + 0, 56, 42,182, 29, 16, 2, 0, 0, 0, 44, 31, 30, 17, 3, 0, + 0, 46, 45, 32, 18, 5, 4,186, 0, 57, 47, 33, 20, 19, 6, 85, + 0, 49, 48, 35, 34, 21, 7, 89, 0, 0, 50, 36, 22, 8, 9, 90, 0, 51, 37, 23, 24, 11, 10, 0, 0, 52, 53, 38, 39, 25, 12, 0, - 122, 89, 40,120, 26, 13, 0, 0, 58, 54, 28, 27, 0, 43, 0, 0, - 85, 86, 90, 91, 92, 93, 14, 94, 95, 79,183, 75, 71,121, 0,123, + 0,181, 40, 0, 26, 13, 0, 0, 58, 54, 28, 27, 0, 43, 0,194, + 0, 86,193,192,184, 0, 14,185, 0, 79,182, 75, 71,124, 0, 0, 82, 83, 80, 76, 77, 72, 1, 69, 87, 78, 81, 74, 55, 73, 70, 99, - 0, 0, 0, 65, 99, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,255, - 0, 0, 92, 90, 85, 0,137, 0, 0, 0, 0, 91, 89,144,115, 0, - 217,100,255, 0, 97,165,164, 0,156, 0, 0,140,115, 0, 0,125, - 173,114, 0,113,152,163,151,126,128,166, 0,140, 0,147, 0,127, - 159,167,115,160,164, 0, 0,116,158, 0,150,166, 0, 0, 0,142, - 157, 0,114,166,168, 0, 0,213,155, 0, 98,113, 0,163, 0,138, - 226, 0, 0, 0, 0, 0,153,140, 0,255, 96, 0, 0, 0,143, 0, - 133, 0,116, 0,143, 0,174,133, 0,107, 0,105,102, 0, 0,112, - 110,111,108,112,106,103, 0,119, 0,118,109, 0, 99,104,119 + 217,100,255, 0, 97,165, 0, 0,156, 0, 0, 0, 0, 0, 0,125, + 173,114, 0,113, 0, 0, 0,126,128, 0, 0,140, 0, 0, 0,127, + 159, 0,115, 0,164, 0, 0,116,158, 0,150,166, 0, 0, 0,142, + 157, 0, 0, 0, 0, 0, 0, 0,155, 0, 98, 0, 0,163, 0, 0, + 226, 0, 0, 0, 0, 0, 0, 0, 0,255, 96, 0, 0, 0,143, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0,107, 0,105,102, 0, 0,112, + 110,111,108,112,106,103, 0,119, 0,118,109, 0, 99,104,119, 0, + + 0, 0, 0, 65, 99, }; static unsigned char atkbd_set3_keycode[512] = { + 0, 0, 0, 0, 0, 0, 0, 59, 1,138,128,129,130, 15, 41, 60, 131, 29, 42, 86, 58, 16, 2, 61,133, 56, 44, 31, 30, 17, 3, 62, 134, 46, 45, 32, 18, 5, 4, 63,135, 57, 47, 33, 20, 19, 6, 64, @@ -83,25 +80,21 @@ 113,114, 40, 84, 26, 13, 87, 99, 97, 54, 28, 27, 43, 84, 88, 70, 108,105,119,103,111,107, 14,110, 0, 79,106, 75, 71,109,102,104, 82, 83, 80, 76, 77, 72, 69, 98, 0, 96, 81, 0, 78, 73, 55, 85, + 89, 90, 91, 92, 74,185,184,182, 0, 0, 0,125,126,127,112, 0, 0,139,150,163,165,115,152,150,166,140,160,154,113,114,167,168, - 148,149,147,140, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,255 + 148,149,147,140 }; static unsigned char atkbd_unxlate_table[128] = { - 0,118, 22, 30, 38, 37, 46, 54, 61, 62, 70, 69, 78, 85,102, 13, - 21, 29, 36, 45, 44, 53, 60, 67, 68, 77, 84, 91, 90, 20, 28, 27, - 35, 43, 52, 51, 59, 66, 75, 76, 82, 14, 18, 93, 26, 34, 33, 42, - 50, 49, 58, 65, 73, 74, 89,124, 17, 41, 88, 5, 6, 4, 12, 3, - 11, 2, 10, 1, 9,119,126,108,117,125,123,107,115,116,121,105, - 114,122,112,113,127, 96, 97,120, 7, 15, 23, 31, 39, 47, 55, 63, - 71, 79, 86, 94, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 87,111, - 19, 25, 57, 81, 83, 92, 95, 98, 99,100,101,103,104,106,109,110 + 0,118, 22, 30, 38, 37, 46, 54, 61, 62, 70, 69, 78, 85,102, 13, + 21, 29, 36, 45, 44, 53, 60, 67, 68, 77, 84, 91, 90, 20, 28, 27, + 35, 43, 52, 51, 59, 66, 75, 76, 82, 14, 18, 93, 26, 34, 33, 42, + 50, 49, 58, 65, 73, 74, 89,124, 17, 41, 88, 5, 6, 4, 12, 3, + 11, 2, 10, 1, 9,119,126,108,117,125,123,107,115,116,121,105, + 114,122,112,113,127, 96, 97,120, 7, 15, 23, 31, 39, 47, 55, 63, + 71, 79, 86, 94, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 87,111, + 19, 25, 57, 81, 83, 92, 95, 98, 99,100,101,103,104,106,109,110 }; #define ATKBD_CMD_SETLEDS 0x10ed @@ -125,6 +118,9 @@ #define ATKBD_RET_EMULX 0x80 #define ATKBD_RET_EMUL1 0xe1 #define ATKBD_RET_RELEASE 0xf0 +#define ATKBD_RET_HANGUEL 0xf1 +#define ATKBD_RET_HANJA 0xf2 +#define ATKBD_RET_ERR 0xff #define ATKBD_KEY_UNKNOWN 0 #define ATKBD_KEY_NULL 255 @@ -156,6 +152,17 @@ unsigned long time; }; +static void atkbd_report_key(struct input_dev *dev, struct pt_regs *regs, int code, int value) +{ + input_regs(dev, regs); + if (value == 3) { + input_report_key(dev, code, 1); + input_report_key(dev, code, 0); + } else + input_event(dev, EV_KEY, code, value); + input_sync(dev); +} + /* * atkbd_interrupt(). Here takes place processing of data received from * the keyboard into events. @@ -184,47 +191,37 @@ atkbd->resend = 0; #endif - switch (code) { - case ATKBD_RET_ACK: - atkbd->ack = 1; - goto out; - case ATKBD_RET_NAK: - atkbd->ack = -1; - goto out; - } - - if (atkbd->translated) do { - - if (atkbd->emul != 1) { - if (code == ATKBD_RET_EMUL0 || code == ATKBD_RET_EMUL1) - break; - if (code == ATKBD_RET_BAT) { - if (!atkbd->bat_xl) - break; - atkbd->bat_xl = 0; - } - if (code == (ATKBD_RET_BAT & 0x7f)) - atkbd->bat_xl = 1; - } - - if (code < 0x80) { - code = atkbd_unxlate_table[code]; - break; + if (!atkbd->ack) + switch (code) { + case ATKBD_RET_ACK: + atkbd->ack = 1; + goto out; + case ATKBD_RET_NAK: + atkbd->ack = -1; + goto out; } - if (atkbd->cmdcnt) - break; - - code = atkbd_unxlate_table[code & 0x7f]; - atkbd->release = 1; - - } while (0); - if (atkbd->cmdcnt) { atkbd->cmdbuf[--atkbd->cmdcnt] = code; goto out; } + if (atkbd->translated) { + + if (atkbd->emul || + !(code == ATKBD_RET_EMUL0 || code == ATKBD_RET_EMUL1 || + code == ATKBD_RET_HANGUEL || code == ATKBD_RET_HANJA || + code == ATKBD_RET_ERR || + (code == ATKBD_RET_BAT && !atkbd->bat_xl))) { + atkbd->release = code >> 7; + code &= 0x7f; + } + + if (!atkbd->emul && + (code & 0x7f) == (ATKBD_RET_BAT & 0x7f)) + atkbd->bat_xl = !atkbd->release; + } + switch (code) { case ATKBD_RET_BAT: serio_rescan(atkbd->serio); @@ -238,22 +235,33 @@ case ATKBD_RET_RELEASE: atkbd->release = 1; goto out; + case ATKBD_RET_HANGUEL: + atkbd_report_key(&atkbd->dev, regs, KEY_LANG1, 3); + goto out; + case ATKBD_RET_HANJA: + atkbd_report_key(&atkbd->dev, regs, KEY_LANG2, 3); + goto out; + case ATKBD_RET_ERR: + printk(KERN_WARNING "atkbd.c: Keyboard on %s reports too many keys pressed.\n", serio->phys); + goto out; } + if (atkbd->set != 3) + code = (code & 0x7f) | ((code & 0x80) << 1); if (atkbd->emul) { if (--atkbd->emul) goto out; - code |= 0x100; + code |= (atkbd->set != 3) ? 0x80 : 0x100; } switch (atkbd->keycode[code]) { case ATKBD_KEY_NULL: break; case ATKBD_KEY_UNKNOWN: - printk(KERN_WARNING "atkbd.c: Unknown key %s (%s set %d, code %#x, data %#x, on %s).\n", + printk(KERN_WARNING "atkbd.c: Unknown key %s (%s set %d, code %#x on %s).\n", atkbd->release ? "released" : "pressed", atkbd->translated ? "translated" : "raw", - atkbd->set, code, data, serio->phys); + atkbd->set, code, serio->phys); break; default: value = atkbd->release ? 0 : @@ -273,9 +281,7 @@ break; } - input_regs(&atkbd->dev, regs); - input_event(&atkbd->dev, EV_KEY, atkbd->keycode[code], value); - input_sync(&atkbd->dev); + atkbd_report_key(&atkbd->dev, regs, atkbd->keycode[code], value); } atkbd->release = 0; @@ -369,10 +375,11 @@ static int atkbd_event(struct input_dev *dev, unsigned int type, unsigned int code, int value) { struct atkbd *atkbd = dev->private; - struct { int p; u8 v; } period[] = - { {30, 0x00}, {25, 0x02}, {20, 0x04}, {15, 0x08}, {10, 0x0c}, {7, 0x10}, {5, 0x14}, {0, 0x14} }; - struct { int d; u8 v; } delay[] = - { {1000, 0x60}, {750, 0x40}, {500, 0x20}, {250, 0x00}, {0, 0x00} }; + const short period[32] = + { 33, 37, 42, 46, 50, 54, 58, 63, 67, 75, 83, 92, 100, 109, 116, 125, + 133, 149, 167, 182, 200, 217, 232, 250, 270, 303, 333, 370, 400, 435, 470, 500 }; + const short delay[4] = + { 250, 500, 750, 1000 }; char param[2]; int i, j; @@ -406,11 +413,11 @@ if (atkbd_softrepeat) return 0; i = j = 0; - while (period[i].p > dev->rep[REP_PERIOD]) i++; - while (delay[j].d > dev->rep[REP_DELAY]) j++; - dev->rep[REP_PERIOD] = period[i].p; - dev->rep[REP_DELAY] = delay[j].d; - param[0] = period[i].v | delay[j].v; + while (i < 32 && period[i] < dev->rep[REP_PERIOD]) i++; + while (j < 4 && delay[j] < dev->rep[REP_DELAY]) j++; + dev->rep[REP_PERIOD] = period[i]; + dev->rep[REP_DELAY] = delay[j]; + param[0] = i | (j << 5); atkbd_command(atkbd, param, ATKBD_CMD_SETREP); return 0; @@ -578,6 +585,7 @@ struct atkbd *atkbd = serio->private; input_unregister_device(&atkbd->dev); serio_close(serio); + serio->private = NULL; kfree(atkbd); } @@ -623,6 +631,7 @@ atkbd->dev.rep[REP_PERIOD] = 33; } + atkbd->ack = 1; atkbd->serio = serio; init_input_dev(&atkbd->dev); @@ -636,6 +645,7 @@ serio->private = atkbd; if (serio_open(serio, dev)) { + serio->private = NULL; kfree(atkbd); return; } @@ -644,6 +654,7 @@ if (atkbd_probe(atkbd)) { serio_close(serio); + serio->private = NULL; kfree(atkbd); return; } @@ -665,16 +676,22 @@ sprintf(atkbd->phys, "%s/input0", serio->phys); - if (atkbd->set == 3) - memcpy(atkbd->keycode, atkbd_set3_keycode, sizeof(atkbd->keycode)); - else + if (atkbd->translated) { + for (i = 0; i < 128; i++) { + atkbd->keycode[i] = atkbd_set2_keycode[atkbd_unxlate_table[i]]; + atkbd->keycode[i | 0x80] = atkbd_set2_keycode[atkbd_unxlate_table[i] | 0x80]; + } + } else if (atkbd->set == 2) { memcpy(atkbd->keycode, atkbd_set2_keycode, sizeof(atkbd->keycode)); + } else { + memcpy(atkbd->keycode, atkbd_set3_keycode, sizeof(atkbd->keycode)); + } atkbd->dev.name = atkbd->name; atkbd->dev.phys = atkbd->phys; atkbd->dev.id.bustype = BUS_I8042; atkbd->dev.id.vendor = 0x0001; - atkbd->dev.id.product = atkbd->set; + atkbd->dev.id.product = atkbd->translated ? 1 : atkbd->set; atkbd->dev.id.version = atkbd->id; for (i = 0; i < 512; i++) @@ -686,7 +703,6 @@ printk(KERN_INFO "input: %s on %s\n", atkbd->name, serio->phys); } - static struct serio_dev atkbd_dev = { .interrupt = atkbd_interrupt, .connect = atkbd_connect, --- diff/drivers/input/mouse/logips2pp.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/input/mouse/logips2pp.c 2003-11-26 10:09:05.000000000 +0000 @@ -10,6 +10,7 @@ */ #include <linux/input.h> +#include <linux/serio.h> #include "psmouse.h" #include "logips2pp.h" --- diff/drivers/input/mouse/psmouse-base.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/input/mouse/psmouse-base.c 2003-11-26 10:09:05.000000000 +0000 @@ -139,7 +139,8 @@ goto out; } - if (psmouse->pktcnt && time_after(jiffies, psmouse->last + HZ/2)) { + if (psmouse->state == PSMOUSE_ACTIVATED && + psmouse->pktcnt && time_after(jiffies, psmouse->last + HZ/2)) { printk(KERN_WARNING "psmouse.c: %s at %s lost synchronization, throwing %d bytes away.\n", psmouse->name, psmouse->phys, psmouse->pktcnt); psmouse->pktcnt = 0; @@ -274,24 +275,18 @@ return PSMOUSE_PS2; /* - * Try Synaptics TouchPad magic ID + * Try Synaptics TouchPad */ - - param[0] = 0; - psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); - psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); - psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); - psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); - psmouse_command(psmouse, param, PSMOUSE_CMD_GETINFO); - - if (param[1] == 0x47) { + if (synaptics_detect(psmouse) == 0) { psmouse->vendor = "Synaptics"; psmouse->name = "TouchPad"; - if (!synaptics_init(psmouse)) + +#if CONFIG_MOUSE_PS2_SYNAPTICS + if (synaptics_init(psmouse) == 0) return PSMOUSE_SYNAPTICS; - else - return PSMOUSE_PS2; - } +#endif + return PSMOUSE_PS2; + } /* * Try Genius NetMouse magic init. @@ -513,7 +508,18 @@ struct psmouse *psmouse = serio->private; psmouse->state = PSMOUSE_IGNORE; - synaptics_disconnect(psmouse); + + if (psmouse->ptport) { + if (psmouse->ptport->deactivate) + psmouse->ptport->deactivate(psmouse); + __serio_unregister_port(&psmouse->ptport->serio); /* we have serio_sem */ + kfree(psmouse->ptport); + psmouse->ptport = NULL; + } + + if (psmouse->disconnect) + psmouse->disconnect(psmouse); + input_unregister_device(&psmouse->dev); serio_close(serio); kfree(psmouse); @@ -526,20 +532,11 @@ static int psmouse_pm_callback(struct pm_dev *dev, pm_request_t request, void *data) { struct psmouse *psmouse = dev->data; - struct serio_dev *ser_dev = psmouse->serio->dev; - - synaptics_disconnect(psmouse); - - /* We need to reopen the serio port to reinitialize the i8042 controller */ - serio_close(psmouse->serio); - serio_open(psmouse->serio, ser_dev); - - /* Probe and re-initialize the mouse */ - psmouse_probe(psmouse); - psmouse_initialize(psmouse); - synaptics_pt_init(psmouse); - psmouse_activate(psmouse); + if (request == PM_RESUME) { + psmouse->state = PSMOUSE_IGNORE; + serio_reconnect(psmouse->serio); + } return 0; } @@ -547,7 +544,6 @@ * psmouse_connect() is a callback from the serio module when * an unhandled serio port is found. */ - static void psmouse_connect(struct serio *serio, struct serio_dev *dev) { struct psmouse *psmouse; @@ -572,7 +568,6 @@ psmouse->dev.private = psmouse; serio->private = psmouse; - if (serio_open(serio, dev)) { kfree(psmouse); return; @@ -584,10 +579,12 @@ return; } - pmdev = pm_register(PM_SYS_DEV, PM_SYS_UNKNOWN, psmouse_pm_callback); - if (pmdev) { - psmouse->dev.pm_dev = pmdev; - pmdev->data = psmouse; + if (serio->type != SERIO_PS_PSTHRU) { + pmdev = pm_register(PM_SYS_DEV, PM_SYS_UNKNOWN, psmouse_pm_callback); + if (pmdev) { + psmouse->dev.pm_dev = pmdev; + pmdev->data = psmouse; + } } sprintf(psmouse->devname, "%s %s %s", @@ -608,14 +605,70 @@ psmouse_initialize(psmouse); - synaptics_pt_init(psmouse); + if (psmouse->ptport) { + printk(KERN_INFO "serio: %s port at %s\n", psmouse->ptport->serio.name, psmouse->phys); + __serio_register_port(&psmouse->ptport->serio); /* we have serio_sem */ + if (psmouse->ptport->activate) + psmouse->ptport->activate(psmouse); + } + + psmouse_activate(psmouse); +} + + +static int psmouse_reconnect(struct serio *serio) +{ + struct psmouse *psmouse = serio->private; + struct serio_dev *dev = serio->dev; + int old_type = psmouse->type; + + if (!dev) { + printk(KERN_DEBUG "psmouse: reconnect request, but serio is disconnected, ignoring...\n"); + return -1; + } + + /* We need to reopen the serio port to reinitialize the i8042 controller */ + serio_close(serio); + if (serio_open(serio, dev)) { + /* do a disconnect here as serio_open leaves dev as NULL so disconnect + * will not be called automatically later + */ + psmouse_disconnect(serio); + return -1; + } + + psmouse->state = PSMOUSE_NEW_DEVICE; + psmouse->type = psmouse->acking = psmouse->cmdcnt = psmouse->pktcnt = 0; + if (psmouse->reconnect) { + if (psmouse->reconnect(psmouse)) + return -1; + } else if (psmouse_probe(psmouse) != old_type) + return -1; + + /* ok, the device type (and capabilities) match the old one, + * we can continue using it, complete intialization + */ + psmouse->type = old_type; + psmouse_initialize(psmouse); + + if (psmouse->ptport) { + if (psmouse_reconnect(&psmouse->ptport->serio)) { + __serio_unregister_port(&psmouse->ptport->serio); + __serio_register_port(&psmouse->ptport->serio); + if (psmouse->ptport->activate) + psmouse->ptport->activate(psmouse); + } + } psmouse_activate(psmouse); + return 0; } + static struct serio_dev psmouse_dev = { .interrupt = psmouse_interrupt, .connect = psmouse_connect, + .reconnect = psmouse_reconnect, .disconnect = psmouse_disconnect, .cleanup = psmouse_cleanup, }; --- diff/drivers/input/mouse/psmouse.h 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/input/mouse/psmouse.h 2003-11-26 10:09:05.000000000 +0000 @@ -22,10 +22,20 @@ #define PSMOUSE_ACTIVATED 1 #define PSMOUSE_IGNORE 2 +struct psmouse; + +struct psmouse_ptport { + struct serio serio; + + void (*activate)(struct psmouse *parent); + void (*deactivate)(struct psmouse *parent); +}; + struct psmouse { void *private; struct input_dev dev; struct serio *serio; + struct psmouse_ptport *ptport; char *vendor; char *name; unsigned char cmdbuf[8]; @@ -41,6 +51,9 @@ char error; char devname[64]; char phys[32]; + + int (*reconnect)(struct psmouse *psmouse); + void (*disconnect)(struct psmouse *psmouse); }; #define PSMOUSE_PS2 1 --- diff/drivers/input/mouse/synaptics.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/input/mouse/synaptics.c 2003-11-26 10:09:05.000000000 +0000 @@ -2,7 +2,8 @@ * Synaptics TouchPad PS/2 mouse driver * * 2003 Dmitry Torokhov <dtor@mail.ru> - * Added support for pass-through port + * Added support for pass-through port. Special thanks to Peter Berg Larsen + * for explaining various Synaptics quirks. * * 2003 Peter Osterlund <petero2@telia.com> * Ported to 2.5 input device infrastructure. @@ -194,9 +195,7 @@ static int synaptics_query_hardware(struct psmouse *psmouse) { - struct synaptics_data *priv = psmouse->private; int retries = 0; - int mode; while ((retries++ < 3) && synaptics_reset(psmouse)) printk(KERN_ERR "synaptics reset failed\n"); @@ -208,7 +207,14 @@ if (synaptics_capability(psmouse)) return -1; - mode = SYN_BIT_ABSOLUTE_MODE | SYN_BIT_HIGH_RATE; + return 0; +} + +static int synaptics_set_mode(struct psmouse *psmouse, int mode) +{ + struct synaptics_data *priv = psmouse->private; + + mode |= SYN_BIT_ABSOLUTE_MODE | SYN_BIT_HIGH_RATE; if (SYN_ID_MAJOR(priv->identity) >= 4) mode |= SYN_BIT_DISABLE_GESTURE; if (SYN_CAP_EXTENDED(priv->capabilities)) @@ -265,49 +271,38 @@ } } -int synaptics_pt_init(struct psmouse *psmouse) +static void synaptics_pt_activate(struct psmouse *psmouse) { - struct synaptics_data *priv = psmouse->private; - struct serio *port; - struct psmouse *child; + struct psmouse *child = psmouse->ptport->serio.private; - if (psmouse->type != PSMOUSE_SYNAPTICS) - return -1; - if (!SYN_CAP_EXTENDED(priv->capabilities)) - return -1; - if (!SYN_CAP_PASS_THROUGH(priv->capabilities)) - return -1; + /* adjust the touchpad to child's choice of protocol */ + if (child && child->type >= PSMOUSE_GENPS) { + if (synaptics_set_mode(psmouse, SYN_BIT_FOUR_BYTE_CLIENT)) + printk(KERN_INFO "synaptics: failed to enable 4-byte guest protocol\n"); + } +} + +static void synaptics_pt_create(struct psmouse *psmouse) +{ + struct psmouse_ptport *port; - priv->ptport = port = kmalloc(sizeof(struct serio), GFP_KERNEL); + psmouse->ptport = port = kmalloc(sizeof(struct psmouse_ptport), GFP_KERNEL); if (!port) { - printk(KERN_ERR "synaptics: not enough memory to allocate serio port\n"); - return -1; + printk(KERN_ERR "synaptics: not enough memory to allocate pass-through port\n"); + return; } - memset(port, 0, sizeof(struct serio)); - port->type = SERIO_PS_PSTHRU; - port->name = "Synaptics pass-through"; - port->phys = "synaptics-pt/serio0"; - port->write = synaptics_pt_write; - port->open = synaptics_pt_open; - port->close = synaptics_pt_close; - port->driver = psmouse; - - printk(KERN_INFO "serio: %s port at %s\n", port->name, psmouse->phys); - serio_register_slave_port(port); + memset(port, 0, sizeof(struct psmouse_ptport)); - /* adjust the touchpad to child's choice of protocol */ - child = port->private; - if (child && child->type >= PSMOUSE_GENPS) { - if (synaptics_mode_cmd(psmouse, (SYN_BIT_ABSOLUTE_MODE | - SYN_BIT_HIGH_RATE | - SYN_BIT_DISABLE_GESTURE | - SYN_BIT_FOUR_BYTE_CLIENT | - SYN_BIT_W_MODE))) - printk(KERN_INFO "synaptics: failed to enable 4-byte guest protocol\n"); - } + port->serio.type = SERIO_PS_PSTHRU; + port->serio.name = "Synaptics pass-through"; + port->serio.phys = "synaptics-pt/serio0"; + port->serio.write = synaptics_pt_write; + port->serio.open = synaptics_pt_open; + port->serio.close = synaptics_pt_close; + port->serio.driver = psmouse; - return 0; + port->activate = synaptics_pt_activate; } /***************************************************************************** @@ -371,27 +366,82 @@ clear_bit(REL_Y, dev->relbit); } +static void synaptics_disconnect(struct psmouse *psmouse) +{ + synaptics_mode_cmd(psmouse, 0); + kfree(psmouse->private); +} + +static int synaptics_reconnect(struct psmouse *psmouse) +{ + struct synaptics_data *priv = psmouse->private; + struct synaptics_data old_priv = *priv; + + if (synaptics_detect(psmouse)) + return -1; + + if (synaptics_query_hardware(psmouse)) { + printk(KERN_ERR "Unable to query Synaptics hardware.\n"); + return -1; + } + + if (old_priv.identity != priv->identity || + old_priv.model_id != priv->model_id || + old_priv.capabilities != priv->capabilities || + old_priv.ext_cap != priv->ext_cap) + return -1; + + if (synaptics_set_mode(psmouse, 0)) { + printk(KERN_ERR "Unable to initialize Synaptics hardware.\n"); + return -1; + } + + return 0; +} + +int synaptics_detect(struct psmouse *psmouse) +{ + unsigned char param[4]; + + param[0] = 0; + + psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); + psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); + psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); + psmouse_command(psmouse, param, PSMOUSE_CMD_SETRES); + psmouse_command(psmouse, param, PSMOUSE_CMD_GETINFO); + + return param[1] == 0x47 ? 0 : -1; +} + int synaptics_init(struct psmouse *psmouse) { struct synaptics_data *priv; -#ifndef CONFIG_MOUSE_PS2_SYNAPTICS - return -1; -#endif - psmouse->private = priv = kmalloc(sizeof(struct synaptics_data), GFP_KERNEL); if (!priv) return -1; memset(priv, 0, sizeof(struct synaptics_data)); if (synaptics_query_hardware(psmouse)) { - printk(KERN_ERR "Unable to query/initialize Synaptics hardware.\n"); + printk(KERN_ERR "Unable to query Synaptics hardware.\n"); + goto init_fail; + } + + if (synaptics_set_mode(psmouse, 0)) { + printk(KERN_ERR "Unable to initialize Synaptics hardware.\n"); goto init_fail; } + if (SYN_CAP_EXTENDED(priv->capabilities) && SYN_CAP_PASS_THROUGH(priv->capabilities)) + synaptics_pt_create(psmouse); + print_ident(priv); set_input_params(&psmouse->dev, priv); + psmouse->disconnect = synaptics_disconnect; + psmouse->reconnect = synaptics_reconnect; + return 0; init_fail: @@ -399,36 +449,13 @@ return -1; } -void synaptics_disconnect(struct psmouse *psmouse) -{ - struct synaptics_data *priv = psmouse->private; - - if (psmouse->type == PSMOUSE_SYNAPTICS && priv) { - synaptics_mode_cmd(psmouse, 0); - if (priv->ptport) { - serio_unregister_slave_port(priv->ptport); - kfree(priv->ptport); - } - kfree(priv); - } -} - /***************************************************************************** * Functions to interpret the absolute mode packets ****************************************************************************/ static void synaptics_parse_hw_state(unsigned char buf[], struct synaptics_data *priv, struct synaptics_hw_state *hw) { - hw->up = 0; - hw->down = 0; - hw->b0 = 0; - hw->b1 = 0; - hw->b2 = 0; - hw->b3 = 0; - hw->b4 = 0; - hw->b5 = 0; - hw->b6 = 0; - hw->b7 = 0; + memset(hw, 0, sizeof(struct synaptics_hw_state)); if (SYN_MODEL_NEWABS(priv->model_id)) { hw->x = (((buf[3] & 0x10) << 8) | @@ -570,64 +597,47 @@ input_sync(dev); } +static int synaptics_validate_byte(struct psmouse *psmouse) +{ + static unsigned char newabs_mask[] = { 0xC8, 0x00, 0x00, 0xC8, 0x00 }; + static unsigned char newabs_rslt[] = { 0x80, 0x00, 0x00, 0xC0, 0x00 }; + static unsigned char oldabs_mask[] = { 0xC0, 0x60, 0x00, 0xC0, 0x60 }; + static unsigned char oldabs_rslt[] = { 0xC0, 0x00, 0x00, 0x80, 0x00 }; + struct synaptics_data *priv = psmouse->private; + int idx = psmouse->pktcnt - 1; + + if (SYN_MODEL_NEWABS(priv->model_id)) + return (psmouse->packet[idx] & newabs_mask[idx]) == newabs_rslt[idx]; + else + return (psmouse->packet[idx] & oldabs_mask[idx]) == oldabs_rslt[idx]; +} + void synaptics_process_byte(struct psmouse *psmouse, struct pt_regs *regs) { struct input_dev *dev = &psmouse->dev; struct synaptics_data *priv = psmouse->private; - unsigned char data = psmouse->packet[psmouse->pktcnt - 1]; - int newabs = SYN_MODEL_NEWABS(priv->model_id); input_regs(dev, regs); - switch (psmouse->pktcnt) { - case 1: - if (newabs ? ((data & 0xC8) != 0x80) : ((data & 0xC0) != 0xC0)) { - printk(KERN_WARNING "Synaptics driver lost sync at 1st byte\n"); - goto bad_sync; - } - break; - case 2: - if (!newabs && ((data & 0x60) != 0x00)) { - printk(KERN_WARNING "Synaptics driver lost sync at 2nd byte\n"); - goto bad_sync; - } - break; - case 4: - if (newabs ? ((data & 0xC8) != 0xC0) : ((data & 0xC0) != 0x80)) { - printk(KERN_WARNING "Synaptics driver lost sync at 4th byte\n"); - goto bad_sync; - } - break; - case 5: - if (!newabs && ((data & 0x60) != 0x00)) { - printk(KERN_WARNING "Synaptics driver lost sync at 5th byte\n"); - goto bad_sync; - } - break; - default: - if (psmouse->pktcnt < 6) - break; /* Wait for full packet */ - + if (psmouse->pktcnt >= 6) { /* Full packet received */ if (priv->out_of_sync) { priv->out_of_sync = 0; printk(KERN_NOTICE "Synaptics driver resynced.\n"); } - if (priv->ptport && synaptics_is_pt_packet(psmouse->packet)) - synaptics_pass_pt_packet(priv->ptport, psmouse->packet); + if (psmouse->ptport && psmouse->ptport->serio.dev && synaptics_is_pt_packet(psmouse->packet)) + synaptics_pass_pt_packet(&psmouse->ptport->serio, psmouse->packet); else synaptics_process_packet(psmouse); - psmouse->pktcnt = 0; - break; - } - return; - bad_sync: - priv->out_of_sync++; - psmouse->pktcnt = 0; - if (psmouse_resetafter > 0 && priv->out_of_sync == psmouse_resetafter) { - psmouse->state = PSMOUSE_IGNORE; - serio_rescan(psmouse->serio); + } else if (psmouse->pktcnt && !synaptics_validate_byte(psmouse)) { + printk(KERN_WARNING "Synaptics driver lost sync at byte %d\n", psmouse->pktcnt); + psmouse->pktcnt = 0; + if (++priv->out_of_sync == psmouse_resetafter) { + psmouse->state = PSMOUSE_IGNORE; + printk(KERN_NOTICE "synaptics: issuing reconnect request\n"); + serio_reconnect(psmouse->serio); + } } } --- diff/drivers/input/mouse/synaptics.h 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/input/mouse/synaptics.h 2003-11-26 10:09:05.000000000 +0000 @@ -9,11 +9,9 @@ #ifndef _SYNAPTICS_H #define _SYNAPTICS_H - extern void synaptics_process_byte(struct psmouse *psmouse, struct pt_regs *regs); +extern int synaptics_detect(struct psmouse *psmouse); extern int synaptics_init(struct psmouse *psmouse); -extern int synaptics_pt_init(struct psmouse *psmouse); -extern void synaptics_disconnect(struct psmouse *psmouse); /* synaptics queries */ #define SYN_QUE_IDENTIFY 0x00 @@ -105,8 +103,6 @@ /* Data for normal processing */ unsigned int out_of_sync; /* # of packets out of sync */ int old_w; /* Previous w value */ - - struct serio *ptport; /* pass-through port */ }; #endif /* _SYNAPTICS_H */ --- diff/drivers/input/serio/serio.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/input/serio/serio.c 2003-11-26 10:09:05.000000000 +0000 @@ -49,14 +49,15 @@ EXPORT_SYMBOL(serio_interrupt); EXPORT_SYMBOL(serio_register_port); -EXPORT_SYMBOL(serio_register_slave_port); +EXPORT_SYMBOL(__serio_register_port); EXPORT_SYMBOL(serio_unregister_port); -EXPORT_SYMBOL(serio_unregister_slave_port); +EXPORT_SYMBOL(__serio_unregister_port); EXPORT_SYMBOL(serio_register_device); EXPORT_SYMBOL(serio_unregister_device); EXPORT_SYMBOL(serio_open); EXPORT_SYMBOL(serio_close); EXPORT_SYMBOL(serio_rescan); +EXPORT_SYMBOL(serio_reconnect); struct serio_event { int type; @@ -83,10 +84,20 @@ } #define SERIO_RESCAN 1 +#define SERIO_RECONNECT 2 static DECLARE_WAIT_QUEUE_HEAD(serio_wait); static DECLARE_COMPLETION(serio_exited); +static void serio_invalidate_pending_events(struct serio *serio) +{ + struct serio_event *event; + + list_for_each_entry(event, &serio_event_list, node) + if (event->serio == serio) + event->serio = NULL; +} + void serio_handle_events(void) { struct list_head *node, *next; @@ -95,17 +106,27 @@ list_for_each_safe(node, next, &serio_event_list) { event = container_of(node, struct serio_event, node); + down(&serio_sem); + if (event->serio == NULL) + goto event_done; + switch (event->type) { + case SERIO_RECONNECT : + if (event->serio->dev && event->serio->dev->reconnect) + if (event->serio->dev->reconnect(event->serio) == 0) + break; + /* reconnect failed - fall through to rescan */ + case SERIO_RESCAN : - down(&serio_sem); if (event->serio->dev && event->serio->dev->disconnect) event->serio->dev->disconnect(event->serio); serio_find_dev(event->serio); - up(&serio_sem); break; default: break; } +event_done: + up(&serio_sem); list_del_init(node); kfree(event); } @@ -130,18 +151,27 @@ complete_and_exit(&serio_exited, 0); } -void serio_rescan(struct serio *serio) +static void serio_queue_event(struct serio *serio, int event_type) { struct serio_event *event; - if (!(event = kmalloc(sizeof(struct serio_event), GFP_ATOMIC))) - return; + if ((event = kmalloc(sizeof(struct serio_event), GFP_ATOMIC))) { + event->type = event_type; + event->serio = serio; + + list_add_tail(&event->node, &serio_event_list); + wake_up(&serio_wait); + } +} - event->type = SERIO_RESCAN; - event->serio = serio; +void serio_rescan(struct serio *serio) +{ + serio_queue_event(serio, SERIO_RESCAN); +} - list_add_tail(&event->node, &serio_event_list); - wake_up(&serio_wait); +void serio_reconnect(struct serio *serio) +{ + serio_queue_event(serio, SERIO_RECONNECT); } irqreturn_t serio_interrupt(struct serio *serio, @@ -163,17 +193,16 @@ void serio_register_port(struct serio *serio) { down(&serio_sem); - list_add_tail(&serio->node, &serio_list); - serio_find_dev(serio); + __serio_register_port(serio); up(&serio_sem); } /* - * Same as serio_register_port but does not try to acquire serio_sem. - * Should be used when registering a serio from other input device's + * Should only be called directly if serio_sem has already been taken, + * for example when unregistering a serio from other input device's * connect() function. */ -void serio_register_slave_port(struct serio *serio) +void __serio_register_port(struct serio *serio) { list_add_tail(&serio->node, &serio_list); serio_find_dev(serio); @@ -182,19 +211,18 @@ void serio_unregister_port(struct serio *serio) { down(&serio_sem); - list_del_init(&serio->node); - if (serio->dev && serio->dev->disconnect) - serio->dev->disconnect(serio); + __serio_unregister_port(serio); up(&serio_sem); } /* - * Same as serio_unregister_port but does not try to acquire serio_sem. - * Should be used when unregistering a serio from other input device's + * Should only be called directly if serio_sem has already been taken, + * for example when unregistering a serio from other input device's * disconnect() function. */ -void serio_unregister_slave_port(struct serio *serio) +void __serio_unregister_port(struct serio *serio) { + serio_invalidate_pending_events(serio); list_del_init(&serio->node); if (serio->dev && serio->dev->disconnect) serio->dev->disconnect(serio); --- diff/drivers/isdn/eicon/Kconfig 2002-11-11 11:09:36.000000000 +0000 +++ source/drivers/isdn/eicon/Kconfig 2003-11-26 10:09:05.000000000 +0000 @@ -13,7 +13,7 @@ choice prompt "Eicon active card support" optional - depends on ISDN_DRV_EICON && ISDN + depends on ISDN_DRV_EICON && ISDN && m config ISDN_DRV_EICON_DIVAS tristate "Eicon driver" --- diff/drivers/md/Kconfig 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/md/Kconfig 2003-11-26 10:09:05.000000000 +0000 @@ -138,6 +138,7 @@ config DM_IOCTL_V4 bool "ioctl interface version 4" depends on BLK_DEV_DM + default y ---help--- Recent tools use a new version of the ioctl interface, only select this option if you intend using such tools. --- diff/drivers/md/dm-ioctl-v1.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/dm-ioctl-v1.c 2003-11-26 10:09:05.000000000 +0000 @@ -566,7 +566,7 @@ if (r) return r; - r = dm_table_create(&t, get_mode(param)); + r = dm_table_create(&t, get_mode(param), param->target_count); if (r) return r; @@ -894,7 +894,7 @@ struct mapped_device *md; struct dm_table *t; - r = dm_table_create(&t, get_mode(param)); + r = dm_table_create(&t, get_mode(param), param->target_count); if (r) return r; --- diff/drivers/md/dm-ioctl-v4.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/dm-ioctl-v4.c 2003-11-26 10:09:05.000000000 +0000 @@ -872,7 +872,7 @@ struct hash_cell *hc; struct dm_table *t; - r = dm_table_create(&t, get_mode(param)); + r = dm_table_create(&t, get_mode(param), param->target_count); if (r) return r; --- diff/drivers/md/dm-table.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/md/dm-table.c 2003-11-26 10:09:05.000000000 +0000 @@ -12,6 +12,7 @@ #include <linux/namei.h> #include <linux/ctype.h> #include <linux/slab.h> +#include <linux/interrupt.h> #include <asm/atomic.h> #define MAX_DEPTH 16 @@ -202,7 +203,7 @@ return 0; } -int dm_table_create(struct dm_table **result, int mode) +int dm_table_create(struct dm_table **result, int mode, unsigned num_targets) { struct dm_table *t = kmalloc(sizeof(*t), GFP_NOIO); @@ -213,8 +214,12 @@ INIT_LIST_HEAD(&t->devices); atomic_set(&t->holders, 1); - /* allocate a single nodes worth of targets to begin with */ - if (alloc_targets(t, KEYS_PER_NODE)) { + if (!num_targets) + num_targets = KEYS_PER_NODE; + + num_targets = dm_round_up(num_targets, KEYS_PER_NODE); + + if (alloc_targets(t, num_targets)) { kfree(t); t = NULL; return -ENOMEM; @@ -626,6 +631,16 @@ return 0; } +static void set_default_limits(struct io_restrictions *rs) +{ + rs->max_sectors = MAX_SECTORS; + rs->max_phys_segments = MAX_PHYS_SEGMENTS; + rs->max_hw_segments = MAX_HW_SEGMENTS; + rs->hardsect_size = 1 << SECTOR_SHIFT; + rs->max_segment_size = MAX_SEGMENT_SIZE; + rs->seg_boundary_mask = -1; +} + int dm_table_add_target(struct dm_table *t, const char *type, sector_t start, sector_t len, char *params) { @@ -638,6 +653,7 @@ tgt = t->targets + t->num_targets; memset(tgt, 0, sizeof(*tgt)); + set_default_limits(&tgt->limits); tgt->type = dm_get_target_type(type); if (!tgt->type) { @@ -731,22 +747,28 @@ return r; } -static spinlock_t _event_lock = SPIN_LOCK_UNLOCKED; +static DECLARE_MUTEX(_event_lock); void dm_table_event_callback(struct dm_table *t, void (*fn)(void *), void *context) { - spin_lock_irq(&_event_lock); + down(&_event_lock); t->event_fn = fn; t->event_context = context; - spin_unlock_irq(&_event_lock); + up(&_event_lock); } void dm_table_event(struct dm_table *t) { - spin_lock(&_event_lock); + /* + * You can no longer call dm_table_event() from interrupt + * context, use a bottom half instead. + */ + BUG_ON(in_interrupt()); + + down(&_event_lock); if (t->event_fn) t->event_fn(t->event_context); - spin_unlock(&_event_lock); + up(&_event_lock); } sector_t dm_table_get_size(struct dm_table *t) --- diff/drivers/md/dm.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/dm.c 2003-11-26 10:09:05.000000000 +0000 @@ -160,20 +160,16 @@ /* * Block device functions */ -static int dm_blk_open(struct inode *inode, struct file *file) +static int dm_blk_open(struct block_device *bdev, struct file *file) { - struct mapped_device *md; - - md = inode->i_bdev->bd_disk->private_data; + struct mapped_device *md = bdev->bd_disk->private_data; dm_get(md); return 0; } -static int dm_blk_close(struct inode *inode, struct file *file) +static int dm_blk_close(struct gendisk *disk) { - struct mapped_device *md; - - md = inode->i_bdev->bd_disk->private_data; + struct mapped_device *md = disk->private_data; dm_put(md); return 0; } @@ -666,6 +662,20 @@ up_write(&md->lock); } +static void __set_size(struct gendisk *disk, sector_t size) +{ + struct block_device *bdev; + + set_capacity(disk, size); + bdev = bdget_disk(disk, 0); + if (bdev) { + down(&bdev->bd_inode->i_sem); + i_size_write(bdev->bd_inode, size << SECTOR_SHIFT); + up(&bdev->bd_inode->i_sem); + bdput(bdev); + } +} + static int __bind(struct mapped_device *md, struct dm_table *t) { request_queue_t *q = md->queue; @@ -673,7 +683,7 @@ md->map = t; size = dm_table_get_size(t); - set_capacity(md->disk, size); + __set_size(md->disk, size); if (size == 0) return 0; @@ -692,7 +702,6 @@ dm_table_event_callback(md->map, NULL, NULL); dm_table_put(md->map); md->map = NULL; - set_capacity(md->disk, 0); } /* --- diff/drivers/md/dm.h 2003-08-20 14:16:09.000000000 +0100 +++ source/drivers/md/dm.h 2003-11-26 10:09:05.000000000 +0000 @@ -95,7 +95,7 @@ * Functions for manipulating a table. Tables are also reference * counted. *---------------------------------------------------------------*/ -int dm_table_create(struct dm_table **result, int mode); +int dm_table_create(struct dm_table **result, int mode, unsigned num_targets); void dm_table_get(struct dm_table *t); void dm_table_put(struct dm_table *t); --- diff/drivers/md/linear.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/linear.c 2003-11-26 10:09:05.000000000 +0000 @@ -113,8 +113,17 @@ } disk->rdev = rdev; + blk_queue_stack_limits(mddev->queue, rdev->bdev->bd_disk->queue); + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, so limit ->max_sector to one PAGE, as + * a one page request is never in violation. + */ + if (rdev->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + disk->size = rdev->size; mddev->array_size += rdev->size; --- diff/drivers/md/md.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/md.c 2003-11-26 10:09:05.000000000 +0000 @@ -2360,11 +2360,10 @@ return 1; } -static int md_ioctl(struct inode *inode, struct file *file, +static int md_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { char b[BDEVNAME_SIZE]; - unsigned int minor = iminor(inode); int err = 0; struct hd_geometry *loc = (struct hd_geometry *) arg; mddev_t *mddev = NULL; @@ -2372,11 +2371,6 @@ if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (minor >= MAX_MD_DEVS) { - MD_BUG(); - return -EINVAL; - } - /* * Commands dealing with the RAID driver but not any * particular array: @@ -2405,7 +2399,7 @@ * Commands creating/starting a new array: */ - mddev = inode->i_bdev->bd_inode->u.generic_ip; + mddev = bdev->bd_inode->u.generic_ip; if (!mddev) { BUG(); @@ -2527,7 +2521,7 @@ (short *) &loc->cylinders); if (err) goto abort_unlock; - err = put_user (get_start_sect(inode->i_bdev), + err = put_user (get_start_sect(bdev), (long *) &loc->start); goto done_unlock; } @@ -2605,12 +2599,12 @@ return err; } -static int md_open(struct inode *inode, struct file *file) +static int md_open(struct block_device *bdev, struct file *file) { /* * Succeed if we can find or allocate a mddev structure. */ - mddev_t *mddev = mddev_find(iminor(inode)); + mddev_t *mddev = mddev_find(MINOR(bdev->bd_dev)); int err = -ENOMEM; if (!mddev) @@ -2621,16 +2615,16 @@ err = 0; mddev_unlock(mddev); - inode->i_bdev->bd_inode->u.generic_ip = mddev_get(mddev); + bdev->bd_inode->u.generic_ip = mddev_get(mddev); put: mddev_put(mddev); out: return err; } -static int md_release(struct inode *inode, struct file * file) +static int md_release(struct gendisk *disk) { - mddev_t *mddev = inode->i_bdev->bd_inode->u.generic_ip; + mddev_t *mddev = disk->private_data; if (!mddev) BUG(); --- diff/drivers/md/multipath.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/multipath.c 2003-11-26 10:09:05.000000000 +0000 @@ -273,6 +273,17 @@ p->rdev = rdev; blk_queue_stack_limits(mddev->queue, rdev->bdev->bd_disk->queue); + + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, so limit ->max_sector to one PAGE, as + * a one page request is never in violation. + * (Note: it is very unlikely that a device with + * merge_bvec_fn will be involved in multipath.) + */ + if (rdev->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + conf->working_disks++; rdev->raid_disk = path; rdev->in_sync = 1; @@ -410,8 +421,16 @@ disk = conf->multipaths + disk_idx; disk->rdev = rdev; + blk_queue_stack_limits(mddev->queue, rdev->bdev->bd_disk->queue); + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, not that we ever expect a device with + * a merge_bvec_fn to be involved in multipath */ + if (rdev->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + if (!rdev->faulty) conf->working_disks++; } --- diff/drivers/md/raid0.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/md/raid0.c 2003-11-26 10:09:05.000000000 +0000 @@ -112,8 +112,18 @@ goto abort; } zone->dev[j] = rdev1; + blk_queue_stack_limits(mddev->queue, rdev1->bdev->bd_disk->queue); + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, so limit ->max_sector to one PAGE, as + * a one page request is never in violation. + */ + + if (rdev1->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + if (!smallest || (rdev1->size <smallest->size)) smallest = rdev1; cnt++; @@ -301,6 +311,22 @@ conf->hash_spacing++; } + /* calculate the max read-ahead size. + * For read-ahead of large files to be effective, we need to + * readahead at least a whole stripe. i.e. number of devices + * multiplied by chunk size. + * If an individual device has an ra_pages greater than the + * chunk size, then we will not drive that device as hard as it + * wants. We consider this a configuration error: a larger + * chunksize should be used in that case. + */ + { + int stripe = mddev->raid_disks * mddev->chunk_size / PAGE_CACHE_SIZE; + if (mddev->queue->backing_dev_info.ra_pages < stripe) + mddev->queue->backing_dev_info.ra_pages = stripe; + } + + blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec); return 0; --- diff/drivers/md/raid1.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/md/raid1.c 2003-11-26 10:09:05.000000000 +0000 @@ -677,8 +677,17 @@ for (mirror=0; mirror < mddev->raid_disks; mirror++) if ( !(p=conf->mirrors+mirror)->rdev) { p->rdev = rdev; + blk_queue_stack_limits(mddev->queue, rdev->bdev->bd_disk->queue); + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, so limit ->max_sector to one PAGE, as + * a one page request is never in violation. + */ + if (rdev->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + p->head_position = 0; rdev->raid_disk = mirror; found = 1; @@ -1077,8 +1086,17 @@ disk = conf->mirrors + disk_idx; disk->rdev = rdev; + blk_queue_stack_limits(mddev->queue, rdev->bdev->bd_disk->queue); + /* as we don't honour merge_bvec_fn, we must never risk + * violating it, so limit ->max_sector to one PAGE, as + * a one page request is never in violation. + */ + if (rdev->bdev->bd_disk->queue->merge_bvec_fn && + mddev->queue->max_sectors > (PAGE_SIZE>>9)) + mddev->queue->max_sectors = (PAGE_SIZE>>9); + disk->head_position = 0; if (!rdev->faulty && rdev->in_sync) conf->working_disks++; --- diff/drivers/md/raid5.c 2003-09-30 15:46:14.000000000 +0100 +++ source/drivers/md/raid5.c 2003-11-26 10:09:05.000000000 +0000 @@ -1571,6 +1571,16 @@ print_raid5_conf(conf); + /* read-ahead size must cover a whole stripe, which is + * (n-1) * chunksize where 'n' is the number of raid devices + */ + { + int stripe = (mddev->raid_disks-1) * mddev->chunk_size + / PAGE_CACHE_SIZE; + if (mddev->queue->backing_dev_info.ra_pages < stripe) + mddev->queue->backing_dev_info.ra_pages = stripe; + } + /* Ok, everything is just fine now */ mddev->array_size = mddev->size * (mddev->raid_disks - 1); return 0; --- diff/drivers/media/video/video-buf.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/media/video/video-buf.c 2003-11-26 10:09:05.000000000 +0000 @@ -1078,7 +1078,7 @@ */ static struct page* videobuf_vm_nopage(struct vm_area_struct *vma, unsigned long vaddr, - int write_access) + int *type) { struct page *page; @@ -1090,6 +1090,8 @@ if (!page) return NOPAGE_OOM; clear_user_page(page_address(page), vaddr, page); + if (type) + *type = VM_FAULT_MINOR; return page; } --- diff/drivers/message/i2o/i2o_block.c 2003-08-20 14:16:29.000000000 +0100 +++ source/drivers/message/i2o/i2o_block.c 2003-11-26 10:09:05.000000000 +0000 @@ -885,10 +885,10 @@ * Issue device specific ioctl calls. */ -static int i2ob_ioctl(struct inode *inode, struct file *file, +static int i2ob_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct i2ob_device *dev = disk->private_data; /* Anyone capable of this syscall can do *real bad* things */ @@ -901,7 +901,7 @@ struct hd_geometry g; i2o_block_biosparam(get_capacity(disk), &g.cylinders, &g.heads, &g.sectors); - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); return copy_to_user((void *)arg,&g, sizeof(g))?-EFAULT:0; } @@ -927,9 +927,8 @@ * Close the block device down */ -static int i2ob_release(struct inode *inode, struct file *file) +static int i2ob_release(struct gendisk *disk) { - struct gendisk *disk = inode->i_bdev->bd_disk; struct i2ob_device *dev = disk->private_data; /* @@ -999,9 +998,9 @@ * Open the block device. */ -static int i2ob_open(struct inode *inode, struct file *file) +static int i2ob_open(struct block_device *bdev, struct file *file) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct i2ob_device *dev = disk->private_data; if(!dev->i2odev) --- diff/drivers/mtd/mtd_blkdevs.c 2003-09-30 15:46:15.000000000 +0100 +++ source/drivers/mtd/mtd_blkdevs.c 2003-11-26 10:09:06.000000000 +0000 @@ -141,14 +141,12 @@ } -int blktrans_open(struct inode *i, struct file *f) +static int blktrans_open(struct block_device *bdev, struct file *f) { - struct mtd_blktrans_dev *dev; - struct mtd_blktrans_ops *tr; + struct mtd_blktrans_dev *dev = bdev->bd_disk->private_data; + struct mtd_blktrans_ops *tr = dev->tr; int ret = -ENODEV; - dev = i->i_bdev->bd_disk->private_data; - tr = dev->tr; if (!try_module_get(dev->mtd->owner)) goto out; @@ -172,15 +170,12 @@ return ret; } -int blktrans_release(struct inode *i, struct file *f) +static int blktrans_release(struct gendisk *disk) { - struct mtd_blktrans_dev *dev; - struct mtd_blktrans_ops *tr; + struct mtd_blktrans_dev *dev = disk->private_data; + struct mtd_blktrans_ops *tr = dev->tr; int ret = 0; - dev = i->i_bdev->bd_disk->private_data; - tr = dev->tr; - if (tr->release) ret = tr->release(dev); @@ -194,10 +189,10 @@ } -static int blktrans_ioctl(struct inode *inode, struct file *file, +static int blktrans_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct mtd_blktrans_dev *dev = inode->i_bdev->bd_disk->private_data; + struct mtd_blktrans_dev *dev = bdev->bd_disk->private_data; struct mtd_blktrans_ops *tr = dev->tr; switch (cmd) { @@ -217,7 +212,7 @@ if (ret) return ret; - g.start = get_start_sect(inode->i_bdev); + g.start = get_start_sect(bdev); if (copy_to_user((void *)arg, &g, sizeof(g))) return -EFAULT; return 0; --- diff/drivers/net/3c527.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/net/3c527.c 2003-11-26 10:09:06.000000000 +0000 @@ -1,9 +1,10 @@ -/* 3c527.c: 3Com Etherlink/MC32 driver for Linux 2.4 +/* 3c527.c: 3Com Etherlink/MC32 driver for Linux 2.4 and 2.6. * * (c) Copyright 1998 Red Hat Software Inc * Written by Alan Cox. * Further debugging by Carl Drougge. - * Modified by Richard Procter (rnp@netlink.co.nz) + * Initial SMP support by Felipe W Damasio <felipewd@terra.com.br> + * Heavily modified by Richard Procter <rnp@paradise.net.nz> * * Based on skeleton.c written 1993-94 by Donald Becker and ne2.c * (for the MCA stuff) written by Wim Dumon. @@ -17,11 +18,11 @@ */ #define DRV_NAME "3c527" -#define DRV_VERSION "0.6a" -#define DRV_RELDATE "2001/11/17" +#define DRV_VERSION "0.7-SMP" +#define DRV_RELDATE "2003/09/21" static const char *version = -DRV_NAME ".c:v" DRV_VERSION " " DRV_RELDATE " Richard Proctor (rnp@netlink.co.nz)\n"; +DRV_NAME ".c:v" DRV_VERSION " " DRV_RELDATE " Richard Procter <rnp@paradise.net.nz>\n"; /** * DOC: Traps for the unwary @@ -100,7 +101,9 @@ #include <linux/string.h> #include <linux/wait.h> #include <linux/ethtool.h> +#include <linux/completion.h> +#include <asm/semaphore.h> #include <asm/uaccess.h> #include <asm/system.h> #include <asm/bitops.h> @@ -143,19 +146,19 @@ static const int WORKAROUND_82586=1; /* Pointers to buffers and their on-card records */ - struct mc32_ring_desc { volatile struct skb_header *p; struct sk_buff *skb; }; - /* Information that needs to be kept for each board. */ struct mc32_local { - struct net_device_stats net_stats; int slot; + + u32 base; + struct net_device_stats net_stats; volatile struct mc32_mailbox *rx_box; volatile struct mc32_mailbox *tx_box; volatile struct mc32_mailbox *exec_box; @@ -165,22 +168,23 @@ u16 tx_len; /* Transmit list count */ u16 rx_len; /* Receive list count */ - u32 base; - u16 exec_pending; - u16 mc_reload_wait; /* a multicast load request is pending */ + u16 xceiver_desired_state; /* HALTED or RUNNING */ + u16 cmd_nonblocking; /* Thread is uninterested in command result */ + u16 mc_reload_wait; /* A multicast load request is pending */ u32 mc_list_valid; /* True when the mclist is set */ - u16 xceiver_state; /* Current transceiver state. bitmapped */ - u16 desired_state; /* The state we want the transceiver to be in */ - atomic_t tx_count; /* buffers left */ - wait_queue_head_t event; struct mc32_ring_desc tx_ring[TX_RING_LEN]; /* Host Transmit ring */ struct mc32_ring_desc rx_ring[RX_RING_LEN]; /* Host Receive ring */ + atomic_t tx_count; /* buffers left */ + atomic_t tx_ring_head; /* index to tx en-queue end */ u16 tx_ring_tail; /* index to tx de-queue end */ - u16 tx_ring_head; /* index to tx en-queue end */ u16 rx_ring_tail; /* index to rx de-queue end */ + + struct semaphore cmd_mutex; /* Serialises issuing of execute commands */ + struct completion execution_cmd; /* Card has completed an execute command */ + struct completion xceiver_cmd; /* Card has completed a tx or rx command */ }; /* The station (ethernet) address prefix, used for a sanity check. */ @@ -236,7 +240,6 @@ { static int current_mca_slot = -1; int i; - int adapter_found = 0; SET_MODULE_OWNER(dev); @@ -247,11 +250,11 @@ Autodetecting MCA cards is extremely simple. Just search for the card. */ - for(i = 0; (mc32_adapters[i].name != NULL) && !adapter_found; i++) { + for(i = 0; (mc32_adapters[i].name != NULL); i++) { current_mca_slot = mca_find_unused_adapter(mc32_adapters[i].id, 0); - if((current_mca_slot != MCA_NOTFOUND) && !adapter_found) { + if(current_mca_slot != MCA_NOTFOUND) { if(!mc32_probe1(dev, current_mca_slot)) { mca_set_adapter_name(current_mca_slot, @@ -409,7 +412,7 @@ * Grab the IRQ */ - i = request_irq(dev->irq, &mc32_interrupt, SA_SHIRQ, dev->name, dev); + i = request_irq(dev->irq, &mc32_interrupt, SA_SHIRQ | SA_SAMPLE_RANDOM, dev->name, dev); if (i) { release_region(dev->base_addr, MC32_IO_EXTENT); printk(KERN_ERR "%s: unable to get IRQ %d.\n", dev->name, dev->irq); @@ -498,7 +501,9 @@ lp->tx_len = lp->exec_box->data[9]; /* Transmit list count */ lp->rx_len = lp->exec_box->data[11]; /* Receive list count */ - init_waitqueue_head(&lp->event); + init_MUTEX_LOCKED(&lp->cmd_mutex); + init_completion(&lp->execution_cmd); + init_completion(&lp->xceiver_cmd); printk("%s: Firmware Rev %d. %d RX buffers, %d TX buffers. Base of 0x%08X.\n", dev->name, lp->exec_box->data[12], lp->rx_len, lp->tx_len, lp->base); @@ -511,10 +516,6 @@ dev->tx_timeout = mc32_timeout; dev->watchdog_timeo = HZ*5; /* Board does all the work */ dev->ethtool_ops = &netdev_ethtool_ops; - - lp->xceiver_state = HALTED; - - lp->tx_ring_tail=lp->tx_ring_head=0; /* Fill in the fields of the device structure with ethernet values. */ ether_setup(dev); @@ -539,7 +540,7 @@ * status of any pending commands and takes very little time at all. */ -static void mc32_ready_poll(struct net_device *dev) +static inline void mc32_ready_poll(struct net_device *dev) { int ioaddr = dev->base_addr; while(!(inb(ioaddr+HOST_STATUS)&HOST_STATUS_CRR)); @@ -554,31 +555,38 @@ * @len: Length of the data block * * Send a command from interrupt state. If there is a command - * currently being executed then we return an error of -1. It simply - * isn't viable to wait around as commands may be slow. Providing we - * get in, we busy wait for the board to become ready to accept the - * command and issue it. We do not wait for the command to complete - * --- the card will interrupt us when it's done. + * currently being executed then we return an error of -1. It + * simply isn't viable to wait around as commands may be + * slow. This can theoretically be starved on SMP, but it's hard + * to see a realistic situation. We do not wait for the command + * to complete --- we rely on the interrupt handler to tidy up + * after us. */ static int mc32_command_nowait(struct net_device *dev, u16 cmd, void *data, int len) { struct mc32_local *lp = (struct mc32_local *)dev->priv; int ioaddr = dev->base_addr; + int ret = -1; - if(lp->exec_pending) - return -1; - - lp->exec_pending=3; - lp->exec_box->mbox=0; - lp->exec_box->mbox=cmd; - memcpy((void *)lp->exec_box->data, data, len); - barrier(); /* the memcpy forgot the volatile so be sure */ + if (down_trylock(&lp->cmd_mutex) == 0) + { + lp->cmd_nonblocking=1; + lp->exec_box->mbox=0; + lp->exec_box->mbox=cmd; + memcpy((void *)lp->exec_box->data, data, len); + barrier(); /* the memcpy forgot the volatile so be sure */ + + /* Send the command */ + mc32_ready_poll(dev); + outb(1<<6, ioaddr+HOST_CMD); - /* Send the command */ - while(!(inb(ioaddr+HOST_STATUS)&HOST_STATUS_CRR)); - outb(1<<6, ioaddr+HOST_CMD); - return 0; + ret = 0; + + /* Interrupt handler will signal mutex on completion */ + } + + return ret; } @@ -592,76 +600,47 @@ * Sends exec commands in a user context. This permits us to wait around * for the replies and also to wait for the command buffer to complete * from a previous command before we execute our command. After our - * command completes we will complete any pending multicast reload + * command completes we will attempt any pending multicast reload * we blocked off by hogging the exec buffer. * * You feed the card a command, you wait, it interrupts you get a * reply. All well and good. The complication arises because you use * commands for filter list changes which come in at bh level from things * like IPV6 group stuff. - * - * We have a simple state machine - * - * 0 - nothing issued - * - * 1 - command issued, wait reply - * - * 2 - reply waiting - reader then goes to state 0 - * - * 3 - command issued, trash reply. In which case the irq - * takes it back to state 0 - * */ static int mc32_command(struct net_device *dev, u16 cmd, void *data, int len) { struct mc32_local *lp = (struct mc32_local *)dev->priv; int ioaddr = dev->base_addr; - unsigned long flags; int ret = 0; + down(&lp->cmd_mutex); + /* - * Wait for a command - */ - - save_flags(flags); - cli(); - - while(lp->exec_pending) - sleep_on(&lp->event); - - /* - * Issue mine + * My Turn */ - lp->exec_pending=1; - - restore_flags(flags); - + lp->cmd_nonblocking=0; lp->exec_box->mbox=0; lp->exec_box->mbox=cmd; memcpy((void *)lp->exec_box->data, data, len); barrier(); /* the memcpy forgot the volatile so be sure */ - /* Send the command */ - while(!(inb(ioaddr+HOST_STATUS)&HOST_STATUS_CRR)); - outb(1<<6, ioaddr+HOST_CMD); - - save_flags(flags); - cli(); + mc32_ready_poll(dev); + outb(1<<6, ioaddr+HOST_CMD); - while(lp->exec_pending!=2) - sleep_on(&lp->event); - lp->exec_pending=0; - restore_flags(flags); + wait_for_completion(&lp->execution_cmd); if(lp->exec_box->mbox&(1<<13)) ret = -1; + up(&lp->cmd_mutex); + /* - * A multicast set got blocked - do it now - */ - + * A multicast set got blocked - try it now + */ + if(lp->mc_reload_wait) { mc32_reset_multicast_list(dev); @@ -678,11 +657,9 @@ * This may be called from the interrupt state, where it is used * to restart the rx ring if the card runs out of rx buffers. * - * First, we check if it's ok to start the transceiver. We then show - * the card where to start in the rx ring and issue the - * commands to start reception and transmission. We don't wait - * around for these to complete. - */ + * We must first check if it's ok to (re)start the transceiver. See + * mc32_close for details. + */ static void mc32_start_transceiver(struct net_device *dev) { @@ -690,24 +667,20 @@ int ioaddr = dev->base_addr; /* Ignore RX overflow on device closure */ - if (lp->desired_state==HALTED) + if (lp->xceiver_desired_state==HALTED) return; + /* Give the card the offset to the post-EOL-bit RX descriptor */ mc32_ready_poll(dev); - - lp->tx_box->mbox=0; lp->rx_box->mbox=0; - - /* Give the card the offset to the post-EOL-bit RX descriptor */ lp->rx_box->data[0]=lp->rx_ring[prev_rx(lp->rx_ring_tail)].p->next; - outb(HOST_CMD_START_RX, ioaddr+HOST_CMD); mc32_ready_poll(dev); + lp->tx_box->mbox=0; outb(HOST_CMD_RESTRT_TX, ioaddr+HOST_CMD); /* card ignores this on RX restart */ /* We are not interrupted on start completion */ - lp->xceiver_state=RUNNING; } @@ -727,25 +700,17 @@ { struct mc32_local *lp = (struct mc32_local *)dev->priv; int ioaddr = dev->base_addr; - unsigned long flags; mc32_ready_poll(dev); - - lp->tx_box->mbox=0; lp->rx_box->mbox=0; - outb(HOST_CMD_SUSPND_RX, ioaddr+HOST_CMD); + wait_for_completion(&lp->xceiver_cmd); + mc32_ready_poll(dev); + lp->tx_box->mbox=0; outb(HOST_CMD_SUSPND_TX, ioaddr+HOST_CMD); - - save_flags(flags); - cli(); - - while(lp->xceiver_state!=HALTED) - sleep_on(&lp->event); - - restore_flags(flags); -} + wait_for_completion(&lp->xceiver_cmd); +} /** @@ -756,7 +721,7 @@ * the point where mc32_start_transceiver() can be called. * * The card sets up the receive ring for us. We are required to use the - * ring it provides although we can change the size of the ring. + * ring it provides, although the size of the ring is configurable. * * We allocate an sk_buff for each ring entry in turn and * initalise its house-keeping info. At the same time, we read @@ -777,7 +742,7 @@ rx_base=lp->rx_chain; - for(i=0;i<RX_RING_LEN;i++) + for(i=0; i<RX_RING_LEN; i++) { lp->rx_ring[i].skb=alloc_skb(1532, GFP_KERNEL); skb_reserve(lp->rx_ring[i].skb, 18); @@ -814,21 +779,19 @@ * * Free the buffer for each ring slot. This may be called * before mc32_load_rx_ring(), eg. on error in mc32_open(). + * Requires rx skb pointers to point to a valid skb, or NULL. */ static void mc32_flush_rx_ring(struct net_device *dev) { struct mc32_local *lp = (struct mc32_local *)dev->priv; - - struct sk_buff *skb; int i; for(i=0; i < RX_RING_LEN; i++) { - skb = lp->rx_ring[i].skb; - if (skb!=NULL) { - kfree_skb(skb); - skb=NULL; + if (lp->rx_ring[i].skb) { + dev_kfree_skb(lp->rx_ring[i].skb); + lp->rx_ring[i].skb = NULL; } lp->rx_ring[i].p=NULL; } @@ -860,7 +823,7 @@ tx_base=lp->tx_box->data[0]; - for(i=0;i<lp->tx_len;i++) + for(i=0 ; i<TX_RING_LEN ; i++) { p=isa_bus_to_virt(lp->base+tx_base); lp->tx_ring[i].p=p; @@ -869,11 +832,12 @@ tx_base=p->next; } - /* -1 so that tx_ring_head cannot "lap" tx_ring_tail, */ - /* which would be bad news for mc32_tx_ring as cur. implemented */ + /* -1 so that tx_ring_head cannot "lap" tx_ring_tail */ + /* see mc32_tx_ring */ atomic_set(&lp->tx_count, TX_RING_LEN-1); - lp->tx_ring_head=lp->tx_ring_tail=0; + atomic_set(&lp->tx_ring_head, 0); + lp->tx_ring_tail=0; } @@ -881,47 +845,29 @@ * mc32_flush_tx_ring - free transmit ring * @lp: Local data of 3c527 to flush the tx ring of * - * We have to consider two cases here. We want to free the pending - * buffers only. If the ring buffer head is past the start then the - * ring segment we wish to free wraps through zero. The tx ring - * house-keeping variables are then reset. + * If the ring is non-empty, zip over the it, freeing any + * allocated skb_buffs. The tx ring house-keeping variables are + * then reset. Requires rx skb pointers to point to a valid skb, + * or NULL. */ static void mc32_flush_tx_ring(struct net_device *dev) { struct mc32_local *lp = (struct mc32_local *)dev->priv; - - if(lp->tx_ring_tail!=lp->tx_ring_head) + int i; + + for (i=0; i < TX_RING_LEN; i++) { - int i; - if(lp->tx_ring_tail < lp->tx_ring_head) - { - for(i=lp->tx_ring_tail;i<lp->tx_ring_head;i++) - { - dev_kfree_skb(lp->tx_ring[i].skb); - lp->tx_ring[i].skb=NULL; - lp->tx_ring[i].p=NULL; - } - } - else + if (lp->tx_ring[i].skb) { - for(i=lp->tx_ring_tail; i<TX_RING_LEN; i++) - { - dev_kfree_skb(lp->tx_ring[i].skb); - lp->tx_ring[i].skb=NULL; - lp->tx_ring[i].p=NULL; - } - for(i=0; i<lp->tx_ring_head; i++) - { - dev_kfree_skb(lp->tx_ring[i].skb); - lp->tx_ring[i].skb=NULL; - lp->tx_ring[i].p=NULL; - } + dev_kfree_skb(lp->tx_ring[i].skb); + lp->tx_ring[i].skb = NULL; } } - + atomic_set(&lp->tx_count, 0); - lp->tx_ring_tail=lp->tx_ring_head=0; + atomic_set(&lp->tx_ring_head, 0); + lp->tx_ring_tail=0; } @@ -958,6 +904,12 @@ regs|=HOST_CTRL_INTE; outb(regs, ioaddr+HOST_CTRL); + /* + * Allow ourselves to issue commands + */ + + up(&lp->cmd_mutex); + /* * Send the indications on command @@ -1010,7 +962,7 @@ return -ENOBUFS; } - lp->desired_state = RUNNING; + lp->xceiver_desired_state = RUNNING; /* And finally, set the ball rolling... */ mc32_start_transceiver(dev); @@ -1047,61 +999,64 @@ * Transmit a buffer. This normally means throwing the buffer onto * the transmit queue as the queue is quite large. If the queue is * full then we set tx_busy and return. Once the interrupt handler - * gets messages telling it to reclaim transmit queue entries we will + * gets messages telling it to reclaim transmit queue entries, we will * clear tx_busy and the kernel will start calling this again. * - * We use cli rather than spinlocks. Since I have no access to an SMP - * MCA machine I don't plan to change it. It is probably the top - * performance hit for this driver on SMP however. + * We do not disable interrupts or acquire any locks; this can + * run concurrently with mc32_tx_ring(), and the function itself + * is serialised at a higher layer. However, similarly for the + * card itself, we must ensure that we update tx_ring_head only + * after we've established a valid packet on the tx ring (and + * before we let the card "see" it, to prevent it racing with the + * irq handler). + * */ static int mc32_send_packet(struct sk_buff *skb, struct net_device *dev) { struct mc32_local *lp = (struct mc32_local *)dev->priv; - unsigned long flags; + u32 head = atomic_read(&lp->tx_ring_head); volatile struct skb_header *p, *np; netif_stop_queue(dev); - save_flags(flags); - cli(); - - if(atomic_read(&lp->tx_count)==0) - { - restore_flags(flags); + if(atomic_read(&lp->tx_count)==0) { return 1; } + skb = skb_padto(skb, ETH_ZLEN); + if (skb == NULL) { + netif_wake_queue(dev); + return 0; + } + atomic_dec(&lp->tx_count); /* P is the last sending/sent buffer as a pointer */ - p=lp->tx_ring[lp->tx_ring_head].p; + p=lp->tx_ring[head].p; - lp->tx_ring_head=next_tx(lp->tx_ring_head); + head = next_tx(head); /* NP is the buffer we will be loading */ - np=lp->tx_ring[lp->tx_ring_head].p; - - if (skb->len < ETH_ZLEN) { - skb = skb_padto(skb, ETH_ZLEN); - if (skb == NULL) - goto out; - } + np=lp->tx_ring[head].p; /* We will need this to flush the buffer out */ - lp->tx_ring[lp->tx_ring_head].skb = skb; - - np->length = (skb->len < ETH_ZLEN) ? ETH_ZLEN : skb->len; - + lp->tx_ring[head].skb=skb; + + np->length = unlikely(skb->len < ETH_ZLEN) ? ETH_ZLEN : skb->len; np->data = isa_virt_to_bus(skb->data); np->status = 0; np->control = CONTROL_EOP | CONTROL_EOL; wmb(); - p->control &= ~CONTROL_EOL; /* Clear EOL on p */ -out: - restore_flags(flags); + /* + * The new frame has been setup; we can now + * let the interrupt handler and card "see" it + */ + + atomic_set(&lp->tx_ring_head, head); + p->control &= ~CONTROL_EOL; netif_wake_queue(dev); return 0; @@ -1182,10 +1137,11 @@ { struct mc32_local *lp=dev->priv; volatile struct skb_header *p; - u16 rx_ring_tail = lp->rx_ring_tail; - u16 rx_old_tail = rx_ring_tail; - + u16 rx_ring_tail; + u16 rx_old_tail; int x=0; + + rx_old_tail = rx_ring_tail = lp->rx_ring_tail; do { @@ -1275,9 +1231,14 @@ struct mc32_local *lp=(struct mc32_local *)dev->priv; volatile struct skb_header *np; - /* NB: lp->tx_count=TX_RING_LEN-1 so that tx_ring_head cannot "lap" tail here */ + /* + * We rely on head==tail to mean 'queue empty'. + * This is why lp->tx_count=TX_RING_LEN-1: in order to prevent + * tx_ring_head wrapping to tail and confusing a 'queue empty' + * condition with 'queue full' + */ - while (lp->tx_ring_tail != lp->tx_ring_head) + while (lp->tx_ring_tail != atomic_read(&lp->tx_ring_head)) { u16 t; @@ -1388,8 +1349,7 @@ break; case 3: /* Halt */ case 4: /* Abort */ - lp->xceiver_state |= TX_HALTED; - wake_up(&lp->event); + complete(&lp->xceiver_cmd); break; default: printk("%s: strange tx ack %d\n", dev->name, status&7); @@ -1404,8 +1364,7 @@ break; case 3: /* Halt */ case 4: /* Abort */ - lp->xceiver_state |= RX_HALTED; - wake_up(&lp->event); + complete(&lp->xceiver_cmd); break; case 6: /* Out of RX buffers stat */ @@ -1421,26 +1380,17 @@ status>>=3; if(status&1) { - - /* 0=no 1=yes 2=replied, get cmd, 3 = wait reply & dump it */ - - if(lp->exec_pending!=3) { - lp->exec_pending=2; - wake_up(&lp->event); - } - else - { - lp->exec_pending=0; - - /* A new multicast set may have been - blocked while the old one was - running. If so, do it now. */ + /* + * No thread is waiting: we need to tidy + * up ourself. + */ + if (lp->cmd_nonblocking) { + up(&lp->cmd_mutex); if (lp->mc_reload_wait) mc32_reset_multicast_list(dev); - else - wake_up(&lp->event); } + else complete(&lp->execution_cmd); } if(status&2) { @@ -1493,12 +1443,12 @@ static int mc32_close(struct net_device *dev) { struct mc32_local *lp = (struct mc32_local *)dev->priv; - int ioaddr = dev->base_addr; + u8 regs; u16 one=1; - lp->desired_state = HALTED; + lp->xceiver_desired_state = HALTED; netif_stop_queue(dev); /* @@ -1511,11 +1461,10 @@ mc32_halt_transceiver(dev); - /* Catch any waiting commands */ + /* Ensure we issue no more commands beyond this point */ + + down(&lp->cmd_mutex); - while(lp->exec_pending==1) - sleep_on(&lp->event); - /* Ok the card is now stopping */ regs=inb(ioaddr+HOST_CTRL); @@ -1542,12 +1491,9 @@ static struct net_device_stats *mc32_get_stats(struct net_device *dev) { - struct mc32_local *lp; + struct mc32_local *lp = (struct mc32_local *)dev->priv; mc32_update_stats(dev); - - lp = (struct mc32_local *)dev->priv; - return &lp->net_stats; } --- diff/drivers/net/3c527.h 2002-10-16 04:27:13.000000000 +0100 +++ source/drivers/net/3c527.h 2003-11-26 10:09:06.000000000 +0000 @@ -27,10 +27,8 @@ #define HOST_RAMPAGE 8 -#define RX_HALTED (1<<0) -#define TX_HALTED (1<<1) -#define HALTED (RX_HALTED | TX_HALTED) -#define RUNNING 0 +#define HALTED 0 +#define RUNNING 1 struct mc32_mailbox { --- diff/drivers/net/3c59x.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/net/3c59x.c 2003-11-26 10:09:06.000000000 +0000 @@ -1063,6 +1063,22 @@ return rc; } +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vortex_rx_poll(struct net_device *dev) +{ + disable_irq(dev->irq); + vortex_interrupt(dev->irq, (void *)dev, 0); + enable_irq(dev->irq); +} + +static void boomerang_rx_poll(struct net_device *dev) +{ + disable_irq(dev->irq); + boomerang_interrupt(dev->irq, (void *)dev, 0); + enable_irq(dev->irq); +} +#endif + /* * Start up the PCI/EISA device which is described by *gendev. * Return 0 on success. @@ -1450,6 +1466,13 @@ dev->set_multicast_list = set_rx_mode; dev->tx_timeout = vortex_tx_timeout; dev->watchdog_timeo = (watchdog * HZ) / 1000; +#ifdef CONFIG_NET_POLL_CONTROLLER + if (vp->full_bus_master_tx) + dev->poll_controller = boomerang_rx_poll; + else + dev->poll_controller = vortex_rx_poll; +#endif + if (pdev) { vp->pm_state_valid = 1; pci_save_state(VORTEX_PCI(vp), vp->power_state); --- diff/drivers/net/8139too.c 2003-11-25 15:24:57.000000000 +0000 +++ source/drivers/net/8139too.c 2003-11-26 10:09:06.000000000 +0000 @@ -620,6 +620,10 @@ static void rtl8139_hw_start (struct net_device *dev); static struct ethtool_ops rtl8139_ethtool_ops; +#ifdef CONFIG_NET_POLL_CONTROLLER +static void rtl8139_rx_poll (struct net_device *dev); +#endif + #ifdef USE_IO_OPS #define RTL_R8(reg) inb (((unsigned long)ioaddr) + (reg)) @@ -972,6 +976,10 @@ dev->tx_timeout = rtl8139_tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = rtl8139_rx_poll; +#endif + /* note: the hardware is not capable of sg/csum/highdma, however * through the use of skb_copy_and_csum_dev we enable these * features @@ -2390,6 +2398,15 @@ return &tp->stats; } +#ifdef CONFIG_NET_POLL_CONTROLLER +static void rtl8139_rx_poll (struct net_device *dev) +{ + disable_irq(dev->irq); + rtl8139_interrupt(dev->irq, (void *)dev, 0); + enable_irq(dev->irq); +} +#endif + /* Set or clear the multicast filter for this adaptor. This routine is not state sensitive and need not be SMP locked. */ @@ -2475,10 +2492,11 @@ tp->stats.rx_missed_errors += RTL_R32 (RxMissed); RTL_W32 (RxMissed, 0); + spin_unlock_irqrestore (&tp->lock, flags); + pci_set_power_state (pdev, 3); pci_save_state (pdev, tp->pci_state); - spin_unlock_irqrestore (&tp->lock, flags); return 0; } --- diff/drivers/net/Kconfig 2003-10-27 09:20:38.000000000 +0000 +++ source/drivers/net/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -657,7 +657,7 @@ config ELMC_II tristate "3c527 \"EtherLink/MC 32\" support (EXPERIMENTAL)" - depends on NET_VENDOR_3COM && MCA && EXPERIMENTAL && BROKEN_ON_SMP + depends on NET_VENDOR_3COM && MCA && MCA_LEGACY help If you have a network (Ethernet) card of this type, say Y and read the Ethernet-HOWTO, available from @@ -1283,6 +1283,19 @@ <file:Documentation/networking/net-modules.txt>. The module will be called b44. +config FORCEDETH + tristate "Reverse Engineered nForce Ethernet support (EXPERIMENTAL)" + depends on NET_PCI && PCI && EXPERIMENTAL + help + If you have a network (Ethernet) controller of this type, say Y and + read the Ethernet-HOWTO, available from + <http://www.tldp.org/docs.html#howto>. + + To compile this driver as a module, choose M here and read + <file:Documentation/networking/net-modules.txt>. The module will be + called forcedeth. + + config CS89x0 tristate "CS89x0 support" depends on NET_PCI && ISA @@ -2441,6 +2454,9 @@ To compile this driver as a module, choose M here: the module will be called shaper. If unsure, say N. +config NET_POLL_CONTROLLER + def_bool KGDB + source "drivers/net/wan/Kconfig" source "drivers/net/pcmcia/Kconfig" --- diff/drivers/net/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/net/Makefile 2003-11-26 10:09:06.000000000 +0000 @@ -32,6 +32,8 @@ obj-$(CONFIG_OAKNET) += oaknet.o 8390.o +obj-$(CONFIG_KGDB) += kgdb_eth.o + obj-$(CONFIG_DGRS) += dgrs.o obj-$(CONFIG_RCPCI) += rcpci.o obj-$(CONFIG_VORTEX) += 3c59x.o @@ -95,6 +97,7 @@ obj-$(CONFIG_NE3210) += ne3210.o 8390.o obj-$(CONFIG_NET_SB1250_MAC) += sb1250-mac.o obj-$(CONFIG_B44) += b44.o +obj-$(CONFIG_FORCEDETH) += forcedeth.o obj-$(CONFIG_PPP) += ppp_generic.o slhc.o obj-$(CONFIG_PPP_ASYNC) += ppp_async.o --- diff/drivers/net/e100/e100_main.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/net/e100/e100_main.c 2003-11-26 10:09:06.000000000 +0000 @@ -539,6 +539,15 @@ readw(&(bdp->scb->scb_status)); /* flushes last write, read-safe */ } +#ifdef CONFIG_NET_POLL_CONTROLLER +static void e100_rx_poll(struct net_device *dev) +{ + disable_irq(dev->irq); + e100intr(dev->irq, (void *)dev, 0); + enable_irq(dev->irq); +} +#endif + static int __devinit e100_found1(struct pci_dev *pcid, const struct pci_device_id *ent) { @@ -631,7 +640,9 @@ dev->set_multicast_list = &e100_set_multi; dev->set_mac_address = &e100_set_mac; dev->do_ioctl = &e100_ioctl; - +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = e100_rx_poll; +#endif if (bdp->flags & USE_IPCB) dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; --- diff/drivers/net/eepro100.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/net/eepro100.c 2003-11-26 10:09:06.000000000 +0000 @@ -543,6 +543,9 @@ static int speedo_rx(struct net_device *dev); static void speedo_tx_buffer_gc(struct net_device *dev); static irqreturn_t speedo_interrupt(int irq, void *dev_instance, struct pt_regs *regs); +#ifdef CONFIG_NET_POLL_CONTROLLER +static void poll_speedo (struct net_device *dev); +#endif static int speedo_close(struct net_device *dev); static struct net_device_stats *speedo_get_stats(struct net_device *dev); static int speedo_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); @@ -885,6 +888,9 @@ dev->get_stats = &speedo_get_stats; dev->set_multicast_list = &set_rx_mode; dev->do_ioctl = &speedo_ioctl; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = &poll_speedo; +#endif if (register_netdevice(dev)) goto err_free_unlock; @@ -1675,6 +1681,22 @@ return IRQ_RETVAL(handled); } +#ifdef CONFIG_NET_POLL_CONTROLLER + +/* + * Polling 'interrupt' - used by things like netconsole to send skbs + * without having to re-enable interrupts. It's not called while + * the interrupt routine is executing. + */ + +static void poll_speedo (struct net_device *dev) +{ + disable_irq(dev->irq); + speedo_interrupt (dev->irq, dev, NULL); + enable_irq(dev->irq); +} +#endif + static inline struct RxFD *speedo_rx_alloc(struct net_device *dev, int entry) { struct speedo_private *sp = (struct speedo_private *)dev->priv; --- diff/drivers/net/pppoe.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/net/pppoe.c 2003-11-26 10:09:06.000000000 +0000 @@ -352,7 +352,8 @@ if (!__pppoe_xmit( relay_po->sk, skb)) goto abort_put; } else { - sock_queue_rcv_skb(sk, skb); + if (sock_queue_rcv_skb(sk, skb)) + goto abort_kfree; } return NET_RX_SUCCESS; --- diff/drivers/net/sis900.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/net/sis900.c 2003-11-26 10:09:06.000000000 +0000 @@ -18,10 +18,11 @@ preliminary Rev. 1.0 Jan. 18, 1998 http://www.sis.com.tw/support/databook.htm + Rev 1.08.07 Nov. 2 2003 Daniele Venzano <webvenza@libero.it> add suspend/resume support Rev 1.08.06 Sep. 24 2002 Mufasa Yang bug fix for Tx timeout & add SiS963 support - Rev 1.08.05 Jun. 6 2002 Mufasa Yang bug fix for read_eeprom & Tx descriptor over-boundary + Rev 1.08.05 Jun. 6 2002 Mufasa Yang bug fix for read_eeprom & Tx descriptor over-boundary Rev 1.08.04 Apr. 25 2002 Mufasa Yang <mufasa@sis.com.tw> added SiS962 support - Rev 1.08.03 Feb. 1 2002 Matt Domsch <Matt_Domsch@dell.com> update to use library crc32 function + Rev 1.08.03 Feb. 1 2002 Matt Domsch <Matt_Domsch@dell.com> update to use library crc32 function Rev 1.08.02 Nov. 30 2001 Hui-Fen Hsu workaround for EDB & bug fix for dhcp problem Rev 1.08.01 Aug. 25 2001 Hui-Fen Hsu update for 630ET & workaround for ICS1893 PHY Rev 1.08.00 Jun. 11 2001 Hui-Fen Hsu workaround for RTL8201 PHY and some bug fix @@ -72,7 +73,7 @@ #include "sis900.h" #define SIS900_MODULE_NAME "sis900" -#define SIS900_DRV_VERSION "v1.08.06 9/24/2002" +#define SIS900_DRV_VERSION "v1.08.07 11/02/2003" static char version[] __devinitdata = KERN_INFO "sis900.c: " SIS900_DRV_VERSION "\n"; @@ -169,6 +170,7 @@ unsigned int tx_full; /* The Tx queue is full. */ u8 host_bridge_rev; + u32 pci_state[16]; }; MODULE_AUTHOR("Jim Huang <cmhuang@sis.com.tw>, Ollie Lho <ollie@sis.com.tw>"); @@ -305,7 +307,7 @@ *( ((u16 *)net_dev->dev_addr) + i) = inw(ioaddr + rfdr); } - /* enable packet filitering */ + /* enable packet filtering */ outl(rfcrSave | RFEN, rfcr + ioaddr); return 1; @@ -994,7 +996,7 @@ } } - /* enable packet filitering */ + /* enable packet filtering */ outl(rfcrSave | RFEN, rfcr + ioaddr); } @@ -1466,7 +1468,7 @@ * @net_dev: the net device to transmit with * * Set the transmit buffer descriptor, - * and write TxENA to enable transimt state machine. + * and write TxENA to enable transmit state machine. * tell upper layer if the buffer is full */ @@ -2184,11 +2186,72 @@ pci_set_drvdata(pci_dev, NULL); } +#ifdef CONFIG_PM + +static int sis900_suspend(struct pci_dev *pci_dev, u32 state) +{ + struct net_device *net_dev = pci_get_drvdata(pci_dev); + struct sis900_private *sis_priv = net_dev->priv; + long ioaddr = net_dev->base_addr; + + if(!netif_running(net_dev)) + return 0; + + netif_stop_queue(net_dev); + + /* Stop the chip's Tx and Rx Status Machine */ + outl(RxDIS | TxDIS | inl(ioaddr + cr), ioaddr + cr); + + pci_set_power_state(pci_dev, 3); + pci_save_state(pci_dev, sis_priv->pci_state); + + return 0; +} + +static int sis900_resume(struct pci_dev *pci_dev) +{ + struct net_device *net_dev = pci_get_drvdata(pci_dev); + struct sis900_private *sis_priv = net_dev->priv; + long ioaddr = net_dev->base_addr; + + if(!netif_running(net_dev)) + return 0; + pci_restore_state(pci_dev, sis_priv->pci_state); + pci_set_power_state(pci_dev, 0); + + sis900_init_rxfilter(net_dev); + + sis900_init_tx_ring(net_dev); + sis900_init_rx_ring(net_dev); + + set_rx_mode(net_dev); + + netif_device_attach(net_dev); + netif_start_queue(net_dev); + + /* Workaround for EDB */ + sis900_set_mode(ioaddr, HW_SPEED_10_MBPS, FDX_CAPABLE_HALF_SELECTED); + + /* Enable all known interrupts by setting the interrupt mask. */ + outl((RxSOVR|RxORN|RxERR|RxOK|TxURN|TxERR|TxIDLE), ioaddr + imr); + outl(RxENA | inl(ioaddr + cr), ioaddr + cr); + outl(IE, ioaddr + ier); + + sis900_check_mode(net_dev, sis_priv->mii); + + return 0; +} +#endif /* CONFIG_PM */ + static struct pci_driver sis900_pci_driver = { .name = SIS900_MODULE_NAME, .id_table = sis900_pci_tbl, .probe = sis900_probe, .remove = __devexit_p(sis900_remove), +#ifdef CONFIG_PM + .suspend = sis900_suspend, + .resume = sis900_resume, +#endif /* CONFIG_PM */ }; static int __init sis900_init_module(void) --- diff/drivers/net/tg3.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/net/tg3.c 2003-11-26 10:09:06.000000000 +0000 @@ -34,6 +34,10 @@ #include <asm/byteorder.h> #include <asm/uaccess.h> +#ifdef CONFIG_KGDB +#include <asm/kgdb.h> +#endif + #ifdef CONFIG_SPARC64 #include <asm/idprom.h> #include <asm/oplib.h> @@ -1454,6 +1458,17 @@ return 0; } +static irqreturn_t tg3_interrupt(int irq, void *dev_id, struct pt_regs *regs); +#ifdef CONFIG_KGDB +static void tg3_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + tg3_interrupt(dev->irq, (void *)dev, 0); + enable_irq(dev->irq); +} +#endif + + struct tg3_fiber_aneginfo { int state; #define ANEG_STATE_UNKNOWN 0 @@ -2183,6 +2198,100 @@ } #endif +#ifdef CONFIG_KGDB +/* Get skb from receive buffer */ +static void upcall_kgdb_hook(struct net_device *dev, int* drop) +{ + struct tg3 *tp = dev->priv; + u32 rx_rcb_ptr = tp->rx_rcb_ptr; + u16 hw_idx, sw_idx; + + hw_idx = tp->hw_status->idx[0].rx_producer; + sw_idx = rx_rcb_ptr % TG3_RX_RCB_RING_SIZE(tp); + while (sw_idx != hw_idx) { + struct tg3_rx_buffer_desc *desc = &tp->rx_rcb[sw_idx]; + unsigned int len; + struct sk_buff *skb; + dma_addr_t dma_addr; + u32 opaque_key, desc_idx ; + + desc_idx = desc->opaque & RXD_OPAQUE_INDEX_MASK; + opaque_key = desc->opaque & RXD_OPAQUE_RING_MASK; + if (opaque_key == RXD_OPAQUE_RING_STD) { + dma_addr = pci_unmap_addr(&tp->rx_std_buffers[desc_idx], + mapping); + skb = tp->rx_std_buffers[desc_idx].skb; + } else if (opaque_key == RXD_OPAQUE_RING_JUMBO) { + dma_addr = pci_unmap_addr(&tp->rx_jumbo_buffers[desc_idx], + mapping); + skb = tp->rx_jumbo_buffers[desc_idx].skb; + } + else { + goto next_pkt; + } + + + if ((desc->err_vlan & RXD_ERR_MASK) != 0 && + (desc->err_vlan != RXD_ERR_ODD_NIBBLE_RCVD_MII)) { + goto next_pkt; + } + + len = ((desc->idx_len & RXD_LEN_MASK) >> RXD_LEN_SHIFT) - 4; /* omit crc */ + + if (len > RX_COPY_THRESHOLD) { + int skb_size; + if (opaque_key == RXD_OPAQUE_RING_STD) + skb_size = RX_PKT_BUF_SZ; + else if (opaque_key == RXD_OPAQUE_RING_JUMBO) + skb_size = RX_JUMBO_PKT_BUF_SZ; + else + goto next_pkt; + skb = dev_alloc_skb(skb_size); + if (skb == NULL) + goto next_pkt; + skb->dev = tp->dev; + skb_reserve(skb, tp->rx_offset); + skb_put(skb, len); + } else { + struct sk_buff *copy_skb; + copy_skb = dev_alloc_skb(len + 2); + if (copy_skb == NULL) + goto next_pkt; + + copy_skb->dev = tp->dev; + skb_reserve(copy_skb, 2); + skb_put(copy_skb, len); + memcpy(copy_skb->data, skb->data, len); + + /* We'll reuse the original ring buffer. */ + skb = copy_skb; + } + if ((tp->tg3_flags & TG3_FLAG_RX_CHECKSUMS) && + (desc->type_flags & RXD_FLAG_TCPUDP_CSUM) && + (((desc->ip_tcp_csum & RXD_TCPCSUM_MASK) + >> RXD_TCPCSUM_SHIFT) == 0xffff)) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + skb->protocol = eth_type_trans(skb, tp->dev); +/*into gdb driver*/ + if (!kgdb_net_interrupt(skb)) { + /* No.. if we're 'trapped' then junk it */ + if (kgdb_eth_is_trapped()) + *drop=1; + } else { + /* kgdb_eth ate the packet... drop it silently */ + *drop=1; + } + kfree_skb(skb); +next_pkt: + rx_rcb_ptr++; + sw_idx = rx_rcb_ptr % TG3_RX_RCB_RING_SIZE(tp); + } +} +#endif + /* The RX ring scheme is composed of multiple rings which post fresh * buffers to the chip, and one special ring the chip uses to report * status back to the host. @@ -2453,9 +2562,15 @@ tr32(MAILBOX_INTERRUPT_0 + TG3_64BIT_REG_LOW); sblk->status &= ~SD_STATUS_UPDATED; - if (likely(tg3_has_work(dev, tp))) + if (likely(tg3_has_work(dev, tp))) { +#ifdef CONFIG_KGDB + if (dev->poll_controller != NULL) { + int drop=0; + upcall_kgdb_hook(dev, &drop); /*drop may be used later */ + } +#endif netif_rx_schedule(dev); /* schedule NAPI poll */ - else { + } else { /* no work, shared interrupt perhaps? re-enable * interrupts, and flush that PCI write */ @@ -7636,6 +7751,9 @@ dev->watchdog_timeo = TG3_TX_TIMEOUT; dev->change_mtu = tg3_change_mtu; dev->irq = pdev->irq; +#ifdef CONFIG_KGDB + dev->poll_controller = tg3_poll_controller; +#endif err = tg3_get_invariants(tp); if (err) { --- diff/drivers/net/tlan.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/net/tlan.c 2003-11-26 10:09:06.000000000 +0000 @@ -297,6 +297,7 @@ static int TLan_probe1( struct pci_dev *pdev, long ioaddr, int irq, int rev, const struct pci_device_id *ent); static void TLan_tx_timeout( struct net_device *dev); static int tlan_init_one( struct pci_dev *pdev, const struct pci_device_id *ent); +static void TLan_Poll(struct net_device *dev); static u32 TLan_HandleInvalid( struct net_device *, u16 ); static u32 TLan_HandleTxEOF( struct net_device *, u16 ); @@ -453,6 +454,25 @@ pci_set_drvdata( pdev, NULL ); } +#ifdef CONFIG_NET_POLL_CONTROLLER + +/* + * Polling 'interrupt' - used by things like netconsole to send skbs + * without having to re-enable interrupts. It's not called while + * the interrupt routine is executing. + */ + +static void TLan_Poll (struct net_device *dev) +{ + disable_irq(dev->irq); + TLan_HandleInterrupt(dev->irq, dev, NULL); + enable_irq(dev->irq); +} + +#endif + + + static struct pci_driver tlan_driver = { .name = "tlan", .id_table = tlan_pci_tbl, @@ -895,6 +915,9 @@ dev->do_ioctl = &TLan_ioctl; dev->tx_timeout = &TLan_tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = &TLan_Poll; +#endif return 0; --- diff/drivers/net/tulip/Kconfig 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/net/tulip/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -68,6 +68,26 @@ obscure bugs if your mainboard has memory controller timing issues. If in doubt, say N. +config TULIP_NAPI + bool "Use NAPI RX polling " + depends on TULIP + ---help--- + This is of useful for servers and routers dealing with high network loads. + + See <file:Documentation/networking/NAPI_HOWTO.txt>. + + If in doubt, say N. + +config TULIP_NAPI_HW_MITIGATION + bool "Use Interrupt Mitigation " + depends on TULIP_NAPI + ---help--- + Use HW to reduce RX interrupts. Not strict necessary since NAPI reduces + RX interrupts but itself. Although this reduces RX interrupts even at + low levels traffic at the cost of a small latency. + + If in doubt, say Y. + config DE4X5 tristate "Generic DECchip & DIGITAL EtherWORKS PCI/EISA" depends on NET_TULIP && (PCI || EISA) --- diff/drivers/net/tulip/interrupt.c 2003-06-09 14:18:19.000000000 +0100 +++ source/drivers/net/tulip/interrupt.c 2003-11-26 10:09:06.000000000 +0000 @@ -19,13 +19,13 @@ #include <linux/etherdevice.h> #include <linux/pci.h> - int tulip_rx_copybreak; unsigned int tulip_max_interrupt_work; -#ifdef CONFIG_NET_HW_FLOWCONTROL - +#ifdef CONFIG_TULIP_NAPI_HW_MITIGATION #define MIT_SIZE 15 +#define MIT_TABLE 15 /* We use 0 or max */ + unsigned int mit_table[MIT_SIZE+1] = { /* CRS11 21143 hardware Mitigation Control Interrupt @@ -99,16 +99,28 @@ return refilled; } +#ifdef CONFIG_TULIP_NAPI -static int tulip_rx(struct net_device *dev) +void oom_timer(unsigned long data) +{ + struct net_device *dev = (struct net_device *)data; + netif_rx_schedule(dev); +} + +int tulip_poll(struct net_device *dev, int *budget) { struct tulip_private *tp = (struct tulip_private *)dev->priv; int entry = tp->cur_rx % RX_RING_SIZE; - int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; + int rx_work_limit = *budget; int received = 0; -#ifdef CONFIG_NET_HW_FLOWCONTROL - int drop = 0, mit_sel = 0; + if (!netif_running(dev)) + goto done; + + if (rx_work_limit > dev->quota) + rx_work_limit = dev->quota; + +#ifdef CONFIG_TULIP_NAPI_HW_MITIGATION /* that one buffer is needed for mit activation; or might be a bug in the ring buffer code; check later -- JHS*/ @@ -119,6 +131,237 @@ if (tulip_debug > 4) printk(KERN_DEBUG " In tulip_rx(), entry %d %8.8x.\n", entry, tp->rx_ring[entry].status); + + do { + /* Acknowledge current RX interrupt sources. */ + outl((RxIntr | RxNoBuf), dev->base_addr + CSR5); + + + /* If we own the next entry, it is a new packet. Send it up. */ + while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) { + s32 status = le32_to_cpu(tp->rx_ring[entry].status); + + + if (tp->dirty_rx + RX_RING_SIZE == tp->cur_rx) + break; + + if (tulip_debug > 5) + printk(KERN_DEBUG "%s: In tulip_rx(), entry %d %8.8x.\n", + dev->name, entry, status); + if (--rx_work_limit < 0) + goto not_done; + + if ((status & 0x38008300) != 0x0300) { + if ((status & 0x38000300) != 0x0300) { + /* Ingore earlier buffers. */ + if ((status & 0xffff) != 0x7fff) { + if (tulip_debug > 1) + printk(KERN_WARNING "%s: Oversized Ethernet frame " + "spanned multiple buffers, status %8.8x!\n", + dev->name, status); + tp->stats.rx_length_errors++; + } + } else if (status & RxDescFatalErr) { + /* There was a fatal error. */ + if (tulip_debug > 2) + printk(KERN_DEBUG "%s: Receive error, Rx status %8.8x.\n", + dev->name, status); + tp->stats.rx_errors++; /* end of a packet.*/ + if (status & 0x0890) tp->stats.rx_length_errors++; + if (status & 0x0004) tp->stats.rx_frame_errors++; + if (status & 0x0002) tp->stats.rx_crc_errors++; + if (status & 0x0001) tp->stats.rx_fifo_errors++; + } + } else { + /* Omit the four octet CRC from the length. */ + short pkt_len = ((status >> 16) & 0x7ff) - 4; + struct sk_buff *skb; + +#ifndef final_version + if (pkt_len > 1518) { + printk(KERN_WARNING "%s: Bogus packet size of %d (%#x).\n", + dev->name, pkt_len, pkt_len); + pkt_len = 1518; + tp->stats.rx_length_errors++; + } +#endif + /* Check if the packet is long enough to accept without copying + to a minimally-sized skbuff. */ + if (pkt_len < tulip_rx_copybreak + && (skb = dev_alloc_skb(pkt_len + 2)) != NULL) { + skb->dev = dev; + skb_reserve(skb, 2); /* 16 byte align the IP header */ + pci_dma_sync_single(tp->pdev, + tp->rx_buffers[entry].mapping, + pkt_len, PCI_DMA_FROMDEVICE); +#if ! defined(__alpha__) + eth_copy_and_sum(skb, tp->rx_buffers[entry].skb->tail, + pkt_len, 0); + skb_put(skb, pkt_len); +#else + memcpy(skb_put(skb, pkt_len), + tp->rx_buffers[entry].skb->tail, + pkt_len); +#endif + } else { /* Pass up the skb already on the Rx ring. */ + char *temp = skb_put(skb = tp->rx_buffers[entry].skb, + pkt_len); + +#ifndef final_version + if (tp->rx_buffers[entry].mapping != + le32_to_cpu(tp->rx_ring[entry].buffer1)) { + printk(KERN_ERR "%s: Internal fault: The skbuff addresses " + "do not match in tulip_rx: %08x vs. %08x %p / %p.\n", + dev->name, + le32_to_cpu(tp->rx_ring[entry].buffer1), + tp->rx_buffers[entry].mapping, + skb->head, temp); + } +#endif + + pci_unmap_single(tp->pdev, tp->rx_buffers[entry].mapping, + PKT_BUF_SZ, PCI_DMA_FROMDEVICE); + + tp->rx_buffers[entry].skb = NULL; + tp->rx_buffers[entry].mapping = 0; + } + skb->protocol = eth_type_trans(skb, dev); + + netif_receive_skb(skb); + + dev->last_rx = jiffies; + tp->stats.rx_packets++; + tp->stats.rx_bytes += pkt_len; + } + received++; + + entry = (++tp->cur_rx) % RX_RING_SIZE; + if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/4) + tulip_refill_rx(dev); + + } + + /* New ack strategy... irq does not ack Rx any longer + hopefully this helps */ + + /* Really bad things can happen here... If new packet arrives + * and an irq arrives (tx or just due to occasionally unset + * mask), it will be acked by irq handler, but new thread + * is not scheduled. It is major hole in design. + * No idea how to fix this if "playing with fire" will fail + * tomorrow (night 011029). If it will not fail, we won + * finally: amount of IO did not increase at all. */ + } while ((inl(dev->base_addr + CSR5) & RxIntr)); + +done: + + #ifdef CONFIG_TULIP_NAPI_HW_MITIGATION + + /* We use this simplistic scheme for IM. It's proven by + real life installations. We can have IM enabled + continuesly but this would cause unnecessary latency. + Unfortunely we can't use all the NET_RX_* feedback here. + This would turn on IM for devices that is not contributing + to backlog congestion with unnecessary latency. + + We monitor the the device RX-ring and have: + + HW Interrupt Mitigation either ON or OFF. + + ON: More then 1 pkt received (per intr.) OR we are dropping + OFF: Only 1 pkt received + + Note. We only use min and max (0, 15) settings from mit_table */ + + + if( tp->flags & HAS_INTR_MITIGATION) { + if( received > 1 ) { + if( ! tp->mit_on ) { + tp->mit_on = 1; + outl(mit_table[MIT_TABLE], dev->base_addr + CSR11); + } + } + else { + if( tp->mit_on ) { + tp->mit_on = 0; + outl(0, dev->base_addr + CSR11); + } + } + } + +#endif /* CONFIG_TULIP_NAPI_HW_MITIGATION */ + + dev->quota -= received; + *budget -= received; + + tulip_refill_rx(dev); + + /* If RX ring is not full we are out of memory. */ + if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) goto oom; + + /* Remove us from polling list and enable RX intr. */ + + netif_rx_complete(dev); + outl(tulip_tbl[tp->chip_id].valid_intrs, dev->base_addr+CSR7); + + /* The last op happens after poll completion. Which means the following: + * 1. it can race with disabling irqs in irq handler + * 2. it can race with dise/enabling irqs in other poll threads + * 3. if an irq raised after beginning loop, it will be immediately + * triggered here. + * + * Summarizing: the logic results in some redundant irqs both + * due to races in masking and due to too late acking of already + * processed irqs. But it must not result in losing events. + */ + + return 0; + + not_done: + if (!received) { + + received = dev->quota; /* Not to happen */ + } + dev->quota -= received; + *budget -= received; + + if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || + tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) + tulip_refill_rx(dev); + + if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) goto oom; + + return 1; + + + oom: /* Executed with RX ints disabled */ + + + /* Start timer, stop polling, but do not enable rx interrupts. */ + mod_timer(&tp->oom_timer, jiffies+1); + + /* Think: timer_pending() was an explicit signature of bug. + * Timer can be pending now but fired and completed + * before we did netif_rx_complete(). See? We would lose it. */ + + /* remove ourselves from the polling list */ + netif_rx_complete(dev); + + return 0; +} + +#else /* CONFIG_TULIP_NAPI */ + +static int tulip_rx(struct net_device *dev) +{ + struct tulip_private *tp = (struct tulip_private *)dev->priv; + int entry = tp->cur_rx % RX_RING_SIZE; + int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; + int received = 0; + + if (tulip_debug > 4) + printk(KERN_DEBUG " In tulip_rx(), entry %d %8.8x.\n", entry, + tp->rx_ring[entry].status); /* If we own the next entry, it is a new packet. Send it up. */ while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) { s32 status = le32_to_cpu(tp->rx_ring[entry].status); @@ -163,11 +406,6 @@ } #endif -#ifdef CONFIG_NET_HW_FLOWCONTROL - drop = atomic_read(&netdev_dropping); - if (drop) - goto throttle; -#endif /* Check if the packet is long enough to accept without copying to a minimally-sized skbuff. */ if (pkt_len < tulip_rx_copybreak @@ -209,44 +447,9 @@ tp->rx_buffers[entry].mapping = 0; } skb->protocol = eth_type_trans(skb, dev); -#ifdef CONFIG_NET_HW_FLOWCONTROL - mit_sel = -#endif - netif_rx(skb); -#ifdef CONFIG_NET_HW_FLOWCONTROL - switch (mit_sel) { - case NET_RX_SUCCESS: - case NET_RX_CN_LOW: - case NET_RX_CN_MOD: - break; - - case NET_RX_CN_HIGH: - rx_work_limit -= NET_RX_CN_HIGH; /* additional*/ - break; - case NET_RX_DROP: - rx_work_limit = -1; - break; - default: - printk("unknown feedback return code %d\n", mit_sel); - break; - } + netif_rx(skb); - drop = atomic_read(&netdev_dropping); - if (drop) { -throttle: - rx_work_limit = -1; - mit_sel = NET_RX_DROP; - - if (tp->fc_bit) { - long ioaddr = dev->base_addr; - - /* disable Rx & RxNoBuf ints. */ - outl(tulip_tbl[tp->chip_id].valid_intrs&RX_A_NBF_STOP, ioaddr + CSR7); - set_bit(tp->fc_bit, &netdev_fc_xoff); - } - } -#endif dev->last_rx = jiffies; tp->stats.rx_packets++; tp->stats.rx_bytes += pkt_len; @@ -254,42 +457,9 @@ received++; entry = (++tp->cur_rx) % RX_RING_SIZE; } -#ifdef CONFIG_NET_HW_FLOWCONTROL - - /* We use this simplistic scheme for IM. It's proven by - real life installations. We can have IM enabled - continuesly but this would cause unnecessary latency. - Unfortunely we can't use all the NET_RX_* feedback here. - This would turn on IM for devices that is not contributing - to backlog congestion with unnecessary latency. - - We monitor the device RX-ring and have: - - HW Interrupt Mitigation either ON or OFF. - - ON: More then 1 pkt received (per intr.) OR we are dropping - OFF: Only 1 pkt received - - Note. We only use min and max (0, 15) settings from mit_table */ - - - if( tp->flags & HAS_INTR_MITIGATION) { - if((received > 1 || mit_sel == NET_RX_DROP) - && tp->mit_sel != 15 ) { - tp->mit_sel = 15; - tp->mit_change = 1; /* Force IM change */ - } - if((received <= 1 && mit_sel != NET_RX_DROP) && tp->mit_sel != 0 ) { - tp->mit_sel = 0; - tp->mit_change = 1; /* Force IM change */ - } - } - - return RX_RING_SIZE+1; /* maxrx+1 */ -#else return received; -#endif } +#endif /* CONFIG_TULIP_NAPI */ static inline unsigned int phy_interrupt (struct net_device *dev) { @@ -323,7 +493,6 @@ struct tulip_private *tp = (struct tulip_private *)dev->priv; long ioaddr = dev->base_addr; int csr5; - int entry; int missed; int rx = 0; int tx = 0; @@ -331,6 +500,11 @@ int maxrx = RX_RING_SIZE; int maxtx = TX_RING_SIZE; int maxoi = TX_RING_SIZE; +#ifdef CONFIG_TULIP_NAPI + int rxd = 0; +#else + int entry; +#endif unsigned int work_count = tulip_max_interrupt_work; unsigned int handled = 0; @@ -346,22 +520,41 @@ tp->nir++; do { + +#ifdef CONFIG_TULIP_NAPI + + if (!rxd && (csr5 & (RxIntr | RxNoBuf))) { + rxd++; + /* Mask RX intrs and add the device to poll list. */ + outl(tulip_tbl[tp->chip_id].valid_intrs&~RxPollInt, ioaddr + CSR7); + netif_rx_schedule(dev); + + if (!(csr5&~(AbnormalIntr|NormalIntr|RxPollInt|TPLnkPass))) + break; + } + + /* Acknowledge the interrupt sources we handle here ASAP + the poll function does Rx and RxNoBuf acking */ + + outl(csr5 & 0x0001ff3f, ioaddr + CSR5); + +#else /* Acknowledge all of the current interrupt sources ASAP. */ outl(csr5 & 0x0001ffff, ioaddr + CSR5); - if (tulip_debug > 4) - printk(KERN_DEBUG "%s: interrupt csr5=%#8.8x new csr5=%#8.8x.\n", - dev->name, csr5, inl(dev->base_addr + CSR5)); if (csr5 & (RxIntr | RxNoBuf)) { -#ifdef CONFIG_NET_HW_FLOWCONTROL - if ((!tp->fc_bit) || - (!test_bit(tp->fc_bit, &netdev_fc_xoff))) -#endif rx += tulip_rx(dev); tulip_refill_rx(dev); } +#endif /* CONFIG_TULIP_NAPI */ + + if (tulip_debug > 4) + printk(KERN_DEBUG "%s: interrupt csr5=%#8.8x new csr5=%#8.8x.\n", + dev->name, csr5, inl(dev->base_addr + CSR5)); + + if (csr5 & (TxNoBuf | TxDied | TxIntr | TimerInt)) { unsigned int dirty_tx; @@ -462,15 +655,8 @@ } if (csr5 & RxDied) { /* Missed a Rx frame. */ tp->stats.rx_missed_errors += inl(ioaddr + CSR8) & 0xffff; -#ifdef CONFIG_NET_HW_FLOWCONTROL - if (tp->fc_bit && !test_bit(tp->fc_bit, &netdev_fc_xoff)) { - tp->stats.rx_errors++; - tulip_start_rxtx(tp); - } -#else tp->stats.rx_errors++; tulip_start_rxtx(tp); -#endif } /* * NB: t21142_lnk_change() does a del_timer_sync(), so be careful if this @@ -504,10 +690,6 @@ if (tulip_debug > 2) printk(KERN_ERR "%s: Re-enabling interrupts, %8.8x.\n", dev->name, csr5); -#ifdef CONFIG_NET_HW_FLOWCONTROL - if (tp->fc_bit && (test_bit(tp->fc_bit, &netdev_fc_xoff))) - if (net_ratelimit()) printk("BUG!! enabling interrupt when FC off (timerintr.) \n"); -#endif outl(tulip_tbl[tp->chip_id].valid_intrs, ioaddr + CSR7); tp->ttimer = 0; oi++; @@ -520,16 +702,9 @@ /* Acknowledge all interrupt sources. */ outl(0x8001ffff, ioaddr + CSR5); if (tp->flags & HAS_INTR_MITIGATION) { -#ifdef CONFIG_NET_HW_FLOWCONTROL - if(tp->mit_change) { - outl(mit_table[tp->mit_sel], ioaddr + CSR11); - tp->mit_change = 0; - } -#else /* Josip Loncaric at ICASE did extensive experimentation to develop a good interrupt mitigation setting.*/ outl(0x8b240000, ioaddr + CSR11); -#endif } else if (tp->chip_id == LC82C168) { /* the LC82C168 doesn't have a hw timer.*/ outl(0x00, ioaddr + CSR7); @@ -537,10 +712,8 @@ } else { /* Mask all interrupting sources, set timer to re-enable. */ -#ifndef CONFIG_NET_HW_FLOWCONTROL outl(((~csr5) & 0x0001ebef) | AbnormalIntr | TimerInt, ioaddr + CSR7); outl(0x0012, ioaddr + CSR11); -#endif } break; } @@ -550,6 +723,21 @@ break; csr5 = inl(ioaddr + CSR5); + +#ifdef CONFIG_TULIP_NAPI + if (rxd) + csr5 &= ~RxPollInt; + } while ((csr5 & (TxNoBuf | + TxDied | + TxIntr | + TimerInt | + /* Abnormal intr. */ + RxDied | + TxFIFOUnderflow | + TxJabber | + TPLnkFail | + SytemError )) != 0); +#else } while ((csr5 & (NormalIntr|AbnormalIntr)) != 0); tulip_refill_rx(dev); @@ -574,6 +762,7 @@ } } } +#endif /* CONFIG_TULIP_NAPI */ if ((missed = inl(ioaddr + CSR8) & 0x1ffff)) { tp->stats.rx_dropped += missed & 0x10000 ? 0x10000 : missed; --- diff/drivers/net/tulip/tulip.h 2003-06-09 14:18:19.000000000 +0100 +++ source/drivers/net/tulip/tulip.h 2003-11-26 10:09:06.000000000 +0000 @@ -126,6 +126,7 @@ CFDD_Snooze = (1 << 30), }; +#define RxPollInt (RxIntr|RxNoBuf|RxDied|RxJabber) /* The bits in the CSR5 status registers, mostly interrupt sources. */ enum status_bits { @@ -251,9 +252,9 @@ Making the Tx ring too large decreases the effectiveness of channel bonding and packet priority. There are no ill effects from too-large receive rings. */ -#define TX_RING_SIZE 16 -#define RX_RING_SIZE 32 +#define TX_RING_SIZE 32 +#define RX_RING_SIZE 128 #define MEDIA_MASK 31 #define PKT_BUF_SZ 1536 /* Size of each temporary Rx buffer. */ @@ -343,17 +344,15 @@ int flags; struct net_device_stats stats; struct timer_list timer; /* Media selection timer. */ + struct timer_list oom_timer; /* Out of memory timer. */ u32 mc_filter[2]; spinlock_t lock; spinlock_t mii_lock; unsigned int cur_rx, cur_tx; /* The next free ring entry */ unsigned int dirty_rx, dirty_tx; /* The ring entries to be free()ed. */ -#ifdef CONFIG_NET_HW_FLOWCONTROL -#define RX_A_NBF_STOP 0xffffff3f /* To disable RX and RX-NOBUF ints. */ - int fc_bit; - int mit_sel; - int mit_change; /* Signal for Interrupt Mitigtion */ +#ifdef CONFIG_TULIP_NAPI_HW_MITIGATION + int mit_on; #endif unsigned int full_duplex:1; /* Full-duplex operation requested. */ unsigned int full_duplex_lock:1; @@ -415,6 +414,10 @@ extern int tulip_rx_copybreak; irqreturn_t tulip_interrupt(int irq, void *dev_instance, struct pt_regs *regs); int tulip_refill_rx(struct net_device *dev); +#ifdef CONFIG_TULIP_NAPI +int tulip_poll(struct net_device *dev, int *budget); +#endif + /* media.c */ int tulip_mdio_read(struct net_device *dev, int phy_id, int location); @@ -438,6 +441,7 @@ extern const char * const medianame[]; extern const char tulip_media_cap[]; extern struct tulip_chip_table tulip_tbl[]; +void oom_timer(unsigned long data); extern u8 t21040_csr13[]; #ifndef USE_IO_OPS --- diff/drivers/net/tulip/tulip_core.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/net/tulip/tulip_core.c 2003-11-26 10:09:06.000000000 +0000 @@ -14,11 +14,17 @@ */ +#include <linux/config.h> + #define DRV_NAME "tulip" +#ifdef CONFIG_TULIP_NAPI +#define DRV_VERSION "1.1.13-NAPI" /* Keep at least for test */ +#else #define DRV_VERSION "1.1.13" +#endif #define DRV_RELDATE "May 11, 2002" -#include <linux/config.h> + #include <linux/module.h> #include "tulip.h" #include <linux/pci.h> @@ -247,6 +253,9 @@ static struct net_device_stats *tulip_get_stats(struct net_device *dev); static int private_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); static void set_rx_mode(struct net_device *dev); +#ifdef CONFIG_NET_POLL_CONTROLLER +static void poll_tulip(struct net_device *dev); +#endif @@ -466,29 +475,16 @@ to an alternate media type. */ tp->timer.expires = RUN_AT(next_tick); add_timer(&tp->timer); -} - -#ifdef CONFIG_NET_HW_FLOWCONTROL -/* Enable receiver */ -void tulip_xon(struct net_device *dev) -{ - struct tulip_private *tp = (struct tulip_private *)dev->priv; - - clear_bit(tp->fc_bit, &netdev_fc_xoff); - if (netif_running(dev)){ - - tulip_refill_rx(dev); - outl(tulip_tbl[tp->chip_id].valid_intrs, dev->base_addr+CSR7); - } -} +#ifdef CONFIG_TULIP_NAPI + init_timer(&tp->oom_timer); + tp->oom_timer.data = (unsigned long)dev; + tp->oom_timer.function = oom_timer; #endif +} static int tulip_open(struct net_device *dev) { -#ifdef CONFIG_NET_HW_FLOWCONTROL - struct tulip_private *tp = (struct tulip_private *)dev->priv; -#endif int retval; if ((retval = request_irq(dev->irq, &tulip_interrupt, SA_SHIRQ, dev->name, dev))) @@ -498,10 +494,6 @@ tulip_up (dev); -#ifdef CONFIG_NET_HW_FLOWCONTROL - tp->fc_bit = netdev_register_fc(dev, tulip_xon); -#endif - netif_start_queue (dev); return 0; @@ -582,10 +574,7 @@ #endif /* Stop and restart the chip's Tx processes . */ -#ifdef CONFIG_NET_HW_FLOWCONTROL - if (tp->fc_bit && test_bit(tp->fc_bit,&netdev_fc_xoff)) - printk("BUG tx_timeout restarting rx when fc on\n"); -#endif + tulip_restart_rxtx(tp); /* Trigger an immediate transmit demand. */ outl(0, ioaddr + CSR1); @@ -742,7 +731,9 @@ unsigned long flags; del_timer_sync (&tp->timer); - +#ifdef CONFIG_TULIP_NAPI + del_timer_sync (&tp->oom_timer); +#endif spin_lock_irqsave (&tp->lock, flags); /* Disable interrupts by clearing the interrupt mask. */ @@ -781,13 +772,6 @@ netif_stop_queue (dev); -#ifdef CONFIG_NET_HW_FLOWCONTROL - if (tp->fc_bit) { - int bit = tp->fc_bit; - tp->fc_bit = 0; - netdev_unregister_fc(bit); - } -#endif tulip_down (dev); if (tulip_debug > 1) @@ -1629,10 +1613,17 @@ dev->hard_start_xmit = tulip_start_xmit; dev->tx_timeout = tulip_tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; +#ifdef CONFIG_TULIP_NAPI + dev->poll = tulip_poll; + dev->weight = 16; +#endif dev->stop = tulip_close; dev->get_stats = tulip_get_stats; dev->do_ioctl = private_ioctl; dev->set_multicast_list = set_rx_mode; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = &poll_tulip; +#endif if (register_netdev(dev)) goto err_out_free_ring; @@ -1790,6 +1781,24 @@ } +#ifdef CONFIG_NET_POLL_CONTROLLER + +/* + * Polling 'interrupt' - used by things like netconsole to send skbs + * without having to re-enable interrupts. It's not called while + * the interrupt routine is executing. + */ + +static void poll_tulip (struct net_device *dev) +{ + disable_irq(dev->irq); + tulip_interrupt (dev->irq, dev, NULL); + enable_irq(dev->irq); +} + +#endif + + static struct pci_driver tulip_driver = { .name = DRV_NAME, .id_table = tulip_pci_tbl, --- diff/drivers/pci/Makefile 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/pci/Makefile 2003-11-26 10:09:06.000000000 +0000 @@ -27,6 +27,7 @@ obj-$(CONFIG_SGI_IP27) += setup-irq.o obj-$(CONFIG_SGI_IP32) += setup-irq.o obj-$(CONFIG_X86_VISWS) += setup-irq.o +obj-$(CONFIG_PCI_USE_VECTOR) += msi.o # Cardbus & CompactPCI use setup-bus obj-$(CONFIG_HOTPLUG) += setup-bus.o --- diff/drivers/pci/pci.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/pci/pci.c 2003-11-26 10:09:06.000000000 +0000 @@ -223,6 +223,8 @@ int pm; u16 pmcsr; + might_sleep(); + /* bound the state we're entering */ if (state > 3) state = 3; --- diff/drivers/pci/probe.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/pci/probe.c 2003-11-26 10:09:06.000000000 +0000 @@ -176,7 +176,7 @@ limit |= (io_limit_hi << 16); } - if (base && base <= limit) { + if (base <= limit) { res->flags = (io_base_lo & PCI_IO_RANGE_TYPE_MASK) | IORESOURCE_IO; res->start = base; res->end = limit + 0xfff; @@ -187,7 +187,7 @@ pci_read_config_word(dev, PCI_MEMORY_LIMIT, &mem_limit_lo); base = (mem_base_lo & PCI_MEMORY_RANGE_MASK) << 16; limit = (mem_limit_lo & PCI_MEMORY_RANGE_MASK) << 16; - if (base && base <= limit) { + if (base <= limit) { res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM; res->start = base; res->end = limit + 0xfffff; @@ -213,7 +213,7 @@ } #endif } - if (base && base <= limit) { + if (base <= limit) { res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM | IORESOURCE_PREFETCH; res->start = base; res->end = limit + 0xfffff; @@ -552,6 +552,7 @@ struct pci_dev *dev; dev = pci_scan_device(bus, devfn); + pci_scan_msi_device(dev); if (func == 0) { if (!dev) break; --- diff/drivers/pci/proc.c 2003-08-20 14:16:31.000000000 +0100 +++ source/drivers/pci/proc.c 2003-11-26 10:09:06.000000000 +0000 @@ -25,7 +25,7 @@ { loff_t new = -1; - lock_kernel(); + down(&file->f_dentry->d_inode->i_sem); switch (whence) { case 0: new = off; @@ -37,10 +37,12 @@ new = PCI_CFG_SPACE_SIZE + off; break; } - unlock_kernel(); if (new < 0 || new > PCI_CFG_SPACE_SIZE) - return -EINVAL; - return (file->f_pos = new); + new = -EINVAL; + else + file->f_pos = new; + up(&file->f_dentry->d_inode->i_sem); + return new; } static ssize_t --- diff/drivers/pci/remove.c 2003-08-20 14:16:15.000000000 +0100 +++ source/drivers/pci/remove.c 2003-11-26 10:09:06.000000000 +0000 @@ -14,6 +14,8 @@ { int i; + msi_remove_pci_irq_vectors(dev); + for (i = 0; i < PCI_NUM_RESOURCES; i++) { struct resource *res = dev->resource + i; if (res->parent) --- diff/drivers/pcmcia/i82365.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/pcmcia/i82365.c 2003-11-26 10:09:06.000000000 +0000 @@ -1211,6 +1211,7 @@ return 0; } /* i365_set_mem_map */ +#if 0 /* driver model ordering issue */ /*====================================================================== Routines for accessing socket information and register dumps via @@ -1250,6 +1251,7 @@ static CLASS_DEVICE_ATTR(exca, S_IRUGO, show_exca, NULL); static CLASS_DEVICE_ATTR(info, S_IRUGO, show_info, NULL); +#endif /*====================================================================*/ @@ -1414,10 +1416,12 @@ pcmcia_unregister_socket(&socket[i].socket); break; } +#if 0 /* driver model ordering issue */ class_device_create_file(&socket[i].socket.dev, &class_device_attr_info); class_device_create_file(&socket[i].socket.dev, &class_device_attr_exca); +#endif } /* Finally, schedule a polling interrupt */ --- diff/drivers/pnp/Kconfig 2003-08-20 14:16:31.000000000 +0100 +++ source/drivers/pnp/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -30,33 +30,9 @@ comment "Protocols" depends on PNP -config ISAPNP - bool "ISA Plug and Play support (EXPERIMENTAL)" - depends on PNP && EXPERIMENTAL - help - Say Y here if you would like support for ISA Plug and Play devices. - Some information is in <file:Documentation/isapnp.txt>. +source "drivers/pnp/isapnp/Kconfig" - If unsure, say Y. - -config PNPBIOS - bool "Plug and Play BIOS support (EXPERIMENTAL)" - depends on PNP && EXPERIMENTAL - ---help--- - Linux uses the PNPBIOS as defined in "Plug and Play BIOS - Specification Version 1.0A May 5, 1994" to autodetect built-in - mainboard resources (e.g. parallel port resources). - - Some features (e.g. event notification, docking station information, - ISAPNP services) are not used. - - Note: ACPI is expected to supersede PNPBIOS some day, currently it - co-exists nicely. - - See latest pcmcia-cs (stand-alone package) for a nice "lspnp" tools, - or have a look at /proc/bus/pnp. - - If unsure, say Y. +source "drivers/pnp/pnpbios/Kconfig" endmenu --- diff/drivers/pnp/isapnp/core.c 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/pnp/isapnp/core.c 2003-11-26 10:09:06.000000000 +0000 @@ -890,11 +890,9 @@ header[4], header[5], header[6], header[7], header[8]); printk(KERN_DEBUG "checksum = 0x%x\n", checksum); #endif - /* Don't be strict on the checksum, here ! - e.g. 'SCM SwapBox Plug and Play' has header[8]==0 (should be: b7)*/ - if (header[8] == 0) - ; - else if (checksum == 0x00 || checksum != header[8]) /* not valid CSN */ + /* Per Section 6.1 of the Plug and Play ISA Specification (Version 1.0a), */ + /* Bit[7] of Vendor ID Byte 0 must be 0 */ + if (header[0] & 0x80) /* not valid CSN */ continue; if ((card = isapnp_alloc(sizeof(struct pnp_card))) == NULL) continue; --- diff/drivers/pnp/pnpbios/Makefile 2003-08-20 14:16:31.000000000 +0100 +++ source/drivers/pnp/pnpbios/Makefile 2003-11-26 10:09:06.000000000 +0000 @@ -2,6 +2,6 @@ # Makefile for the kernel PNPBIOS driver. # -pnpbios-proc-$(CONFIG_PROC_FS) = proc.o +pnpbios-proc-$(CONFIG_PNPBIOS_PROC_FS) = proc.o obj-y := core.o bioscalls.o rsparser.o $(pnpbios-proc-y) --- diff/drivers/pnp/pnpbios/core.c 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/pnp/pnpbios/core.c 2003-11-26 10:09:06.000000000 +0000 @@ -353,16 +353,8 @@ for(nodenum=0; nodenum<0xff; ) { u8 thisnodenum = nodenum; - /* eventually we will want to use PNPMODE_STATIC here but for now - * dynamic will help us catch buggy bioses to add to the blacklist. - */ - if (!pnpbios_dont_use_current_config) { - if (pnp_bios_get_dev_node(&nodenum, (char )PNPMODE_DYNAMIC, node)) - break; - } else { - if (pnp_bios_get_dev_node(&nodenum, (char )PNPMODE_STATIC, node)) - break; - } + if (pnp_bios_get_dev_node(&nodenum, (char )PNPMODE_STATIC, node)) + break; nodes_got++; dev = pnpbios_kmalloc(sizeof (struct pnp_dev), GFP_KERNEL); if (!dev) --- diff/drivers/pnp/pnpbios/pnpbios.h 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/pnp/pnpbios/pnpbios.h 2003-11-26 10:09:06.000000000 +0000 @@ -36,7 +36,7 @@ extern void pnpbios_print_status(const char * module, u16 status); extern void pnpbios_calls_init(union pnp_bios_install_struct * header); -#ifdef CONFIG_PROC_FS +#ifdef CONFIG_PNPBIOS_PROC_FS extern int pnpbios_interface_attach_device(struct pnp_bios_node * node); extern int pnpbios_proc_init (void); extern void pnpbios_proc_exit (void); @@ -44,4 +44,4 @@ static inline int pnpbios_interface_attach_device(struct pnp_bios_node * node) { return 0; } static inline int pnpbios_proc_init (void) { return 0; } static inline void pnpbios_proc_exit (void) { ; } -#endif /* CONFIG_PROC */ +#endif /* CONFIG_PNPBIOS_PROC_FS */ --- diff/drivers/s390/block/dasd.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/s390/block/dasd.c 2003-11-26 10:09:06.000000000 +0000 @@ -1643,9 +1643,9 @@ } static int -dasd_open(struct inode *inp, struct file *filp) +dasd_open(struct block_device *bdev, struct file *filp) { - struct gendisk *disk = inp->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct dasd_device *device = disk->private_data; int rc; @@ -1676,10 +1676,8 @@ return rc; } -static int -dasd_release(struct inode *inp, struct file *filp) +static int dasd_release(struct gendisk *disk) { - struct gendisk *disk = inp->i_bdev->bd_disk; struct dasd_device *device = disk->private_data; if (device->state < DASD_STATE_BASIC) { --- diff/drivers/s390/block/dasd_int.h 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/s390/block/dasd_int.h 2003-11-26 10:09:06.000000000 +0000 @@ -493,7 +493,7 @@ void dasd_ioctl_exit(void); int dasd_ioctl_no_register(struct module *, int, dasd_ioctl_fn_t); int dasd_ioctl_no_unregister(struct module *, int, dasd_ioctl_fn_t); -int dasd_ioctl(struct inode *, struct file *, unsigned int, unsigned long); +int dasd_ioctl(struct block_device *, struct file *, unsigned int, unsigned long); /* externals in dasd_proc.c */ int dasd_proc_init(void); --- diff/drivers/s390/block/dasd_ioctl.c 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/s390/block/dasd_ioctl.c 2003-11-26 10:09:06.000000000 +0000 @@ -78,10 +78,9 @@ } int -dasd_ioctl(struct inode *inp, struct file *filp, +dasd_ioctl(struct block_device *bdev, struct file *filp, unsigned int no, unsigned long data) { - struct block_device *bdev = inp->i_bdev; struct dasd_device *device = bdev->bd_disk->private_data; struct dasd_ioctl *ioctl; const char *dir; --- diff/drivers/s390/block/xpram.c 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/s390/block/xpram.c 2003-11-26 10:09:06.000000000 +0000 @@ -328,7 +328,7 @@ return 0; } -static int xpram_ioctl (struct inode *inode, struct file *filp, +static int xpram_ioctl (struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long arg) { struct hd_geometry *geo; --- diff/drivers/s390/char/tape_block.c 2003-09-30 15:46:16.000000000 +0100 +++ source/drivers/s390/char/tape_block.c 2003-11-26 10:09:06.000000000 +0000 @@ -27,8 +27,8 @@ /* * file operation structure for tape block frontend */ -static int tapeblock_open(struct inode *, struct file *); -static int tapeblock_release(struct inode *, struct file *); +static int tapeblock_open(block_device *, struct file *); +static int tapeblock_release(struct gendisk *); static struct block_device_operations tapeblock_fops = { .owner = THIS_MODULE, @@ -299,9 +299,9 @@ * Block frontend tape device open function. */ static int -tapeblock_open(struct inode *inode, struct file *filp) +tapeblock_open(struct block_device *bdev, struct file *filp) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct tape_device *device = disk->private_data; int rc; @@ -336,9 +336,8 @@ * Block frontend tape device release function. */ static int -tapeblock_release(struct inode *inode, struct file *filp) +tapeblock_release(struct gendisk *disk) { - struct gendisk *disk = inode->i_bdev->bd_disk; struct tape_device *device = disk->private_data; tape_release(device); --- diff/drivers/scsi/BusLogic.c 2003-09-30 15:46:17.000000000 +0100 +++ source/drivers/scsi/BusLogic.c 2003-11-26 10:09:06.000000000 +0000 @@ -3462,6 +3462,18 @@ return false; } +/* Error Handling (EH) support */ + +static int BusLogic_host_reset(Scsi_Cmnd *SCpnt) +{ + BusLogic_HostAdapter_T *HostAdapter = + (BusLogic_HostAdapter_T *) SCpnt->device->host->hostdata; + + /* printk("BusLogic_host_reset\n"); */ + HostAdapter->HostAdapterExternalReset = 1; + BusLogic_ResetHostAdapter(HostAdapter, NULL, 0); + return SUCCESS; +} /* BusLogic_QueueCommand creates a CCB for Command and places it into an @@ -4589,220 +4601,8 @@ adapters whereas the remaining options apply individually only to the selected host adapter. - The BusLogic Driver Probing Options comprise the following: - - IO:<integer> - - The "IO:" option specifies an ISA I/O Address to be probed for a non-PCI - MultiMaster Host Adapter. If neither "IO:" nor "NoProbeISA" options are - specified, then the standard list of BusLogic MultiMaster ISA I/O Addresses - will be probed (0x330, 0x334, 0x230, 0x234, 0x130, and 0x134). Multiple - "IO:" options may be specified to precisely determine the I/O Addresses to - be probed, but the probe order will always follow the standard list. - - NoProbe - - The "NoProbe" option disables all probing and therefore no BusLogic Host - Adapters will be detected. - - NoProbeISA - - The "NoProbeISA" option disables probing of the standard BusLogic ISA I/O - Addresses and therefore only PCI MultiMaster and FlashPoint Host Adapters - will be detected. - - NoProbePCI - - The "NoProbePCI" options disables the interrogation of PCI Configuration - Space and therefore only ISA Multimaster Host Adapters will be detected, as - well as PCI Multimaster Host Adapters that have their ISA Compatible I/O - Port set to "Primary" or "Alternate". - - NoSortPCI - - The "NoSortPCI" option forces PCI MultiMaster Host Adapters to be - enumerated in the order provided by the PCI BIOS, ignoring any setting of - the AutoSCSI "Use Bus And Device # For PCI Scanning Seq." option. - - MultiMasterFirst - - The "MultiMasterFirst" option forces MultiMaster Host Adapters to be probed - before FlashPoint Host Adapters. By default, if both FlashPoint and PCI - MultiMaster Host Adapters are present, this driver will probe for - FlashPoint Host Adapters first unless the BIOS primary disk is controlled - by the first PCI MultiMaster Host Adapter, in which case MultiMaster Host - Adapters will be probed first. - - FlashPointFirst - - The "FlashPointFirst" option forces FlashPoint Host Adapters to be probed - before MultiMaster Host Adapters. - - The BusLogic Driver Tagged Queuing Options allow for explicitly specifying - the Queue Depth and whether Tagged Queuing is permitted for each Target - Device (assuming that the Target Device supports Tagged Queuing). The Queue - Depth is the number of SCSI Commands that are allowed to be concurrently - presented for execution (either to the Host Adapter or Target Device). Note - that explicitly enabling Tagged Queuing may lead to problems; the option to - enable or disable Tagged Queuing is provided primarily to allow disabling - Tagged Queuing on Target Devices that do not implement it correctly. The - following options are available: - - QueueDepth:<integer> - - The "QueueDepth:" or QD:" option specifies the Queue Depth to use for all - Target Devices that support Tagged Queuing, as well as the maximum Queue - Depth for devices that do not support Tagged Queuing. If no Queue Depth - option is provided, the Queue Depth will be determined automatically based - on the Host Adapter's Total Queue Depth and the number, type, speed, and - capabilities of the detected Target Devices. For Host Adapters that - require ISA Bounce Buffers, the Queue Depth is automatically set by default - to BusLogic_TaggedQueueDepthBB or BusLogic_UntaggedQueueDepthBB to avoid - excessive preallocation of DMA Bounce Buffer memory. Target Devices that - do not support Tagged Queuing always have their Queue Depth set to - BusLogic_UntaggedQueueDepth or BusLogic_UntaggedQueueDepthBB, unless a - lower Queue Depth option is provided. A Queue Depth of 1 automatically - disables Tagged Queuing. - - QueueDepth:[<integer>,<integer>...] - - The "QueueDepth:[...]" or "QD:[...]" option specifies the Queue Depth - individually for each Target Device. If an <integer> is omitted, the - associated Target Device will have its Queue Depth selected automatically. - - TaggedQueuing:Default - - The "TaggedQueuing:Default" or "TQ:Default" option permits Tagged Queuing - based on the firmware version of the BusLogic Host Adapter and based on - whether the Queue Depth allows queuing multiple commands. - - TaggedQueuing:Enable - - The "TaggedQueuing:Enable" or "TQ:Enable" option enables Tagged Queuing for - all Target Devices on this Host Adapter, overriding any limitation that - would otherwise be imposed based on the Host Adapter firmware version. - - TaggedQueuing:Disable - - The "TaggedQueuing:Disable" or "TQ:Disable" option disables Tagged Queuing - for all Target Devices on this Host Adapter. - - TaggedQueuing:<Target-Spec> - - The "TaggedQueuing:<Target-Spec>" or "TQ:<Target-Spec>" option controls - Tagged Queuing individually for each Target Device. <Target-Spec> is a - sequence of "Y", "N", and "X" characters. "Y" enables Tagged Queuing, "N" - disables Tagged Queuing, and "X" accepts the default based on the firmware - version. The first character refers to Target Device 0, the second to - Target Device 1, and so on; if the sequence of "Y", "N", and "X" characters - does not cover all the Target Devices, unspecified characters are assumed - to be "X". - - The BusLogic Driver Error Recovery Option allows for explicitly specifying - the Error Recovery action to be performed when BusLogic_ResetCommand is - called due to a SCSI Command failing to complete successfully. The following - options are available: - - ErrorRecovery:Default - - The "ErrorRecovery:Default" or "ER:Default" option selects between the Hard - Reset and Bus Device Reset options based on the recommendation of the SCSI - Subsystem. - - ErrorRecovery:HardReset - - The "ErrorRecovery:HardReset" or "ER:HardReset" option will initiate a Host - Adapter Hard Reset which also causes a SCSI Bus Reset. - - ErrorRecovery:BusDeviceReset - - The "ErrorRecovery:BusDeviceReset" or "ER:BusDeviceReset" option will send - a Bus Device Reset message to the individual Target Device causing the - error. If Error Recovery is again initiated for this Target Device and no - SCSI Command to this Target Device has completed successfully since the Bus - Device Reset message was sent, then a Hard Reset will be attempted. - - ErrorRecovery:None - - The "ErrorRecovery:None" or "ER:None" option suppresses Error Recovery. - This option should only be selected if a SCSI Bus Reset or Bus Device Reset - will cause the Target Device or a critical operation to suffer a complete - and unrecoverable failure. - - ErrorRecovery:<Target-Spec> - - The "ErrorRecovery:<Target-Spec>" or "ER:<Target-Spec>" option controls - Error Recovery individually for each Target Device. <Target-Spec> is a - sequence of "D", "H", "B", and "N" characters. "D" selects Default, "H" - selects Hard Reset, "B" selects Bus Device Reset, and "N" selects None. - The first character refers to Target Device 0, the second to Target Device - 1, and so on; if the sequence of "D", "H", "B", and "N" characters does not - cover all the possible Target Devices, unspecified characters are assumed - to be "D". - - The BusLogic Driver Miscellaneous Options comprise the following: - - BusSettleTime:<seconds> - - The "BusSettleTime:" or "BST:" option specifies the Bus Settle Time in - seconds. The Bus Settle Time is the amount of time to wait between a Host - Adapter Hard Reset which initiates a SCSI Bus Reset and issuing any SCSI - Commands. If unspecified, it defaults to BusLogic_DefaultBusSettleTime. - - InhibitTargetInquiry - - The "InhibitTargetInquiry" option inhibits the execution of an Inquire - Target Devices or Inquire Installed Devices command on MultiMaster Host - Adapters. This may be necessary with some older Target Devices that do not - respond correctly when Logical Units above 0 are addressed. - - The BusLogic Driver Debugging Options comprise the following: - - TraceProbe - - The "TraceProbe" option enables tracing of Host Adapter Probing. - - TraceHardwareReset - - The "TraceHardwareReset" option enables tracing of Host Adapter Hardware - Reset. - - TraceConfiguration - - The "TraceConfiguration" option enables tracing of Host Adapter - Configuration. - - TraceErrors - - The "TraceErrors" option enables tracing of SCSI Commands that return an - error from the Target Device. The CDB and Sense Data will be printed for - each SCSI Command that fails. - - Debug - - The "Debug" option enables all debugging options. - - The following examples demonstrate setting the Queue Depth for Target Devices - 1 and 2 on the first host adapter to 7 and 15, the Queue Depth for all Target - Devices on the second host adapter to 31, and the Bus Settle Time on the - second host adapter to 30 seconds. - - Linux Kernel Command Line: - - linux BusLogic=QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30 - - LILO Linux Boot Loader (in /etc/lilo.conf): - - append = "BusLogic=QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30" - - INSMOD Loadable Kernel Module Installation Facility: - - insmod BusLogic.o \ - 'BusLogic="QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30"' - - NOTE: Module Utilities 2.1.71 or later is required for correct parsing - of driver options containing commas. - + The BusLogic Driver Probing Options are described in + <file:Documentation/scsi/BusLogic.txt>. */ static int __init BusLogic_ParseDriverOptions(char *OptionsString) @@ -5126,6 +4926,7 @@ .queuecommand = BusLogic_QueueCommand, .slave_configure = BusLogic_SlaveConfigure, .bios_param = BusLogic_BIOSDiskParameters, + .eh_host_reset_handler = BusLogic_host_reset, .unchecked_isa_dma = 1, .max_sectors = 128, .use_clustering = ENABLE_CLUSTERING, --- diff/drivers/scsi/Kconfig 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -55,6 +55,14 @@ In this case, do not compile the driver for your SCSI host adapter (below) as a module either. +config MAX_SD_DISKS + int "Maximum number of SCSI disks to support (256-8192)" + depends on BLK_DEV_SD + default "256" + help + The maximum number SCSI disks to support. Default is 256. + Change this value if you want kernel to support lots of SCSI devices. + config CHR_DEV_ST tristate "SCSI tape support" depends on SCSI @@ -911,37 +919,34 @@ depends on SCSI_SYM53C8XX_2 default "1" ---help--- - This option only applies to PCI-SCSI chip that are PCI DAC capable - (875A, 895A, 896, 1010-33, 1010-66, 1000). + This option only applies to PCI-SCSI chips that are PCI DAC + capable (875A, 895A, 896, 1010-33, 1010-66, 1000). - When set to 0, only PCI 32 bit DMA addressing (SAC) will be performed. - When set to 1, 40 bit DMA addressing (with upper 24 bits of address - set to zero) is supported. The addressable range is here 1 TB. - When set to 2, full 64 bits of address for DMA are supported, but only - 16 segments of 4 GB can be addressed. The addressable range is so - limited to 64 GB. - - The safest value is 0 (32 bit DMA addressing) that is guessed to still - fit most of real machines. - - The preferred value 1 (40 bit DMA addressing) should make happy - properly engineered PCI DAC capable host bridges. You may configure - this option for Intel platforms with more than 4 GB of memory. - - The still experimental value 2 (64 bit DMA addressing with 16 x 4GB - segments limitation) can be used on systems that require PCI address - bits past bit 39 to be set for the addressing of memory using PCI - DAC cycles. + When set to 0, the driver will program the chip to only perform + 32-bit DMA. When set to 1, the chip will be able to perform DMA + to addresses up to 1TB. When set to 2, the driver supports the + full 64-bit DMA address range, but can only address 16 segments + of 4 GB each. This limits the total addressable range to 64 GB. + + Most machines with less than 4GB of memory should use a setting + of 0 for best performance. If your machine has 4GB of memory + or more, you should set this option to 1 (the default). + + The still experimental value 2 (64 bit DMA addressing with 16 + x 4GB segments limitation) can be used on systems that require + PCI address bits past bit 39 to be set for the addressing of + memory using PCI DAC cycles. config SCSI_SYM53C8XX_DEFAULT_TAGS int "default tagged command queue depth" depends on SCSI_SYM53C8XX_2 default "16" help - This is the default value of the command queue depth the driver will - announce to the generic SCSI layer for devices that support tagged - command queueing. This value can be changed from the boot command line. - This is a soft limit that cannot exceed CONFIG_SCSI_SYM53C8XX_MAX_TAGS. + This is the default value of the command queue depth the + driver will announce to the generic SCSI layer for devices + that support tagged command queueing. This value can be changed + from the boot command line. This is a soft limit that cannot + exceed CONFIG_SCSI_SYM53C8XX_MAX_TAGS. config SCSI_SYM53C8XX_MAX_TAGS int "maximum number of queued commands" @@ -954,11 +959,12 @@ This value is used as a compiled-in hard limit. config SCSI_SYM53C8XX_IOMAPPED - bool "use normal IO" + bool "use port IO" depends on SCSI_SYM53C8XX_2 help - If you say Y here, the driver will preferently use normal IO rather than - memory mapped IO. + If you say Y here, the driver will use port IO to access + the card. This is significantly slower then using memory + mapped IO. Most people should answer N. config SCSI_ZALON tristate "Zalon SCSI support" --- diff/drivers/scsi/aic7xxx/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/scsi/aic7xxx/Makefile 2003-11-26 10:09:06.000000000 +0000 @@ -58,7 +58,9 @@ -p $(obj)/aic7xxx_reg_print.c -i aic7xxx_osm.h ifeq ($(CONFIG_AIC7XXX_BUILD_FIRMWARE),y) -$(aic7xxx-gen-y): $(src)/aic7xxx.seq $(src)/aic7xxx.reg $(obj)/aicasm/aicasm +$(aic7xxx-gen-y): $(src)/aic7xxx.seq + +$(src)/aic7xxx.seq: $(obj)/aicasm/aicasm $(src)/aic7xxx.reg $(obj)/aicasm/aicasm -I$(src) -r $(obj)/aic7xxx_reg.h \ $(aicasm-7xxx-opts-y) -o $(obj)/aic7xxx_seq.h \ $(src)/aic7xxx.seq @@ -72,7 +74,9 @@ -p $(obj)/aic79xx_reg_print.c -i aic79xx_osm.h ifeq ($(CONFIG_AIC79XX_BUILD_FIRMWARE),y) -$(aic79xx-gen-y): $(src)/aic79xx.seq $(src)/aic79xx.reg $(obj)/aicasm/aicasm +$(aic79xx-gen-y): $(src)/aic79xx.seq + +$(src)/aic79xx.seq: $(obj)/aicasm/aicasm $(src)/aic79xx.reg $(obj)/aicasm/aicasm -I$(src) -r $(obj)/aic79xx_reg.h \ $(aicasm-79xx-opts-y) -o $(obj)/aic79xx_seq.h \ $(src)/aic79xx.seq --- diff/drivers/scsi/aic7xxx/aic7xxx_osm_pci.c 2003-09-17 12:28:09.000000000 +0100 +++ source/drivers/scsi/aic7xxx/aic7xxx_osm_pci.c 2003-11-26 10:09:06.000000000 +0000 @@ -100,9 +100,10 @@ ahc_lock(ahc, &s); ahc_intr_enable(ahc, FALSE); ahc_unlock(ahc, &s); - ahc_free(ahc); } ahc_list_unlock(&l); + if (ahc) + ahc_free(ahc); } #endif /* !LINUX_VERSION_CODE < KERNEL_VERSION(2,4,0) */ --- diff/drivers/scsi/aic7xxx/aicasm/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/scsi/aic7xxx/aicasm/Makefile 2003-11-26 10:09:06.000000000 +0000 @@ -49,14 +49,18 @@ clean: rm -f $(clean-files) -aicasm_gram.c aicasm_gram.h: aicasm_gram.y +aicasm_gram.c: aicasm_gram.h + mv $(<:.h=).tab.c $(<:.h=.c) + +aicasm_gram.h: aicasm_gram.y $(YACC) $(YFLAGS) -b $(<:.y=) $< - mv $(<:.y=).tab.c $(<:.y=.c) mv $(<:.y=).tab.h $(<:.y=.h) -aicasm_macro_gram.c aicasm_macro_gram.h: aicasm_macro_gram.y +aicasm_macro_gram.c: aicasm_macro_gram.h + mv $(<:.h=).tab.c $(<:.h=.c) + +aicasm_macro_gram.h: aicasm_macro_gram.y $(YACC) $(YFLAGS) -b $(<:.y=) -p mm $< - mv $(<:.y=).tab.c $(<:.y=.c) mv $(<:.y=).tab.h $(<:.y=.h) aicasm_scan.c: aicasm_scan.l --- diff/drivers/scsi/aic7xxx_old/aic7xxx_proc.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/scsi/aic7xxx_old/aic7xxx_proc.c 2003-11-26 10:09:06.000000000 +0000 @@ -90,9 +90,7 @@ unsigned char i; unsigned char tindex; - HBAptr = NULL; - - for(p=first_aic7xxx; p->host != HBAptr; p=p->next) + for(p=first_aic7xxx; p && p->host != HBAptr; p=p->next) ; if (!p) --- diff/drivers/scsi/ide-scsi.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/ide-scsi.c 2003-11-26 10:09:06.000000000 +0000 @@ -635,24 +635,23 @@ .drives = LIST_HEAD_INIT(idescsi_driver.drives), }; -static int idescsi_ide_open(struct inode *inode, struct file *filp) +static int idescsi_ide_open(struct block_device *bdev, struct file *filp) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = bdev->bd_disk->private_data; drive->usage++; return 0; } -static int idescsi_ide_release(struct inode *inode, struct file *filp) +static int idescsi_ide_release(struct gendisk *disk) { - ide_drive_t *drive = inode->i_bdev->bd_disk->private_data; + ide_drive_t *drive = disk->private_data; drive->usage--; return 0; } -static int idescsi_ide_ioctl(struct inode *inode, struct file *file, +static int idescsi_ide_ioctl(struct block_device *bdev, struct file *file, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; return generic_ide_ioctl(bdev, cmd, arg); } --- diff/drivers/scsi/libata-core.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/libata-core.c 2003-11-26 10:09:06.000000000 +0000 @@ -133,7 +133,11 @@ struct ata_ioports *ioaddr = &ap->ioaddr; unsigned int is_addr = tf->flags & ATA_TFLAG_ISADDR; - outb(tf->ctl, ioaddr->ctl_addr); + if (tf->ctl != ap->last_ctl) { + outb(tf->ctl, ioaddr->ctl_addr); + ap->last_ctl = tf->ctl; + ata_wait_idle(ap); + } if (is_addr && (tf->flags & ATA_TFLAG_LBA48)) { outb(tf->hob_feature, ioaddr->error_addr); @@ -187,7 +191,11 @@ struct ata_ioports *ioaddr = &ap->ioaddr; unsigned int is_addr = tf->flags & ATA_TFLAG_ISADDR; - writeb(tf->ctl, ap->ioaddr.ctl_addr); + if (tf->ctl != ap->last_ctl) { + writeb(tf->ctl, ap->ioaddr.ctl_addr); + ap->last_ctl = tf->ctl; + ata_wait_idle(ap); + } if (is_addr && (tf->flags & ATA_TFLAG_LBA48)) { writeb(tf->hob_feature, (void *) ioaddr->error_addr); @@ -1281,9 +1289,9 @@ /* software reset. causes dev0 to be selected */ if (ap->flags & ATA_FLAG_MMIO) { writeb(ap->ctl, ioaddr->ctl_addr); - udelay(10); /* FIXME: flush */ + udelay(20); /* FIXME: flush */ writeb(ap->ctl | ATA_SRST, ioaddr->ctl_addr); - udelay(10); /* FIXME: flush */ + udelay(20); /* FIXME: flush */ writeb(ap->ctl, ioaddr->ctl_addr); } else { outb(ap->ctl, ioaddr->ctl_addr); @@ -2755,6 +2763,7 @@ ap->cbl = ATA_CBL_NONE; ap->device[0].flags = ATA_DFLAG_MASTER; ap->active_tag = ATA_TAG_POISON; + ap->last_ctl = 0xFF; /* ata_engine init */ ap->eng.flags = 0; --- diff/drivers/scsi/libata.h 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/libata.h 2003-11-26 10:09:06.000000000 +0000 @@ -26,7 +26,7 @@ #define __LIBATA_H__ #define DRV_NAME "libata" -#define DRV_VERSION "0.80" /* must be exactly four chars */ +#define DRV_VERSION "0.81" /* must be exactly four chars */ struct ata_scsi_args { struct ata_port *ap; --- diff/drivers/scsi/qla1280.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/scsi/qla1280.c 2003-11-26 10:09:06.000000000 +0000 @@ -4,6 +4,7 @@ * QLogic QLA1280 (Ultra2) and QLA12160 (Ultra3) SCSI driver * Copyright (C) 2000 Qlogic Corporation (www.qlogic.com) * Copyright (C) 2001-2003 Jes Sorensen, Wild Open Source Inc. +* Copyright (C) 2003 Christoph Hellwig * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the @@ -16,9 +17,13 @@ * General Public License for more details. * ******************************************************************************/ -#define QLA1280_VERSION "3.23.37" +#define QLA1280_VERSION "3.23.38" /***************************************************************************** Revision History: + Rev 3.23.38 October 18, 2003, Christoph Hellwig + - Convert to new-style hotplugable driver for 2.6 + - Fix missing scsi_unregister/scsi_host_put on HBA removal + - Kill some of cruft Rev 3.23.37 October 1, 2003, Jes Sorensen - Make MMIO depend on CONFIG_X86_VISWS instead of yet another random CONFIG option @@ -337,9 +342,6 @@ * Compile time Options: * 0 - Disable and 1 - Enable */ -#define QL1280_LUN_SUPPORT 0 -#define WATCHDOGTIMER 0 - #define DEBUG_QLA1280_INTR 0 #define DEBUG_PRINT_NVRAM 0 #define DEBUG_QLA1280 0 @@ -419,6 +421,14 @@ } device->queue_depth = depth; } +static inline struct Scsi_Host *scsi_host_alloc(Scsi_Host_Template *t, size_t s) +{ + return scsi_register(t, s); +} +static inline void scsi_host_put(struct Scsi_Host *h) +{ + scsi_unregister(h); +} #else #define HOST_LOCK ha->host->host_lock #endif @@ -431,30 +441,23 @@ #define ia64_platform_is(foo) (!strcmp(x, platform_name)) #endif +static int qla1280_probe_one(struct pci_dev *, const struct pci_device_id *); +static void qla1280_remove_one(struct pci_dev *); + /* * QLogic Driver Support Function Prototypes. */ static void qla1280_done(struct scsi_qla_host *, struct srb **, struct srb **); -static void qla1280_done_q_put(struct srb *, struct srb **, struct srb **); -static int qla1280_slave_configure(Scsi_Device *); #if LINUX_VERSION_CODE < 0x020545 -static void qla1280_select_queue_depth(struct Scsi_Host *, Scsi_Device *); static void qla1280_get_target_options(struct scsi_cmnd *, struct scsi_qla_host *); #endif - -static int qla1280_return_status(struct response * sts, Scsi_Cmnd * cp); -static void qla1280_mem_free(struct scsi_qla_host *ha); static int qla1280_get_token(char *); static int qla1280_setup(char *s) __init; -static inline void qla1280_enable_intrs(struct scsi_qla_host *); -static inline void qla1280_disable_intrs(struct scsi_qla_host *); /* * QLogic ISP1280 Hardware Support Function Prototypes. */ -static int qla1280_initialize_adapter(struct scsi_qla_host *ha); static int qla1280_isp_firmware(struct scsi_qla_host *); -static int qla1280_pci_config(struct scsi_qla_host *); static int qla1280_chip_diag(struct scsi_qla_host *); static int qla1280_setup_chip(struct scsi_qla_host *); static int qla1280_init_rings(struct scsi_qla_host *); @@ -473,7 +476,6 @@ static void qla1280_reset_adapter(struct scsi_qla_host *); static void qla1280_marker(struct scsi_qla_host *, int, int, int, u8); static void qla1280_isp_cmd(struct scsi_qla_host *); -irqreturn_t qla1280_intr_handler(int, void *, struct pt_regs *); static void qla1280_isr(struct scsi_qla_host *, struct srb **, struct srb **); static void qla1280_rst_aen(struct scsi_qla_host *); static void qla1280_status_entry(struct scsi_qla_host *, struct response *, @@ -486,11 +488,9 @@ static request_t *qla1280_req_pkt(struct scsi_qla_host *); static int qla1280_check_for_dead_scsi_bus(struct scsi_qla_host *, unsigned int); -static int qla1280_mem_alloc(struct scsi_qla_host *ha); - -static void qla12160_get_target_parameters(struct scsi_qla_host *, +static void qla1280_get_target_parameters(struct scsi_qla_host *, Scsi_Device *); -static int qla12160_set_target_parameters(struct scsi_qla_host *, int, int); +static int qla1280_set_target_parameters(struct scsi_qla_host *, int, int); static struct qla_driver_setup driver_setup __initdata; @@ -525,10 +525,6 @@ return flags; } -#if QL1280_LUN_SUPPORT -static void qla1280_enable_lun(struct scsi_qla_host *, int, int); -#endif - #if DEBUG_QLA1280 static void __qla1280_print_scsi_cmd(Scsi_Cmnd * cmd); static void __qla1280_dump_buffer(char *, int); @@ -547,8 +543,6 @@ __setup("qla1280=", qla1280_setup); #endif -MODULE_LICENSE("GPL"); - /* We use the Scsi_Pointer structure that's included with each command * SCSI_Cmnd as a scratchpad for our SRB. @@ -572,28 +566,23 @@ #define CMD_RESULT(Cmnd) Cmnd->result #define CMD_HANDLE(Cmnd) Cmnd->host_scribble #if LINUX_VERSION_CODE < 0x020545 -#define CMD_HOST(Cmnd) Cmnd->host #define CMD_REQUEST(Cmnd) Cmnd->request.cmd -#define SCSI_BUS_32(Cmnd) Cmnd->channel -#define SCSI_TCN_32(Cmnd) Cmnd->target -#define SCSI_LUN_32(Cmnd) Cmnd->lun #else -#define CMD_HOST(Cmnd) Cmnd->device->host #define CMD_REQUEST(Cmnd) Cmnd->request->cmd +#endif + +#define CMD_HOST(Cmnd) Cmnd->device->host #define SCSI_BUS_32(Cmnd) Cmnd->device->channel #define SCSI_TCN_32(Cmnd) Cmnd->device->id #define SCSI_LUN_32(Cmnd) Cmnd->device->lun -#endif + /*****************************************/ /* ISP Boards supported by this driver */ /*****************************************/ -#define NUM_OF_ISP_DEVICES 6 - struct qla_boards { unsigned char name[9]; /* Board ID String */ - unsigned long device_id; /* Device PCI ID */ int numPorts; /* Number of SCSI ports */ unsigned short *fwcode; /* pointer to FW array */ unsigned short *fwlen; /* number of words in array */ @@ -601,28 +590,38 @@ unsigned char *fwver; /* Ptr to F/W version array */ }; -struct qla_boards ql1280_board_tbl[NUM_OF_ISP_DEVICES] = { - /* Name , Board PCI Device ID, Number of ports */ - {"QLA12160", PCI_DEVICE_ID_QLOGIC_ISP12160, 2, - &fw12160i_code01[0], &fw12160i_length01, +/* NOTE: qla1280_pci_tbl and ql1280_board_tbl must be in the same order */ +static struct pci_device_id qla1280_pci_tbl[] = { + {PCI_VENDOR_ID_QLOGIC, PCI_DEVICE_ID_QLOGIC_ISP12160, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, + {PCI_VENDOR_ID_QLOGIC, PCI_DEVICE_ID_QLOGIC_ISP1080, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 1}, + {PCI_VENDOR_ID_QLOGIC, PCI_DEVICE_ID_QLOGIC_ISP1240, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 2}, + {PCI_VENDOR_ID_QLOGIC, PCI_DEVICE_ID_QLOGIC_ISP1280, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 3}, + {PCI_VENDOR_ID_QLOGIC, PCI_DEVICE_ID_QLOGIC_ISP10160, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 4}, + {0,} +}; +MODULE_DEVICE_TABLE(pci, qla1280_pci_tbl); + +static struct qla_boards ql1280_board_tbl[] = { + /* Name , Number of ports, FW details */ + {"QLA12160", 2, &fw12160i_code01[0], &fw12160i_length01, &fw12160i_addr01, &fw12160i_version_str[0]}, - {"QLA1080", PCI_DEVICE_ID_QLOGIC_ISP1080, 1, - &fw1280ei_code01[0], &fw1280ei_length01, + {"QLA1080", 1, &fw1280ei_code01[0], &fw1280ei_length01, &fw1280ei_addr01, &fw1280ei_version_str[0]}, - {"QLA1240", PCI_DEVICE_ID_QLOGIC_ISP1240, 2, - &fw1280ei_code01[0], &fw1280ei_length01, + {"QLA1240", 2, &fw1280ei_code01[0], &fw1280ei_length01, &fw1280ei_addr01, &fw1280ei_version_str[0]}, - {"QLA1280", PCI_DEVICE_ID_QLOGIC_ISP1280, 2, - &fw1280ei_code01[0], &fw1280ei_length01, + {"QLA1280", 2, &fw1280ei_code01[0], &fw1280ei_length01, &fw1280ei_addr01, &fw1280ei_version_str[0]}, - {"QLA10160", PCI_DEVICE_ID_QLOGIC_ISP10160, 1, - &fw12160i_code01[0], &fw12160i_length01, + {"QLA10160", 1, &fw12160i_code01[0], &fw12160i_length01, &fw12160i_addr01, &fw12160i_version_str[0]}, - {" ", 0, 0} + {" ", 0} }; static int qla1280_verbose = 1; -static struct scsi_qla_host *qla1280_hostlist; static int qla1280_buffer_size; static char *qla1280_buffer; @@ -671,31 +670,19 @@ int size = 0; int len = 0; struct qla_boards *bdp; -#ifdef BOGUS_QUEUE - struct scsi_lu *up; - uint32_t b, t, l; -#endif -#if LINUX_VERSION_CODE >= 0x020600 - ha = (struct scsi_qla_host *)host->hostdata; -#else +#if LINUX_VERSION_CODE < 0x020600 struct Scsi_Host *host; - /* Find the host that was specified */ - for (ha = qla1280_hostlist; (ha != NULL) - && ha->host->host_no != hostno; ha = ha->next) ; - - /* if host wasn't found then exit */ - if (!ha) { - size = sprintf(buffer, "Can't find adapter for host " - "number %d\n", hostno); - if (size > length) { - return size; - } else { - return 0; - } + + for (host = scsi_hostlist; host; host = host->next) { + if (host->host_no == hostno) + goto found; } - host = ha->host; + return -ESRCH; + + found: #endif + ha = (struct scsi_qla_host *)host->hostdata; if (inout) return -ENOSYS; @@ -749,51 +736,6 @@ size = sprintf(PROC_BUF, "\n"); /* 1 */ len += size; - size = sprintf(PROC_BUF, "SCSI device Information:\n"); - len += size; -#ifdef BOGUS_QUEUE - /* scan for all equipment stats */ - for (b = 0; b < MAX_BUSES; b++) - for (t = 0; t < MAX_TARGETS; t++) { - for (l = 0; l < MAX_LUNS; l++) { - up = LU_Q(ha, b, t, l); - if (up == NULL) - continue; - /* unused device/lun */ - if (up->io_cnt == 0 || up->io_cnt < 2) - continue; - /* total reads since boot */ - /* total writes since boot */ - /* total requests since boot */ - size = sprintf (PROC_BUF, - "(%2d:%2d:%2d): Total reqs %ld,", - b, t, l, up->io_cnt); - len += size; - /* current number of pending requests */ - size = sprintf(PROC_BUF, " Pend reqs %d,", - up->q_outcnt); - len += size; -#if 0 - /* avg response time */ - size = sprintf(PROC_BUF, " Avg resp time %ld%%,", - (up->resp_time / up->io_cnt) * - 100); - len += size; - - /* avg active time */ - size = sprintf(PROC_BUF, - " Avg active time %ld%%\n", - (up->act_time / up->io_cnt) * 100); -#else - size = sprintf(PROC_BUF, "\n"); -#endif - len += size; - } - if (len >= qla1280_buffer_size) - break; - } -#endif - if (len >= qla1280_buffer_size) { printk(KERN_WARNING "qla1280: Overflow buffer in qla1280_proc.c\n"); @@ -871,312 +813,6 @@ return chksum; } - -/************************************************************************** - * qla1280_do_device_init - * This routine will register the device with the SCSI subsystem, - * initialize the host adapter structure and call the device init - * routines. - * - * Input: - * pdev - pointer to struct pci_dev for adapter - * template - pointer to SCSI template - * devnum - the device number - * bdp - pointer to struct _qlaboards - * num_hosts - the host number - * - * Returns: - * host - pointer to SCSI host structure - **************************************************************************/ -struct Scsi_Host * -qla1280_do_device_init(struct pci_dev *pdev, Scsi_Host_Template * template, - int devnum, struct qla_boards *bdp, int num_hosts) -{ - struct Scsi_Host *host; - struct scsi_qla_host *ha; - - printk(KERN_INFO "qla1280: %s found on PCI bus %i, dev %i\n", - bdp->name, pdev->bus->number, PCI_SLOT(pdev->devfn)); - - host = scsi_register(template, sizeof(struct scsi_qla_host)); - if (!host) { - printk(KERN_WARNING - "qla1280: Failed to register host, aborting.\n"); - goto error; - } - -#if LINUX_VERSION_CODE < 0x020545 - scsi_set_pci_device(host, pdev); -#else - scsi_set_device(host, &pdev->dev); -#endif - ha = (struct scsi_qla_host *)host->hostdata; - /* Clear our data area */ - memset(ha, 0, sizeof(struct scsi_qla_host)); - /* Sanitize the information from PCI BIOS. */ - host->irq = pdev->irq; - ha->pci_bus = pdev->bus->number; - ha->pci_device_fn = pdev->devfn; - ha->pdev = pdev; - ha->device_id = bdp->device_id; - ha->devnum = devnum; /* specifies microcode load address */ - - if (qla1280_mem_alloc(ha)) { - printk(KERN_INFO "qla1x160: Failed to get memory\n"); - goto error_scsi_unregister; - } - - ha->ports = bdp->numPorts; - /* following needed for all cases of OS versions */ - ha->host = host; - ha->host_no = host->host_no; - - host->can_queue = 0xfffff; /* unlimited */ - host->cmd_per_lun = 1; - host->base = (unsigned long)ha->mmpbase; - host->max_channel = bdp->numPorts - 1; - host->max_lun = MAX_LUNS - 1; - host->max_id = MAX_TARGETS; - host->max_sectors = 1024; -#if LINUX_VERSION_CODE < 0x020545 - host->select_queue_depths = qla1280_select_queue_depth; -#endif - - ha->instance = num_hosts; - host->unique_id = ha->instance; - - if (qla1280_pci_config(ha)) { - printk(KERN_INFO "qla1x160: Unable to configure PCI\n"); - goto error_mem_alloced; - } - - /* Disable ISP interrupts. */ - qla1280_disable_intrs(ha); - - /* Register the IRQ with Linux (sharable) */ - if (request_irq(host->irq, qla1280_intr_handler, SA_SHIRQ, - "qla1280", ha)) { - printk("qla1280 : Failed to reserve interrupt %d already " - "in use\n", host->irq); - goto error_iounmap; - } -#if !MEMORY_MAPPED_IO - /* Register the I/O space with Linux */ - if (!request_region(host->io_port, 0xff, "qla1280")) { - printk("qla1280: Failed to reserve i/o region 0x%04lx-0x%04lx" - " already in use\n", - host->io_port, host->io_port + 0xff); - goto error_free_irq; - } -#endif - - /* load the F/W, read paramaters, and init the H/W */ - if (qla1280_initialize_adapter(ha)) { - printk(KERN_INFO "qla1x160: Failed to initialize adapter\n"); - goto error_release_region; - } - - /* set our host ID (need to do something about our two IDs) */ - host->this_id = ha->bus_settings[0].id; - - return host; - - error_release_region: -#if !MEMORY_MAPPED_IO - release_region(host->io_port, 0xff); - error_free_irq: -#endif - free_irq(host->irq, ha); - error_iounmap: -#if MEMORY_MAPPED_IO - if (ha->mmpbase) - iounmap((void *)(((unsigned long) ha->mmpbase) & PAGE_MASK)); -#endif - error_mem_alloced: - qla1280_mem_free(ha); - error_scsi_unregister: - scsi_unregister(host); - error: - return NULL; -} - -/************************************************************************** - * qla1280_detect - * This routine will probe for Qlogic 1280 SCSI host adapters. - * It returns the number of host adapters of a particular - * type that were found. It also initialize all data necessary for - * the driver. It is passed-in the host number, so that it - * knows where its first entry is in the scsi_hosts[] array. - * - * Input: - * template - pointer to SCSI template - * - * Returns: - * num - number of host adapters found. - **************************************************************************/ -static int -qla1280_detect(Scsi_Host_Template * template) -{ - struct pci_dev *pdev = NULL; - struct Scsi_Host *host; - struct scsi_qla_host *ha, *cur_ha; - struct qla_boards *bdp; - uint16_t subsys_vendor, subsys_device; - int num_hosts = 0; - int devnum = 0; - - ENTER("qla1280_detect"); - - if (sizeof(struct srb) > sizeof(Scsi_Pointer)) { - printk(KERN_WARNING - "qla1280_detect: [WARNING] struct srb too big\n"); - return 0; - } -#ifdef MODULE - /* - * If we are called as a module, the qla1280 pointer may not be null - * and it would point to our bootup string, just like on the lilo - * command line. IF not NULL, then process this config string with - * qla1280_setup - * - * Boot time Options - * To add options at boot time add a line to your lilo.conf file like: - * append="qla1280=verbose,max_tags:{{255,255,255,255},{255,255,255,255}}" - * which will result in the first four devices on the first two - * controllers being set to a tagged queue depth of 32. - */ - if (qla1280) - qla1280_setup(qla1280); -#endif - - bdp = &ql1280_board_tbl[0]; - qla1280_hostlist = NULL; - template->proc_name = "qla1280"; - - /* First Initialize QLA12160 on PCI Bus 1 Dev 2 */ - while ((pdev = pci_find_subsys(PCI_VENDOR_ID_QLOGIC, bdp->device_id, - PCI_ANY_ID, PCI_ANY_ID, pdev))) { - - /* find QLA12160 device on PCI bus=1 slot=2 */ - if ((pdev->bus->number != 1) || (PCI_SLOT(pdev->devfn) != 2)) - continue; - - /* Bypass all AMI SUBSYS VENDOR IDs */ - if (pdev->subsystem_vendor == PCI_VENDOR_ID_AMI) { - printk(KERN_INFO - "qla1x160: Skip AMI SubSys Vendor ID Chip\n"); - continue; - } - - if (pci_enable_device(pdev)) - goto find_devices; - - host = qla1280_do_device_init(pdev, template, devnum, - bdp, num_hosts); - if (!host) - continue; - ha = (struct scsi_qla_host *)host->hostdata; - - /* this preferred device will always be the first one found */ - cur_ha = qla1280_hostlist = ha; - num_hosts++; - } - - find_devices: - - pdev = NULL; - /* Try and find each different type of adapter we support */ - for (devnum = 0; bdp->device_id != 0 && devnum < NUM_OF_ISP_DEVICES; - devnum++, bdp++) { - /* PCI_SUBSYSTEM_IDS supported */ - while ((pdev = pci_find_subsys(PCI_VENDOR_ID_QLOGIC, - bdp->device_id, PCI_ANY_ID, - PCI_ANY_ID, pdev))) { - if (pci_enable_device(pdev)) - continue; - /* found an adapter */ - subsys_vendor = pdev->subsystem_vendor; - subsys_device = pdev->subsystem_device; - - /* - * skip QLA12160 already initialized on - * PCI Bus 1 Dev 2 since we already initialized - * and presented it - */ - if ((bdp->device_id == PCI_DEVICE_ID_QLOGIC_ISP12160)&& - (pdev->bus->number == 1) && - (PCI_SLOT(pdev->devfn) == 2)) - continue; - - /* Bypass all AMI SUBSYS VENDOR IDs */ - if (subsys_vendor == PCI_VENDOR_ID_AMI) { - printk(KERN_INFO - "qla1x160: Skip AMI SubSys Vendor ID Chip\n"); - continue; - } - dprintk(1, "qla1x160: Supported Device Found VID=%x " - "DID=%x SSVID=%x SSDID=%x\n", pdev->vendor, - pdev->device, subsys_vendor, subsys_device); - - host = qla1280_do_device_init(pdev, template, - devnum, bdp, num_hosts); - if (!host) - continue; - ha = (struct scsi_qla_host *)host->hostdata; - - if (qla1280_hostlist == NULL) { - cur_ha = qla1280_hostlist = ha; - } else { - cur_ha = qla1280_hostlist; - while (cur_ha->next != NULL) - cur_ha = cur_ha->next; - cur_ha->next = ha; - } - num_hosts++; - } /* end of WHILE */ - } /* end of FOR */ - - LEAVE("qla1280_detect"); - return num_hosts; -} - -/************************************************************************** - * qla1280_release - * Free the passed in Scsi_Host memory structures prior to unloading the - * module. - **************************************************************************/ -static int -qla1280_release(struct Scsi_Host *host) -{ - struct scsi_qla_host *ha = (struct scsi_qla_host *)host->hostdata; - - ENTER("qla1280_release"); - - if (!ha->flags.online) - return 0; - - /* turn-off interrupts on the card */ - WRT_REG_WORD(&ha->iobase->ictrl, 0); - - /* Detach interrupts */ - if (host->irq) - free_irq(host->irq, ha); - -#if MEMORY_MAPPED_IO - if (ha->mmpbase) - iounmap(ha->mmpbase); -#else - /* release io space registers */ - if (host->io_port) - release_region(host->io_port, 0xff); -#endif /* MEMORY_MAPPED_IO */ - - qla1280_mem_free(ha); - - ENTER("qla1280_release"); - return 0; -} - /************************************************************************** * qla1280_info * Return a string describing the driver. @@ -1193,11 +829,11 @@ ha = (struct scsi_qla_host *)host->hostdata; bdp = &ql1280_board_tbl[ha->devnum]; memset(bp, 0, sizeof(qla1280_scsi_name_buffer)); + sprintf (bp, - "QLogic %s PCI to SCSI Host Adapter: bus %d device %d irq %d\n" + "QLogic %s PCI to SCSI Host Adapter\n" " Firmware version: %2d.%02d.%02d, Driver version %s", - &bdp->name[0], ha->pci_bus, (ha->pci_device_fn & 0xf8) >> 3, - host->irq, bdp->fwver[0], bdp->fwver[1], bdp->fwver[2], + &bdp->name[0], bdp->fwver[0], bdp->fwver[1], bdp->fwver[2], QLA1280_VERSION); return bp; } @@ -1216,38 +852,19 @@ static int qla1280_queuecommand(Scsi_Cmnd * cmd, void (*fn) (Scsi_Cmnd *)) { - struct scsi_qla_host *ha; - struct srb *sp; - struct Scsi_Host *host; - int bus, target, lun; - int status; - - /*ENTER("qla1280_queuecommand"); - */ - dprintk(2, "qla1280_queuecommand(): jiffies %li\n", jiffies); - - host = CMD_HOST(cmd); - ha = (struct scsi_qla_host *)host->hostdata; + struct Scsi_Host *host = cmd->device->host; + struct scsi_qla_host *ha = (struct scsi_qla_host *)host->hostdata; + struct srb *sp = (struct srb *)&cmd->SCp; - /* send command to adapter */ - sp = (struct srb *)CMD_SP(cmd); - sp->cmd = cmd; cmd->scsi_done = fn; + sp->cmd = cmd; sp->flags = 0; qla1280_print_scsi_cmd(5, cmd); - /* Generate LU queue on bus, target, LUN */ - bus = SCSI_BUS_32(cmd); - target = SCSI_TCN_32(cmd); - lun = SCSI_LUN_32(cmd); if (ha->flags.enable_64bit_addressing) - status = qla1280_64bit_start_scsi(ha, sp); - else - status = qla1280_32bit_start_scsi(ha, sp); - - /*LEAVE("qla1280_queuecommand"); */ - return status; + return qla1280_64bit_start_scsi(ha, sp); + return qla1280_32bit_start_scsi(ha, sp); } enum action { @@ -1553,29 +1170,105 @@ unsigned long capacity = disk->capacity; #endif - heads = 64; - sectors = 32; - cylinders = (unsigned long)capacity / (heads * sectors); - if (cylinders > 1024) { - heads = 255; - sectors = 63; - cylinders = (unsigned long)capacity / (heads * sectors); - /* if (cylinders > 1023) - cylinders = 1023; */ + heads = 64; + sectors = 32; + cylinders = (unsigned long)capacity / (heads * sectors); + if (cylinders > 1024) { + heads = 255; + sectors = 63; + cylinders = (unsigned long)capacity / (heads * sectors); + /* if (cylinders > 1023) + cylinders = 1023; */ + } + + geom[0] = heads; + geom[1] = sectors; + geom[2] = cylinders; + + return 0; +} + +#if LINUX_VERSION_CODE < 0x020600 +static int +qla1280_detect(Scsi_Host_Template *template) +{ + struct pci_device_id *id = &qla1280_pci_tbl[0]; + struct pci_dev *pdev = NULL; + int num_hosts = 0; + + if (sizeof(struct srb) > sizeof(Scsi_Pointer)) { + printk(KERN_WARNING + "qla1280: struct srb too big, aborting\n"); + return 0; + } + +#ifdef MODULE + /* + * If we are called as a module, the qla1280 pointer may not be null + * and it would point to our bootup string, just like on the lilo + * command line. IF not NULL, then process this config string with + * qla1280_setup + * + * Boot time Options + * To add options at boot time add a line to your lilo.conf file like: + * append="qla1280=verbose,max_tags:{{255,255,255,255},{255,255,255,255}}" + * which will result in the first four devices on the first two + * controllers being set to a tagged queue depth of 32. + */ + if (qla1280) + qla1280_setup(qla1280); +#endif + + /* First Initialize QLA12160 on PCI Bus 1 Dev 2 */ + while ((pdev = pci_find_device(id->vendor, id->device, pdev))) { + if (pdev->bus->number == 1 && PCI_SLOT(pdev->devfn) == 2) { + if (!qla1280_probe_one(pdev, id)) + num_hosts++; + } + } + + pdev = NULL; + /* Try and find each different type of adapter we support */ + for (id = &qla1280_pci_tbl[0]; id->device; id++) { + while ((pdev = pci_find_device(id->vendor, id->device, pdev))) { + /* + * skip QLA12160 already initialized on + * PCI Bus 1 Dev 2 since we already initialized + * and presented it + */ + if (id->device == PCI_DEVICE_ID_QLOGIC_ISP12160 && + pdev->bus->number == 1 && + PCI_SLOT(pdev->devfn) == 2) + continue; + + if (!qla1280_probe_one(pdev, id)) + num_hosts++; + } } - geom[0] = heads; - geom[1] = sectors; - geom[2] = cylinders; + return num_hosts; +} + +/* + * This looks a bit ugly as we could just pass down host to + * qla1280_remove_one, but I want to keep qla1280_release purely a wrapper + * around pci_driver::remove as used from 2.6 onwards. + */ +static int +qla1280_release(struct Scsi_Host *host) +{ + struct scsi_qla_host *ha = (struct scsi_qla_host *)host->hostdata; + qla1280_remove_one(ha->pdev); return 0; } +#endif /************************************************************************** * qla1280_intr_handler * Handles the H/W interrupt **************************************************************************/ -irqreturn_t +static irqreturn_t qla1280_intr_handler(int irq, void *dev_id, struct pt_regs *regs) { struct scsi_qla_host *ha; @@ -1613,7 +1306,7 @@ static int -qla12160_set_target_parameters(struct scsi_qla_host *ha, int bus, int target) +qla1280_set_target_parameters(struct scsi_qla_host *ha, int bus, int target) { uint8_t mr; uint16_t mb[MAILBOX_REGISTER_COUNT]; @@ -1622,8 +1315,8 @@ nv = &ha->nvram; - if (ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP12160 || - ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP10160) + if (ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP12160 || + ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP10160) is1x160 = 1; else is1x160 = 0; @@ -1710,8 +1403,8 @@ (driver_setup.wide_mask && (~driver_setup.wide_mask & (1 << target)))) nv->bus[bus].target[target].parameter.f.enable_wide = 0; - if (ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP12160 || - ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP10160) { + if (ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP12160 || + ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP10160) { if (driver_setup.no_ppr || (driver_setup.ppr_mask && (~driver_setup.ppr_mask & (1 << target)))) @@ -1719,11 +1412,9 @@ } spin_lock_irqsave(HOST_LOCK, flags); - if (nv->bus[bus].target[target].parameter.f.enable_sync) { - status = qla12160_set_target_parameters(ha, bus, target); - } - - qla12160_get_target_parameters(ha, device); + if (nv->bus[bus].target[target].parameter.f.enable_sync) + status = qla1280_set_target_parameters(ha, bus, target); + qla1280_get_target_parameters(ha, device); spin_unlock_irqrestore(HOST_LOCK, flags); return status; } @@ -1750,16 +1441,11 @@ if (scsi_devs) qla1280_check_for_dead_scsi_bus(ha, scsi_devs->channel); - LEAVE("qla1280_select_queue_depth"); } #endif /* - * Driver Support Routines - */ - -/* * qla1280_done * Process completed commands. * @@ -1961,100 +1647,19 @@ LEAVE("qla1280_put_done_q"); } - -/* -* qla1280_mem_alloc -* Allocates adapter memory. -* -* Returns: -* 0 = success. -* 1 = failure. -*/ -static int -qla1280_mem_alloc(struct scsi_qla_host *ha) -{ - int status = 1; - dma_addr_t dma_handle; - - ENTER("qla1280_mem_alloc"); - - /* get consistent memory allocated for request and response rings */ - ha->request_ring = pci_alloc_consistent(ha->pdev, - ((REQUEST_ENTRY_CNT + 1) * - (sizeof(request_t))), - &dma_handle); - if (!ha->request_ring) - goto error; - ha->request_dma = dma_handle; - ha->response_ring = pci_alloc_consistent(ha->pdev, - ((RESPONSE_ENTRY_CNT + 1) * - (sizeof(struct response))), - &dma_handle); - if (!ha->response_ring) - goto error; - ha->response_dma = dma_handle; - status = 0; - goto finish; - - error: - if (status) - dprintk(2, "qla1280_mem_alloc: **** FAILED ****\n"); - - if (ha->request_ring) - pci_free_consistent(ha->pdev, - ((REQUEST_ENTRY_CNT + 1) * - (sizeof(request_t))), - ha->request_ring, ha->request_dma); - finish: - LEAVE("qla1280_mem_alloc"); - return status; -} - -/* - * qla1280_mem_free - * Frees adapter allocated memory. - * - * Input: - * ha = adapter block pointer. - */ -static void -qla1280_mem_free(struct scsi_qla_host *ha) -{ - ENTER("qlc1280_mem_free"); - /* free consistent memory allocated for request and response rings */ - if (ha->request_ring) - pci_free_consistent(ha->pdev, - ((REQUEST_ENTRY_CNT + 1) * - (sizeof(request_t))), - ha->request_ring, ha->request_dma); - - if (ha->response_ring) - pci_free_consistent(ha->pdev, - ((RESPONSE_ENTRY_CNT + 1) * - (sizeof(struct response))), - ha->response_ring, ha->response_dma); - - if (qla1280_buffer) { - free_page((unsigned long) qla1280_buffer); - qla1280_buffer = NULL; - } - - LEAVE("qlc1280_mem_free"); -} - /****************************************************************************/ /* QLogic ISP1280 Hardware Support Functions. */ /****************************************************************************/ /* - * qla2100_enable_intrs - * qla2100_disable_intrs - * - * Input: - * ha = adapter block pointer. - * - * Returns: - * None + * qla2100_enable_intrs + * qla2100_disable_intrs + * + * Input: + * ha = adapter block pointer. + * + * Returns: + * None */ static inline void qla1280_enable_intrs(struct scsi_qla_host *ha) @@ -2090,7 +1695,7 @@ * Returns: * 0 = success */ -static int +static int __devinit qla1280_initialize_adapter(struct scsi_qla_host *ha) { struct device_reg *reg; @@ -2286,7 +1891,7 @@ * Returns: * 0 = success. */ -static int +static int __devinit qla1280_pci_config(struct scsi_qla_host *ha) { #if MEMORY_MAPPED_IO @@ -2713,8 +2318,8 @@ ENTER("qla1280_nvram_config"); - if (ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP12160 || - ha->device_id == PCI_DEVICE_ID_QLOGIC_ISP10160) + if (ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP12160 || + ha->pdev->device == PCI_DEVICE_ID_QLOGIC_ISP10160) is1x160 = 1; else is1x160 = 0; @@ -4196,48 +3801,6 @@ LEAVE("qla1280_isp_cmd"); } -#if QL1280_LUN_SUPPORT -/* - * qla1280_enable_lun - * Issue enable LUN entry IOCB. - * - * Input: - * ha = adapter block pointer. - * bus = SCSI BUS number. - * lun = LUN number. - */ -static void -qla1280_enable_lun(struct scsi_qla_host *ha, int bus, int lun) -{ - struct elun_entry *pkt; - - ENTER("qla1280_enable_lun"); - - /* Get request packet. */ - /* - if (pkt = (struct elun_entry *)qla1280_req_pkt(ha)) - { - pkt->entry_type = ENABLE_LUN_TYPE; - pkt->lun = cpu_to_le16(bus ? lun | BIT_15 : lun); - pkt->command_count = 32; - pkt->immed_notify_count = 1; - pkt->group_6_length = MAX_CMDSZ; - pkt->group_7_length = MAX_CMDSZ; - pkt->timeout = cpu_to_le16(0x30); - - qla1280_isp_cmd(ha); - } - */ - pkt = (struct elun_entry *) 1; - - if (!pkt) - dprintk(2, "qla1280_enable_lun: **** FAILED ****\n"); - else - dprintk(3, "qla1280_enable_lun: exiting normally\n"); -} -#endif - - /****************************************************************************/ /* Interrupt Service Routine. */ /****************************************************************************/ @@ -4883,7 +4446,7 @@ } static void -qla12160_get_target_parameters(struct scsi_qla_host *ha, Scsi_Device *device) +qla1280_get_target_parameters(struct scsi_qla_host *ha, Scsi_Device *device) { uint16_t mb[MAILBOX_REGISTER_COUNT]; int bus, target, lun; @@ -5125,34 +4688,274 @@ return ret; } - -static Scsi_Host_Template driver_template = { - .proc_info = qla1280_proc_info, +static Scsi_Host_Template qla1280_driver_template = { + .proc_name = "qla1280", .name = "Qlogic ISP 1280/12160", +#if LINUX_VERSION_CODE >= 0x020545 + .slave_configure = qla1280_slave_configure, +#else .detect = qla1280_detect, .release = qla1280_release, +#endif .info = qla1280_info, .queuecommand = qla1280_queuecommand, -#if LINUX_VERSION_CODE >= 0x020545 - .slave_configure = qla1280_slave_configure, -#endif .eh_abort_handler = qla1280_eh_abort, .eh_device_reset_handler= qla1280_eh_device_reset, .eh_bus_reset_handler = qla1280_eh_bus_reset, .eh_host_reset_handler = qla1280_eh_adapter_reset, .bios_param = qla1280_biosparam, - .can_queue = 255, + .proc_info = qla1280_proc_info, + .can_queue = 0xfffff, .this_id = -1, .sg_tablesize = SG_ALL, - .cmd_per_lun = 3, + .cmd_per_lun = 1, .use_clustering = ENABLE_CLUSTERING, #if LINUX_VERSION_CODE < 0x020545 .use_new_eh_code = 1, #endif }; -#include "scsi_module.c" +static int __devinit +qla1280_probe_one(struct pci_dev *pdev, const struct pci_device_id *id) +{ + int devnum = id->driver_data; + struct qla_boards *bdp = &ql1280_board_tbl[devnum]; + struct Scsi_Host *host; + struct scsi_qla_host *ha; + int error = -ENODEV; + + /* Bypass all AMI SUBSYS VENDOR IDs */ + if (pdev->subsystem_vendor == PCI_VENDOR_ID_AMI) { + printk(KERN_INFO + "qla1280: Skipping AMI SubSys Vendor ID Chip\n"); + goto error; + } + + printk(KERN_INFO "qla1280: %s found on PCI bus %i, dev %i\n", + bdp->name, pdev->bus->number, PCI_SLOT(pdev->devfn)); + + if (pci_enable_device(pdev)) { + printk(KERN_WARNING + "qla1280: Failed to enabled pci device, aborting.\n"); + goto error; + } + + error = -ENOMEM; + host = scsi_host_alloc(&qla1280_driver_template, sizeof(*ha)); + if (!host) { + printk(KERN_WARNING + "qla1280: Failed to register host, aborting.\n"); + goto error; + } + + ha = (struct scsi_qla_host *)host->hostdata; + memset(ha, 0, sizeof(struct scsi_qla_host)); + + ha->pdev = pdev; + ha->devnum = devnum; /* specifies microcode load address */ + + ha->request_ring = pci_alloc_consistent(ha->pdev, + ((REQUEST_ENTRY_CNT + 1) * (sizeof(request_t))), + &ha->request_dma); + if (!ha->request_ring) { + printk(KERN_INFO "qla1280: Failed to get request memory\n"); + goto error_put_host; + } + + ha->response_ring = pci_alloc_consistent(ha->pdev, + ((RESPONSE_ENTRY_CNT + 1) * (sizeof(struct response))), + &ha->response_dma); + if (!ha->response_ring) { + printk(KERN_INFO "qla1280: Failed to get response memory\n"); + goto error_free_request_ring; + } + + ha->ports = bdp->numPorts; + + ha->host = host; + ha->host_no = host->host_no; + + host->irq = pdev->irq; + host->base = (unsigned long)ha->mmpbase; + host->max_channel = bdp->numPorts - 1; + host->max_lun = MAX_LUNS - 1; + host->max_id = MAX_TARGETS; + host->max_sectors = 1024; + host->unique_id = host->host_no; + +#if LINUX_VERSION_CODE < 0x020545 + host->select_queue_depths = qla1280_select_queue_depth; +#endif + + error = -ENODEV; + if (qla1280_pci_config(ha)) { + printk(KERN_INFO "qla1280: Unable to configure PCI\n"); + goto error_free_response_ring; + } + + /* Disable ISP interrupts. */ + qla1280_disable_intrs(ha); + + /* Register the IRQ with Linux (sharable) */ + if (request_irq(pdev->irq, qla1280_intr_handler, SA_SHIRQ, + "qla1280", ha)) { + printk("qla1280 : Failed to reserve interrupt %d already " + "in use\n", pdev->irq); + goto error_iounmap; + } + +#if !MEMORY_MAPPED_IO + /* Register the I/O space with Linux */ + if (!request_region(host->io_port, 0xff, "qla1280")) { + printk("qla1280: Failed to reserve i/o region 0x%04lx-0x%04lx" + " already in use\n", + host->io_port, host->io_port + 0xff); + goto error_free_irq; + } +#endif + + /* load the F/W, read paramaters, and init the H/W */ + if (qla1280_initialize_adapter(ha)) { + printk(KERN_INFO "qla1x160: Failed to initialize adapter\n"); + goto error_release_region; + } + + /* set our host ID (need to do something about our two IDs) */ + host->this_id = ha->bus_settings[0].id; + + pci_set_drvdata(pdev, host); + +#if LINUX_VERSION_CODE >= 0x020600 + error = scsi_add_host(host, &pdev->dev); + if (error) + goto error_disable_adapter; + scsi_scan_host(host); +#else + scsi_set_pci_device(host, pdev); +#endif + + return 0; + +#if LINUX_VERSION_CODE >= 0x020600 + error_disable_adapter: + WRT_REG_WORD(&ha->iobase->ictrl, 0); +#endif + error_release_region: +#if !MEMORY_MAPPED_IO + release_region(host->io_port, 0xff); + error_free_irq: +#endif + free_irq(pdev->irq, ha); + error_iounmap: +#if MEMORY_MAPPED_IO + iounmap((void *)(((unsigned long) ha->mmpbase) & PAGE_MASK)); +#endif + error_free_response_ring: + pci_free_consistent(ha->pdev, + ((RESPONSE_ENTRY_CNT + 1) * (sizeof(struct response))), + ha->response_ring, ha->response_dma); + error_free_request_ring: + pci_free_consistent(ha->pdev, + ((REQUEST_ENTRY_CNT + 1) * (sizeof(request_t))), + ha->request_ring, ha->request_dma); + error_put_host: + scsi_host_put(host); + error: + return error; +} + +/* + * Older ia64 toolchains have problems with relative links when this + * goes into the .exit.text section + */ +#if !defined(CONFIG_QLA1280_MODULE) && defined(__ia64__) && (__GNUC__ == 2) +static void +#else +static void __devexit +#endif +qla1280_remove_one(struct pci_dev *pdev) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct scsi_qla_host *ha = (struct scsi_qla_host *)host->hostdata; + +#if LINUX_VERSION_CODE >= 0x020600 + scsi_remove_host(host); +#endif + + WRT_REG_WORD(&ha->iobase->ictrl, 0); + + free_irq(pdev->irq, ha); + +#if MEMORY_MAPPED_IO + iounmap(ha->mmpbase); +#else + release_region(host->io_port, 0xff); +#endif + + pci_free_consistent(ha->pdev, + ((REQUEST_ENTRY_CNT + 1) * (sizeof(request_t))), + ha->request_ring, ha->request_dma); + pci_free_consistent(ha->pdev, + ((RESPONSE_ENTRY_CNT + 1) * (sizeof(struct response))), + ha->response_ring, ha->response_dma); + + scsi_host_put(host); +} + +#if LINUX_VERSION_CODE >= 0x020600 +static struct pci_driver qla1280_pci_driver = { + .name = "qla1280", + .id_table = qla1280_pci_tbl, + .probe = qla1280_probe_one, + .remove = __devexit_p(qla1280_remove_one), +}; + +static int __init +qla1280_init(void) +{ + if (sizeof(struct srb) > sizeof(Scsi_Pointer)) { + printk(KERN_WARNING + "qla1280: struct srb too big, aborting\n"); + return -EINVAL; + } + +#ifdef MODULE + /* + * If we are called as a module, the qla1280 pointer may not be null + * and it would point to our bootup string, just like on the lilo + * command line. IF not NULL, then process this config string with + * qla1280_setup + * + * Boot time Options + * To add options at boot time add a line to your lilo.conf file like: + * append="qla1280=verbose,max_tags:{{255,255,255,255},{255,255,255,255}}" + * which will result in the first four devices on the first two + * controllers being set to a tagged queue depth of 32. + */ + if (qla1280) + qla1280_setup(qla1280); +#endif + + return pci_module_init(&qla1280_pci_driver); +} +static void __exit +qla1280_exit(void) +{ + pci_unregister_driver(&qla1280_pci_driver); +} + +module_init(qla1280_init); +module_exit(qla1280_exit); + +#else +# define driver_template qla1280_driver_template +# include "scsi_module.c" +#endif + +MODULE_AUTHOR("Qlogic & Jes Sorensen"); +MODULE_DESCRIPTION("Qlogic ISP SCSI (qla1x80/qla1x160) driver"); +MODULE_LICENSE("GPL"); /* * Overrides for Emacs so that we almost follow Linus's tabbing style. --- diff/drivers/scsi/qla1280.h 2003-09-30 15:46:17.000000000 +0100 +++ source/drivers/scsi/qla1280.h 2003-11-26 10:09:06.000000000 +0000 @@ -1021,11 +1021,7 @@ unsigned char *mmpbase; /* memory mapped address */ unsigned long host_no; - unsigned long instance; struct pci_dev *pdev; - uint32_t device_id; - uint8_t pci_bus; - uint8_t pci_device_fn; uint8_t devnum; uint8_t revision; uint8_t ports; @@ -1040,18 +1036,9 @@ /* BUS configuration data */ struct bus_param bus_settings[MAX_BUSES]; -#if 0 - /* bottom half run queue */ - struct tq_struct run_qla_bh; -#endif - /* Received ISP mailbox data. */ volatile uint16_t mailbox_out[MAILBOX_REGISTER_COUNT]; -#ifdef UNUSED - struct timer_list dev_timer[MAX_TARGETS]; -#endif - dma_addr_t request_dma; /* Physical Address */ request_t *request_ring; /* Base virtual address */ request_t *request_ring_ptr; /* Current address. */ @@ -1063,15 +1050,6 @@ struct response *response_ring_ptr; /* Current address. */ uint16_t rsp_ring_index; /* Current index. */ -#if WATCHDOGTIMER - /* Watchdog queue, lock and total timer */ - uint8_t watchdog_q_lock; /* Lock for watchdog queue */ - struct srb *wdg_q_first; /* First job on watchdog queue */ - struct srb *wdg_q_last; /* Last job on watchdog queue */ - uint32_t total_timeout; /* Total timeout (quantum count) */ - uint32_t watchdogactive; -#endif - struct srb *done_q_first; /* First job on done queue */ struct srb *done_q_last; /* Last job on done queue */ --- diff/drivers/scsi/sata_promise.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/sata_promise.c 2003-11-26 10:09:06.000000000 +0000 @@ -213,6 +213,8 @@ board_2037x }, { PCI_VENDOR_ID_PROMISE, 0x3375, PCI_ANY_ID, PCI_ANY_ID, 0, 0, board_2037x }, + { PCI_VENDOR_ID_PROMISE, 0x3376, PCI_ANY_ID, PCI_ANY_ID, 0, 0, + board_2037x }, { PCI_VENDOR_ID_PROMISE, 0x3318, PCI_ANY_ID, PCI_ANY_ID, 0, 0, board_20319 }, { PCI_VENDOR_ID_PROMISE, 0x3319, PCI_ANY_ID, PCI_ANY_ID, 0, 0, --- diff/drivers/scsi/sata_svw.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/sata_svw.c 2003-11-26 10:09:06.000000000 +0000 @@ -44,7 +44,7 @@ #endif /* CONFIG_ALL_PPC */ #define DRV_NAME "ata_k2" -#define DRV_VERSION "1.02" +#define DRV_VERSION "1.03" static u32 k2_sata_scr_read (struct ata_port *ap, unsigned int sc_reg) @@ -69,8 +69,11 @@ struct ata_ioports *ioaddr = &ap->ioaddr; unsigned int is_addr = tf->flags & ATA_TFLAG_ISADDR; - writeb(tf->ctl, ioaddr->ctl_addr); - + if (tf->ctl != ap->last_ctl) { + writeb(tf->ctl, ioaddr->ctl_addr); + ap->last_ctl = tf->ctl; + ata_wait_idle(ap); + } if (is_addr && (tf->flags & ATA_TFLAG_LBA48)) { writew(tf->feature | (((u16)tf->hob_feature) << 8), ioaddr->error_addr); writew(tf->nsect | (((u16)tf->hob_nsect) << 8), ioaddr->nsect_addr); @@ -311,13 +314,24 @@ rc = -ENODEV; goto err_out_unmap; } + + /* Clear a magic bit in SCR1 according to Darwin, those help + * some funky seagate drives (though so far, those were already + * set by the firmware on the machines I had access to + */ + writel(readl(mmio_base + 0x80) & ~0x00040000, mmio_base + 0x80); + + /* Clear SATA error & interrupts we don't use */ + writel(0xffffffff, mmio_base + 0x44); + writel(0x0, mmio_base + 0x88); + probe_ent->sht = &k2_sata_sht; - probe_ent->host_flags = ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY | - ATA_FLAG_SRST | ATA_FLAG_MMIO; + probe_ent->host_flags = ATA_FLAG_SATA | ATA_FLAG_SATA_RESET | + ATA_FLAG_NO_LEGACY | ATA_FLAG_MMIO; probe_ent->port_ops = &k2_sata_ops; - probe_ent->n_ports = 2; - probe_ent->irq = pdev->irq; - probe_ent->irq_flags = SA_SHIRQ; + probe_ent->n_ports = 2; + probe_ent->irq = pdev->irq; + probe_ent->irq_flags = SA_SHIRQ; probe_ent->mmio_base = mmio_base; /* --- diff/drivers/scsi/scsi.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/scsi.c 2003-11-26 10:09:06.000000000 +0000 @@ -367,6 +367,16 @@ unsigned long timeout; int rtn = 0; + /* check if the device is still usable */ + if (unlikely(cmd->device->sdev_state == SDEV_DEL)) { + /* in SDEV_DEL we error all commands. DID_NO_CONNECT + * returns an immediate error upwards, and signals + * that the device is no longer present */ + cmd->result = DID_NO_CONNECT << 16; + scsi_done(cmd); + /* return 0 (because the command has been processed) */ + goto out; + } /* Assign a unique nonzero serial_number. */ /* XXX(hch): this is racy */ if (++serial_number == 0) @@ -893,7 +903,7 @@ */ int scsi_device_get(struct scsi_device *sdev) { - if (test_bit(SDEV_DEL, &sdev->sdev_state)) + if (sdev->sdev_state == SDEV_DEL) return -ENXIO; if (!get_device(&sdev->sdev_gendev)) return -ENXIO; @@ -1015,7 +1025,7 @@ struct list_head *lh, *lh_sf; unsigned long flags; - set_bit(SDEV_CANCEL, &sdev->sdev_state); + sdev->sdev_state = SDEV_CANCEL; spin_lock_irqsave(&sdev->list_lock, flags); list_for_each_entry(scmd, &sdev->cmd_list, list) { --- diff/drivers/scsi/scsi_error.c 2003-09-30 15:46:17.000000000 +0100 +++ source/drivers/scsi/scsi_error.c 2003-11-26 10:09:06.000000000 +0000 @@ -911,7 +911,9 @@ if (rtn == SUCCESS) { scsi_sleep(BUS_RESET_SETTLE_TIME); + spin_lock_irqsave(scmd->device->host->host_lock, flags); scsi_report_bus_reset(scmd->device->host, scmd->device->channel); + spin_unlock_irqrestore(scmd->device->host->host_lock, flags); } return rtn; @@ -940,7 +942,9 @@ if (rtn == SUCCESS) { scsi_sleep(HOST_RESET_SETTLE_TIME); + spin_lock_irqsave(scmd->device->host->host_lock, flags); scsi_report_bus_reset(scmd->device->host, scmd->device->channel); + spin_unlock_irqrestore(scmd->device->host->host_lock, flags); } return rtn; @@ -1608,7 +1612,7 @@ * * Returns: Nothing * - * Lock status: No locks are assumed held. + * Lock status: Host lock must be held. * * Notes: This only needs to be called if the reset is one which * originates from an unknown location. Resets originated @@ -1622,7 +1626,7 @@ { struct scsi_device *sdev; - shost_for_each_device(sdev, shost) { + __shost_for_each_device(sdev, shost) { if (channel == sdev->channel) { sdev->was_reset = 1; sdev->expecting_cc_ua = 1; @@ -1642,7 +1646,7 @@ * * Returns: Nothing * - * Lock status: No locks are assumed held. + * Lock status: Host lock must be held * * Notes: This only needs to be called if the reset is one which * originates from an unknown location. Resets originated @@ -1656,7 +1660,7 @@ { struct scsi_device *sdev; - shost_for_each_device(sdev, shost) { + __shost_for_each_device(sdev, shost) { if (channel == sdev->channel && target == sdev->id) { sdev->was_reset = 1; --- diff/drivers/scsi/scsi_lib.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/scsi_lib.c 2003-11-26 10:09:06.000000000 +0000 @@ -923,6 +923,22 @@ { struct scsi_device *sdev = q->queuedata; struct scsi_cmnd *cmd; + int specials_only = 0; + + if(unlikely(sdev->sdev_state != SDEV_RUNNING)) { + /* OK, we're not in a running state don't prep + * user commands */ + if(sdev->sdev_state == SDEV_DEL) { + /* Device is fully deleted, no commands + * at all allowed down */ + printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to dead device\n", + sdev->host->host_no, sdev->id, sdev->lun); + return BLKPREP_KILL; + } + /* OK, we only allow special commands (i.e. not + * user initiated ones */ + specials_only = 1; + } /* * Find the actual device driver associated with this command. @@ -945,6 +961,14 @@ } else cmd = req->special; } else if (req->flags & (REQ_CMD | REQ_BLOCK_PC)) { + + if(unlikely(specials_only)) { + printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to device being removed\n", + sdev->host->host_no, sdev->id, sdev->lun); + return BLKPREP_KILL; + } + + /* * Just check to see if the device is online. If * it isn't, we refuse to process ordinary commands @@ -1127,6 +1151,10 @@ struct scsi_cmnd *cmd; struct request *req; + if(!get_device(&sdev->sdev_gendev)) + /* We must be tearing the block queue down already */ + return; + /* * To start with, we keep looping until the queue is empty, or until * the host is no longer able to accept any more requests. @@ -1199,7 +1227,7 @@ } } - return; + goto out; not_ready: spin_unlock_irq(shost->host_lock); @@ -1217,6 +1245,12 @@ sdev->device_busy--; if(sdev->device_busy == 0) blk_plug_device(q); + out: + /* must be careful here...if we trigger the ->remove() function + * we cannot be holding the q lock */ + spin_unlock_irq(q->queue_lock); + put_device(&sdev->sdev_gendev); + spin_lock_irq(q->queue_lock); } u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost) --- diff/drivers/scsi/scsi_priv.h 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/scsi/scsi_priv.h 2003-11-26 10:09:06.000000000 +0000 @@ -130,7 +130,6 @@ extern int scsi_scan_host_selected(struct Scsi_Host *, unsigned int, unsigned int, unsigned int, int); extern void scsi_forget_host(struct Scsi_Host *); -extern void scsi_free_sdev(struct scsi_device *); extern void scsi_rescan_device(struct device *); /* scsi_sysctl.c */ @@ -143,7 +142,8 @@ #endif /* CONFIG_SYSCTL */ /* scsi_sysfs.c */ -extern int scsi_device_register(struct scsi_device *); +extern void scsi_device_dev_release(struct device *); +extern int scsi_sysfs_add_sdev(struct scsi_device *); extern int scsi_sysfs_add_host(struct Scsi_Host *); extern int scsi_sysfs_register(void); extern void scsi_sysfs_unregister(void); --- diff/drivers/scsi/scsi_scan.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/scsi_scan.c 2003-11-26 10:09:06.000000000 +0000 @@ -205,6 +205,7 @@ sdev->lun = lun; sdev->channel = channel; sdev->online = TRUE; + sdev->sdev_state = SDEV_CREATED; INIT_LIST_HEAD(&sdev->siblings); INIT_LIST_HEAD(&sdev->same_target_siblings); INIT_LIST_HEAD(&sdev->cmd_list); @@ -236,6 +237,25 @@ goto out_free_queue; } + if (get_device(&sdev->host->shost_gendev)) { + + device_initialize(&sdev->sdev_gendev); + sdev->sdev_gendev.parent = &sdev->host->shost_gendev; + sdev->sdev_gendev.bus = &scsi_bus_type; + sdev->sdev_gendev.release = scsi_device_dev_release; + sprintf(sdev->sdev_gendev.bus_id,"%d:%d:%d:%d", + sdev->host->host_no, sdev->channel, sdev->id, + sdev->lun); + + class_device_initialize(&sdev->sdev_classdev); + sdev->sdev_classdev.dev = &sdev->sdev_gendev; + sdev->sdev_classdev.class = &sdev_class; + snprintf(sdev->sdev_classdev.class_id, BUS_ID_SIZE, + "%d:%d:%d:%d", sdev->host->host_no, + sdev->channel, sdev->id, sdev->lun); + } else + goto out_free_queue; + /* * If there are any same target siblings, add this to the * sibling list @@ -273,36 +293,6 @@ } /** - * scsi_free_sdev - cleanup and free a scsi_device - * @sdev: cleanup and free this scsi_device - * - * Description: - * Undo the actions in scsi_alloc_sdev, including removing @sdev from - * the list, and freeing @sdev. - **/ -void scsi_free_sdev(struct scsi_device *sdev) -{ - unsigned long flags; - - spin_lock_irqsave(sdev->host->host_lock, flags); - list_del(&sdev->siblings); - list_del(&sdev->same_target_siblings); - spin_unlock_irqrestore(sdev->host->host_lock, flags); - - if (sdev->request_queue) - scsi_free_queue(sdev->request_queue); - - spin_lock_irqsave(sdev->host->host_lock, flags); - list_del(&sdev->starved_entry); - if (sdev->single_lun && --sdev->sdev_target->starget_refcnt == 0) - kfree(sdev->sdev_target); - spin_unlock_irqrestore(sdev->host->host_lock, flags); - - kfree(sdev->inquiry); - kfree(sdev); -} - -/** * scsi_probe_lun - probe a single LUN using a SCSI INQUIRY * @sreq: used to send the INQUIRY * @inq_result: area to store the INQUIRY result @@ -642,7 +632,7 @@ * register it and tell the rest of the kernel * about it. */ - scsi_device_register(sdev); + scsi_sysfs_add_sdev(sdev); return SCSI_SCAN_LUN_PRESENT; } @@ -748,8 +738,11 @@ if (res == SCSI_SCAN_LUN_PRESENT) { if (sdevp) *sdevp = sdev; - } else - scsi_free_sdev(sdev); + } else { + if (sdev->host->hostt->slave_destroy) + sdev->host->hostt->slave_destroy(sdev); + put_device(&sdev->sdev_gendev); + } out: return res; } @@ -1301,5 +1294,8 @@ void scsi_free_host_dev(struct scsi_device *sdev) { BUG_ON(sdev->id != sdev->host->this_id); - scsi_free_sdev(sdev); + + if (sdev->host->hostt->slave_destroy) + sdev->host->hostt->slave_destroy(sdev); + put_device(&sdev->sdev_gendev); } --- diff/drivers/scsi/scsi_sysfs.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/scsi_sysfs.c 2003-11-26 10:09:06.000000000 +0000 @@ -115,14 +115,29 @@ put_device(&sdev->sdev_gendev); } -static void scsi_device_dev_release(struct device *dev) +void scsi_device_dev_release(struct device *dev) { struct scsi_device *sdev; struct device *parent; + unsigned long flags; parent = dev->parent; sdev = to_scsi_device(dev); - scsi_free_sdev(sdev); + + spin_lock_irqsave(sdev->host->host_lock, flags); + list_del(&sdev->siblings); + list_del(&sdev->same_target_siblings); + list_del(&sdev->starved_entry); + if (sdev->single_lun && --sdev->sdev_target->starget_refcnt == 0) + kfree(sdev->sdev_target); + spin_unlock_irqrestore(sdev->host->host_lock, flags); + + if (sdev->request_queue) + scsi_free_queue(sdev->request_queue); + + kfree(sdev->inquiry); + kfree(sdev); + put_device(parent); } @@ -321,29 +336,20 @@ } /** - * scsi_device_register - register a scsi device with the scsi bus - * @sdev: scsi_device to register + * scsi_sysfs_add_sdev - add scsi device to sysfs + * @sdev: scsi_device to add * * Return value: * 0 on Success / non-zero on Failure **/ -int scsi_device_register(struct scsi_device *sdev) +int scsi_sysfs_add_sdev(struct scsi_device *sdev) { - int error = 0, i; + int error = -EINVAL, i; + + if (sdev->sdev_state != SDEV_CREATED) + return error; - set_bit(SDEV_ADD, &sdev->sdev_state); - device_initialize(&sdev->sdev_gendev); - sprintf(sdev->sdev_gendev.bus_id,"%d:%d:%d:%d", - sdev->host->host_no, sdev->channel, sdev->id, sdev->lun); - sdev->sdev_gendev.parent = &sdev->host->shost_gendev; - sdev->sdev_gendev.bus = &scsi_bus_type; - sdev->sdev_gendev.release = scsi_device_dev_release; - - class_device_initialize(&sdev->sdev_classdev); - sdev->sdev_classdev.dev = &sdev->sdev_gendev; - sdev->sdev_classdev.class = &sdev_class; - snprintf(sdev->sdev_classdev.class_id, BUS_ID_SIZE, "%d:%d:%d:%d", - sdev->host->host_no, sdev->channel, sdev->id, sdev->lun); + sdev->sdev_state = SDEV_RUNNING; error = device_add(&sdev->sdev_gendev); if (error) { @@ -351,8 +357,6 @@ return error; } - get_device(sdev->sdev_gendev.parent); - error = class_device_add(&sdev->sdev_classdev); if (error) { printk(KERN_INFO "error 2\n"); @@ -384,8 +388,11 @@ return error; clean_device: + sdev->sdev_state = SDEV_CANCEL; + device_del(&sdev->sdev_gendev); put_device(&sdev->sdev_gendev); + return error; } @@ -396,12 +403,14 @@ **/ void scsi_remove_device(struct scsi_device *sdev) { - class_device_unregister(&sdev->sdev_classdev); - set_bit(SDEV_DEL, &sdev->sdev_state); - if (sdev->host->hostt->slave_destroy) - sdev->host->hostt->slave_destroy(sdev); - device_del(&sdev->sdev_gendev); - put_device(&sdev->sdev_gendev); + if (sdev->sdev_state == SDEV_RUNNING || sdev->sdev_state == SDEV_CANCEL) { + sdev->sdev_state = SDEV_DEL; + class_device_unregister(&sdev->sdev_classdev); + device_del(&sdev->sdev_gendev); + if (sdev->host->hostt->slave_destroy) + sdev->host->hostt->slave_destroy(sdev); + put_device(&sdev->sdev_gendev); + } } int scsi_register_driver(struct device_driver *drv) --- diff/drivers/scsi/sd.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/scsi/sd.c 2003-11-26 10:09:06.000000000 +0000 @@ -62,6 +62,7 @@ */ #define SD_MAJORS 16 #define SD_DISKS (SD_MAJORS << 4) +#define TOTAL_SD_DISKS CONFIG_MAX_SD_DISKS /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -95,7 +96,7 @@ }; -static unsigned long sd_index_bits[SD_DISKS / BITS_PER_LONG]; +static unsigned long sd_index_bits[TOTAL_SD_DISKS / BITS_PER_LONG]; static spinlock_t sd_index_lock = SPIN_LOCK_UNLOCKED; static int sd_revalidate_disk(struct gendisk *disk); @@ -130,6 +131,9 @@ return SCSI_DISK1_MAJOR + major_idx - 1; case 8 ... 15: return SCSI_DISK8_MAJOR + major_idx - 8; +#define MAX_IDX (TOTAL_SD_DISKS >> 4) + case 16 ... MAX_IDX: + return SCSI_DISK15_MAJOR; default: BUG(); return 0; /* shut up gcc */ @@ -378,9 +382,9 @@ * In the latter case @inode and @filp carry an abridged amount * of information as noted above. **/ -static int sd_open(struct inode *inode, struct file *filp) +static int sd_open(struct block_device *bdev, struct file *filp) { - struct gendisk *disk = inode->i_bdev->bd_disk; + struct gendisk *disk = bdev->bd_disk; struct scsi_disk *sdkp = scsi_disk(disk); struct scsi_device *sdev; int retval; @@ -402,7 +406,7 @@ goto error_out; if (sdev->removable || sdkp->write_prot) - check_disk_change(inode->i_bdev); + check_disk_change(bdev); /* * If the drive is empty, just let the open fail. @@ -453,9 +457,8 @@ * Note: may block (uninterruptible) if error recovery is underway * on this disk. **/ -static int sd_release(struct inode *inode, struct file *filp) +static int sd_release(struct gendisk *disk) { - struct gendisk *disk = inode->i_bdev->bd_disk; struct scsi_disk *sdkp = scsi_disk(disk); struct scsi_device *sdev = sdkp->device; @@ -518,10 +521,9 @@ * Note: most ioctls are forward onto the block subsystem or further * down in the scsi subsytem. **/ -static int sd_ioctl(struct inode * inode, struct file * filp, +static int sd_ioctl(struct block_device *bdev, struct file *filp, unsigned int cmd, unsigned long arg) { - struct block_device *bdev = inode->i_bdev; struct gendisk *disk = bdev->bd_disk; struct scsi_device *sdp = scsi_disk(disk)->device; int error; @@ -1320,8 +1322,8 @@ goto out_free; spin_lock(&sd_index_lock); - index = find_first_zero_bit(sd_index_bits, SD_DISKS); - if (index == SD_DISKS) { + index = find_first_zero_bit(sd_index_bits, TOTAL_SD_DISKS); + if (index == TOTAL_SD_DISKS) { spin_unlock(&sd_index_lock); error = -EBUSY; goto out_put; @@ -1336,15 +1338,24 @@ sdkp->openers = 0; gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + if (index > SD_DISKS) + gd->first_minor = ((index - SD_DISKS) & 15) << 4; + else + gd->first_minor = (index & 15) << 4; gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", 'a' + index/26-1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index/ 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); --- diff/drivers/scsi/sg.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/scsi/sg.c 2003-11-26 10:09:06.000000000 +0000 @@ -1118,7 +1118,7 @@ } static struct page * -sg_vma_nopage(struct vm_area_struct *vma, unsigned long addr, int unused) +sg_vma_nopage(struct vm_area_struct *vma, unsigned long addr, int *type) { Sg_fd *sfp; struct page *page = NOPAGE_SIGBUS; @@ -1158,6 +1158,8 @@ page = virt_to_page(page_ptr); get_page(page); /* increment page count */ } + if (type) + *type = VM_FAULT_MINOR; return page; } --- diff/drivers/scsi/sr.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/scsi/sr.c 2003-11-26 10:09:06.000000000 +0000 @@ -413,22 +413,22 @@ return 1; } -static int sr_block_open(struct inode *inode, struct file *file) +static int sr_block_open(struct block_device *bdev, struct file *file) { - struct scsi_cd *cd = scsi_cd(inode->i_bdev->bd_disk); - return cdrom_open(&cd->cdi, inode, file); + struct scsi_cd *cd = scsi_cd(bdev->bd_disk); + return cdrom_open(&cd->cdi, bdev, file); } -static int sr_block_release(struct inode *inode, struct file *file) +static int sr_block_release(struct gendisk *disk) { - struct scsi_cd *cd = scsi_cd(inode->i_bdev->bd_disk); - return cdrom_release(&cd->cdi, file); + struct scsi_cd *cd = scsi_cd(disk); + return cdrom_release(&cd->cdi); } -static int sr_block_ioctl(struct inode *inode, struct file *file, unsigned cmd, - unsigned long arg) +static int sr_block_ioctl(struct block_device *bdev, struct file *file, + unsigned cmd, unsigned long arg) { - struct scsi_cd *cd = scsi_cd(inode->i_bdev->bd_disk); + struct scsi_cd *cd = scsi_cd(bdev->bd_disk); struct scsi_device *sdev = cd->device; /* @@ -440,7 +440,7 @@ case SCSI_IOCTL_GET_BUS_NUMBER: return scsi_ioctl(sdev, cmd, (void *)arg); } - return cdrom_ioctl(&cd->cdi, inode, cmd, arg); + return cdrom_ioctl(&cd->cdi, bdev, cmd, arg); } static int sr_block_media_changed(struct gendisk *disk) --- diff/drivers/scsi/sym53c8xx_2/sym53c8xx.h 2003-09-30 15:46:17.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym53c8xx.h 2003-11-26 10:09:06.000000000 +0000 @@ -55,19 +55,7 @@ #include <linux/config.h> -/* - * Use normal IO if configured. - * Normal IO forced for alpha. - * Forced to MMIO for sparc. - */ -#if defined(__alpha__) -#define SYM_CONF_IOMAPPED -#elif defined(__sparc__) -#undef SYM_CONF_IOMAPPED -/* #elif defined(__powerpc__) */ -/* #define SYM_CONF_IOMAPPED */ -/* #define SYM_OPT_NO_BUS_MEMORY_MAPPING */ -#elif defined(CONFIG_SCSI_SYM53C8XX_IOMAPPED) +#ifdef CONFIG_SCSI_SYM53C8XX_IOMAPPED #define SYM_CONF_IOMAPPED #endif @@ -93,8 +81,6 @@ */ #if 1 #define SYM_CONF_NVRAM_SUPPORT (1) -#define SYM_SETUP_SYMBIOS_NVRAM (1) -#define SYM_SETUP_TEKRAM_NVRAM (1) #endif /* --- diff/drivers/scsi/sym53c8xx_2/sym_fw.c 2003-09-17 12:28:10.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_fw.c 2003-11-26 10:09:06.000000000 +0000 @@ -89,9 +89,6 @@ }; static struct sym_fwz_ofs sym_fw1z_ofs = { SYM_GEN_FW_Z(struct SYM_FWZ_SCR) -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - SYM_GEN_Z(struct SYM_FWZ_SCR, start_ram) -#endif }; #undef SYM_FWA_SCR #undef SYM_FWB_SCR @@ -122,10 +119,6 @@ }; static struct sym_fwz_ofs sym_fw2z_ofs = { SYM_GEN_FW_Z(struct SYM_FWZ_SCR) -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - SYM_GEN_Z(struct SYM_FWZ_SCR, start_ram) - SYM_GEN_Z(struct SYM_FWZ_SCR, start_ram64) -#endif }; #undef SYM_FWA_SCR #undef SYM_FWB_SCR @@ -146,22 +139,10 @@ { struct sym_fw1a_scr *scripta0; struct sym_fw1b_scr *scriptb0; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - struct sym_fw1z_scr *scriptz0 = - (struct sym_fw1z_scr *) np->scriptz0; -#endif scripta0 = (struct sym_fw1a_scr *) np->scripta0; scriptb0 = (struct sym_fw1b_scr *) np->scriptb0; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - /* - * Set up BUS physical address of SCRIPTS that is to - * be copied to on-chip RAM by the SCRIPTS processor. - */ - scriptz0->scripta0_ba[0] = cpu_to_scr(vtobus(scripta0)); -#endif - /* * Remove LED support if not needed. */ @@ -199,25 +180,10 @@ { struct sym_fw2a_scr *scripta0; struct sym_fw2b_scr *scriptb0; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - struct sym_fw2z_scr *scriptz0 = - (struct sym_fw2z_scr *) np->scriptz0; -#endif scripta0 = (struct sym_fw2a_scr *) np->scripta0; scriptb0 = (struct sym_fw2b_scr *) np->scriptb0; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - /* - * Set up BUS physical address of SCRIPTS that is to - * be copied to on-chip RAM by the SCRIPTS processor. - */ - scriptz0->scripta0_ba64[0] = /* Nothing is missing here */ - scriptz0->scripta0_ba[0] = cpu_to_scr(vtobus(scripta0)); - scriptz0->scriptb0_ba64[0] = cpu_to_scr(vtobus(scriptb0)); - scriptz0->ram_seg64[0] = np->scr_ram_seg; -#endif - /* * Remove LED support if not needed. */ --- diff/drivers/scsi/sym53c8xx_2/sym_fw.h 2002-10-16 04:28:23.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_fw.h 2003-11-26 10:09:06.000000000 +0000 @@ -113,10 +113,6 @@ }; struct sym_fwz_ofs { SYM_GEN_FW_Z(u_short) -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - SYM_GEN_Z(u_short, start_ram) - SYM_GEN_Z(u_short, start_ram64) -#endif }; /* @@ -136,10 +132,6 @@ }; struct sym_fwz_ba { SYM_GEN_FW_Z(u32) -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - SYM_GEN_Z(u32, start_ram) - SYM_GEN_Z(u32, start_ram64) -#endif }; #undef SYM_GEN_A #undef SYM_GEN_B --- diff/drivers/scsi/sym53c8xx_2/sym_fw1.h 2003-05-21 11:49:46.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_fw1.h 2003-11-26 10:09:06.000000000 +0000 @@ -234,10 +234,6 @@ struct SYM_FWZ_SCR { u32 snooptest [ 9]; u32 snoopend [ 2]; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - u32 start_ram [ 1]; - u32 scripta0_ba [ 4]; -#endif }; static struct SYM_FWA_SCR SYM_FWA_SCR = { @@ -1851,24 +1847,5 @@ */ SCR_INT, 99, -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - /* - * We may use MEMORY MOVE instructions to load the on chip-RAM, - * if it happens that mapping PCI memory is not possible. - * But writing the RAM from the CPU is the preferred method, - * since PCI 2.2 seems to disallow PCI self-mastering. - */ -}/*-------------------------< START_RAM >------------------------*/,{ - /* - * Load the script into on-chip RAM, - * and jump to start point. - */ - SCR_COPY (sizeof(struct SYM_FWA_SCR)), -}/*-------------------------< SCRIPTA0_BA >----------------------*/,{ - 0, - PADDR_A (start), - SCR_JUMP, - PADDR_A (init), -#endif /* SYM_OPT_NO_BUS_MEMORY_MAPPING */ }/*--------------------------<>----------------------------------*/ }; --- diff/drivers/scsi/sym53c8xx_2/sym_fw2.h 2003-05-21 11:49:46.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_fw2.h 2003-11-26 10:09:06.000000000 +0000 @@ -228,14 +228,6 @@ struct SYM_FWZ_SCR { u32 snooptest [ 6]; u32 snoopend [ 2]; -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - u32 start_ram [ 1]; - u32 scripta0_ba [ 4]; - u32 start_ram64 [ 3]; - u32 scripta0_ba64 [ 3]; - u32 scriptb0_ba64 [ 6]; - u32 ram_seg64 [ 1]; -#endif }; static struct SYM_FWA_SCR SYM_FWA_SCR = { @@ -1944,51 +1936,5 @@ */ SCR_INT, 99, -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - /* - * We may use MEMORY MOVE instructions to load the on chip-RAM, - * if it happens that mapping PCI memory is not possible. - * But writing the RAM from the CPU is the preferred method, - * since PCI 2.2 seems to disallow PCI self-mastering. - */ -}/*-------------------------< START_RAM >------------------------*/,{ - /* - * Load the script into on-chip RAM, - * and jump to start point. - */ - SCR_COPY (sizeof(struct SYM_FWA_SCR)), -}/*-------------------------< SCRIPTA0_BA >----------------------*/,{ - 0, - PADDR_A (start), - SCR_JUMP, - PADDR_A (init), -}/*-------------------------< START_RAM64 >----------------------*/,{ - /* - * Load the RAM and start for 64 bit PCI (895A,896). - * Both scripts (script and scripth) are loaded into - * the RAM which is 8K (4K for 825A/875/895). - * We also need to load some 32-63 bit segments - * address of the SCRIPTS processor. - * LOAD/STORE ABSOLUTE always refers to on-chip RAM - * in our implementation. The main memory is - * accessed using LOAD/STORE DSA RELATIVE. - */ - SCR_LOAD_REL (mmws, 4), - offsetof (struct sym_hcb, scr_ram_seg), - SCR_COPY (sizeof(struct SYM_FWA_SCR)), -}/*-------------------------< SCRIPTA0_BA64 >--------------------*/,{ - 0, - PADDR_A (start), - SCR_COPY (sizeof(struct SYM_FWB_SCR)), -}/*-------------------------< SCRIPTB0_BA64 >--------------------*/,{ - 0, - PADDR_B (start64), - SCR_LOAD_REL (mmrs, 4), - offsetof (struct sym_hcb, scr_ram_seg), - SCR_JUMP64, - PADDR_B (start64), -}/*-------------------------< RAM_SEG64 >------------------------*/,{ - 0, -#endif /* SYM_OPT_NO_BUS_MEMORY_MAPPING */ }/*-------------------------<>-----------------------------------*/ }; --- diff/drivers/scsi/sym53c8xx_2/sym_glue.c 2003-10-09 09:47:16.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_glue.c 2003-11-26 10:09:06.000000000 +0000 @@ -167,34 +167,16 @@ #define SYM_SCMD_PTR(ucmd) sym_que_entry(ucmd, struct scsi_cmnd, SCp) #define SYM_SOFTC_PTR(cmd) (((struct host_data *)cmd->device->host->hostdata)->ncb) -/* - * Deal with DMA mapping/unmapping. - */ -#define bus_unmap_sg(pdev, sgptr, sgcnt, dir) \ - pci_unmap_sg(pdev, sgptr, sgcnt, dir) -#define bus_unmap_single(pdev, mapping, bufptr, dir) \ - pci_unmap_single(pdev, mapping, bufptr, dir) -#define bus_map_single(pdev, bufptr, bufsiz, dir) \ - pci_map_single(pdev, bufptr, bufsiz, dir) -#define bus_map_sg(pdev, sgptr, sgcnt, dir) \ - pci_map_sg(pdev, sgptr, sgcnt, dir) -#define bus_dma_sync_sg(pdev, sgptr, sgcnt, dir) \ - pci_dma_sync_sg(pdev, sgptr, sgcnt, dir) -#define bus_dma_sync_single(pdev, mapping, bufsiz, dir) \ - pci_dma_sync_single(pdev, mapping, bufsiz, dir) -#define bus_sg_dma_address(sc) sg_dma_address(sc) -#define bus_sg_dma_len(sc) sg_dma_len(sc) - static void __unmap_scsi_data(struct pci_dev *pdev, struct scsi_cmnd *cmd) { int dma_dir = scsi_to_pci_dma_dir(cmd->sc_data_direction); switch(SYM_UCMD_PTR(cmd)->data_mapped) { case 2: - bus_unmap_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); + pci_unmap_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); break; case 1: - bus_unmap_single(pdev, SYM_UCMD_PTR(cmd)->data_mapping, + pci_unmap_single(pdev, SYM_UCMD_PTR(cmd)->data_mapping, cmd->request_bufflen, dma_dir); break; } @@ -206,7 +188,7 @@ dma_addr_t mapping; int dma_dir = scsi_to_pci_dma_dir(cmd->sc_data_direction); - mapping = bus_map_single(pdev, cmd->request_buffer, + mapping = pci_map_single(pdev, cmd->request_buffer, cmd->request_bufflen, dma_dir); if (mapping) { SYM_UCMD_PTR(cmd)->data_mapped = 1; @@ -221,7 +203,7 @@ int use_sg; int dma_dir = scsi_to_pci_dma_dir(cmd->sc_data_direction); - use_sg = bus_map_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); + use_sg = pci_map_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); if (use_sg > 0) { SYM_UCMD_PTR(cmd)->data_mapped = 2; SYM_UCMD_PTR(cmd)->data_mapping = use_sg; @@ -236,10 +218,10 @@ switch(SYM_UCMD_PTR(cmd)->data_mapped) { case 2: - bus_dma_sync_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); + pci_dma_sync_sg(pdev, cmd->buffer, cmd->use_sg, dma_dir); break; case 1: - bus_dma_sync_single(pdev, SYM_UCMD_PTR(cmd)->data_mapping, + pci_dma_sync_single(pdev, SYM_UCMD_PTR(cmd)->data_mapping, cmd->request_bufflen, dma_dir); break; } @@ -469,8 +451,8 @@ data = &cp->phys.data[SYM_CONF_MAX_SG - use_sg]; for (segment = 0; segment < use_sg; segment++) { - dma_addr_t baddr = bus_sg_dma_address(&scatter[segment]); - unsigned int len = bus_sg_dma_len(&scatter[segment]); + dma_addr_t baddr = sg_dma_address(&scatter[segment]); + unsigned int len = sg_dma_len(&scatter[segment]); sym_build_sge(np, &data[segment], baddr, len); cp->data_len += len; @@ -1595,10 +1577,8 @@ if (np->s.mmio_va) iounmap(np->s.mmio_va); #endif -#ifndef SYM_OPT_NO_BUS_MEMORY_MAPPING if (np->s.ram_va) iounmap(np->s.ram_va); -#endif /* * Free O/S independent resources. */ @@ -1650,14 +1630,13 @@ * If all is OK, install interrupt handling and * start the timer daemon. */ -static int __devinit -sym_attach (struct scsi_host_template *tpnt, int unit, struct sym_device *dev) +static struct Scsi_Host * __devinit sym_attach(struct scsi_host_template *tpnt, + int unit, struct sym_device *dev) { struct host_data *host_data; struct sym_hcb *np = NULL; struct Scsi_Host *instance = NULL; unsigned long flags; - struct sym_nvram *nvram = dev->nvram; struct sym_fw *fw; printk(KERN_INFO @@ -1762,20 +1741,18 @@ np->ram_ws = 8192; else np->ram_ws = 4096; -#ifndef SYM_OPT_NO_BUS_MEMORY_MAPPING np->s.ram_va = ioremap(dev->s.base_2_c, np->ram_ws); if (!np->s.ram_va) { printf_err("%s: can't map PCI MEMORY region\n", sym_name(np)); goto attach_failed; } -#endif } /* * Perform O/S independent stuff. */ - if (sym_hcb_attach(np, fw, nvram)) + if (sym_hcb_attach(np, fw, dev->nvram)) goto attach_failed; @@ -1843,13 +1820,7 @@ spin_unlock_irqrestore(instance->host_lock, flags); - /* - * Now let the generic SCSI driver - * look for the SCSI devices on the bus .. - */ - scsi_add_host(instance, &dev->pdev->dev); /* XXX: handle failure */ - scsi_scan_host(instance); - return 0; + return instance; reset_failed: printf_err("%s: FATAL ERROR: CHECK SCSI BUS - CABLES, " @@ -1857,13 +1828,13 @@ spin_unlock_irqrestore(instance->host_lock, flags); attach_failed: if (!instance) - return -1; + return NULL; printf_info("%s: giving up ...\n", sym_name(np)); if (np) sym_free_resources(np); scsi_host_put(instance); - return -1; + return NULL; } @@ -2115,7 +2086,7 @@ * Ignore Symbios chips controlled by various RAID controllers. * These controllers set value 0x52414944 at RAM end - 16. */ -#if defined(__i386__) && !defined(SYM_OPT_NO_BUS_MEMORY_MAPPING) +#if defined(__i386__) if (base_2_c) { unsigned int ram_size, ram_val; void *ram_ptr; @@ -2202,12 +2173,9 @@ /* - * Linux release module stuff. - * * Called before unloading the module. * Detach the host. * We have to free resources and halt the NCR chip. - * */ static int __devexit sym_detach(struct sym_hcb *np) { @@ -2216,18 +2184,15 @@ del_timer_sync(&np->s.timer); /* - * Reset NCR chip. - * We should use sym_soft_reset(), but we donnot want to do - * so, since we may not be safe if interrupts occur. + * Reset NCR chip. + * We should use sym_soft_reset(), but we don't want to do + * so, since we may not be safe if interrupts occur. */ printk("%s: resetting chip\n", sym_name(np)); OUTB (nc_istat, SRST); UDELAY (10); OUTB (nc_istat, 0); - /* - * Free host resources - */ sym_free_resources(np); return 1; @@ -2336,6 +2301,7 @@ { struct sym_device sym_dev; struct sym_nvram nvram; + struct Scsi_Host *instance; memset(&sym_dev, 0, sizeof(sym_dev)); memset(&nvram, 0, sizeof(nvram)); @@ -2354,12 +2320,20 @@ sym_get_nvram(&sym_dev, &nvram); - if (sym_attach(&sym2_template, attach_count, &sym_dev)) + instance = sym_attach(&sym2_template, attach_count, &sym_dev); + if (!instance) goto free; + if (scsi_add_host(instance, &pdev->dev)) + goto detach; + scsi_scan_host(instance); + attach_count++; + return 0; + detach: + sym_detach(pci_get_drvdata(pdev)); free: pci_release_regions(pdev); disable: @@ -2369,7 +2343,13 @@ static void __devexit sym2_remove(struct pci_dev *pdev) { - sym_detach(pci_get_drvdata(pdev)); + struct sym_hcb *np = pci_get_drvdata(pdev); + struct Scsi_Host *host = np->s.host; + + scsi_remove_host(host); + scsi_host_put(host); + + sym_detach(np); pci_release_regions(pdev); pci_disable_device(pdev); --- diff/drivers/scsi/sym53c8xx_2/sym_glue.h 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_glue.h 2003-11-26 10:09:06.000000000 +0000 @@ -74,10 +74,6 @@ #define bzero(d, n) memset((d), 0, (n)) #endif -#ifndef bcmp -#define bcmp(a, b, n) memcmp((a), (b), (n)) -#endif - /* * General driver includes. */ @@ -96,7 +92,6 @@ #define SYM_OPT_SNIFF_INQUIRY #define SYM_OPT_LIMIT_COMMAND_REORDERING #define SYM_OPT_ANNOUNCE_TRANSFER_RATE -#define SYM_OPT_BUS_DMA_ABSTRACTION /* * Print a message with severity. --- diff/drivers/scsi/sym53c8xx_2/sym_hipd.c 2003-10-09 09:47:34.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_hipd.c 2003-11-26 10:09:06.000000000 +0000 @@ -50,7 +50,7 @@ * SUCH DAMAGE. */ -#define SYM_DRIVER_NAME "sym-2.1.18b" +#define SYM_DRIVER_NAME "sym-2.1.18f" #ifdef __FreeBSD__ #include <dev/sym/sym_glue.h> @@ -751,8 +751,6 @@ &np->maxwide, &scsi_mode)) return period; - printk("scsi_mode = %d, period = %ld\n", scsi_mode, pdc_period); - if (scsi_mode >= 0) { /* C3000 PDC reports period/mode */ SYM_SETUP_SCSI_DIFF = 0; @@ -1060,12 +1058,10 @@ * and BUS width. */ if (np->features & FE_ULTRA3) { - if (tp->tinfo.user.period <= 9 && - tp->tinfo.user.width == BUS_16_BIT) { - tp->tinfo.user.options |= PPR_OPT_DT; - tp->tinfo.user.offset = np->maxoffs_dt; - tp->tinfo.user.spi_version = 3; - } + tp->tinfo.user.options |= PPR_OPT_DT; + tp->tinfo.user.period = np->minsync_dt; + tp->tinfo.user.offset = np->maxoffs_dt; + tp->tinfo.user.spi_version = 3; } if (!tp->usrtags) @@ -1962,13 +1958,6 @@ if (sym_verbose >= 2) printf ("%s: Downloading SCSI SCRIPTS.\n", sym_name(np)); -#ifdef SYM_OPT_NO_BUS_MEMORY_MAPPING - np->fw_patch(np); - if (np->ram_ws == 8192) - phys = SCRIPTZ_BA (np, start_ram64); - else - phys = SCRIPTZ_BA (np, start_ram); -#else if (np->ram_ws == 8192) { OUTRAM_OFF(4096, np->scriptb0, np->scriptb_sz); phys = scr_to_cpu(np->scr_ram_seg); @@ -1980,7 +1969,6 @@ else phys = SCRIPTA_BA (np, init); OUTRAM_OFF(0, np->scripta0, np->scripta_sz); -#endif } else phys = SCRIPTA_BA (np, init); @@ -2136,9 +2124,15 @@ sym_settrans(np, target, 0, ofs, per, wide, div, fak); - tp->tinfo.goal.period = tp->tinfo.curr.period = per; - tp->tinfo.goal.offset = tp->tinfo.curr.offset = ofs; - tp->tinfo.goal.options = tp->tinfo.curr.options = 0; + tp->tinfo.curr.period = per; + tp->tinfo.curr.offset = ofs; + tp->tinfo.curr.options = 0; + + if (!(tp->tinfo.goal.options & PPR_OPT_MASK)) { + tp->tinfo.goal.period = per; + tp->tinfo.goal.offset = ofs; + tp->tinfo.goal.options = 0; + } sym_xpt_async_nego_sync(np, target); } @@ -4151,8 +4145,10 @@ /* * Check values against our limits. */ - if (wide > np->maxwide) - {chg = 1; wide = np->maxwide;} + if (wide > np->maxwide) { + chg = 1; + wide = np->maxwide; + } if (!wide || !(np->features & FE_ULTRA3)) dt &= ~PPR_OPT_DT; if (req) { @@ -4306,8 +4302,10 @@ /* * Check values against our limits. */ - if (wide > np->maxwide) - {chg = 1; wide = np->maxwide;} + if (wide > np->maxwide) { + chg = 1; + wide = np->maxwide; + } if (req) { if (wide > tp->tinfo.user.width) {chg = 1; wide = tp->tinfo.user.width;} --- diff/drivers/scsi/sym53c8xx_2/sym_hipd.h 2003-09-30 15:46:17.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_hipd.h 2003-11-26 10:09:06.000000000 +0000 @@ -59,12 +59,6 @@ * They may be defined in platform specific headers, if they * are useful. * - * SYM_OPT_NO_BUS_MEMORY_MAPPING - * When this option is set, the driver will not load the - * on-chip RAM using MMIO, but let the SCRIPTS processor - * do the work using MOVE MEMORY instructions. - * (set for Linux/PPC) - * * SYM_OPT_HANDLE_DIR_UNKNOWN * When this option is set, the SCRIPTS used by the driver * are able to handle SCSI transfers with direction not @@ -75,12 +69,6 @@ * When this option is set, the driver will use a queue per * device and handle QUEUE FULL status requeuing internally. * - * SYM_OPT_BUS_DMA_ABSTRACTION - * When this option is set, the driver allocator is responsible - * of maintaining bus physical addresses and so provides virtual - * to bus physical address translation of driver data structures. - * (set for FreeBSD-4 and Linux 2.3) - * * SYM_OPT_SNIFF_INQUIRY * When this option is set, the driver sniff out successful * INQUIRY response and performs negotiations accordingly. @@ -92,10 +80,8 @@ * (set for Linux) */ #if 0 -#define SYM_OPT_NO_BUS_MEMORY_MAPPING #define SYM_OPT_HANDLE_DIR_UNKNOWN #define SYM_OPT_HANDLE_DEVICE_QUEUEING -#define SYM_OPT_BUS_DMA_ABSTRACTION #define SYM_OPT_SNIFF_INQUIRY #define SYM_OPT_LIMIT_COMMAND_REORDERING #endif @@ -958,9 +944,7 @@ /* * DMA pool handle for this HBA. */ -#ifdef SYM_OPT_BUS_DMA_ABSTRACTION m_pool_ident_t bus_dmat; -#endif /* * O/S specific data structure @@ -1133,9 +1117,20 @@ /* * NVRAM reading (sym_nvram.c). */ +#if SYM_CONF_NVRAM_SUPPORT void sym_nvram_setup_host (hcb_p np, struct sym_nvram *nvram); void sym_nvram_setup_target (hcb_p np, int target, struct sym_nvram *nvp); int sym_read_nvram (sdev_p np, struct sym_nvram *nvp); +#else +static inline void sym_nvram_setup_host(hcb_p np, struct sym_nvram *nvram) { } +static inline void sym_nvram_setup_target(hcb_p np, struct sym_nvram *nvram) { } +static inline int sym_read_nvram(sdev_p np, struct sym_nvram *nvp) +{ + nvp->type = 0; + return 0; +} +#endif + /* * FIRMWARES (sym_fw.c) @@ -1347,7 +1342,6 @@ * Virtual to bus physical translation for a given cluster. * Such a structure is only useful with DMA abstraction. */ -#ifdef SYM_OPT_BUS_DMA_ABSTRACTION typedef struct sym_m_vtob { /* Virtual to Bus address translation */ struct sym_m_vtob *next; #ifdef SYM_HAVE_M_SVTOB @@ -1363,7 +1357,6 @@ #define VTOB_HASH_MASK (VTOB_HASH_SIZE-1) #define VTOB_HASH_CODE(m) \ ((((m_addr_t) (m)) >> SYM_MEM_CLUSTER_SHIFT) & VTOB_HASH_MASK) -#endif /* SYM_OPT_BUS_DMA_ABSTRACTION */ /* * Memory pool of a given kind. @@ -1375,7 +1368,6 @@ * method are expected to tell the driver about. */ typedef struct sym_m_pool { -#ifdef SYM_OPT_BUS_DMA_ABSTRACTION m_pool_ident_t dev_dmat; /* Identifies the pool (see above) */ m_addr_t (*get_mem_cluster)(struct sym_m_pool *); #ifdef SYM_MEM_FREE_UNUSED @@ -1389,10 +1381,6 @@ int nump; m_vtob_p vtob[VTOB_HASH_SIZE]; struct sym_m_pool *next; -#else -#define M_GET_MEM_CLUSTER() sym_get_mem_cluster() -#define M_FREE_MEM_CLUSTER(p) sym_free_mem_cluster(p) -#endif /* SYM_OPT_BUS_DMA_ABSTRACTION */ struct sym_m_link h[SYM_MEM_CLUSTER_SHIFT - SYM_MEM_SHIFT + 1]; } *m_pool_p; @@ -1406,12 +1394,10 @@ * Alloc, free and translate addresses to bus physical * for DMAable memory. */ -#ifdef SYM_OPT_BUS_DMA_ABSTRACTION void *__sym_calloc_dma_unlocked(m_pool_ident_t dev_dmat, int size, char *name); void __sym_mfree_dma_unlocked(m_pool_ident_t dev_dmat, void *m,int size, char *name); u32 __vtobus_unlocked(m_pool_ident_t dev_dmat, void *m); -#endif /* * Verbs used by the driver code for DMAable memory handling. --- diff/drivers/scsi/sym53c8xx_2/sym_malloc.c 2003-02-26 16:01:09.000000000 +0000 +++ source/drivers/scsi/sym53c8xx_2/sym_malloc.c 2003-11-26 10:09:06.000000000 +0000 @@ -204,18 +204,9 @@ /* * Default memory pool we donnot need to involve in DMA. * - * If DMA abtraction is not needed, the generic allocator - * calls directly some kernel allocator. - * * With DMA abstraction, we use functions (methods), to * distinguish between non DMAable memory and DMAable memory. */ -#ifndef SYM_OPT_BUS_DMA_ABSTRACTION - -static struct sym_m_pool mp0; - -#else - static m_addr_t ___mp0_get_mem_cluster(m_pool_p mp) { m_addr_t m = (m_addr_t) sym_get_mem_cluster(); @@ -240,8 +231,6 @@ {0, ___mp0_get_mem_cluster}; #endif -#endif /* SYM_OPT_BUS_DMA_ABSTRACTION */ - /* * Actual memory allocation routine for non-DMAed memory. */ @@ -260,7 +249,6 @@ __sym_mfree(&mp0, ptr, size, name); } -#ifdef SYM_OPT_BUS_DMA_ABSTRACTION /* * Methods that maintains DMAable pools according to user allocations. * New pools are created on the fly when a new pool id is provided. @@ -417,5 +405,3 @@ panic("sym: VTOBUS FAILED!\n"); return (u32)(vp ? vp->baddr + (((m_addr_t) m) - a) : 0); } - -#endif /* SYM_OPT_BUS_DMA_ABSTRACTION */ --- diff/drivers/scsi/sym53c8xx_2/sym_nvram.c 2002-10-16 04:28:24.000000000 +0100 +++ source/drivers/scsi/sym53c8xx_2/sym_nvram.c 2003-11-26 10:09:06.000000000 +0000 @@ -59,25 +59,22 @@ /* * Some poor and bogus sync table that refers to Tekram NVRAM layout. */ -#if SYM_CONF_NVRAM_SUPPORT static u_char Tekram_sync[16] = {25,31,37,43, 50,62,75,125, 12,15,18,21, 6,7,9,10}; #ifdef SYM_CONF_DEBUG_NVRAM static u_char Tekram_boot_delay[7] = {3, 5, 10, 20, 30, 60, 120}; #endif -#endif /* * Get host setup from NVRAM. */ -void sym_nvram_setup_host (hcb_p np, struct sym_nvram *nvram) +void sym_nvram_setup_host(struct sym_hcb *np, struct sym_nvram *nvram) { -#if SYM_CONF_NVRAM_SUPPORT /* * Get parity checking, host ID, verbose mode * and miscellaneous host flags from NVRAM. */ - switch(nvram->type) { + switch (nvram->type) { case SYM_SYMBIOS_NVRAM: if (!(nvram->data.Symbios.flags & SYMBIOS_PARITY_ENABLE)) np->rv_scntl0 &= ~0x0a; @@ -95,41 +92,15 @@ default: break; } -#endif -} - -/* - * Get target setup from NVRAM. - */ -#if SYM_CONF_NVRAM_SUPPORT -static void sym_Symbios_setup_target(hcb_p np,int target, Symbios_nvram *nvram); -static void sym_Tekram_setup_target(hcb_p np,int target, Tekram_nvram *nvram); -#endif - -void sym_nvram_setup_target (hcb_p np, int target, struct sym_nvram *nvp) -{ -#if SYM_CONF_NVRAM_SUPPORT - switch(nvp->type) { - case SYM_SYMBIOS_NVRAM: - sym_Symbios_setup_target (np, target, &nvp->data.Symbios); - break; - case SYM_TEKRAM_NVRAM: - sym_Tekram_setup_target (np, target, &nvp->data.Tekram); - break; - default: - break; - } -#endif } -#if SYM_CONF_NVRAM_SUPPORT /* * Get target set-up from Symbios format NVRAM. */ static void -sym_Symbios_setup_target(hcb_p np, int target, Symbios_nvram *nvram) +sym_Symbios_setup_target(struct sym_hcb *np, int target, Symbios_nvram *nvram) { - tcb_p tp = &np->target[target]; + struct sym_tcb *tp = &np->target[target]; Symbios_target *tn = &nvram->target[target]; tp->tinfo.user.period = tn->sync_period ? (tn->sync_period + 3) / 4 : 0; @@ -149,9 +120,9 @@ * Get target set-up from Tekram format NVRAM. */ static void -sym_Tekram_setup_target(hcb_p np, int target, Tekram_nvram *nvram) +sym_Tekram_setup_target(struct sym_hcb *np, int target, Tekram_nvram *nvram) { - tcb_p tp = &np->target[target]; + struct sym_tcb *tp = &np->target[target]; struct Tekram_target *tn = &nvram->target[target]; int i; @@ -160,8 +131,8 @@ tp->tinfo.user.period = Tekram_sync[i]; } - tp->tinfo.user.width = - (tn->flags & TEKRAM_WIDE_NEGO) ? BUS_16_BIT : BUS_8_BIT; + tp->tinfo.user.width = (tn->flags & TEKRAM_WIDE_NEGO) ? + BUS_16_BIT : BUS_8_BIT; if (tn->flags & TEKRAM_TAGGED_COMMANDS) { tp->usrtags = 2 << nvram->max_tags_index; @@ -175,11 +146,28 @@ np->rv_scntl0 &= ~0x0a; /* SCSI parity checking disabled */ } +/* + * Get target setup from NVRAM. + */ +void sym_nvram_setup_target(struct sym_hcb *np, int target, struct sym_nvram *nvp) +{ + switch (nvp->type) { + case SYM_SYMBIOS_NVRAM: + sym_Symbios_setup_target(np, target, &nvp->data.Symbios); + break; + case SYM_TEKRAM_NVRAM: + sym_Tekram_setup_target(np, target, &nvp->data.Tekram); + break; + default: + break; + } +} + #ifdef SYM_CONF_DEBUG_NVRAM /* * Dump Symbios format NVRAM for debugging purpose. */ -static void sym_display_Symbios_nvram(sdev_p np, Symbios_nvram *nvram) +static void sym_display_Symbios_nvram(struct sym_device *np, Symbios_nvram *nvram) { int i; @@ -211,7 +199,7 @@ /* * Dump TEKRAM format NVRAM for debugging purpose. */ -static void sym_display_Tekram_nvram(sdev_p np, Tekram_nvram *nvram) +static void sym_display_Tekram_nvram(struct sym_device *np, Tekram_nvram *nvram) { int i, tags, boot_delay; char *rem; @@ -221,7 +209,7 @@ boot_delay = 0; if (nvram->boot_delay_index < 6) boot_delay = Tekram_boot_delay[nvram->boot_delay_index]; - switch((nvram->flags & TEKRAM_REMOVABLE_FLAGS) >> 6) { + switch ((nvram->flags & TEKRAM_REMOVABLE_FLAGS) >> 6) { default: case 0: rem = ""; break; case 1: rem = " REMOVABLE=boot device"; break; @@ -257,49 +245,12 @@ sync); } } -#endif /* SYM_CONF_DEBUG_NVRAM */ -#endif /* SYM_CONF_NVRAM_SUPPORT */ - - -/* - * Try reading Symbios or Tekram NVRAM - */ -#if SYM_CONF_NVRAM_SUPPORT -static int sym_read_Symbios_nvram (sdev_p np, Symbios_nvram *nvram); -static int sym_read_Tekram_nvram (sdev_p np, Tekram_nvram *nvram); -#endif - -int sym_read_nvram (sdev_p np, struct sym_nvram *nvp) -{ -#if SYM_CONF_NVRAM_SUPPORT - /* - * Try to read SYMBIOS nvram. - * Try to read TEKRAM nvram if Symbios nvram not found. - */ - if (SYM_SETUP_SYMBIOS_NVRAM && - !sym_read_Symbios_nvram (np, &nvp->data.Symbios)) { - nvp->type = SYM_SYMBIOS_NVRAM; -#ifdef SYM_CONF_DEBUG_NVRAM - sym_display_Symbios_nvram(np, &nvp->data.Symbios); -#endif - } - else if (SYM_SETUP_TEKRAM_NVRAM && - !sym_read_Tekram_nvram (np, &nvp->data.Tekram)) { - nvp->type = SYM_TEKRAM_NVRAM; -#ifdef SYM_CONF_DEBUG_NVRAM - sym_display_Tekram_nvram(np, &nvp->data.Tekram); -#endif - } - else - nvp->type = 0; #else - nvp->type = 0; -#endif - return nvp->type; -} +static void sym_display_Symbios_nvram(struct sym_device *np, Symbios_nvram *nvram) { } +static void sym_display_Tekram_nvram(struct sym_device *np, Tekram_nvram *nvram) { } +#endif /* SYM_CONF_DEBUG_NVRAM */ -#if SYM_CONF_NVRAM_SUPPORT /* * 24C16 EEPROM reading. * @@ -316,11 +267,11 @@ /* * Set/clear data/clock bit in GPIO0 */ -static void S24C16_set_bit(sdev_p np, u_char write_bit, u_char *gpreg, +static void S24C16_set_bit(struct sym_device *np, u_char write_bit, u_char *gpreg, int bit_mode) { UDELAY (5); - switch (bit_mode){ + switch (bit_mode) { case SET_BIT: *gpreg |= write_bit; break; @@ -342,7 +293,7 @@ /* * Send START condition to NVRAM to wake it up. */ -static void S24C16_start(sdev_p np, u_char *gpreg) +static void S24C16_start(struct sym_device *np, u_char *gpreg) { S24C16_set_bit(np, 1, gpreg, SET_BIT); S24C16_set_bit(np, 0, gpreg, SET_CLK); @@ -353,7 +304,7 @@ /* * Send STOP condition to NVRAM - puts NVRAM to sleep... ZZzzzz!! */ -static void S24C16_stop(sdev_p np, u_char *gpreg) +static void S24C16_stop(struct sym_device *np, u_char *gpreg) { S24C16_set_bit(np, 0, gpreg, SET_CLK); S24C16_set_bit(np, 1, gpreg, SET_BIT); @@ -363,7 +314,7 @@ * Read or write a bit to the NVRAM, * read if GPIO0 input else write if GPIO0 output */ -static void S24C16_do_bit(sdev_p np, u_char *read_bit, u_char write_bit, +static void S24C16_do_bit(struct sym_device *np, u_char *read_bit, u_char write_bit, u_char *gpreg) { S24C16_set_bit(np, write_bit, gpreg, SET_BIT); @@ -378,7 +329,7 @@ * Output an ACK to the NVRAM after reading, * change GPIO0 to output and when done back to an input */ -static void S24C16_write_ack(sdev_p np, u_char write_bit, u_char *gpreg, +static void S24C16_write_ack(struct sym_device *np, u_char write_bit, u_char *gpreg, u_char *gpcntl) { OUTB (nc_gpcntl, *gpcntl & 0xfe); @@ -390,7 +341,7 @@ * Input an ACK from NVRAM after writing, * change GPIO0 to input and when done back to an output */ -static void S24C16_read_ack(sdev_p np, u_char *read_bit, u_char *gpreg, +static void S24C16_read_ack(struct sym_device *np, u_char *read_bit, u_char *gpreg, u_char *gpcntl) { OUTB (nc_gpcntl, *gpcntl | 0x01); @@ -402,7 +353,7 @@ * WRITE a byte to the NVRAM and then get an ACK to see it was accepted OK, * GPIO0 must already be set as an output */ -static void S24C16_write_byte(sdev_p np, u_char *ack_data, u_char write_data, +static void S24C16_write_byte(struct sym_device *np, u_char *ack_data, u_char write_data, u_char *gpreg, u_char *gpcntl) { int x; @@ -417,7 +368,7 @@ * READ a byte from the NVRAM and then send an ACK to say we have got it, * GPIO0 must already be set as an input */ -static void S24C16_read_byte(sdev_p np, u_char *read_data, u_char ack_data, +static void S24C16_read_byte(struct sym_device *np, u_char *read_data, u_char ack_data, u_char *gpreg, u_char *gpcntl) { int x; @@ -435,7 +386,7 @@ /* * Read 'len' bytes starting at 'offset'. */ -static int sym_read_S24C16_nvram (sdev_p np, int offset, u_char *data, int len) +static int sym_read_S24C16_nvram(struct sym_device *np, int offset, u_char *data, int len) { u_char gpcntl, gpreg; u_char old_gpcntl, old_gpreg; @@ -514,7 +465,7 @@ * Try reading Symbios NVRAM. * Return 0 if OK. */ -static int sym_read_Symbios_nvram (sdev_p np, Symbios_nvram *nvram) +static int sym_read_Symbios_nvram(struct sym_device *np, Symbios_nvram *nvram) { static u_char Symbios_trailer[6] = {0xfe, 0xfe, 0, 0, 0, 0}; u_char *data = (u_char *) nvram; @@ -528,7 +479,7 @@ /* check valid NVRAM signature, verify byte count and checksum */ if (nvram->type != 0 || - bcmp(nvram->trailer, Symbios_trailer, 6) || + memcmp(nvram->trailer, Symbios_trailer, 6) || nvram->byte_count != len - 12) return 1; @@ -555,7 +506,7 @@ /* * Pulse clock bit in GPIO0 */ -static void T93C46_Clk(sdev_p np, u_char *gpreg) +static void T93C46_Clk(struct sym_device *np, u_char *gpreg) { OUTB (nc_gpreg, *gpreg | 0x04); UDELAY (2); @@ -565,7 +516,7 @@ /* * Read bit from NVRAM */ -static void T93C46_Read_Bit(sdev_p np, u_char *read_bit, u_char *gpreg) +static void T93C46_Read_Bit(struct sym_device *np, u_char *read_bit, u_char *gpreg) { UDELAY (2); T93C46_Clk(np, gpreg); @@ -575,7 +526,7 @@ /* * Write bit to GPIO0 */ -static void T93C46_Write_Bit(sdev_p np, u_char write_bit, u_char *gpreg) +static void T93C46_Write_Bit(struct sym_device *np, u_char write_bit, u_char *gpreg) { if (write_bit & 0x01) *gpreg |= 0x02; @@ -593,7 +544,7 @@ /* * Send STOP condition to NVRAM - puts NVRAM to sleep... ZZZzzz!! */ -static void T93C46_Stop(sdev_p np, u_char *gpreg) +static void T93C46_Stop(struct sym_device *np, u_char *gpreg) { *gpreg &= 0xef; OUTB (nc_gpreg, *gpreg); @@ -605,7 +556,7 @@ /* * Send read command and address to NVRAM */ -static void T93C46_Send_Command(sdev_p np, u_short write_data, +static void T93C46_Send_Command(struct sym_device *np, u_short write_data, u_char *read_bit, u_char *gpreg) { int x; @@ -620,7 +571,8 @@ /* * READ 2 bytes from the NVRAM */ -static void T93C46_Read_Word(sdev_p np, u_short *nvram_data, u_char *gpreg) +static void T93C46_Read_Word(struct sym_device *np, + unsigned short *nvram_data, unsigned char *gpreg) { int x; u_char read_bit; @@ -639,13 +591,13 @@ /* * Read Tekram NvRAM data. */ -static int T93C46_Read_Data(sdev_p np, u_short *data,int len,u_char *gpreg) +static int T93C46_Read_Data(struct sym_device *np, unsigned short *data, + int len, unsigned char *gpreg) { - u_char read_bit; - int x; + int x; for (x = 0; x < len; x++) { - + unsigned char read_bit; /* output read command and address */ T93C46_Send_Command(np, 0x180 | x, &read_bit, gpreg); if (read_bit & 0x01) @@ -660,7 +612,7 @@ /* * Try reading 93C46 Tekram NVRAM. */ -static int sym_read_T93C46_nvram (sdev_p np, Tekram_nvram *nvram) +static int sym_read_T93C46_nvram(struct sym_device *np, Tekram_nvram *nvram) { u_char gpcntl, gpreg; u_char old_gpcntl, old_gpreg; @@ -692,7 +644,7 @@ * Try reading Tekram NVRAM. * Return 0 if OK. */ -static int sym_read_Tekram_nvram (sdev_p np, Tekram_nvram *nvram) +static int sym_read_Tekram_nvram (struct sym_device *np, Tekram_nvram *nvram) { u_char *data = (u_char *) nvram; int len = sizeof(*nvram); @@ -700,13 +652,13 @@ int x; switch (np->device_id) { - case PCI_ID_SYM53C885: - case PCI_ID_SYM53C895: - case PCI_ID_SYM53C896: + case PCI_DEVICE_ID_NCR_53C885: + case PCI_DEVICE_ID_NCR_53C895: + case PCI_DEVICE_ID_NCR_53C896: x = sym_read_S24C16_nvram(np, TEKRAM_24C16_NVRAM_ADDRESS, data, len); break; - case PCI_ID_SYM53C875: + case PCI_DEVICE_ID_NCR_53C875: x = sym_read_S24C16_nvram(np, TEKRAM_24C16_NVRAM_ADDRESS, data, len); if (!x) @@ -727,4 +679,19 @@ return 0; } -#endif /* SYM_CONF_NVRAM_SUPPORT */ +/* + * Try reading Symbios or Tekram NVRAM + */ +int sym_read_nvram(struct sym_device *np, struct sym_nvram *nvp) +{ + if (!sym_read_Symbios_nvram(np, &nvp->data.Symbios)) { + nvp->type = SYM_SYMBIOS_NVRAM; + sym_display_Symbios_nvram(np, &nvp->data.Symbios); + } else if (!sym_read_Tekram_nvram(np, &nvp->data.Tekram)) { + nvp->type = SYM_TEKRAM_NVRAM; + sym_display_Tekram_nvram(np, &nvp->data.Tekram); + } else { + nvp->type = 0; + } + return nvp->type; +} --- diff/drivers/serial/8250.c 2003-10-27 09:20:43.000000000 +0000 +++ source/drivers/serial/8250.c 2003-11-26 10:09:06.000000000 +0000 @@ -844,7 +844,7 @@ if (unlikely(tty->flip.count >= TTY_FLIPBUF_SIZE)) { tty->flip.work.func((void *)tty); if (tty->flip.count >= TTY_FLIPBUF_SIZE) - return; // if TTY_DONT_FLIP is set + return; /* if TTY_DONT_FLIP is set */ } ch = serial_inp(up, UART_RX); *tty->flip.char_buf_ptr = ch; @@ -1205,12 +1205,21 @@ spin_unlock_irqrestore(&up->port.lock, flags); } +#ifdef CONFIG_KGDB +static int kgdb_irq = -1; +#endif + static int serial8250_startup(struct uart_port *port) { struct uart_8250_port *up = (struct uart_8250_port *)port; unsigned long flags; int retval; +#ifdef CONFIG_KGDB + if (up->port.irq == kgdb_irq) + return -EBUSY; +#endif + up->capabilities = uart_config[up->port.type].flags; if (up->port.type == PORT_16C950) { @@ -1876,6 +1885,10 @@ for (i = 0; i < UART_NR; i++) { struct uart_8250_port *up = &serial8250_ports[i]; +#ifdef CONFIG_KGDB + if (up->port.irq == kgdb_irq) + up->port.kgdb = 1; +#endif up->port.line = i; up->port.ops = &serial8250_pops; init_timer(&up->timer); @@ -2145,6 +2158,31 @@ uart_resume_port(&serial8250_reg, &serial8250_ports[line].port); } +#ifdef CONFIG_KGDB +/* + * Find all the ports using the given irq and shut them down. + * Result should be that the irq will be released. + */ +void shutdown_for_kgdb(struct async_struct * info) +{ + int irq = info->state->irq; + struct uart_8250_port *up; + int ttyS; + + kgdb_irq = irq; /* save for later init */ + for (ttyS = 0; ttyS < UART_NR; ttyS++){ + up = &serial8250_ports[ttyS]; + if (up->port.irq == irq && (irq_lists + irq)->head) { +#ifdef CONFIG_DEBUG_SPINLOCK /* ugly business... */ + if(up->port.lock.magic != SPINLOCK_MAGIC) + spin_lock_init(&up->port.lock); +#endif + serial8250_shutdown(&up->port); + } + } +} +#endif /* CONFIG_KGDB */ + static int __init serial8250_init(void) { int ret, i; --- diff/drivers/serial/serial_core.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/serial/serial_core.c 2003-11-26 10:09:06.000000000 +0000 @@ -1862,6 +1862,9 @@ if (flow == 'r') termios.c_cflag |= CRTSCTS; + if (!port->ops) + return 0; /* "console=" on ia64 */ + port->ops->set_termios(port, &termios, NULL); co->cflag = termios.c_cflag; @@ -1975,6 +1978,11 @@ { unsigned int flags; +#ifdef CONFIG_KGDB + if (port->kgdb) + return; +#endif + /* * If there isn't a port here, don't do anything further. */ --- diff/drivers/usb/class/cdc-acm.c 2003-10-27 09:20:44.000000000 +0000 +++ source/drivers/usb/class/cdc-acm.c 2003-11-26 10:09:06.000000000 +0000 @@ -1,5 +1,5 @@ /* - * acm.c Version 0.22 + * cdc-acm.c * * Copyright (c) 1999 Armin Fuerst <fuerst@in.tum.de> * Copyright (c) 1999 Pavel Machek <pavel@suse.cz> @@ -26,6 +26,7 @@ * v0.21 - revert to probing on device for devices with multiple configs * v0.22 - probe only the control interface. if usbcore doesn't choose the * config we want, sysadmin changes bConfigurationValue in sysfs. + * v0.23 - use softirq for rx processing, as needed by tty layer */ /* @@ -44,6 +45,8 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#undef DEBUG + #include <linux/kernel.h> #include <linux/errno.h> #include <linux/init.h> @@ -54,14 +57,13 @@ #include <linux/module.h> #include <linux/smp_lock.h> #include <asm/uaccess.h> -#undef DEBUG #include <linux/usb.h> #include <asm/byteorder.h> /* * Version Information */ -#define DRIVER_VERSION "v0.21" +#define DRIVER_VERSION "v0.23" #define DRIVER_AUTHOR "Armin Fuerst, Pavel Machek, Johannes Erdfelt, Vojtech Pavlik" #define DRIVER_DESC "USB Abstract Control Model driver for USB modems and ISDN adapters" @@ -146,7 +148,8 @@ struct tty_struct *tty; /* the corresponding tty */ struct urb *ctrlurb, *readurb, *writeurb; /* urbs */ struct acm_line line; /* line coding (bits, stop, parity) */ - struct work_struct work; /* work queue entry for line discipline waking up */ + struct work_struct work; /* work queue entry for line discipline waking up */ + struct tasklet_struct bh; /* rx processing */ unsigned int ctrlin; /* input control lines (DCD, DSR, RI, break, overruns) */ unsigned int ctrlout; /* output control lines (DTR, RTS) */ unsigned int writesize; /* max packet size for the output bulk endpoint */ @@ -184,9 +187,10 @@ #define acm_send_break(acm, ms) acm_ctrl_msg(acm, ACM_REQ_SEND_BREAK, ms, NULL, 0) /* - * Interrupt handler for various ACM control events + * Interrupt handlers for various ACM device responses */ +/* control interface reports status changes with "interrupt" transfers */ static void acm_ctrl_irq(struct urb *urb, struct pt_regs *regs) { struct acm *acm = urb->context; @@ -251,20 +255,30 @@ __FUNCTION__, status); } +/* data interface returns incoming bytes, or we got unthrottled */ static void acm_read_bulk(struct urb *urb, struct pt_regs *regs) { struct acm *acm = urb->context; - struct tty_struct *tty = acm->tty; - unsigned char *data = urb->transfer_buffer; - int i = 0; if (!ACM_READY(acm)) return; if (urb->status) - dbg("nonzero read bulk status received: %d", urb->status); + dev_dbg(&acm->data->dev, "bulk rx status %d\n", urb->status); + + /* calling tty_flip_buffer_push() in_irq() isn't allowed */ + tasklet_schedule(&acm->bh); +} + +static void acm_rx_tasklet(unsigned long _acm) +{ + struct acm *acm = (void *)_acm; + struct urb *urb = acm->readurb; + struct tty_struct *tty = acm->tty; + unsigned char *data = urb->transfer_buffer; + int i = 0; - if (!urb->status && !acm->throttle) { + if (urb->actual_length > 0 && !acm->throttle) { for (i = 0; i < urb->actual_length && !acm->throttle; i++) { /* if we insert more than TTY_FLIPBUF_SIZE characters, * we drop them. */ @@ -285,10 +299,12 @@ urb->actual_length = 0; urb->dev = acm->dev; - if (usb_submit_urb(urb, GFP_ATOMIC)) - dbg("failed resubmitting read urb"); + i = usb_submit_urb(urb, GFP_ATOMIC); + if (i) + dev_dbg(&acm->data->dev, "bulk rx resubmit %d\n", i); } +/* data interface wrote those outgoing bytes */ static void acm_write_bulk(struct urb *urb, struct pt_regs *regs) { struct acm *acm = (struct acm *)urb->context; @@ -621,6 +637,8 @@ acm->minor = minor; acm->dev = dev; + acm->bh.func = acm_rx_tasklet; + acm->bh.data = (unsigned long) acm; INIT_WORK(&acm->work, acm_softint, acm); if (!(buf = kmalloc(ctrlsize + readsize + acm->writesize, GFP_KERNEL))) { --- diff/drivers/usb/core/urb.c 2003-08-20 14:16:31.000000000 +0100 +++ source/drivers/usb/core/urb.c 2003-11-26 10:09:06.000000000 +0000 @@ -268,7 +268,7 @@ /* "high bandwidth" mode, 1-3 packets/uframe? */ if (dev->speed == USB_SPEED_HIGH) { int mult = 1 + ((max >> 11) & 0x03); - max &= 0x03ff; + max &= 0x07ff; max *= mult; } --- diff/drivers/usb/host/ehci-sched.c 2003-08-20 14:16:31.000000000 +0100 +++ source/drivers/usb/host/ehci-sched.c 2003-11-26 10:09:06.000000000 +0000 @@ -580,10 +580,10 @@ maxp = urb->dev->epmaxpacketout [epnum]; buf1 = 0; } - buf1 |= (maxp & 0x03ff); + buf1 |= (maxp & 0x07ff); multi = 1; multi += (maxp >> 11) & 0x03; - maxp &= 0x03ff; + maxp &= 0x07ff; maxp *= multi; /* transfer can't fit in any uframe? */ --- diff/drivers/usb/input/powermate.c 2003-09-17 12:28:10.000000000 +0100 +++ source/drivers/usb/input/powermate.c 2003-11-26 10:09:06.000000000 +0000 @@ -54,7 +54,11 @@ #define UPDATE_PULSE_AWAKE (1<<2) #define UPDATE_PULSE_MODE (1<<3) -#define POWERMATE_PAYLOAD_SIZE 3 +/* at least two versions of the hardware exist, with differing payload + sizes. the first three bytes always contain the "interesting" data in + the relevant format. */ +#define POWERMATE_PAYLOAD_SIZE_MAX 6 +#define POWERMATE_PAYLOAD_SIZE_MIN 3 struct powermate_device { signed char *data; dma_addr_t data_dma; @@ -269,7 +273,7 @@ static int powermate_alloc_buffers(struct usb_device *udev, struct powermate_device *pm) { - pm->data = usb_buffer_alloc(udev, POWERMATE_PAYLOAD_SIZE, + pm->data = usb_buffer_alloc(udev, POWERMATE_PAYLOAD_SIZE_MAX, SLAB_ATOMIC, &pm->data_dma); if (!pm->data) return -1; @@ -284,7 +288,7 @@ static void powermate_free_buffers(struct usb_device *udev, struct powermate_device *pm) { if (pm->data) - usb_buffer_free(udev, POWERMATE_PAYLOAD_SIZE, + usb_buffer_free(udev, POWERMATE_PAYLOAD_SIZE_MAX, pm->data, pm->data_dma); if (pm->configcr) usb_buffer_free(udev, sizeof(*(pm->configcr)), @@ -347,12 +351,14 @@ pipe = usb_rcvintpipe(udev, endpoint->bEndpointAddress); maxp = usb_maxpacket(udev, pipe, usb_pipeout(pipe)); - if (maxp != POWERMATE_PAYLOAD_SIZE) - printk("powermate: Expected payload of %d bytes, found %d bytes!\n", POWERMATE_PAYLOAD_SIZE, maxp); - + if(maxp < POWERMATE_PAYLOAD_SIZE_MIN || maxp > POWERMATE_PAYLOAD_SIZE_MAX){ + printk("powermate: Expected payload of %d--%d bytes, found %d bytes!\n", + POWERMATE_PAYLOAD_SIZE_MIN, POWERMATE_PAYLOAD_SIZE_MAX, maxp); + maxp = POWERMATE_PAYLOAD_SIZE_MAX; + } usb_fill_int_urb(pm->irq, udev, pipe, pm->data, - POWERMATE_PAYLOAD_SIZE, powermate_irq, + maxp, powermate_irq, pm, endpoint->bInterval); pm->irq->transfer_dma = pm->data_dma; pm->irq->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; --- diff/drivers/video/fbmon.c 2003-06-30 10:07:23.000000000 +0100 +++ source/drivers/video/fbmon.c 2003-11-26 10:09:06.000000000 +0000 @@ -890,30 +890,6 @@ u32 vtotal; }; -/* - * a simple function to get the square root of integers - */ -static u32 fb_sqrt(int x) -{ - register int op, res, one; - - op = x; - res = 0; - - one = 1 << 30; - while (one > op) one >>= 2; - - while (one != 0) { - if (op >= res + one) { - op = op - (res + one); - res = res + 2 * one; - } - res /= 2; - one /= 4; - } - return((u32) res); -} - /** * fb_get_vblank - get vertical blank time * @hfreq: horizontal freq @@ -1002,7 +978,7 @@ h_period += (M_VAL * xres * 2 * 1000)/(5 * dclk); h_period *=10000; - h_period = fb_sqrt((int) h_period); + h_period = int_sqrt(h_period); h_period -= (100 - C_VAL) * 100; h_period *= 1000; h_period /= 2 * M_VAL; --- diff/drivers/video/radeonfb.c 2003-11-25 15:24:58.000000000 +0000 +++ source/drivers/video/radeonfb.c 2003-11-26 10:09:06.000000000 +0000 @@ -679,7 +679,7 @@ */ static char *mode_option __initdata; -static char noaccel = 1; +static char noaccel = 0; static char mirror = 0; static int panel_yres __initdata = 0; static char force_dfp __initdata = 0; @@ -1099,7 +1099,7 @@ printk("radeonfb: detected DFP panel size from BIOS: %dx%d\n", rinfo->panel_xres, rinfo->panel_yres); - for(i=0; i<20; i++) { + for(i=0; i<21; i++) { tmp0 = rinfo->bios_seg + readw(tmp+64+i*2); if (tmp0 == 0) break; @@ -1241,9 +1241,6 @@ radeon_fifo_wait (1); OUTREG(RB2D_DSTCACHE_MODE, 0); - /* XXX */ - rinfo->pitch = ((rinfo->xres_virtual * (rinfo->bpp / 8) + 0x3f)) >> 6; - radeon_fifo_wait (1); temp = INREG(DEFAULT_PITCH_OFFSET); OUTREG(DEFAULT_PITCH_OFFSET, ((temp & 0xc0000000) | @@ -1782,6 +1779,7 @@ int hsync_start, hsync_fudge, bytpp, hsync_wid, vsync_wid; int primary_mon = PRIMARY_MONITOR(rinfo); int depth = var_to_depth(mode); + int accel = (mode->accel_flags & FB_ACCELF_TEXT) != 0; rinfo->xres = mode->xres; rinfo->yres = mode->yres; @@ -1878,7 +1876,15 @@ newmode.crtc_v_sync_strt_wid = (((vSyncStart - 1) & 0xfff) | (vsync_wid << 16) | (v_sync_pol << 23)); - newmode.crtc_pitch = (mode->xres_virtual >> 3); + if (accel) { + /* We first calculate the engine pitch */ + rinfo->pitch = ((mode->xres_virtual * ((mode->bits_per_pixel + 1) / 8) + 0x3f) + & ~(0x3f)) >> 6; + + /* Then, re-multiply it to get the CRTC pitch */ + newmode.crtc_pitch = (rinfo->pitch << 3) / ((mode->bits_per_pixel + 1) / 8); + } else + newmode.crtc_pitch = (mode->xres_virtual >> 3); newmode.crtc_pitch |= (newmode.crtc_pitch << 16); #if defined(__BIG_ENDIAN) @@ -2085,18 +2091,21 @@ if (!rinfo->asleep) { radeon_write_mode (rinfo, &newmode); /* (re)initialize the engine */ - if (!noaccel) + if (noaccel) radeon_engine_init (rinfo); } /* Update fix */ - info->fix.line_length = rinfo->pitch*64; + if (accel) + info->fix.line_length = rinfo->pitch*64; + else + info->fix.line_length = mode->xres_virtual * ((mode->bits_per_pixel + 1) / 8); info->fix.visual = rinfo->depth == 8 ? FB_VISUAL_PSEUDOCOLOR : FB_VISUAL_DIRECTCOLOR; #ifdef CONFIG_BOOTX_TEXT /* Update debug text engine */ btext_update_display(rinfo->fb_base_phys, mode->xres, mode->yres, - rinfo->depth, rinfo->pitch*64); + rinfo->depth, info->fix.line_length); #endif return 0; @@ -3022,11 +3031,6 @@ */ radeon_save_state (rinfo, &rinfo->init_state); - if (!noaccel) { - /* initialize the engine */ - radeon_engine_init (rinfo); - } - /* set all the vital stuff */ radeon_set_fbinfo (rinfo); --- diff/drivers/video/sis/init301.c 2003-10-09 09:47:17.000000000 +0100 +++ source/drivers/video/sis/init301.c 2003-11-26 10:09:06.000000000 +0000 @@ -11712,7 +11712,7 @@ } temp = GetOEMLCDPtr(SiS_Pr,HwDeviceExtension, ROMAddr, 1); - if(temp = 0xFFFF) return; + if(temp == 0xFFFF) return; index = SiS_Pr->SiS_VBModeIDTable[ModeIdIndex]._VB_LCDHIndex; for(i=0x14, j=0; i<=0x17; i++, j++) { --- diff/fs/Kconfig 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/Kconfig 2003-11-26 10:09:07.000000000 +0000 @@ -246,6 +246,7 @@ config JFS_FS tristate "JFS filesystem support" + select NLS help This is a port of IBM's Journaled Filesystem . More information is available in the file Documentation/filesystems/jfs.txt. @@ -485,6 +486,7 @@ config JOLIET bool "Microsoft Joliet CDROM extensions" depends on ISO9660_FS + select NLS help Joliet is a Microsoft extension for the ISO 9660 CD-ROM file system which allows for long filenames in unicode format (unicode is the @@ -530,6 +532,7 @@ config FAT_FS tristate "DOS FAT fs support" + select NLS help If you want to use one of the FAT-based file systems (the MS-DOS, VFAT (Windows 95) and UMSDOS (used to run Linux on top of an @@ -651,6 +654,7 @@ config NTFS_FS tristate "NTFS file system support" + select NLS help NTFS is the file system of Microsoft Windows NT, 2000, XP and 2003. @@ -962,6 +966,7 @@ config BEFS_FS tristate "BeOS file systemv(BeFS) support (read only) (EXPERIMENTAL)" depends on EXPERIMENTAL + select NLS help The BeOS File System (BeFS) is the native file system of Be, Inc's BeOS. Notable features include support for arbitrary attributes @@ -1440,6 +1445,7 @@ config SMB_FS tristate "SMB file system support (to mount Windows shares etc.)" depends on INET + select NLS help SMB (Server Message Block) is the protocol Windows for Workgroups (WfW), Windows 95/98, Windows NT and OS/2 Lan Manager use to share @@ -1495,6 +1501,7 @@ config CIFS tristate "CIFS support (advanced network filesystem for Samba, Window and other CIFS compliant servers)(EXPERIMENTAL)" depends on INET + select NLS help This is the client VFS module for the Common Internet File System (CIFS) protocol which is the successor to the Server Message Block --- diff/fs/Kconfig.binfmt 2003-11-25 15:24:58.000000000 +0000 +++ source/fs/Kconfig.binfmt 2003-11-26 10:09:07.000000000 +0000 @@ -23,10 +23,6 @@ ld.so (check the file <file:Documentation/Changes> for location and latest version). - To compile this as a module, choose M here: the module will be called - binfmt_elf. Saying M or N here is dangerous because some crucial - programs on your system might be in ELF format. - config BINFMT_FLAT tristate "Kernel support for flat binaries" depends on !MMU || SUPERH --- diff/fs/aio.c 2003-11-25 15:24:58.000000000 +0000 +++ source/fs/aio.c 2003-11-26 10:09:06.000000000 +0000 @@ -27,6 +27,8 @@ #include <linux/aio.h> #include <linux/highmem.h> #include <linux/workqueue.h> +#include <linux/writeback.h> +#include <linux/pagemap.h> #include <asm/kmap_types.h> #include <asm/uaccess.h> @@ -38,6 +40,9 @@ #define dprintk(x...) do { ; } while (0) #endif +long aio_run = 0; /* for testing only */ +long aio_wakeups = 0; /* for testing only */ + /*------ sysctl variables----*/ atomic_t aio_nr = ATOMIC_INIT(0); /* current system wide number of aio requests */ unsigned aio_max_nr = 0x10000; /* system wide maximum number of aio requests */ @@ -47,6 +52,7 @@ static kmem_cache_t *kioctx_cachep; static struct workqueue_struct *aio_wq; +static struct workqueue_struct *aio_fput_wq; /* Used for rare fput completion. */ static void aio_fput_routine(void *); @@ -74,6 +80,7 @@ panic("unable to create kioctx cache"); aio_wq = create_workqueue("aio"); + aio_fput_wq = create_workqueue("aio_fput"); pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct page)); @@ -281,6 +288,7 @@ struct kiocb *iocb = list_kiocb(pos); list_del_init(&iocb->ki_list); cancel = iocb->ki_cancel; + kiocbSetCancelled(iocb); if (cancel) { iocb->ki_users++; spin_unlock_irq(&ctx->ctx_lock); @@ -341,6 +349,11 @@ aio_cancel_all(ctx); wait_for_all_aios(ctx); + /* + * this is an overkill, but ensures we don't leave + * the ctx on the aio_wq + */ + flush_workqueue(aio_wq); if (1 != atomic_read(&ctx->users)) printk(KERN_DEBUG @@ -400,6 +413,7 @@ req->ki_cancel = NULL; req->ki_retry = NULL; req->ki_user_obj = NULL; + INIT_LIST_HEAD(&req->ki_run_list); /* Check if the completion queue has enough free space to * accept an event from this io. @@ -499,7 +513,7 @@ spin_lock(&fput_lock); list_add(&req->ki_list, &fput_head); spin_unlock(&fput_lock); - queue_work(aio_wq, &fput_work); + queue_work(aio_fput_wq, &fput_work); } else really_put_req(ctx, req); return 1; @@ -541,65 +555,324 @@ return ioctx; } +/* + * use_mm + * Makes the calling kernel thread take on the specified + * mm context. + * Called by the retry thread execute retries within the + * iocb issuer's mm context, so that copy_from/to_user + * operations work seamlessly for aio. + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ static void use_mm(struct mm_struct *mm) { - struct mm_struct *active_mm = current->active_mm; + struct mm_struct *active_mm; + struct task_struct *tsk = current; + + task_lock(tsk); + active_mm = tsk->active_mm; atomic_inc(&mm->mm_count); - current->mm = mm; - if (mm != active_mm) { - current->active_mm = mm; - activate_mm(active_mm, mm); - } + tsk->mm = mm; + tsk->active_mm = mm; + activate_mm(active_mm, mm); + task_unlock(tsk); + mmdrop(active_mm); } -static void unuse_mm(struct mm_struct *mm) +/* + * unuse_mm + * Reverses the effect of use_mm, i.e. releases the + * specified mm context which was earlier taken on + * by the calling kernel thread + * (Note: this routine is intended to be called only + * from a kernel thread context) + * + * Comments: Called with ctx->ctx_lock held. This nests + * task_lock instead ctx_lock. + */ +void unuse_mm(struct mm_struct *mm) { - current->mm = NULL; + struct task_struct *tsk = current; + + task_lock(tsk); + tsk->mm = NULL; /* active_mm is still 'mm' */ - enter_lazy_tlb(mm, current); + enter_lazy_tlb(mm, tsk); + task_unlock(tsk); } -/* Run on kevent's context. FIXME: needs to be per-cpu and warn if an - * operation blocks. +/* + * Queue up a kiocb to be retried. Assumes that the kiocb + * has already been marked as kicked, and places it on + * the retry run list for the corresponding ioctx, if it + * isn't already queued. Returns 1 if it actually queued + * the kiocb (to tell the caller to activate the work + * queue to process it), or 0, if it found that it was + * already queued. + * + * Should be called with the spin lock iocb->ki_ctx->ctx_lock + * held */ -static void aio_kick_handler(void *data) +static inline int __queue_kicked_iocb(struct kiocb *iocb) { - struct kioctx *ctx = data; + struct kioctx *ctx = iocb->ki_ctx; - use_mm(ctx->mm); + if (list_empty(&iocb->ki_run_list)) { + list_add_tail(&iocb->ki_run_list, + &ctx->run_list); + iocb->ki_queued++; + return 1; + } + return 0; +} - spin_lock_irq(&ctx->ctx_lock); - while (!list_empty(&ctx->run_list)) { - struct kiocb *iocb; - long ret; +/* aio_run_iocb + * This is the core aio execution routine. It is + * invoked both for initial i/o submission and + * subsequent retries via the aio_kick_handler. + * Expects to be invoked with iocb->ki_ctx->lock + * already held. The lock is released and reaquired + * as needed during processing. + * + * Calls the iocb retry method (already setup for the + * iocb on initial submission) for operation specific + * handling, but takes care of most of common retry + * execution details for a given iocb. The retry method + * needs to be non-blocking as far as possible, to avoid + * holding up other iocbs waiting to be serviced by the + * retry kernel thread. + * + * The trickier parts in this code have to do with + * ensuring that only one retry instance is in progress + * for a given iocb at any time. Providing that guarantee + * simplifies the coding of individual aio operations as + * it avoids various potential races. + */ +static ssize_t aio_run_iocb(struct kiocb *iocb) +{ + struct kioctx *ctx = iocb->ki_ctx; + ssize_t (*retry)(struct kiocb *); + ssize_t ret; - iocb = list_entry(ctx->run_list.next, struct kiocb, - ki_run_list); - list_del(&iocb->ki_run_list); - iocb->ki_users ++; - spin_unlock_irq(&ctx->ctx_lock); + if (iocb->ki_retried++ > 1024*1024) { + printk("Maximal retry count. Bytes done %Zd\n", + iocb->ki_nbytes - iocb->ki_left); + return -EAGAIN; + } - kiocbClearKicked(iocb); - ret = iocb->ki_retry(iocb); + if (!(iocb->ki_retried & 0xff)) { + pr_debug("%ld retry: %d of %d (kick %ld, Q %ld run %ld, wake %ld)\n", + iocb->ki_retried, + iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes, + iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups); + } + + if (!(retry = iocb->ki_retry)) { + printk("aio_run_iocb: iocb->ki_retry = NULL\n"); + return 0; + } + + /* + * We don't want the next retry iteration for this + * operation to start until this one has returned and + * updated the iocb state. However, wait_queue functions + * can trigger a kick_iocb from interrupt context in the + * meantime, indicating that data is available for the next + * iteration. We want to remember that and enable the + * next retry iteration _after_ we are through with + * this one. + * + * So, in order to be able to register a "kick", but + * prevent it from being queued now, we clear the kick + * flag, but make the kick code *think* that the iocb is + * still on the run list until we are actually done. + * When we are done with this iteration, we check if + * the iocb was kicked in the meantime and if so, queue + * it up afresh. + */ + + kiocbClearKicked(iocb); + + /* + * This is so that aio_complete knows it doesn't need to + * pull the iocb off the run list (We can't just call + * INIT_LIST_HEAD because we don't want a kick_iocb to + * queue this on the run list yet) + */ + iocb->ki_run_list.next = iocb->ki_run_list.prev = NULL; + iocb->ki_retry = NULL; + spin_unlock_irq(&ctx->ctx_lock); + + /* Quit retrying if the i/o has been cancelled */ + if (kiocbIsCancelled(iocb)) { + ret = -EINTR; + aio_complete(iocb, ret, 0); + /* must not access the iocb after this */ + goto out; + } + + /* + * Now we are all set to call the retry method in async + * context. By setting this thread's io_wait context + * to point to the wait queue entry inside the currently + * running iocb for the duration of the retry, we ensure + * that async notification wakeups are queued by the + * operation instead of blocking waits, and when notified, + * cause the iocb to be kicked for continuation (through + * the aio_wake_function callback). + */ + BUG_ON(current->io_wait != NULL); + current->io_wait = &iocb->ki_wait; + ret = retry(iocb); + current->io_wait = NULL; + + if (-EIOCBRETRY != ret) { if (-EIOCBQUEUED != ret) { + BUG_ON(!list_empty(&iocb->ki_wait.task_list)); aio_complete(iocb, ret, 0); - iocb = NULL; + /* must not access the iocb after this */ } + } else { + /* + * Issue an additional retry to avoid waiting forever if + * no waits were queued (e.g. in case of a short read). + */ + if (list_empty(&iocb->ki_wait.task_list)) + kiocbSetKicked(iocb); + } +out: + spin_lock_irq(&ctx->ctx_lock); - spin_lock_irq(&ctx->ctx_lock); - if (NULL != iocb) - __aio_put_req(ctx, iocb); + if (-EIOCBRETRY == ret) { + /* + * OK, now that we are done with this iteration + * and know that there is more left to go, + * this is where we let go so that a subsequent + * "kick" can start the next iteration + */ + iocb->ki_retry = retry; + /* will make __queue_kicked_iocb succeed from here on */ + INIT_LIST_HEAD(&iocb->ki_run_list); + /* we must queue the next iteration ourselves, if it + * has already been kicked */ + if (kiocbIsKicked(iocb)) { + __queue_kicked_iocb(iocb); + } } + return ret; +} + +/* + * __aio_run_iocbs: + * Process all pending retries queued on the ioctx + * run list. + * Assumes it is operating within the aio issuer's mm + * context. Expects to be called with ctx->ctx_lock held + */ +static int __aio_run_iocbs(struct kioctx *ctx) +{ + struct kiocb *iocb; + int count = 0; + LIST_HEAD(run_list); + + list_splice_init(&ctx->run_list, &run_list); + while (!list_empty(&run_list)) { + iocb = list_entry(run_list.next, struct kiocb, + ki_run_list); + list_del(&iocb->ki_run_list); + /* + * Hold an extra reference while retrying i/o. + */ + iocb->ki_users++; /* grab extra reference */ + aio_run_iocb(iocb); + if (__aio_put_req(ctx, iocb)) /* drop extra ref */ + put_ioctx(ctx); + count++; + } + aio_run++; + if (!list_empty(&ctx->run_list)) + return 1; + return 0; +} + +/* + * aio_run_iocbs: + * Process all pending retries queued on the ioctx + * run list. + * Assumes it is operating within the aio issuer's mm + * context. + */ +static inline void aio_run_iocbs(struct kioctx *ctx) +{ + int requeue; + + spin_lock_irq(&ctx->ctx_lock); + requeue = __aio_run_iocbs(ctx); spin_unlock_irq(&ctx->ctx_lock); + if (requeue) + queue_work(aio_wq, &ctx->wq); +} +/* + * aio_kick_handler: + * Work queue handler triggered to process pending + * retries on an ioctx. Takes on the aio issuer's + * mm context before running the iocbs, so that + * copy_xxx_user operates on the issuer's address + * space. + * Run on aiod's context. + */ +static void aio_kick_handler(void *data) +{ + struct kioctx *ctx = data; + mm_segment_t oldfs = get_fs(); + int requeue; + + set_fs(USER_DS); + use_mm(ctx->mm); + spin_lock_irq(&ctx->ctx_lock); + requeue = __aio_run_iocbs(ctx); unuse_mm(ctx->mm); + spin_unlock_irq(&ctx->ctx_lock); + set_fs(oldfs); + if (requeue) + queue_work(aio_wq, &ctx->wq); } -void kick_iocb(struct kiocb *iocb) + +/* + * Called by kick_iocb to queue the kiocb for retry + * and if required activate the aio work queue to process + * it + */ +void queue_kicked_iocb(struct kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; + unsigned long flags; + int run = 0; + + WARN_ON((!list_empty(&iocb->ki_wait.task_list))); + spin_lock_irqsave(&ctx->ctx_lock, flags); + run = __queue_kicked_iocb(iocb); + spin_unlock_irqrestore(&ctx->ctx_lock, flags); + if (run) { + queue_work(aio_wq, &ctx->wq); + aio_wakeups++; + } +} + +/* + * kick_iocb: + * Called typically from a wait queue callback context + * (aio_wake_function) to trigger a retry of the iocb. + * The retry is usually executed by aio workqueue + * threads (See aio_kick_handler). + */ +void kick_iocb(struct kiocb *iocb) +{ /* sync iocbs are easy: they can only ever be executing from a * single context. */ if (is_sync_kiocb(iocb)) { @@ -608,12 +881,10 @@ return; } + iocb->ki_kicked++; + /* If its already kicked we shouldn't queue it again */ if (!kiocbTryKick(iocb)) { - unsigned long flags; - spin_lock_irqsave(&ctx->ctx_lock, flags); - list_add_tail(&iocb->ki_run_list, &ctx->run_list); - spin_unlock_irqrestore(&ctx->ctx_lock, flags); - schedule_work(&ctx->wq); + queue_kicked_iocb(iocb); } } @@ -666,6 +937,9 @@ */ spin_lock_irqsave(&ctx->ctx_lock, flags); + if (iocb->ki_run_list.prev && !list_empty(&iocb->ki_run_list)) + list_del_init(&iocb->ki_run_list); + ring = kmap_atomic(info->ring_pages[0], KM_IRQ1); tail = info->tail; @@ -694,6 +968,11 @@ pr_debug("added to ring %p at [%lu]\n", iocb, tail); + pr_debug("%ld retries: %d of %d (kicked %ld, Q %ld run %ld wake %ld)\n", + iocb->ki_retried, + iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes, + iocb->ki_kicked, iocb->ki_queued, aio_run, aio_wakeups); + /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); @@ -808,6 +1087,7 @@ int i = 0; struct io_event ent; struct timeout to; + int event_loop = 0; /* testing only */ /* needed to zero any padding within an entry (there shouldn't be * any, but C is fun! @@ -857,7 +1137,6 @@ add_wait_queue_exclusive(&ctx->wait, &wait); do { set_task_state(tsk, TASK_INTERRUPTIBLE); - ret = aio_read_evt(ctx, &ent); if (ret) break; @@ -867,6 +1146,7 @@ if (to.timed_out) /* Only check after read evt */ break; schedule(); + event_loop++; if (signal_pending(tsk)) { ret = -EINTR; break; @@ -894,6 +1174,9 @@ if (timeout) clear_timeout(&to); out: + pr_debug("event loop executed %d times\n", event_loop); + pr_debug("aio_run %ld\n", aio_run); + pr_debug("aio_wakeups %ld\n", aio_wakeups); return i ? i : ret; } @@ -923,6 +1206,11 @@ aio_cancel_all(ioctx); wait_for_all_aios(ioctx); + /* + * this is an overkill, but ensures we don't leave + * the ctx on the aio_wq + */ + flush_workqueue(aio_wq); put_ioctx(ioctx); /* once for the lookup */ } @@ -985,13 +1273,191 @@ return -EINVAL; } +/* + * Retry method for aio_read (also used for first time submit) + * Responsible for updating iocb state as retries progress + */ +static ssize_t aio_pread(struct kiocb *iocb) +{ + struct file *file = iocb->ki_filp; + ssize_t ret = 0; + + ret = file->f_op->aio_read(iocb, iocb->ki_buf, + iocb->ki_left, iocb->ki_pos); + + /* + * Can't just depend on iocb->ki_left to determine + * whether we are done. This may have been a short read. + */ + if (ret > 0) { + iocb->ki_buf += ret; + iocb->ki_left -= ret; + + ret = -EIOCBRETRY; + } + + /* This means we must have transferred all that we could */ + /* No need to retry anymore */ + if ((ret == 0) || (iocb->ki_left == 0)) + ret = iocb->ki_nbytes - iocb->ki_left; + + return ret; +} + +/* + * Retry method for aio_write (also used for first time submit) + * Responsible for updating iocb state as retries progress + */ +static ssize_t aio_pwrite(struct kiocb *iocb) +{ + struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t ret = 0; + + ret = file->f_op->aio_write(iocb, iocb->ki_buf, + iocb->ki_left, iocb->ki_pos); + + /* + * Even if iocb->ki_left = 0, we may need to wait + * for a balance_dirty_pages to complete + */ + if (ret > 0) { + iocb->ki_buf += iocb->ki_buf ? ret : 0; + iocb->ki_left -= ret; + + ret = -EIOCBRETRY; + } + + /* This means we must have transferred all that we could */ + /* No need to retry anymore unless we need to osync data */ + if (ret == 0) { + ret = iocb->ki_nbytes - iocb->ki_left; + if (!iocb->ki_buf) + return ret; + + /* Set things up for potential O_SYNC */ + if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + iocb->ki_buf = NULL; + iocb->ki_pos -= ret; /* back up fpos */ + iocb->ki_left = ret; /* sync what we have written out */ + iocb->ki_nbytes = ret; + ret = -EIOCBRETRY; + } + } + + return ret; +} + +static ssize_t aio_fdsync(struct kiocb *iocb) +{ + struct file *file = iocb->ki_filp; + ssize_t ret = -EINVAL; + + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(iocb, 1); + return ret; +} + +static ssize_t aio_fsync(struct kiocb *iocb) +{ + struct file *file = iocb->ki_filp; + ssize_t ret = -EINVAL; + + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(iocb, 0); + return ret; +} + +/* + * aio_setup_iocb: + * Performs the initial checks and aio retry method + * setup for the kiocb at the time of io submission. + */ +ssize_t aio_setup_iocb(struct kiocb *kiocb) +{ + struct file *file = kiocb->ki_filp; + ssize_t ret = 0; + + switch (kiocb->ki_opcode) { + case IOCB_CMD_PREAD: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + break; + ret = -EFAULT; + if (unlikely(!access_ok(VERIFY_WRITE, kiocb->ki_buf, + kiocb->ki_left))) + break; + ret = -EINVAL; + if (file->f_op->aio_read) + kiocb->ki_retry = aio_pread; + break; + case IOCB_CMD_PWRITE: + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + break; + ret = -EFAULT; + if (unlikely(!access_ok(VERIFY_READ, kiocb->ki_buf, + kiocb->ki_left))) + break; + ret = -EINVAL; + if (file->f_op->aio_write) + kiocb->ki_retry = aio_pwrite; + break; + case IOCB_CMD_FDSYNC: + ret = -EINVAL; + if (file->f_op->aio_fsync) + kiocb->ki_retry = aio_fdsync; + break; + case IOCB_CMD_FSYNC: + ret = -EINVAL; + if (file->f_op->aio_fsync) + kiocb->ki_retry = aio_fsync; + break; + default: + dprintk("EINVAL: io_submit: no operation provided\n"); + ret = -EINVAL; + } + + if (!kiocb->ki_retry) + return ret; + + return 0; +} + +/* + * aio_wake_function: + * wait queue callback function for aio notification, + * Simply triggers a retry of the operation via kick_iocb. + * + * This callback is specified in the wait queue entry in + * a kiocb (current->io_wait points to this wait queue + * entry when an aio operation executes; it is used + * instead of a synchronous wait when an i/o blocking + * condition is encountered during aio). + * + * Note: + * This routine is executed with the wait queue lock held. + * Since kick_iocb acquires iocb->ctx->ctx_lock, it nests + * the ioctx lock inside the wait queue lock. This is safe + * because this callback isn't used for wait queues which + * are nested inside ioctx lock (i.e. ctx->wait) + */ +int aio_wake_function(wait_queue_t *wait, unsigned mode, int sync) +{ + struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait); + + list_del_init(&wait->task_list); + kick_iocb(iocb); + return 1; +} + int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, struct iocb *iocb) { struct kiocb *req; struct file *file; ssize_t ret; - char *buf; /* enforce forwards compatibility on users */ if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2 || @@ -1032,52 +1498,31 @@ req->ki_user_data = iocb->aio_data; req->ki_pos = iocb->aio_offset; - buf = (char *)(unsigned long)iocb->aio_buf; + req->ki_buf = (char *)(unsigned long)iocb->aio_buf; + req->ki_left = req->ki_nbytes = iocb->aio_nbytes; + req->ki_opcode = iocb->aio_lio_opcode; + init_waitqueue_func_entry(&req->ki_wait, aio_wake_function); + INIT_LIST_HEAD(&req->ki_wait.task_list); + req->ki_run_list.next = req->ki_run_list.prev = NULL; + req->ki_retry = NULL; + req->ki_retried = 0; + req->ki_kicked = 0; + req->ki_queued = 0; + aio_run = 0; + aio_wakeups = 0; - switch (iocb->aio_lio_opcode) { - case IOCB_CMD_PREAD: - ret = -EBADF; - if (unlikely(!(file->f_mode & FMODE_READ))) - goto out_put_req; - ret = -EFAULT; - if (unlikely(!access_ok(VERIFY_WRITE, buf, iocb->aio_nbytes))) - goto out_put_req; - ret = -EINVAL; - if (file->f_op->aio_read) - ret = file->f_op->aio_read(req, buf, - iocb->aio_nbytes, req->ki_pos); - break; - case IOCB_CMD_PWRITE: - ret = -EBADF; - if (unlikely(!(file->f_mode & FMODE_WRITE))) - goto out_put_req; - ret = -EFAULT; - if (unlikely(!access_ok(VERIFY_READ, buf, iocb->aio_nbytes))) - goto out_put_req; - ret = -EINVAL; - if (file->f_op->aio_write) - ret = file->f_op->aio_write(req, buf, - iocb->aio_nbytes, req->ki_pos); - break; - case IOCB_CMD_FDSYNC: - ret = -EINVAL; - if (file->f_op->aio_fsync) - ret = file->f_op->aio_fsync(req, 1); - break; - case IOCB_CMD_FSYNC: - ret = -EINVAL; - if (file->f_op->aio_fsync) - ret = file->f_op->aio_fsync(req, 0); - break; - default: - dprintk("EINVAL: io_submit: no operation provided\n"); - ret = -EINVAL; - } + ret = aio_setup_iocb(req); + + if (ret) + goto out_put_req; + + spin_lock_irq(&ctx->ctx_lock); + ret = aio_run_iocb(req); + spin_unlock_irq(&ctx->ctx_lock); + if (-EIOCBRETRY == ret) + queue_work(aio_wq, &ctx->wq); aio_put_req(req); /* drop extra ref to req */ - if (likely(-EIOCBQUEUED == ret)) - return 0; - aio_complete(req, ret, 0); /* will drop i/o ref to req */ return 0; out_put_req: --- diff/fs/binfmt_elf.c 2003-10-27 09:20:44.000000000 +0000 +++ source/fs/binfmt_elf.c 2003-11-26 10:09:06.000000000 +0000 @@ -82,13 +82,17 @@ #define BAD_ADDR(x) ((unsigned long)(x) > TASK_SIZE) -static void set_brk(unsigned long start, unsigned long end) +static int set_brk(unsigned long start, unsigned long end) { start = ELF_PAGEALIGN(start); end = ELF_PAGEALIGN(end); - if (end > start) - do_brk(start, end - start); + if (end > start) { + unsigned long addr = do_brk(start, end - start); + if (BAD_ADDR(addr)) + return addr; + } current->mm->start_brk = current->mm->brk = end; + return 0; } @@ -381,8 +385,11 @@ elf_bss = ELF_PAGESTART(elf_bss + ELF_MIN_ALIGN - 1); /* What we have mapped so far */ /* Map the last of the bss segment */ - if (last_bss > elf_bss) - do_brk(elf_bss, last_bss - elf_bss); + if (last_bss > elf_bss) { + error = do_brk(elf_bss, last_bss - elf_bss); + if (BAD_ADDR(error)) + goto out_close; + } *interp_load_addr = load_addr; error = ((unsigned long) interp_elf_ex->e_entry) + load_addr; @@ -672,7 +679,12 @@ /* There was a PT_LOAD segment with p_memsz > p_filesz before this one. Map anonymous pages, if needed, and clear the area. */ - set_brk (elf_bss + load_bias, elf_brk + load_bias); + retval = set_brk (elf_bss + load_bias, + elf_brk + load_bias); + if (retval) { + send_sig(SIGKILL, current, 0); + goto out_free_dentry; + } nbyte = ELF_PAGEOFFSET(elf_bss); if (nbyte) { nbyte = ELF_MIN_ALIGN - nbyte; @@ -737,6 +749,18 @@ start_data += load_bias; end_data += load_bias; + /* Calling set_brk effectively mmaps the pages that we need + * for the bss and break sections. We must do this before + * mapping in the interpreter, to make sure it doesn't wind + * up getting placed where the bss needs to go. + */ + retval = set_brk(elf_bss, elf_brk); + if (retval) { + send_sig(SIGKILL, current, 0); + goto out_free_dentry; + } + padzero(elf_bss); + if (elf_interpreter) { if (interpreter_type == INTERPRETER_AOUT) elf_entry = load_aout_interp(&interp_ex, @@ -782,13 +806,6 @@ current->mm->end_data = end_data; current->mm->start_stack = bprm->p; - /* Calling set_brk effectively mmaps the pages that we need - * for the bss and break sections - */ - set_brk(elf_bss, elf_brk); - - padzero(elf_bss); - if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. --- diff/fs/block_dev.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/block_dev.c 2003-11-26 10:09:06.000000000 +0000 @@ -25,6 +25,22 @@ #include <linux/namei.h> #include <asm/uaccess.h> +struct bdev_inode { + struct block_device bdev; + struct inode vfs_inode; +}; + +static inline struct bdev_inode *BDEV_I(struct inode *inode) +{ + return container_of(inode, struct bdev_inode, vfs_inode); +} + +inline struct block_device *I_BDEV(struct inode *inode) +{ + return &BDEV_I(inode)->bdev; +} + +EXPORT_SYMBOL(I_BDEV); static sector_t max_block(struct block_device *bdev) { @@ -100,10 +116,10 @@ blkdev_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create) { - if (iblock >= max_block(inode->i_bdev)) + if (iblock >= max_block(I_BDEV(inode))) return -EIO; - bh->b_bdev = inode->i_bdev; + bh->b_bdev = I_BDEV(inode); bh->b_blocknr = iblock; set_buffer_mapped(bh); return 0; @@ -113,10 +129,10 @@ blkdev_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, struct buffer_head *bh, int create) { - if ((iblock + max_blocks) > max_block(inode->i_bdev)) + if ((iblock + max_blocks) > max_block(I_BDEV(inode))) return -EIO; - bh->b_bdev = inode->i_bdev; + bh->b_bdev = I_BDEV(inode); bh->b_blocknr = iblock; bh->b_size = max_blocks << inode->i_blkbits; set_buffer_mapped(bh); @@ -128,9 +144,9 @@ loff_t offset, unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; - return blockdev_direct_IO(rw, iocb, inode, inode->i_bdev, iov, offset, + return blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset, nr_segs, blkdev_get_blocks, NULL); } @@ -161,11 +177,10 @@ */ static loff_t block_llseek(struct file *file, loff_t offset, int origin) { - struct inode *bd_inode; + struct inode *bd_inode = file->f_mapping->host; loff_t size; loff_t retval; - bd_inode = file->f_dentry->d_inode->i_bdev->bd_inode; down(&bd_inode->i_sem); size = i_size_read(bd_inode); @@ -188,15 +203,13 @@ } /* - * Filp may be NULL when we are called by an msync of a vma - * since the vma has no handle. + * Filp is never NULL; the only case when ->fsync() is called with + * NULL first argument is nfsd_sync_dir() and that's not a directory. */ static int block_fsync(struct file *filp, struct dentry *dentry, int datasync) { - struct inode * inode = dentry->d_inode; - - return sync_blockdev(inode->i_bdev); + return sync_blockdev(I_BDEV(filp->f_mapping->host)); } /* @@ -206,16 +219,6 @@ static spinlock_t bdev_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED; static kmem_cache_t * bdev_cachep; -struct bdev_inode { - struct block_device bdev; - struct inode vfs_inode; -}; - -static inline struct bdev_inode *BDEV_I(struct inode *inode) -{ - return container_of(inode, struct bdev_inode, vfs_inode); -} - static struct inode *bdev_alloc_inode(struct super_block *sb) { struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, SLAB_KERNEL); @@ -387,26 +390,27 @@ EXPORT_SYMBOL(bdput); -int bd_acquire(struct inode *inode) +static struct block_device *bd_acquire(struct inode *inode) { struct block_device *bdev; spin_lock(&bdev_lock); - if (inode->i_bdev && igrab(inode->i_bdev->bd_inode)) { + bdev = inode->i_bdev; + if (bdev && igrab(bdev->bd_inode)) { spin_unlock(&bdev_lock); - return 0; + return bdev; } spin_unlock(&bdev_lock); bdev = bdget(inode->i_rdev); - if (!bdev) - return -ENOMEM; - spin_lock(&bdev_lock); - if (inode->i_bdev) - __bd_forget(inode); - inode->i_bdev = bdev; - inode->i_mapping = bdev->bd_inode->i_mapping; - list_add(&inode->i_devices, &bdev->bd_inodes); - spin_unlock(&bdev_lock); - return 0; + if (bdev) { + spin_lock(&bdev_lock); + if (inode->i_bdev) + __bd_forget(inode); + inode->i_bdev = bdev; + inode->i_mapping = bdev->bd_inode->i_mapping; + list_add(&inode->i_devices, &bdev->bd_inodes); + spin_unlock(&bdev_lock); + } + return bdev; } /* Call when you free inode */ @@ -531,13 +535,14 @@ bdev->bd_inode->i_blkbits = blksize_bits(bsize); } -static int do_open(struct block_device *bdev, struct inode *inode, struct file *file) +static int do_open(struct block_device *bdev, struct file *file) { struct module *owner = NULL; struct gendisk *disk; int ret = -ENXIO; int part; + file->f_mapping = bdev->bd_inode->i_mapping; lock_kernel(); disk = get_gendisk(bdev->bd_dev, &part); if (!disk) { @@ -554,7 +559,7 @@ if (!part) { struct backing_dev_info *bdi; if (disk->fops->open) { - ret = disk->fops->open(inode, file); + ret = disk->fops->open(bdev, file); if (ret) goto out_first; } @@ -599,7 +604,7 @@ module_put(owner); if (bdev->bd_contains == bdev) { if (bdev->bd_disk->fops->open) { - ret = bdev->bd_disk->fops->open(inode, file); + ret = bdev->bd_disk->fops->open(bdev, file); if (ret) goto out; } @@ -647,7 +652,7 @@ fake_file.f_dentry = &fake_dentry; fake_dentry.d_inode = bdev->bd_inode; - return do_open(bdev, bdev->bd_inode, &fake_file); + return do_open(bdev, &fake_file); } EXPORT_SYMBOL(blkdev_get); @@ -665,10 +670,9 @@ */ filp->f_flags |= O_LARGEFILE; - bd_acquire(inode); - bdev = inode->i_bdev; + bdev = bd_acquire(inode); - res = do_open(bdev, inode, filp); + res = do_open(bdev, filp); if (res) return res; @@ -687,7 +691,6 @@ int blkdev_put(struct block_device *bdev, int kind) { int ret = 0; - struct inode *bd_inode = bdev->bd_inode; struct gendisk *disk = bdev->bd_disk; down(&bdev->bd_sem); @@ -696,14 +699,14 @@ switch (kind) { case BDEV_FILE: case BDEV_FS: - sync_blockdev(bd_inode->i_bdev); + sync_blockdev(bdev); break; } kill_bdev(bdev); } if (bdev->bd_contains == bdev) { if (disk->fops->release) - ret = disk->fops->release(bd_inode, NULL); + ret = disk->fops->release(disk); } else { down(&bdev->bd_contains->bd_sem); bdev->bd_contains->bd_part_count--; @@ -734,11 +737,12 @@ EXPORT_SYMBOL(blkdev_put); -int blkdev_close(struct inode * inode, struct file * filp) +static int blkdev_close(struct inode * inode, struct file * filp) { - if (inode->i_bdev->bd_holder == filp) - bd_release(inode->i_bdev); - return blkdev_put(inode->i_bdev, BDEV_FILE); + struct block_device *bdev = I_BDEV(filp->f_mapping->host); + if (bdev->bd_holder == filp) + bd_release(bdev); + return blkdev_put(bdev, BDEV_FILE); } static ssize_t blkdev_file_write(struct file *file, const char __user *buf, @@ -757,6 +761,11 @@ return generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); } +static int block_ioctl(struct inode *inode, struct file *file, unsigned cmd, + unsigned long arg) +{ + return blkdev_ioctl(I_BDEV(file->f_mapping->host), file, cmd, arg); +} struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, @@ -778,7 +787,7 @@ .aio_write = blkdev_file_aio_write, .mmap = generic_file_mmap, .fsync = block_fsync, - .ioctl = blkdev_ioctl, + .ioctl = block_ioctl, .readv = generic_file_readv, .writev = generic_file_writev, .sendfile = generic_file_sendfile, @@ -791,7 +800,7 @@ int res; mm_segment_t old_fs = get_fs(); set_fs(KERNEL_DS); - res = blkdev_ioctl(bdev->bd_inode, NULL, cmd, arg); + res = blkdev_ioctl(bdev, NULL, cmd, arg); set_fs(old_fs); return res; } @@ -828,11 +837,10 @@ error = -EACCES; if (nd.mnt->mnt_flags & MNT_NODEV) goto fail; - error = bd_acquire(inode); - if (error) + error = -ENOMEM; + bdev = bd_acquire(inode); + if (!bdev) goto fail; - bdev = inode->i_bdev; - out: path_release(&nd); return bdev; --- diff/fs/buffer.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/buffer.c 2003-11-26 10:09:06.000000000 +0000 @@ -116,27 +116,50 @@ } /* - * Block until a buffer comes unlocked. This doesn't stop it + * Wait until a buffer comes unlocked. This doesn't stop it * from becoming locked again - you have to lock it yourself * if you want to preserve its state. + * If the wait queue parameter specifies an async i/o callback, + * then instead of blocking, we just queue up the callback + * on the wait queue for async notification when the buffer gets + * unlocked. + * A NULL wait queue parameter defaults to synchronous behaviour */ -void __wait_on_buffer(struct buffer_head * bh) +int __wait_on_buffer_wq(struct buffer_head * bh, wait_queue_t *wait) { wait_queue_head_t *wqh = bh_waitq_head(bh); - DEFINE_WAIT(wait); + DEFINE_WAIT(local_wait); + + if (!wait) + wait = &local_wait; if (atomic_read(&bh->b_count) == 0 && (!bh->b_page || !PageLocked(bh->b_page))) buffer_error(); do { - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); if (buffer_locked(bh)) { blk_run_queues(); + if (!is_sync_wait(wait)) { + /* + * if we've queued an async wait queue + * callback do not block; just tell the + * caller to return and retry later when + * the callback is notified + */ + return -EIOCBRETRY; + } io_schedule(); } } while (buffer_locked(bh)); - finish_wait(wqh, &wait); + finish_wait(wqh, wait); + return 0; +} + +void __wait_on_buffer(struct buffer_head * bh) +{ + __wait_on_buffer_wq(bh, NULL); } static void @@ -314,8 +337,7 @@ asmlinkage long sys_fsync(unsigned int fd) { struct file * file; - struct dentry * dentry; - struct inode * inode; + struct address_space *mapping; int ret, err; ret = -EBADF; @@ -323,8 +345,7 @@ if (!file) goto out; - dentry = file->f_dentry; - inode = dentry->d_inode; + mapping = file->f_mapping; ret = -EINVAL; if (!file->f_op || !file->f_op->fsync) { @@ -333,17 +354,17 @@ } /* We need to protect against concurrent writers.. */ - down(&inode->i_sem); + down(&mapping->host->i_sem); current->flags |= PF_SYNCWRITE; - ret = filemap_fdatawrite(inode->i_mapping); - err = file->f_op->fsync(file, dentry, 0); + ret = filemap_fdatawrite(mapping); + err = file->f_op->fsync(file, file->f_dentry, 0); if (!ret) ret = err; - err = filemap_fdatawait(inode->i_mapping); + err = filemap_fdatawait(mapping); if (!ret) ret = err; current->flags &= ~PF_SYNCWRITE; - up(&inode->i_sem); + up(&mapping->host->i_sem); out_putf: fput(file); @@ -354,8 +375,7 @@ asmlinkage long sys_fdatasync(unsigned int fd) { struct file * file; - struct dentry * dentry; - struct inode * inode; + struct address_space *mapping; int ret, err; ret = -EBADF; @@ -363,24 +383,23 @@ if (!file) goto out; - dentry = file->f_dentry; - inode = dentry->d_inode; - ret = -EINVAL; if (!file->f_op || !file->f_op->fsync) goto out_putf; - down(&inode->i_sem); + mapping = file->f_mapping; + + down(&mapping->host->i_sem); current->flags |= PF_SYNCWRITE; - ret = filemap_fdatawrite(inode->i_mapping); - err = file->f_op->fsync(file, dentry, 1); + ret = filemap_fdatawrite(mapping); + err = file->f_op->fsync(file, file->f_dentry, 1); if (!ret) ret = err; - err = filemap_fdatawait(inode->i_mapping); + err = filemap_fdatawait(mapping); if (!ret) ret = err; current->flags &= ~PF_SYNCWRITE; - up(&inode->i_sem); + up(&mapping->host->i_sem); out_putf: fput(file); @@ -432,6 +451,7 @@ printk("block=%llu, b_blocknr=%llu\n", (unsigned long long)block, (unsigned long long)bh->b_blocknr); printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); + printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); out_unlock: spin_unlock(&bd_mapping->private_lock); page_cache_release(page); @@ -1296,9 +1316,12 @@ __brelse(bh); } -static struct buffer_head *__bread_slow(struct buffer_head *bh) +static struct buffer_head *__bread_slow_wq(struct buffer_head *bh, + wait_queue_t *wait) { - lock_buffer(bh); + if (-EIOCBRETRY == lock_buffer_wq(bh, wait)) + return ERR_PTR(-EIOCBRETRY); + if (buffer_uptodate(bh)) { unlock_buffer(bh); return bh; @@ -1308,7 +1331,8 @@ get_bh(bh); bh->b_end_io = end_buffer_read_sync; submit_bh(READ, bh); - wait_on_buffer(bh); + if (-EIOCBRETRY == wait_on_buffer_wq(bh, wait)) + return ERR_PTR(-EIOCBRETRY); if (buffer_uptodate(bh)) return bh; } @@ -1316,6 +1340,11 @@ return NULL; } +static inline struct buffer_head *__bread_slow(struct buffer_head *bh) +{ + return __bread_slow_wq(bh, NULL); +} + /* * Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block(). * The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their @@ -1503,6 +1532,18 @@ } EXPORT_SYMBOL(__bread); +struct buffer_head * +__bread_wq(struct block_device *bdev, sector_t block, int size, + wait_queue_t *wait) +{ + struct buffer_head *bh = __getblk(bdev, block, size); + + if (!buffer_uptodate(bh)) + bh = __bread_slow_wq(bh, wait); + return bh; +} +EXPORT_SYMBOL(__bread_wq); + /* * invalidate_bh_lrus() is called rarely - at unmount. Because it is only for * unmount it only needs to ensure that all buffers from the target device are @@ -1980,8 +2021,9 @@ /* * If we issued read requests - let them complete. */ - while(wait_bh > wait) { - wait_on_buffer(*--wait_bh); + while (wait_bh > wait) { + if ((err = wait_on_buffer_wq(*--wait_bh, current->io_wait))) + return err; if (!buffer_uptodate(*wait_bh)) return -EIO; } @@ -3039,6 +3081,7 @@ EXPORT_SYMBOL(__bforget); EXPORT_SYMBOL(__brelse); EXPORT_SYMBOL(__wait_on_buffer); +EXPORT_SYMBOL(__wait_on_buffer_wq); EXPORT_SYMBOL(block_commit_write); EXPORT_SYMBOL(block_prepare_write); EXPORT_SYMBOL(block_read_full_page); --- diff/fs/coda/file.c 2003-09-30 15:46:18.000000000 +0100 +++ source/fs/coda/file.c 2003-11-26 10:09:06.000000000 +0000 @@ -89,6 +89,7 @@ coda_inode = coda_file->f_dentry->d_inode; host_inode = host_file->f_dentry->d_inode; + coda_file->f_mapping = host_file->f_mapping; if (coda_inode->i_mapping == &coda_inode->i_data) coda_inode->i_mapping = host_inode->i_mapping; --- diff/fs/compat.c 2003-11-25 15:24:59.000000000 +0000 +++ source/fs/compat.c 2003-11-26 10:09:06.000000000 +0000 @@ -559,3 +559,96 @@ return compat_sys_fcntl64(fd, cmd, arg); } +extern asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t *ctx); + +asmlinkage long +compat_sys_io_setup(unsigned nr_reqs, u32 *ctx32p) +{ + long ret; + aio_context_t ctx64; + + mm_segment_t oldfs = get_fs(); + if (unlikely(get_user(ctx64, ctx32p))) + return -EFAULT; + + set_fs(KERNEL_DS); + ret = sys_io_setup(nr_reqs, &ctx64); + set_fs(oldfs); + /* truncating is ok because it's a user address */ + if (!ret) + ret = put_user((u32) ctx64, ctx32p); + return ret; +} + +extern asmlinkage long sys_io_getevents(aio_context_t ctx_id, + long min_nr, + long nr, + struct io_event *events, + struct timespec *timeout); + +asmlinkage long +compat_sys_io_getevents(aio_context_t ctx_id, + unsigned long min_nr, + unsigned long nr, + struct io_event *events, + struct compat_timespec *timeout) +{ + long ret; + struct timespec t; + struct timespec *ut = NULL; + + ret = -EFAULT; + if (unlikely(!access_ok(VERIFY_WRITE, events, + nr * sizeof(struct io_event)))) + goto out; + if (timeout) { + if (get_compat_timespec(&t, timeout)) + goto out; + + ut = compat_alloc_user_space(sizeof(*ut)); + if (copy_to_user(ut, &t, sizeof(t)) ) + goto out; + } + ret = sys_io_getevents(ctx_id, min_nr, nr, events, ut); +out: + return ret; +} + +extern asmlinkage long sys_io_submit(aio_context_t, long, + struct iocb __user **); + +static inline long +copy_iocb(long nr, u32 *ptr32, u64 *ptr64) +{ + compat_uptr_t uptr; + int i; + + for (i = 0; i < nr; ++i) { + if (get_user(uptr, ptr32 + i)) + return -EFAULT; + if (put_user((u64)compat_ptr(uptr), ptr64 + i)) + return -EFAULT; + } + return 0; +} + +#define MAX_AIO_SUBMITS (PAGE_SIZE/sizeof(struct iocb *)) + +asmlinkage long +compat_sys_io_submit(aio_context_t ctx_id, int nr, u32 *iocb) +{ + struct iocb **iocb64; + long ret; + + if (unlikely(nr < 0)) + return -EINVAL; + + if (nr > MAX_AIO_SUBMITS) + nr = MAX_AIO_SUBMITS; + + iocb64 = compat_alloc_user_space(nr * sizeof(*iocb64)); + ret = copy_iocb(nr, iocb, (u64 *) iocb64); + if (!ret) + ret = sys_io_submit(ctx_id, nr, iocb64); + return ret; +} --- diff/fs/compat_ioctl.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/compat_ioctl.c 2003-11-26 10:09:06.000000000 +0000 @@ -63,6 +63,8 @@ #include <linux/ctype.h> #include <linux/ioctl32.h> #include <linux/ncp_fs.h> +#include <linux/i2c.h> +#include <linux/i2c-dev.h> #include <net/sock.h> /* siocdevprivate_ioctl */ #include <net/bluetooth/bluetooth.h> @@ -128,7 +130,7 @@ set_fs (KERNEL_DS); err = sys_ioctl(fd, cmd, (unsigned long)&val); set_fs (old_fs); - if (!err && put_user(val, (u32 *)arg)) + if (!err && put_user(val, (u32 *)compat_ptr(arg))) return -EFAULT; return err; } @@ -136,15 +138,16 @@ static int rw_long(unsigned int fd, unsigned int cmd, unsigned long arg) { mm_segment_t old_fs = get_fs(); + u32 *argptr = compat_ptr(arg); int err; unsigned long val; - if(get_user(val, (u32 *)arg)) + if(get_user(val, argptr)) return -EFAULT; set_fs (KERNEL_DS); err = sys_ioctl(fd, cmd, (unsigned long)&val); set_fs (old_fs); - if (!err && put_user(val, (u32 *)arg)) + if (!err && put_user(val, argptr)) return -EFAULT; return err; } @@ -1701,7 +1704,7 @@ set_fs(old_fs); if (err >= 0) - err = put_user(kuid, (compat_pid_t *)arg); + err = put_user(kuid, (compat_uid_t *)arg); return err; } @@ -2869,6 +2872,105 @@ return err; } + +/* + * I2C layer ioctls + */ + +struct i2c_msg32 { + u16 addr; + u16 flags; + u16 len; + compat_caddr_t buf; +}; + +struct i2c_rdwr_ioctl_data32 { + compat_caddr_t msgs; /* struct i2c_msg __user *msgs */ + u32 nmsgs; +}; + +struct i2c_smbus_ioctl_data32 { + u8 read_write; + u8 command; + u32 size; + compat_caddr_t data; /* union i2c_smbus_data *data */ +}; + +static int do_i2c_rdwr_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +{ + struct i2c_rdwr_ioctl_data *tdata; + struct i2c_rdwr_ioctl_data32 *udata; + struct i2c_msg *tmsgs; + struct i2c_msg32 *umsgs; + compat_caddr_t datap; + int nmsgs, i; + + tdata = compat_alloc_user_space(sizeof(*tdata)); + if (tdata == NULL) + return -ENOMEM; + if (verify_area(VERIFY_WRITE, tdata, sizeof(*tdata))) + return -EFAULT; + + udata = (struct i2c_rdwr_ioctl_data32 *)compat_ptr(arg); + if (verify_area(VERIFY_READ, udata, sizeof(*udata))) + return -EFAULT; + if (__get_user(nmsgs, &udata->nmsgs) || __put_user(nmsgs, &tdata->nmsgs)) + return -EFAULT; + if (nmsgs > I2C_RDRW_IOCTL_MAX_MSGS) + return -EINVAL; + if (__get_user(datap, &udata->msgs)) + return -EFAULT; + umsgs = (struct i2c_msg32 *)compat_ptr(datap); + if (verify_area(VERIFY_READ, umsgs, sizeof(struct i2c_msg) * nmsgs)) + return -EFAULT; + + tmsgs = compat_alloc_user_space(sizeof(struct i2c_msg) * nmsgs); + if (tmsgs == NULL) + return -ENOMEM; + if (verify_area(VERIFY_WRITE, tmsgs, sizeof(struct i2c_msg) * nmsgs)) + return -EFAULT; + if (__put_user(tmsgs, &tdata->msgs)) + return -ENOMEM; + for (i = 0; i < nmsgs; i++) { + if (__copy_in_user(&tmsgs[i].addr, + &umsgs[i].addr, + 3 * sizeof(u16))) + return -EFAULT; + if (__get_user(datap, &umsgs[i].buf) || + __put_user(compat_ptr(datap), &tmsgs[i].buf)) + return -EFAULT; + } + return sys_ioctl(fd, cmd, (unsigned long)tdata); +} + +static int do_i2c_smbus_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg) +{ + struct i2c_smbus_ioctl_data *tdata; + struct i2c_smbus_ioctl_data32 *udata; + compat_caddr_t datap; + + tdata = compat_alloc_user_space(sizeof(*tdata)); + if (tdata == NULL) + return -ENOMEM; + if (verify_area(VERIFY_WRITE, tdata, sizeof(*tdata))) + return -EFAULT; + + udata = (struct i2c_smbus_ioctl_data32 *)compat_ptr(arg); + if (verify_area(VERIFY_READ, udata, sizeof(*udata))) + return -EFAULT; + + if (__copy_in_user(&tdata->read_write, &udata->read_write, 2 * sizeof(u8))) + return -EFAULT; + if (__copy_in_user(&tdata->size, &udata->size, 2 * sizeof(u32))) + return -EFAULT; + if (__get_user(datap, &udata->data) || + __put_user(compat_ptr(datap), &tdata->data)) + return -EFAULT; + + return sys_ioctl(fd, cmd, (unsigned long)tdata); +} + + #undef CODE #endif @@ -2979,7 +3081,7 @@ HANDLE_IOCTL(VIDIOCGFREQ32, do_video_ioctl) HANDLE_IOCTL(VIDIOCSFREQ32, do_video_ioctl) /* One SMB ioctl needs translations. */ -#define SMB_IOC_GETMOUNTUID_32 _IOR('u', 1, compat_pid_t) +#define SMB_IOC_GETMOUNTUID_32 _IOR('u', 1, compat_uid_t) HANDLE_IOCTL(SMB_IOC_GETMOUNTUID_32, do_smb_getmountuid) HANDLE_IOCTL(ATM_GETLINKRATE32, do_atm_ioctl) HANDLE_IOCTL(ATM_GETNAMES32, do_atm_ioctl) @@ -3027,5 +3129,10 @@ HANDLE_IOCTL(USBDEVFS_REAPURB32, do_usbdevfs_reapurb) HANDLE_IOCTL(USBDEVFS_REAPURBNDELAY32, do_usbdevfs_reapurb) HANDLE_IOCTL(USBDEVFS_DISCSIGNAL32, do_usbdevfs_discsignal) +/* i2c */ +HANDLE_IOCTL(I2C_FUNCS, w_long) +HANDLE_IOCTL(I2C_RDWR, do_i2c_rdwr_ioctl) +HANDLE_IOCTL(I2C_SMBUS, do_i2c_smbus_ioctl) + #undef DECLARES #endif --- diff/fs/cramfs/inode.c 2003-11-25 15:24:59.000000000 +0000 +++ source/fs/cramfs/inode.c 2003-11-26 10:09:06.000000000 +0000 @@ -112,8 +112,8 @@ */ static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned int len) { - struct buffer_head * bh_array[BLKS_PER_BUF]; - struct buffer_head * read_array[BLKS_PER_BUF]; + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; + struct page *pages[BLKS_PER_BUF]; unsigned i, blocknr, buffer, unread; unsigned long devsize; char *data; @@ -138,33 +138,36 @@ return read_buffers[i] + blk_offset; } - devsize = sb->s_bdev->bd_inode->i_size >> 12; - if (!devsize) - devsize = ~0UL; + devsize = mapping->host->i_size >> PAGE_CACHE_SHIFT; /* Ok, read in BLKS_PER_BUF pages completely first. */ unread = 0; for (i = 0; i < BLKS_PER_BUF; i++) { - struct buffer_head *bh; + struct page *page = NULL; - bh = NULL; if (blocknr + i < devsize) { - bh = sb_getblk(sb, blocknr + i); - if (!buffer_uptodate(bh)) - read_array[unread++] = bh; + page = read_cache_page(mapping, blocknr + i, + (filler_t *)mapping->a_ops->readpage, + NULL); + /* synchronous error? */ + if (IS_ERR(page)) + page = NULL; } - bh_array[i] = bh; + pages[i] = page; } - if (unread) { - ll_rw_block(READ, unread, read_array); - do { - unread--; - wait_on_buffer(read_array[unread]); - } while (unread); + for (i = 0; i < BLKS_PER_BUF; i++) { + struct page *page = pages[i]; + if (page) { + wait_on_page_locked(page); + if (!PageUptodate(page)) { + /* asynchronous error */ + page_cache_release(page); + pages[i] = NULL; + } + } } - /* Ok, copy them to the staging area without sleeping. */ buffer = next_buffer; next_buffer = NEXT_BUFFER(buffer); buffer_blocknr[buffer] = blocknr; @@ -172,10 +175,11 @@ data = read_buffers[buffer]; for (i = 0; i < BLKS_PER_BUF; i++) { - struct buffer_head * bh = bh_array[i]; - if (bh) { - memcpy(data, bh->b_data, PAGE_CACHE_SIZE); - brelse(bh); + struct page *page = pages[i]; + if (page) { + memcpy(data, kmap(page), PAGE_CACHE_SIZE); + kunmap(page); + page_cache_release(page); } else memset(data, 0, PAGE_CACHE_SIZE); data += PAGE_CACHE_SIZE; @@ -202,8 +206,6 @@ sb->s_fs_info = sbi; memset(sbi, 0, sizeof(struct cramfs_sb_info)); - sb_set_blocksize(sb, PAGE_CACHE_SIZE); - /* Invalidate the read buffers on mount: think disk change.. */ down(&read_mutex); for (i = 0; i < READ_BUFFERS; i++) --- diff/fs/dcache.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/dcache.c 2003-11-26 10:09:06.000000000 +0000 @@ -639,24 +639,9 @@ /* * This is called from kswapd when we think we need some more memory. - * - * We don't want the VM to steal _all_ unused dcache. Because that leads to - * the VM stealing all unused inodes, which shoots down recently-used - * pagecache. So what we do is to tell fibs to the VM about how many reapable - * objects there are in this cache. If the number of unused dentries is - * less than half of the total dentry count then return zero. The net effect - * is that the number of unused dentries will be, at a minimum, equal to the - * number of used ones. - * - * If unused_ratio is set to 5, the number of unused dentries will not fall - * below 5* the number of used ones. */ static int shrink_dcache_memory(int nr, unsigned int gfp_mask) { - int nr_used; - int nr_unused; - const int unused_ratio = 1; - if (nr) { /* * Nasty deadlock avoidance. @@ -672,11 +657,7 @@ if (gfp_mask & __GFP_FS) prune_dcache(nr); } - nr_unused = dentry_stat.nr_unused; - nr_used = dentry_stat.nr_dentry - nr_unused; - if (nr_unused < nr_used * unused_ratio) - return 0; - return nr_unused - nr_used * unused_ratio; + return dentry_stat.nr_unused; } #define NAME_ALLOC_LEN(len) ((len+16) & ~15) --- diff/fs/devfs/base.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/devfs/base.c 2003-11-26 10:09:06.000000000 +0000 @@ -1955,15 +1955,9 @@ return 0; } /* End Function devfs_notify_change */ -static void devfs_clear_inode (struct inode *inode) -{ - if ( S_ISBLK (inode->i_mode) ) bdput (inode->i_bdev); -} /* End Function devfs_clear_inode */ - static struct super_operations devfs_sops = { .drop_inode = generic_delete_inode, - .clear_inode = devfs_clear_inode, .statfs = simple_statfs, }; @@ -2015,11 +2009,7 @@ inode->i_rdev = de->u.cdev.dev; } else if ( S_ISBLK (de->mode) ) - { - inode->i_rdev = de->u.bdev.dev; - if (bd_acquire (inode) != 0) - PRINTK ("(%d): no block device from bdget()\n",(int)inode->i_ino); - } + init_special_inode(inode, de->mode, de->u.bdev.dev); else if ( S_ISFIFO (de->mode) ) inode->i_fop = &def_fifo_fops; else if ( S_ISDIR (de->mode) ) @@ -2118,11 +2108,7 @@ if (de == NULL) return -ENODEV; if ( S_ISDIR (de->mode) ) return 0; file->private_data = de->info; - if ( S_ISBLK (inode->i_mode) ) - { - file->f_op = &def_blk_fops; - err = def_blk_fops.open (inode, file); /* Module refcount unchanged */ - } else if (S_ISCHR(inode->i_mode)) { + if (S_ISCHR(inode->i_mode)) { ops = devfs_get_ops (de); /* Now have module refcount */ file->f_op = ops; if (file->f_op) --- diff/fs/direct-io.c 2003-11-25 15:24:59.000000000 +0000 +++ source/fs/direct-io.c 2003-11-26 10:09:06.000000000 +0000 @@ -52,6 +52,10 @@ * * If blkfactor is zero then the user's request was aligned to the filesystem's * blocksize. + * + * needs_locking is set for regular files on direct-IO-naive filesystems. It + * determines whether we need to do the fancy locking which prevents direct-IO + * from being able to read uninitialised disk blocks. */ struct dio { @@ -59,6 +63,7 @@ struct bio *bio; /* bio under assembly */ struct inode *inode; int rw; + int needs_locking; /* doesn't change */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which is finer than the filesystem's soft @@ -204,8 +209,10 @@ */ static void dio_complete(struct dio *dio, loff_t offset, ssize_t bytes) { - if (dio->end_io) + if (dio->end_io && dio->result) dio->end_io(dio->inode, offset, bytes, dio->map_bh.b_private); + if (dio->needs_locking) + up_read(&dio->inode->i_alloc_sem); } /* @@ -218,8 +225,14 @@ if (dio->is_async) { dio_complete(dio, dio->block_in_file << dio->blkbits, dio->result); - aio_complete(dio->iocb, dio->result, 0); - kfree(dio); + /* Complete AIO later if falling back to buffered i/o */ + if (dio->result != -ENOTBLK) { + aio_complete(dio->iocb, dio->result, 0); + kfree(dio); + } else { + if (dio->waiter) + wake_up_process(dio->waiter); + } } } } @@ -449,6 +462,7 @@ unsigned long fs_count; /* Number of filesystem-sized blocks */ unsigned long dio_count;/* Number of dio_block-sized blocks */ unsigned long blkmask; + int beyond_eof = 0; /* * If there was a memory error and we've overwritten all the @@ -466,8 +480,19 @@ if (dio_count & blkmask) fs_count++; + if (dio->needs_locking) { + if (dio->block_in_file >= (i_size_read(dio->inode) >> + dio->blkbits)) + beyond_eof = 1; + } + /* + * For writes inside i_size we forbid block creations: only + * overwrites are permitted. We fall back to buffered writes + * at a higher level for inside-i_size block-instantiating + * writes. + */ ret = (*dio->get_blocks)(dio->inode, fs_startblk, fs_count, - map_bh, dio->rw == WRITE); + map_bh, (dio->rw == WRITE) && beyond_eof); } return ret; } @@ -774,6 +799,10 @@ if (!buffer_mapped(map_bh)) { char *kaddr; + /* AKPM: eargh, -ENOTBLK is a hack */ + if (dio->rw == WRITE) + return -ENOTBLK; + if (dio->block_in_file >= i_size_read(dio->inode)>>blkbits) { /* We hit eof */ @@ -839,23 +868,21 @@ return ret; } +/* + * Releases both i_sem and i_alloc_sem + */ static int direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, - unsigned blkbits, get_blocks_t get_blocks, dio_iodone_t end_io) + unsigned blkbits, get_blocks_t get_blocks, dio_iodone_t end_io, + struct dio *dio) { unsigned long user_addr; int seg; int ret = 0; int ret2; - struct dio *dio; size_t bytes; - dio = kmalloc(sizeof(*dio), GFP_KERNEL); - if (!dio) - return -ENOMEM; - dio->is_async = !is_sync_kiocb(iocb); - dio->bio = NULL; dio->inode = inode; dio->rw = rw; @@ -864,7 +891,6 @@ dio->start_zero_done = 0; dio->block_in_file = offset >> blkbits; dio->blocks_available = 0; - dio->cur_page = NULL; dio->boundary = 0; @@ -947,14 +973,51 @@ dio_bio_submit(dio); /* + * It is possible that, we return short IO due to end of file. + * In that case, we need to release all the pages we got hold on. + */ + dio_cleanup(dio); + + /* + * All block lookups have been performed. For READ requests + * we can let i_sem go now that its achieved its purpose + * of protecting us from looking up uninitialized blocks. + */ + if ((rw == READ) && dio->needs_locking) + up(&dio->inode->i_sem); + + /* * OK, all BIOs are submitted, so we can decrement bio_count to truly * reflect the number of to-be-processed BIOs. */ if (dio->is_async) { if (ret == 0) ret = dio->result; /* Bytes written */ + if (ret == -ENOTBLK) { + /* + * The request will be reissued via buffered I/O + * when we return; Any I/O already issued + * effectively becomes redundant. + */ + dio->result = ret; + dio->waiter = current; + } finished_one_bio(dio); /* This can free the dio */ blk_run_queues(); + if (ret == -ENOTBLK) { + /* + * Wait for already issued I/O to drain out and + * release its references to user-space pages + * before returning to fallback on buffered I/O + */ + set_current_state(TASK_UNINTERRUPTIBLE); + while (atomic_read(&dio->bio_count)) { + io_schedule(); + set_current_state(TASK_UNINTERRUPTIBLE); + } + set_current_state(TASK_RUNNING); + dio->waiter = NULL; + } } else { finished_one_bio(dio); ret2 = dio_await_completion(dio); @@ -974,6 +1037,9 @@ ret = i_size - offset; } dio_complete(dio, offset, ret); + /* We could have also come here on an AIO file extend */ + if (!is_sync_kiocb(iocb) && (ret != -ENOTBLK)) + aio_complete(iocb, ret, 0); kfree(dio); } return ret; @@ -981,11 +1047,17 @@ /* * This is a library function for use by filesystem drivers. + * + * For writes to S_ISREG files, we are called under i_sem and return with i_sem + * held, even though it is internally dropped. + * + * For writes to S_ISBLK files, i_sem is not held on entry; it is never taken. */ int -blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, +__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, - unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io) + unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io, + int needs_special_locking) { int seg; size_t size; @@ -994,6 +1066,9 @@ unsigned bdev_blkbits = 0; unsigned blocksize_mask = (1 << blkbits) - 1; ssize_t retval = -EINVAL; + loff_t end = offset; + struct dio *dio; + int needs_locking; if (bdev) bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev)); @@ -1010,6 +1085,7 @@ for (seg = 0; seg < nr_segs; seg++) { addr = (unsigned long)iov[seg].iov_base; size = iov[seg].iov_len; + end += size; if ((addr & blocksize_mask) || (size & blocksize_mask)) { if (bdev) blkbits = bdev_blkbits; @@ -1019,10 +1095,43 @@ } } - retval = direct_io_worker(rw, iocb, inode, iov, offset, - nr_segs, blkbits, get_blocks, end_io); + dio = kmalloc(sizeof(*dio), GFP_KERNEL); + retval = -ENOMEM; + if (!dio) + goto out; + + /* + * For regular files, + * readers need to grab i_sem and i_alloc_sem + * writers need to grab i_alloc_sem only (i_sem is already held) + */ + needs_locking = 0; + if (S_ISREG(inode->i_mode) && needs_special_locking) { + needs_locking = 1; + if (rw == READ) { + down(&inode->i_sem); + retval = filemap_write_and_wait(inode->i_mapping); + if (retval) { + up(&inode->i_sem); + kfree(dio); + goto out; + } + } + down_read(&inode->i_alloc_sem); + } + dio->needs_locking = needs_locking; + /* + * For file extending writes updating i_size before data + * writeouts complete can expose uninitialized blocks. So + * even for AIO, we need to wait for i/o to complete before + * returning in this case. + */ + dio->is_async = !is_sync_kiocb(iocb) && !((rw == WRITE) && + (end > i_size_read(inode))); + + retval = direct_io_worker(rw, iocb, inode, iov, offset, + nr_segs, blkbits, get_blocks, end_io, dio); out: return retval; } - -EXPORT_SYMBOL(blockdev_direct_IO); +EXPORT_SYMBOL(__blockdev_direct_IO); --- diff/fs/dquot.c 2003-11-25 15:24:59.000000000 +0000 +++ source/fs/dquot.c 2003-11-26 10:09:06.000000000 +0000 @@ -192,6 +192,8 @@ struct dqstats dqstats; +static void dqput(struct dquot *dquot); + static inline int const hashfn(struct super_block *sb, unsigned int id, int type) { return((((unsigned long)sb>>L1_CACHE_SHIFT) ^ id) * (MAXQUOTAS - type)) % NR_DQHASH; @@ -339,8 +341,11 @@ continue; if (!dquot_dirty(dquot)) continue; + atomic_inc(&dquot->dq_count); + dqstats.lookups++; spin_unlock(&dq_list_lock); - sb->dq_op->sync_dquot(dquot); + sb->dq_op->write_dquot(dquot); + dqput(dquot); goto restart; } spin_unlock(&dq_list_lock); @@ -427,7 +432,7 @@ } if (dquot_dirty(dquot)) { spin_unlock(&dq_list_lock); - commit_dqblk(dquot); + dquot->dq_sb->dq_op->write_dquot(dquot); goto we_slept; } atomic_dec(&dquot->dq_count); @@ -1083,7 +1088,7 @@ .free_space = dquot_free_space, .free_inode = dquot_free_inode, .transfer = dquot_transfer, - .sync_dquot = commit_dqblk + .write_dquot = commit_dqblk }; /* Function used by filesystems for initializing the dquot_operations structure */ @@ -1207,9 +1212,9 @@ error = -EINVAL; if (!fmt->qf_ops->check_quota_file(sb, type)) goto out_file_init; - /* We don't want quota on quota files */ + /* We don't want quota and atime on quota files (deadlocks possible) */ dquot_drop_nolock(inode); - inode->i_flags |= S_NOQUOTA; + inode->i_flags |= S_NOQUOTA | S_NOATIME; dqopt->ops[type] = fmt->qf_ops; dqopt->info[type].dqi_format = fmt; --- diff/fs/eventpoll.c 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/eventpoll.c 2003-11-26 10:09:06.000000000 +0000 @@ -740,6 +740,7 @@ d_add(dentry, inode); file->f_vfsmnt = mntget(eventpoll_mnt); file->f_dentry = dget(dentry); + file->f_mapping = inode->i_mapping; file->f_pos = 0; file->f_flags = O_RDONLY; --- diff/fs/ext2/acl.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/ext2/acl.c 2003-11-26 10:09:06.000000000 +0000 @@ -322,7 +322,7 @@ check_capabilities: /* Allowed to override Discretionary Access Control? */ - if ((mask & (MAY_READ|MAY_WRITE)) || (inode->i_mode & S_IXUGO)) + if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable(CAP_DAC_OVERRIDE)) return 0; /* Read and search granted if capable(CAP_DAC_READ_SEARCH) */ --- diff/fs/ext2/inode.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/ext2/inode.c 2003-11-26 10:09:06.000000000 +0000 @@ -257,11 +257,12 @@ * or when it reads all @depth-1 indirect blocks successfully and finds * the whole chain, all way to the data (returns %NULL, *err == 0). */ -static Indirect *ext2_get_branch(struct inode *inode, +static Indirect *ext2_get_branch_wq(struct inode *inode, int depth, int *offsets, Indirect chain[4], - int *err) + int *err, + wait_queue_t *wait) { struct super_block *sb = inode->i_sb; Indirect *p = chain; @@ -273,8 +274,8 @@ if (!p->key) goto no_block; while (--depth) { - bh = sb_bread(sb, le32_to_cpu(p->key)); - if (!bh) + bh = sb_bread_wq(sb, le32_to_cpu(p->key), wait); + if (!bh || IS_ERR(bh)) goto failure; read_lock(&EXT2_I(inode)->i_meta_lock); if (!verify_chain(chain, p)) @@ -292,11 +293,21 @@ *err = -EAGAIN; goto no_block; failure: - *err = -EIO; + *err = IS_ERR(bh) ? PTR_ERR(bh) : -EIO; no_block: return p; } +static Indirect *ext2_get_branch(struct inode *inode, + int depth, + int *offsets, + Indirect chain[4], + int *err) +{ + return ext2_get_branch_wq(inode, depth, offsets, chain, + err, NULL); +} + /** * ext2_find_near - find a place for allocation with sufficient locality * @inode: owner @@ -536,7 +547,8 @@ * reachable from inode. */ -static int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) +static int ext2_get_block_wq(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create, wait_queue_t *wait) { int err = -EIO; int offsets[4]; @@ -551,7 +563,8 @@ goto out; reread: - partial = ext2_get_branch(inode, depth, offsets, chain, &err); + partial = ext2_get_branch_wq(inode, depth, offsets, chain, &err, + wait); /* Simplest case - block found, no allocation needed */ if (!partial) { @@ -565,7 +578,7 @@ } /* Next simple case - plain lookup or failed read of indirect block */ - if (!create || err == -EIO) { + if (!create || err == -EIO || err == -EIOCBRETRY) { cleanup: while (partial > chain) { brelse(partial->bh); @@ -606,6 +619,19 @@ goto reread; } +static int ext2_get_block_async(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + return ext2_get_block_wq(inode, iblock, bh_result, create, + current->io_wait); +} + +static int ext2_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + return ext2_get_block_wq(inode, iblock, bh_result, create, NULL); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); @@ -627,7 +653,7 @@ ext2_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - return block_prepare_write(page,from,to,ext2_get_block); + return block_prepare_write(page,from,to,ext2_get_block_async); } static int @@ -659,7 +685,7 @@ loff_t offset, unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ext2_get_blocks, NULL); --- diff/fs/ext3/acl.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/ext3/acl.c 2003-11-26 10:09:06.000000000 +0000 @@ -327,7 +327,7 @@ check_capabilities: /* Allowed to override Discretionary Access Control? */ - if ((mask & (MAY_READ|MAY_WRITE)) || (inode->i_mode & S_IXUGO)) + if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable(CAP_DAC_OVERRIDE)) return 0; /* Read and search granted if capable(CAP_DAC_READ_SEARCH) */ --- diff/fs/ext3/inode.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/ext3/inode.c 2003-11-26 10:09:06.000000000 +0000 @@ -1532,7 +1532,7 @@ unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; struct ext3_inode_info *ei = EXT3_I(inode); handle_t *handle = NULL; int ret; --- diff/fs/ext3/super.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/ext3/super.c 2003-11-26 10:09:06.000000000 +0000 @@ -340,6 +340,7 @@ */ static int ext3_blkdev_put(struct block_device *bdev) { + bd_release(bdev); return blkdev_put(bdev, BDEV_FS); } @@ -1480,6 +1481,13 @@ if (bdev == NULL) return NULL; + if (bd_claim(bdev, sb)) { + printk(KERN_ERR + "EXT3: failed to claim external journal device.\n"); + blkdev_put(bdev, BDEV_FS); + return NULL; + } + blocksize = sb->s_blocksize; hblock = bdev_hardsect_size(bdev); if (blocksize < hblock) { @@ -1944,9 +1952,9 @@ /* Blocks: quota info + (4 pointer blocks + 1 entry block) * (3 indirect + 1 descriptor + 1 bitmap) + superblock */ #define EXT3_V0_QFMT_BLOCKS 27 -static int (*old_sync_dquot)(struct dquot *dquot); +static int (*old_write_dquot)(struct dquot *dquot); -static int ext3_sync_dquot(struct dquot *dquot) +static int ext3_write_dquot(struct dquot *dquot) { int nblocks; int ret; @@ -1971,7 +1979,7 @@ ret = PTR_ERR(handle); goto out; } - ret = old_sync_dquot(dquot); + ret = old_write_dquot(dquot); err = ext3_journal_stop(handle); if (ret == 0) ret = err; @@ -2004,8 +2012,8 @@ goto out1; #ifdef CONFIG_QUOTA init_dquot_operations(&ext3_qops); - old_sync_dquot = ext3_qops.sync_dquot; - ext3_qops.sync_dquot = ext3_sync_dquot; + old_write_dquot = ext3_qops.write_dquot; + ext3_qops.write_dquot = ext3_write_dquot; #endif err = register_filesystem(&ext3_fs_type); if (err) --- diff/fs/fcntl.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/fcntl.c 2003-11-26 10:09:06.000000000 +0000 @@ -229,8 +229,8 @@ arg |= O_NONBLOCK; if (arg & O_DIRECT) { - if (!inode->i_mapping || !inode->i_mapping->a_ops || - !inode->i_mapping->a_ops->direct_IO) + if (!filp->f_mapping || !filp->f_mapping->a_ops || + !filp->f_mapping->a_ops->direct_IO) return -EINVAL; } --- diff/fs/file_table.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/file_table.c 2003-11-26 10:09:06.000000000 +0000 @@ -120,6 +120,7 @@ filp->f_mode = (flags+1) & O_ACCMODE; atomic_set(&filp->f_count, 1); filp->f_dentry = dentry; + filp->f_mapping = dentry->d_inode->i_mapping; filp->f_uid = current->fsuid; filp->f_gid = current->fsgid; filp->f_op = dentry->d_inode->i_fop; @@ -183,9 +184,9 @@ fops_put(file->f_op); if (file->f_mode & FMODE_WRITE) put_write_access(inode); + file_kill(file); file->f_dentry = NULL; file->f_vfsmnt = NULL; - file_kill(file); file_free(file); dput(dentry); mntput(mnt); --- diff/fs/fs-writeback.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/fs-writeback.c 2003-11-26 10:09:06.000000000 +0000 @@ -514,7 +514,7 @@ * OSYNC_INODE: the inode itself */ -int generic_osync_inode(struct inode *inode, int what) +int generic_osync_inode(struct inode *inode, struct address_space *mapping, int what) { int err = 0; int need_write_inode_now = 0; @@ -522,14 +522,14 @@ current->flags |= PF_SYNCWRITE; if (what & OSYNC_DATA) - err = filemap_fdatawrite(inode->i_mapping); + err = filemap_fdatawrite(mapping); if (what & (OSYNC_METADATA|OSYNC_DATA)) { - err2 = sync_mapping_buffers(inode->i_mapping); + err2 = sync_mapping_buffers(mapping); if (!err) err = err2; } if (what & OSYNC_DATA) { - err2 = filemap_fdatawait(inode->i_mapping); + err2 = filemap_fdatawait(mapping); if (!err) err = err2; } --- diff/fs/hugetlbfs/inode.c 2003-10-27 09:20:44.000000000 +0000 +++ source/fs/hugetlbfs/inode.c 2003-11-26 10:09:06.000000000 +0000 @@ -165,7 +165,7 @@ pagevec_init(&pvec, 0); next = start; while (1) { - if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + if (!pagevec_lookup(&pvec, mapping, &next, PAGEVEC_SIZE)) { if (next == start) break; next = start; @@ -176,9 +176,6 @@ struct page *page = pvec.pages[i]; lock_page(page); - if (page->index > next) - next = page->index; - ++next; truncate_huge_page(page); unlock_page(page); hugetlb_put_quota(mapping); @@ -194,6 +191,7 @@ hlist_del_init(&inode->i_hash); list_del_init(&inode->i_list); + list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; inodes_stat.nr_inodes--; spin_unlock(&inode_lock); @@ -236,6 +234,7 @@ hlist_del_init(&inode->i_hash); out_truncate: list_del_init(&inode->i_list); + list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; inodes_stat.nr_inodes--; spin_unlock(&inode_lock); @@ -788,6 +787,7 @@ inode->i_nlink = 0; file->f_vfsmnt = mntget(hugetlbfs_vfsmount); file->f_dentry = dentry; + file->f_mapping = inode->i_mapping; file->f_op = &hugetlbfs_file_operations; file->f_mode = FMODE_WRITE | FMODE_READ; return file; --- diff/fs/inode.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/inode.c 2003-11-26 10:09:06.000000000 +0000 @@ -183,6 +183,7 @@ INIT_LIST_HEAD(&inode->i_dentry); INIT_LIST_HEAD(&inode->i_devices); sema_init(&inode->i_sem, 1); + init_rwsem(&inode->i_alloc_sem); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); spin_lock_init(&inode->i_data.page_lock); init_MUTEX(&inode->i_data.i_shared_sem); @@ -285,7 +286,7 @@ /* * Invalidate all inodes for a device. */ -static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose) +static int invalidate_list(struct list_head *head, struct list_head *dispose) { struct list_head *next; int busy = 0, count = 0; @@ -298,13 +299,12 @@ next = next->next; if (tmp == head) break; - inode = list_entry(tmp, struct inode, i_list); - if (inode->i_sb != sb) - continue; + inode = list_entry(tmp, struct inode, i_sb_list); invalidate_inode_buffers(inode); if (!atomic_read(&inode->i_count)) { hlist_del_init(&inode->i_hash); list_del(&inode->i_list); + list_del(&inode->i_sb_list); list_add(&inode->i_list, dispose); inode->i_state |= I_FREEING; count++; @@ -340,10 +340,7 @@ down(&iprune_sem); spin_lock(&inode_lock); - busy = invalidate_list(&inode_in_use, sb, &throw_away); - busy |= invalidate_list(&inode_unused, sb, &throw_away); - busy |= invalidate_list(&sb->s_dirty, sb, &throw_away); - busy |= invalidate_list(&sb->s_io, sb, &throw_away); + busy = invalidate_list(&sb->s_inodes, &throw_away); spin_unlock(&inode_lock); dispose_list(&throw_away); @@ -443,6 +440,7 @@ continue; } hlist_del_init(&inode->i_hash); + list_del_init(&inode->i_sb_list); list_move(&inode->i_list, &freeable); inode->i_state |= I_FREEING; nr_pruned++; @@ -553,6 +551,7 @@ spin_lock(&inode_lock); inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); inode->i_ino = ++last_ino; inode->i_state = 0; spin_unlock(&inode_lock); @@ -601,6 +600,7 @@ inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); inode->i_state = I_LOCK|I_NEW; spin_unlock(&inode_lock); @@ -649,6 +649,7 @@ inode->i_ino = ino; inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); inode->i_state = I_LOCK|I_NEW; spin_unlock(&inode_lock); @@ -984,6 +985,7 @@ struct super_operations *op = inode->i_sb->s_op; list_del_init(&inode->i_list); + list_del_init(&inode->i_sb_list); inode->i_state|=I_FREEING; inodes_stat.nr_inodes--; spin_unlock(&inode_lock); @@ -1031,6 +1033,7 @@ hlist_del_init(&inode->i_hash); } list_del_init(&inode->i_list); + list_del_init(&inode->i_sb_list); inode->i_state|=I_FREEING; inodes_stat.nr_inodes--; spin_unlock(&inode_lock); @@ -1221,34 +1224,17 @@ void remove_dquot_ref(struct super_block *sb, int type) { struct inode *inode; - struct list_head *act_head; LIST_HEAD(tofree_head); if (!sb->dq_op) return; /* nothing to do */ spin_lock(&inode_lock); /* This lock is for inodes code */ /* We don't have to lock against quota code - test IS_QUOTAINIT is just for speedup... */ - - list_for_each(act_head, &inode_in_use) { - inode = list_entry(act_head, struct inode, i_list); - if (inode->i_sb == sb && IS_QUOTAINIT(inode)) - remove_inode_dquot_ref(inode, type, &tofree_head); - } - list_for_each(act_head, &inode_unused) { - inode = list_entry(act_head, struct inode, i_list); - if (inode->i_sb == sb && IS_QUOTAINIT(inode)) - remove_inode_dquot_ref(inode, type, &tofree_head); - } - list_for_each(act_head, &sb->s_dirty) { - inode = list_entry(act_head, struct inode, i_list); - if (IS_QUOTAINIT(inode)) - remove_inode_dquot_ref(inode, type, &tofree_head); - } - list_for_each(act_head, &sb->s_io) { - inode = list_entry(act_head, struct inode, i_list); + + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) if (IS_QUOTAINIT(inode)) remove_inode_dquot_ref(inode, type, &tofree_head); - } + spin_unlock(&inode_lock); put_dquot_list(&tofree_head); --- diff/fs/intermezzo/file.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/intermezzo/file.c 2003-11-26 10:09:06.000000000 +0000 @@ -336,7 +336,7 @@ unlock_kernel(); return; } - error = presto_journal_close(&rec, fset, file, + error = presto_journal_close(&rec, fset, fdata, file->f_dentry, &fdata->fd_version, &new_file_ver); --- diff/fs/intermezzo/intermezzo_fs.h 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/intermezzo/intermezzo_fs.h 2003-11-26 10:09:06.000000000 +0000 @@ -603,7 +603,7 @@ int presto_journal_open(struct rec_info *, struct presto_file_set *, struct dentry *, struct presto_version *old_ver); int presto_journal_close(struct rec_info *rec, struct presto_file_set *, - struct file *, struct dentry *, + struct presto_file_data *, struct dentry *, struct presto_version *old_file_ver, struct presto_version *new_file_ver); int presto_write_lml_close(struct rec_info *rec, --- diff/fs/intermezzo/journal.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/intermezzo/journal.c 2003-11-26 10:09:06.000000000 +0000 @@ -2103,12 +2103,11 @@ int presto_journal_close(struct rec_info *rec, struct presto_file_set *fset, - struct file *file, struct dentry *dentry, + struct presto_file_data *fd, struct dentry *dentry, struct presto_version *old_file_ver, struct presto_version *new_file_ver) { int opcode = KML_OPCODE_CLOSE; - struct presto_file_data *fd; char *buffer, *path, *logrecord, record[316]; struct dentry *root; int error, size, i; @@ -2137,7 +2136,6 @@ root = fset->fset_dentry; - fd = (struct presto_file_data *)file->private_data; if (fd) { open_ngroups = fd->fd_ngroups; for (i = 0; i < fd->fd_ngroups; i++) --- diff/fs/intermezzo/presto.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/intermezzo/presto.c 2003-11-26 10:09:06.000000000 +0000 @@ -259,11 +259,8 @@ if (info->flags & LENTO_FL_WRITE_KML) { - struct file file; - file.private_data = NULL; - file.f_dentry = dentry; presto_getversion(&new_ver, dentry->d_inode); - error = presto_journal_close(&rec, fset, &file, dentry, + error = presto_journal_close(&rec, fset, NULL, dentry, &new_ver); if ( error ) { EXIT; --- diff/fs/intermezzo/vfs.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/intermezzo/vfs.c 2003-11-26 10:09:06.000000000 +0000 @@ -321,7 +321,7 @@ } if (fdata->fd_info.flags & LENTO_FL_KML) - rc = presto_journal_close(&rec, fset, file, file->f_dentry, + rc = presto_journal_close(&rec, fset, fdata, file->f_dentry, &fdata->fd_version, &fdata->fd_info.remote_version); if (rc) { @@ -431,14 +431,11 @@ if ( presto_do_kml(info, dentry) ) { if ((iattr->ia_valid & ATTR_SIZE) && (old_size != inode->i_size)) { - struct file file; /* Journal a close whenever we see a potential truncate * At the receiving end, lento should explicitly remove * ATTR_SIZE from the list of valid attributes */ presto_getversion(&new_ver, inode); - file.private_data = NULL; - file.f_dentry = dentry; - error = presto_journal_close(&rec, fset, &file, dentry, + error = presto_journal_close(&rec, fset, NULL, dentry, &old_ver, &new_ver); } @@ -2086,7 +2083,9 @@ } } + /* XXX: where the fuck is ->f_vfsmnt? */ f->f_dentry = dentry; + f->f_mapping = dentry->d_inode->i_mapping; f->f_pos = 0; //f->f_reada = 0; f->f_op = NULL; --- diff/fs/ioctl.c 2003-07-22 18:54:27.000000000 +0100 +++ source/fs/ioctl.c 2003-11-26 10:09:07.000000000 +0000 @@ -22,7 +22,7 @@ switch (cmd) { case FIBMAP: { - struct address_space *mapping = inode->i_mapping; + struct address_space *mapping = filp->f_mapping; int res; /* do we support this mess? */ if (!mapping->a_ops->bmap) --- diff/fs/jbd/commit.c 2003-10-27 09:20:44.000000000 +0000 +++ source/fs/jbd/commit.c 2003-11-26 10:09:07.000000000 +0000 @@ -264,6 +264,16 @@ jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); __brelse(bh); + if (need_resched() && commit_transaction-> + t_sync_datalist) { + commit_transaction->t_sync_datalist = + next_jh; + if (bufs) + break; + spin_unlock(&journal->j_list_lock); + cond_resched(); + goto write_out_data; + } } } if (bufs == ARRAY_SIZE(wbuf)) { @@ -284,8 +294,7 @@ cond_resched(); journal_brelse_array(wbuf, bufs); spin_lock(&journal->j_list_lock); - if (bufs) - goto write_out_data_locked; + goto write_out_data_locked; } /* --- diff/fs/jffs/intrep.c 2003-09-30 15:46:18.000000000 +0100 +++ source/fs/jffs/intrep.c 2003-11-26 10:09:07.000000000 +0000 @@ -3337,18 +3337,16 @@ int result = 0; D1(int i = 1); + daemonize("jffs_gcd"); + c->gc_task = current; lock_kernel(); - exit_mm(c->gc_task); - - set_special_pids(1, 1); init_completion(&c->gc_thread_comp); /* barrier */ spin_lock_irq(¤t->sighand->siglock); siginitsetinv (¤t->blocked, sigmask(SIGHUP) | sigmask(SIGKILL) | sigmask(SIGSTOP) | sigmask(SIGCONT)); recalc_sigpending(); spin_unlock_irq(¤t->sighand->siglock); - strcpy(current->comm, "jffs_gcd"); D1(printk (KERN_NOTICE "jffs_garbage_collect_thread(): Starting infinite loop.\n")); --- diff/fs/jfs/acl.c 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/jfs/acl.c 2003-11-26 10:09:07.000000000 +0000 @@ -191,7 +191,7 @@ * Read/write DACs are always overridable. * Executable DACs are overridable if at least one exec bit is set. */ - if ((mask & (MAY_READ|MAY_WRITE)) || (inode->i_mode & S_IXUGO)) + if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable(CAP_DAC_OVERRIDE)) return 0; --- diff/fs/jfs/inode.c 2003-10-27 09:20:38.000000000 +0000 +++ source/fs/jfs/inode.c 2003-11-26 10:09:07.000000000 +0000 @@ -306,7 +306,7 @@ loff_t offset, unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, jfs_get_blocks, NULL); --- diff/fs/jfs/jfs_logmgr.c 2003-08-20 14:16:32.000000000 +0100 +++ source/fs/jfs/jfs_logmgr.c 2003-11-26 10:09:07.000000000 +0000 @@ -1415,6 +1415,10 @@ int i; struct tblock *target; + /* jfs_write_inode may call us during read-only mount */ + if (!log) + return; + jfs_info("jfs_flush_journal: log:0x%p wait=%d", log, wait); LOGGC_LOCK(log); --- diff/fs/locks.c 2003-10-27 09:20:39.000000000 +0000 +++ source/fs/locks.c 2003-11-26 10:09:07.000000000 +0000 @@ -1454,7 +1454,7 @@ */ if (IS_MANDLOCK(inode) && (inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) { - struct address_space *mapping = inode->i_mapping; + struct address_space *mapping = filp->f_mapping; if (!list_empty(&mapping->i_mmap_shared)) { error = -EAGAIN; @@ -1592,7 +1592,7 @@ */ if (IS_MANDLOCK(inode) && (inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) { - struct address_space *mapping = inode->i_mapping; + struct address_space *mapping = filp->f_mapping; if (!list_empty(&mapping->i_mmap_shared)) { error = -EAGAIN; --- diff/fs/namei.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/namei.c 2003-11-26 10:09:07.000000000 +0000 @@ -190,7 +190,7 @@ * Read/write DACs are always overridable. * Executable DACs are overridable if at least one exec bit is set. */ - if ((mask & (MAY_READ|MAY_WRITE)) || (inode->i_mode & S_IXUGO)) + if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO)) if (capable(CAP_DAC_OVERRIDE)) return 0; --- diff/fs/ncpfs/Kconfig 2002-11-11 11:09:38.000000000 +0000 +++ source/fs/ncpfs/Kconfig 2003-11-26 10:09:07.000000000 +0000 @@ -65,6 +65,7 @@ config NCPFS_NLS bool "Use Native Language Support" depends on NCP_FS + select NLS help Allows you to use codepages and I/O charsets for file name translation between the server file system and input/output. This --- diff/fs/ncpfs/mmap.c 2002-10-16 04:28:30.000000000 +0100 +++ source/fs/ncpfs/mmap.c 2003-11-26 10:09:07.000000000 +0000 @@ -26,7 +26,7 @@ * Fill in the supplied page for mmap */ static struct page* ncp_file_mmap_nopage(struct vm_area_struct *area, - unsigned long address, int write_access) + unsigned long address, int *type) { struct file *file = area->vm_file; struct dentry *dentry = file->f_dentry; @@ -85,6 +85,15 @@ memset(pg_addr + already_read, 0, PAGE_SIZE - already_read); flush_dcache_page(page); kunmap(page); + + /* + * If I understand ncp_read_kernel() properly, the above always + * fetches from the network, here the analogue of disk. + * -- wli + */ + if (type) + *type = VM_FAULT_MAJOR; + inc_page_state(pgmajfault); return page; } --- diff/fs/nfs/file.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/nfs/file.c 2003-11-26 10:09:07.000000000 +0000 @@ -266,7 +266,7 @@ int nfs_lock(struct file *filp, int cmd, struct file_lock *fl) { - struct inode * inode = filp->f_dentry->d_inode; + struct inode * inode = filp->f_mapping->host; int status = 0; int status2; @@ -309,13 +309,13 @@ * Flush all pending writes before doing anything * with locks.. */ - status = filemap_fdatawrite(inode->i_mapping); + status = filemap_fdatawrite(filp->f_mapping); down(&inode->i_sem); status2 = nfs_wb_all(inode); if (!status) status = status2; up(&inode->i_sem); - status2 = filemap_fdatawait(inode->i_mapping); + status2 = filemap_fdatawait(filp->f_mapping); if (!status) status = status2; if (status < 0) @@ -335,11 +335,11 @@ */ out_ok: if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) { - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(filp->f_mapping); down(&inode->i_sem); nfs_wb_all(inode); /* we may have slept */ up(&inode->i_sem); - filemap_fdatawait(inode->i_mapping); + filemap_fdatawait(filp->f_mapping); nfs_zap_caches(inode); } return status; --- diff/fs/nls/Kconfig 2002-11-11 11:09:38.000000000 +0000 +++ source/fs/nls/Kconfig 2003-11-26 10:09:07.000000000 +0000 @@ -1,24 +1,25 @@ # # Native language support configuration # -# smb wants NLS -config SMB_NLS - bool - depends on SMB_FS - default y -# msdos and Joliet want NLS +menu "Native Language Support" + config NLS - bool - depends on JOLIET || FAT_FS || NTFS_FS || NCPFS_NLS || SMB_NLS || JFS_FS || CIFS || BEFS_FS - default y + tristate "Base native language support" + ---help--- + The base Native Language Support. A number of filesystems + depend on it (e.g. FAT, JOLIET, NT, BEOS filesystems), as well + as the ability of some filesystems to use native languages + (NCP, SMB). + If unsure, say Y. -menu "Native Language Support" - depends on NLS + To compile this code as a module, choose M here: the module + will be called nls_base. config NLS_DEFAULT string "Default NLS Option" + depends on NLS default "iso8859-1" ---help--- The default NLS used when mounting file system. Note, that this is @@ -38,6 +39,7 @@ config NLS_CODEPAGE_437 tristate "Codepage 437 (United States, Canada)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored @@ -50,6 +52,7 @@ config NLS_CODEPAGE_737 tristate "Codepage 737 (Greek)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored @@ -62,6 +65,7 @@ config NLS_CODEPAGE_775 tristate "Codepage 775 (Baltic Rim)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored @@ -75,6 +79,7 @@ config NLS_CODEPAGE_850 tristate "Codepage 850 (Europe)" + depends on NLS ---help--- The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -91,6 +96,7 @@ config NLS_CODEPAGE_852 tristate "Codepage 852 (Central/Eastern Europe)" + depends on NLS ---help--- The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -106,6 +112,7 @@ config NLS_CODEPAGE_855 tristate "Codepage 855 (Cyrillic)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -117,6 +124,7 @@ config NLS_CODEPAGE_857 tristate "Codepage 857 (Turkish)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -128,6 +136,7 @@ config NLS_CODEPAGE_860 tristate "Codepage 860 (Portuguese)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -139,6 +148,7 @@ config NLS_CODEPAGE_861 tristate "Codepage 861 (Icelandic)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -150,6 +160,7 @@ config NLS_CODEPAGE_862 tristate "Codepage 862 (Hebrew)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -161,6 +172,7 @@ config NLS_CODEPAGE_863 tristate "Codepage 863 (Canadian French)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -173,6 +185,7 @@ config NLS_CODEPAGE_864 tristate "Codepage 864 (Arabic)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -184,6 +197,7 @@ config NLS_CODEPAGE_865 tristate "Codepage 865 (Norwegian, Danish)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -196,6 +210,7 @@ config NLS_CODEPAGE_866 tristate "Codepage 866 (Cyrillic/Russian)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -208,6 +223,7 @@ config NLS_CODEPAGE_869 tristate "Codepage 869 (Greek)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -219,6 +235,7 @@ config NLS_CODEPAGE_936 tristate "Simplified Chinese charset (CP936, GB2312)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -231,6 +248,7 @@ config NLS_CODEPAGE_950 tristate "Traditional Chinese charset (Big5)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -243,6 +261,7 @@ config NLS_CODEPAGE_932 tristate "Japanese charsets (Shift-JIS, EUC-JP)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -256,6 +275,7 @@ config NLS_CODEPAGE_949 tristate "Korean charset (CP949, EUC-KR)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -267,6 +287,7 @@ config NLS_CODEPAGE_874 tristate "Thai charset (CP874, TIS-620)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -278,6 +299,7 @@ config NLS_ISO8859_8 tristate "Hebrew charsets (ISO-8859-8, CP1255)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -287,6 +309,7 @@ config NLS_CODEPAGE_1250 tristate "Windows CP1250 (Slavic/Central European Languages)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CDROMs @@ -298,6 +321,7 @@ config NLS_CODEPAGE_1251 tristate "Windows CP1251 (Bulgarian, Belarusian)" + depends on NLS help The Microsoft FAT file system family can deal with filenames in native language character sets. These character sets are stored in @@ -310,6 +334,7 @@ config NLS_ISO8859_1 tristate "NLS ISO 8859-1 (Latin 1; Western European Languages)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -322,6 +347,7 @@ config NLS_ISO8859_2 tristate "NLS ISO 8859-2 (Latin 2; Slavic/Central European Languages)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -333,6 +359,7 @@ config NLS_ISO8859_3 tristate "NLS ISO 8859-3 (Latin 3; Esperanto, Galician, Maltese, Turkish)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -343,6 +370,7 @@ config NLS_ISO8859_4 tristate "NLS ISO 8859-4 (Latin 4; old Baltic charset)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -353,6 +381,7 @@ config NLS_ISO8859_5 tristate "NLS ISO 8859-5 (Cyrillic)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -364,6 +393,7 @@ config NLS_ISO8859_6 tristate "NLS ISO 8859-6 (Arabic)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -373,6 +403,7 @@ config NLS_ISO8859_7 tristate "NLS ISO 8859-7 (Modern Greek)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -382,6 +413,7 @@ config NLS_ISO8859_9 tristate "NLS ISO 8859-9 (Latin 5; Turkish)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -392,6 +424,7 @@ config NLS_ISO8859_13 tristate "NLS ISO 8859-13 (Latin 7; Baltic)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -402,6 +435,7 @@ config NLS_ISO8859_14 tristate "NLS ISO 8859-14 (Latin 8; Celtic)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -413,6 +447,7 @@ config NLS_ISO8859_15 tristate "NLS ISO 8859-15 (Latin 9; Western European Languages with Euro)" + depends on NLS ---help--- If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -429,6 +464,7 @@ config NLS_KOI8_R tristate "NLS KOI8-R (Russian)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -438,6 +474,7 @@ config NLS_KOI8_U tristate "NLS KOI8-U/RU (Ukrainian, Belarusian)" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs @@ -447,6 +484,7 @@ config NLS_UTF8 tristate "NLS UTF8" + depends on NLS help If you want to display filenames with native language characters from the Microsoft FAT file system family or from JOLIET CD-ROMs --- diff/fs/nls/nls_base.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/nls/nls_base.c 2003-11-26 10:09:07.000000000 +0000 @@ -480,7 +480,7 @@ if (default_nls != NULL) return default_nls; else - return &default_table; + return &default_table; } EXPORT_SYMBOL(register_nls); @@ -492,3 +492,5 @@ EXPORT_SYMBOL(utf8_mbstowcs); EXPORT_SYMBOL(utf8_wctomb); EXPORT_SYMBOL(utf8_wcstombs); + +MODULE_LICENSE("Dual BSD/GPL"); --- diff/fs/open.c 2003-10-27 09:20:39.000000000 +0000 +++ source/fs/open.c 2003-11-26 10:09:07.000000000 +0000 @@ -192,7 +192,9 @@ newattrs.ia_size = length; newattrs.ia_valid = ATTR_SIZE | ATTR_CTIME; down(&dentry->d_inode->i_sem); + down_write(&dentry->d_inode->i_alloc_sem); err = notify_change(dentry, &newattrs); + up_write(&dentry->d_inode->i_alloc_sem); up(&dentry->d_inode->i_sem); return err; } @@ -776,7 +778,8 @@ goto cleanup_file; } - file_ra_state_init(&f->f_ra, inode->i_mapping); + f->f_mapping = inode->i_mapping; + file_ra_state_init(&f->f_ra, f->f_mapping); f->f_dentry = dentry; f->f_vfsmnt = mnt; f->f_pos = 0; @@ -792,8 +795,8 @@ /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { - if (!inode->i_mapping || !inode->i_mapping->a_ops || - !inode->i_mapping->a_ops->direct_IO) { + if (!f->f_mapping || !f->f_mapping->a_ops || + !f->f_mapping->a_ops->direct_IO) { fput(f); f = ERR_PTR(-EINVAL); } --- diff/fs/pipe.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/pipe.c 2003-11-26 10:09:07.000000000 +0000 @@ -13,6 +13,7 @@ #include <linux/fs.h> #include <linux/mount.h> #include <linux/pipe_fs_i.h> +#include <linux/uio.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -43,19 +44,63 @@ down(PIPE_SEM(*inode)); } +static inline int +pipe_iov_copy_from_user(void *to, struct iovec *iov, unsigned long len) +{ + unsigned long copy; + + while (len > 0) { + while (!iov->iov_len) + iov++; + copy = min_t(unsigned long, len, iov->iov_len); + + if (copy_from_user(to, iov->iov_base, copy)) + return -EFAULT; + to += copy; + len -= copy; + iov->iov_base += copy; + iov->iov_len -= copy; + } + return 0; +} + +static inline int +pipe_iov_copy_to_user(struct iovec *iov, const void *from, unsigned long len) +{ + unsigned long copy; + + while (len > 0) { + while (!iov->iov_len) + iov++; + copy = min_t(unsigned long, len, iov->iov_len); + + if (copy_to_user(iov->iov_base, from, copy)) + return -EFAULT; + from += copy; + len -= copy; + iov->iov_base += copy; + iov->iov_len -= copy; + } + return 0; +} + static ssize_t -pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) +pipe_readv(struct file *filp, const struct iovec *_iov, + unsigned long nr_segs, loff_t *ppos) { struct inode *inode = filp->f_dentry->d_inode; int do_wakeup; ssize_t ret; + struct iovec *iov = (struct iovec *)_iov; + size_t total_len; /* pread is not allowed on pipes. */ if (unlikely(ppos != &filp->f_pos)) return -ESPIPE; - + + total_len = iov_length(iov, nr_segs); /* Null read succeeds. */ - if (unlikely(count == 0)) + if (unlikely(total_len == 0)) return 0; do_wakeup = 0; @@ -67,12 +112,12 @@ char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode); ssize_t chars = PIPE_MAX_RCHUNK(*inode); - if (chars > count) - chars = count; + if (chars > total_len) + chars = total_len; if (chars > size) chars = size; - if (copy_to_user(buf, pipebuf, chars)) { + if (pipe_iov_copy_to_user(iov, pipebuf, chars)) { if (!ret) ret = -EFAULT; break; } @@ -81,12 +126,11 @@ PIPE_START(*inode) += chars; PIPE_START(*inode) &= (PIPE_SIZE - 1); PIPE_LEN(*inode) -= chars; - count -= chars; - buf += chars; + total_len -= chars; do_wakeup = 1; + if (!total_len) + break; /* common path: read succeeded */ } - if (!count) - break; /* common path: read succeeded */ if (PIPE_LEN(*inode)) /* test for cyclic buffers */ continue; if (!PIPE_WRITERS(*inode)) @@ -126,24 +170,35 @@ } static ssize_t -pipe_write(struct file *filp, const char __user *buf, size_t count, loff_t *ppos) +pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) +{ + struct iovec iov = { .iov_base = buf, .iov_len = count }; + return pipe_readv(filp, &iov, 1, ppos); +} + +static ssize_t +pipe_writev(struct file *filp, const struct iovec *_iov, + unsigned long nr_segs, loff_t *ppos) { struct inode *inode = filp->f_dentry->d_inode; ssize_t ret; size_t min; int do_wakeup; + struct iovec *iov = (struct iovec *)_iov; + size_t total_len; /* pwrite is not allowed on pipes. */ if (unlikely(ppos != &filp->f_pos)) return -ESPIPE; - + + total_len = iov_length(iov, nr_segs); /* Null write succeeds. */ - if (unlikely(count == 0)) + if (unlikely(total_len == 0)) return 0; do_wakeup = 0; ret = 0; - min = count; + min = total_len; if (min > PIPE_BUF) min = 1; down(PIPE_SEM(*inode)); @@ -164,23 +219,22 @@ * syscall merging. */ do_wakeup = 1; - if (chars > count) - chars = count; + if (chars > total_len) + chars = total_len; if (chars > free) chars = free; - if (copy_from_user(pipebuf, buf, chars)) { + if (pipe_iov_copy_from_user(pipebuf, iov, chars)) { if (!ret) ret = -EFAULT; break; } - ret += chars; + PIPE_LEN(*inode) += chars; - count -= chars; - buf += chars; + total_len -= chars; + if (!total_len) + break; } - if (!count) - break; if (PIPE_FREE(*inode) && ret) { /* handle cyclic data buffers */ min = 1; @@ -214,6 +268,14 @@ } static ssize_t +pipe_write(struct file *filp, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; + return pipe_writev(filp, &iov, 1, ppos); +} + +static ssize_t bad_pipe_r(struct file *filp, char __user *buf, size_t count, loff_t *ppos) { return -EBADF; @@ -405,6 +467,7 @@ struct file_operations read_fifo_fops = { .llseek = no_llseek, .read = pipe_read, + .readv = pipe_readv, .write = bad_pipe_w, .poll = fifo_poll, .ioctl = pipe_ioctl, @@ -417,6 +480,7 @@ .llseek = no_llseek, .read = bad_pipe_r, .write = pipe_write, + .writev = pipe_writev, .poll = fifo_poll, .ioctl = pipe_ioctl, .open = pipe_write_open, @@ -427,7 +491,9 @@ struct file_operations rdwr_fifo_fops = { .llseek = no_llseek, .read = pipe_read, + .readv = pipe_readv, .write = pipe_write, + .writev = pipe_writev, .poll = fifo_poll, .ioctl = pipe_ioctl, .open = pipe_rdwr_open, @@ -438,6 +504,7 @@ struct file_operations read_pipe_fops = { .llseek = no_llseek, .read = pipe_read, + .readv = pipe_readv, .write = bad_pipe_w, .poll = pipe_poll, .ioctl = pipe_ioctl, @@ -450,6 +517,7 @@ .llseek = no_llseek, .read = bad_pipe_r, .write = pipe_write, + .writev = pipe_writev, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_write_open, @@ -460,7 +528,9 @@ struct file_operations rdwr_pipe_fops = { .llseek = no_llseek, .read = pipe_read, + .readv = pipe_readv, .write = pipe_write, + .writev = pipe_writev, .poll = pipe_poll, .ioctl = pipe_ioctl, .open = pipe_rdwr_open, @@ -580,6 +650,7 @@ d_add(dentry, inode); f1->f_vfsmnt = f2->f_vfsmnt = mntget(mntget(pipe_mnt)); f1->f_dentry = f2->f_dentry = dget(dentry); + f1->f_mapping = f2->f_mapping = inode->i_mapping; /* read file */ f1->f_pos = f2->f_pos = 0; --- diff/fs/proc/base.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/proc/base.c 2003-11-26 10:09:07.000000000 +0000 @@ -1524,6 +1524,7 @@ struct inode *inode; struct proc_inode *ei; unsigned tgid; + int died; if (dentry->d_name.len == 4 && !memcmp(dentry->d_name.name,"self",4)) { inode = new_inode(dir->i_sb); @@ -1567,12 +1568,21 @@ dentry->d_op = &pid_base_dentry_operations; + died = 0; + d_add(dentry, inode); spin_lock(&task->proc_lock); task->proc_dentry = dentry; - d_add(dentry, inode); + if (!pid_alive(task)) { + dentry = proc_pid_unhash(task); + died = 1; + } spin_unlock(&task->proc_lock); put_task_struct(task); + if (died) { + proc_pid_flush(dentry); + goto out; + } return NULL; out: return ERR_PTR(-ENOENT); @@ -1612,10 +1622,7 @@ dentry->d_op = &pid_base_dentry_operations; - spin_lock(&task->proc_lock); - task->proc_dentry = dentry; d_add(dentry, inode); - spin_unlock(&task->proc_lock); put_task_struct(task); return NULL; --- diff/fs/proc/proc_misc.c 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/proc/proc_misc.c 2003-11-26 10:09:07.000000000 +0000 @@ -473,30 +473,46 @@ return proc_calc_metrics(page, start, off, count, eof, len); } -extern int show_interrupts(struct seq_file *p, void *v); -static int interrupts_open(struct inode *inode, struct file *file) +/* + * /proc/interrupts + */ +static void *int_seq_start(struct seq_file *f, loff_t *pos) +{ + return (*pos <= NR_IRQS) ? pos : NULL; +} + +static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos) +{ + (*pos)++; + if (*pos > NR_IRQS) + return NULL; + return pos; +} + +static void int_seq_stop(struct seq_file *f, void *v) { - unsigned size = 4096 * (1 + num_online_cpus() / 8); - char *buf = kmalloc(size, GFP_KERNEL); - struct seq_file *m; - int res; - - if (!buf) - return -ENOMEM; - res = single_open(file, show_interrupts, NULL); - if (!res) { - m = file->private_data; - m->buf = buf; - m->size = size; - } else - kfree(buf); - return res; + /* Nothing to do */ } + + +extern int show_interrupts(struct seq_file *f, void *v); /* In arch code */ +static struct seq_operations int_seq_ops = { + .start = int_seq_start, + .next = int_seq_next, + .stop = int_seq_stop, + .show = show_interrupts +}; + +int interrupts_open(struct inode *inode, struct file *filp) +{ + return seq_open(filp, &int_seq_ops); +} + static struct file_operations proc_interrupts_operations = { .open = interrupts_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .release = seq_release, }; static int filesystems_read_proc(char *page, char **start, off_t off, @@ -638,6 +654,36 @@ entry->proc_fops = f; } +#ifdef CONFIG_LOCKMETER +extern ssize_t get_lockmeter_info(char *, size_t, loff_t *); +extern ssize_t put_lockmeter_info(const char *, size_t); +extern int get_lockmeter_info_size(void); + +/* + * This function accesses lock metering information. + */ +static ssize_t read_lockmeter(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + return get_lockmeter_info(buf, count, ppos); +} + +/* + * Writing to /proc/lockmeter resets the counters + */ +static ssize_t write_lockmeter(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + return put_lockmeter_info(buf, count); +} + +static struct file_operations proc_lockmeter_operations = { + NULL, /* lseek */ + read: read_lockmeter, + write: write_lockmeter, +}; +#endif /* CONFIG_LOCKMETER */ + void __init proc_misc_init(void) { struct proc_dir_entry *entry; @@ -705,6 +751,13 @@ if (entry) entry->proc_fops = &proc_sysrq_trigger_operations; #endif +#ifdef CONFIG_LOCKMETER + entry = create_proc_entry("lockmeter", S_IWUSR | S_IRUGO, NULL); + if (entry) { + entry->proc_fops = &proc_lockmeter_operations; + entry->size = get_lockmeter_info_size(); + } +#endif #ifdef CONFIG_PPC32 { extern struct file_operations ppc_htab_operations; --- diff/fs/proc/proc_tty.c 2003-06-30 10:07:24.000000000 +0100 +++ source/fs/proc/proc_tty.c 2003-11-26 10:09:07.000000000 +0000 @@ -198,6 +198,7 @@ return; ent->read_proc = driver->read_proc; ent->write_proc = driver->write_proc; + ent->owner = driver->owner; ent->data = driver; driver->proc_entry = ent; --- diff/fs/proc/task_mmu.c 2003-09-17 12:28:11.000000000 +0100 +++ source/fs/proc/task_mmu.c 2003-11-26 10:09:07.000000000 +0000 @@ -2,6 +2,7 @@ #include <linux/hugetlb.h> #include <linux/seq_file.h> #include <asm/uaccess.h> +#include <asm/pgtable.h> char *task_mem(struct mm_struct *mm, char *buffer) { @@ -105,12 +106,22 @@ if (len < 1) len = 1; seq_printf(m, "%*c", len, ' '); - seq_path(m, file->f_vfsmnt, file->f_dentry, " \t\n\\"); + seq_path(m, file->f_vfsmnt, file->f_dentry, ""); } seq_putc(m, '\n'); return 0; } +#ifdef FIXADDR_USER_START +static struct vm_area_struct fixmap_vma = { + .vm_mm = NULL, + .vm_start = FIXADDR_USER_START, + .vm_end = FIXADDR_USER_END, + .vm_page_prot = PAGE_READONLY, + .vm_flags = VM_READ | VM_EXEC, +}; +#endif + static void *m_start(struct seq_file *m, loff_t *pos) { struct task_struct *task = m->private; @@ -128,6 +139,10 @@ if (!map) { up_read(&mm->mmap_sem); mmput(mm); +#ifdef FIXADDR_USER_START + if (l == (loff_t) -1) + map = &fixmap_vma; +#endif } return map; } @@ -135,6 +150,10 @@ static void m_stop(struct seq_file *m, void *v) { struct vm_area_struct *map = v; +#ifdef FIXADDR_USER_START + if (map == &fixmap_vma) + return; +#endif if (map) { struct mm_struct *mm = map->vm_mm; up_read(&mm->mmap_sem); @@ -149,6 +168,10 @@ if (map->vm_next) return map->vm_next; m_stop(m, v); +#ifdef FIXADDR_USER_START + if (map != &fixmap_vma) + return &fixmap_vma; +#endif return NULL; } --- diff/fs/read_write.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/read_write.c 2003-11-26 10:09:07.000000000 +0000 @@ -28,7 +28,7 @@ loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { long long retval; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; down(&inode->i_sem); switch (origin) { --- diff/fs/reiserfs/file.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/reiserfs/file.c 2003-11-26 10:09:07.000000000 +0000 @@ -1052,7 +1052,7 @@ /* Check if we can write to specified region of file, file is not overly big and this kind of stuff. Adjust pos and count, if needed */ - res = generic_write_checks(inode, file, &pos, &count, 0); + res = generic_write_checks(file, &pos, &count, 0); if (res) goto out; @@ -1179,7 +1179,7 @@ } if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) - res = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA); + res = generic_osync_inode(inode, file->f_mapping, OSYNC_METADATA|OSYNC_DATA); up(&inode->i_sem); return (already_written != 0)?already_written:res; --- diff/fs/reiserfs/inode.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/reiserfs/inode.c 2003-11-26 10:09:07.000000000 +0000 @@ -2375,7 +2375,7 @@ loff_t offset, unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, reiserfs_get_blocks_direct_io, NULL); --- diff/fs/reiserfs/journal.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/reiserfs/journal.c 2003-11-26 10:09:07.000000000 +0000 @@ -1937,18 +1937,13 @@ journal -> j_dev_file = filp_open( jdev_name, 0, 0 ); if( !IS_ERR( journal -> j_dev_file ) ) { - struct inode *jdev_inode; - - jdev_inode = journal -> j_dev_file -> f_dentry -> d_inode; - journal -> j_dev_bd = jdev_inode -> i_bdev; + struct inode *jdev_inode = journal->j_dev_file->f_mapping->host; if( !S_ISBLK( jdev_inode -> i_mode ) ) { printk( "journal_init_dev: '%s' is not a block device\n", jdev_name ); result = -ENOTBLK; - } else if( jdev_inode -> i_bdev == NULL ) { - printk( "journal_init_dev: bdev uninitialized for '%s'\n", jdev_name ); - result = -ENOMEM; } else { /* ok */ + journal->j_dev_bd = I_BDEV(jdev_inode); set_blocksize(journal->j_dev_bd, super->s_blocksize); } } else { --- diff/fs/stat.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/stat.c 2003-11-26 10:09:07.000000000 +0000 @@ -106,7 +106,7 @@ EXPORT_SYMBOL(vfs_fstat); #if !defined(__alpha__) && !defined(__sparc__) && !defined(__ia64__) \ - && !defined(CONFIG_ARCH_S390) && !defined(__hppa__) && !defined(__x86_64__) \ + && !defined(CONFIG_ARCH_S390) && !defined(__hppa__) \ && !defined(__arm__) && !defined(CONFIG_V850) && !defined(__powerpc64__) /* --- diff/fs/super.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/super.c 2003-11-26 10:09:07.000000000 +0000 @@ -66,6 +66,7 @@ INIT_LIST_HEAD(&s->s_files); INIT_LIST_HEAD(&s->s_instances); INIT_HLIST_HEAD(&s->s_anon); + INIT_LIST_HEAD(&s->s_inodes); init_rwsem(&s->s_umount); sema_init(&s->s_lock, 1); down_write(&s->s_umount); --- diff/fs/sysfs/bin.c 2003-09-17 12:28:12.000000000 +0100 +++ source/fs/sysfs/bin.c 2003-11-26 10:09:07.000000000 +0000 @@ -152,6 +152,9 @@ struct dentry * parent; int error = 0; + if (nosysfs) + return 0; + if (!kobj || !attr) return -EINVAL; @@ -185,6 +188,9 @@ int sysfs_remove_bin_file(struct kobject * kobj, struct bin_attribute * attr) { + if (nosysfs) + return 0; + sysfs_hash_and_remove(kobj->dentry,attr->attr.name); return 0; } --- diff/fs/sysfs/dir.c 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/sysfs/dir.c 2003-11-26 10:09:07.000000000 +0000 @@ -46,6 +46,8 @@ int sysfs_create_subdir(struct kobject * k, const char * n, struct dentry ** d) { + if (nosysfs) + return 0; return create_dir(k,k->dentry,n,d); } @@ -61,6 +63,9 @@ struct dentry * parent; int error = 0; + if (nosysfs) + return 0; + if (!kobj) return -EINVAL; @@ -94,6 +99,8 @@ void sysfs_remove_subdir(struct dentry * d) { + if (nosysfs) + return; remove_dir(d); } @@ -110,8 +117,12 @@ void sysfs_remove_dir(struct kobject * kobj) { struct list_head * node; - struct dentry * dentry = dget(kobj->dentry); + struct dentry *dentry; + + if (nosysfs) + return; + dentry = dget(kobj->dentry); if (!dentry) return; @@ -122,8 +133,8 @@ node = dentry->d_subdirs.next; while (node != &dentry->d_subdirs) { struct dentry * d = list_entry(node,struct dentry,d_child); - list_del_init(node); + node = node->next; pr_debug(" o %s (%d): ",d->d_name.name,atomic_read(&d->d_count)); if (d->d_inode) { d = dget_locked(d); @@ -139,9 +150,7 @@ spin_lock(&dcache_lock); } pr_debug(" done\n"); - node = dentry->d_subdirs.next; } - list_del_init(&dentry->d_child); spin_unlock(&dcache_lock); up(&dentry->d_inode->i_sem); @@ -156,6 +165,9 @@ { struct dentry * new_dentry, * parent; + if (nosysfs) + return; + if (!strcmp(kobject_name(kobj), new_name)) return; --- diff/fs/sysfs/file.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/sysfs/file.c 2003-11-26 10:09:07.000000000 +0000 @@ -350,6 +350,9 @@ struct dentry * dentry; int error; + if (nosysfs) + return 0; + down(&dir->d_inode->i_sem); dentry = sysfs_get_dentry(dir,attr->name); if (!IS_ERR(dentry)) { @@ -374,6 +377,9 @@ int sysfs_create_file(struct kobject * kobj, const struct attribute * attr) { + if (nosysfs) + return 0; + if (kobj && attr) return sysfs_add_file(kobj->dentry,attr); return -EINVAL; @@ -394,6 +400,9 @@ struct dentry * victim; int res = -ENOENT; + if (nosysfs) + return 0; + down(&dir->d_inode->i_sem); victim = sysfs_get_dentry(dir, attr->name); if (!IS_ERR(victim)) { --- diff/fs/sysfs/group.c 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/sysfs/group.c 2003-11-26 10:09:07.000000000 +0000 @@ -45,6 +45,9 @@ struct dentry * dir; int error; + if (nosysfs) + return 0; + if (grp->name) { error = sysfs_create_subdir(kobj,grp->name,&dir); if (error) @@ -65,6 +68,9 @@ { struct dentry * dir; + if (nosysfs) + return; + if (grp->name) dir = sysfs_get_dentry(kobj->dentry,grp->name); else --- diff/fs/sysfs/inode.c 2003-10-09 09:47:17.000000000 +0100 +++ source/fs/sysfs/inode.c 2003-11-26 10:09:07.000000000 +0000 @@ -11,7 +11,8 @@ #include <linux/pagemap.h> #include <linux/namei.h> #include <linux/backing-dev.h> -extern struct super_block * sysfs_sb; +#include <linux/init.h> +#include "sysfs.h" static struct address_space_operations sysfs_aops = { .readpage = simple_readpage, @@ -24,6 +25,8 @@ .memory_backed = 1, /* Does not contribute to dirty memory */ }; +int nosysfs; + struct inode * sysfs_new_inode(mode_t mode) { struct inode * inode = new_inode(sysfs_sb); @@ -44,6 +47,10 @@ { int error = 0; struct inode * inode = NULL; + + if (nosysfs) + return 0; + if (dentry) { if (!dentry->d_inode) { if ((inode = sysfs_new_inode(mode))) @@ -87,6 +94,8 @@ { struct dentry * victim; + if (nosysfs) + return; down(&dir->d_inode->i_sem); victim = sysfs_get_dentry(dir,name); if (!IS_ERR(victim)) { @@ -107,4 +116,9 @@ up(&dir->d_inode->i_sem); } - +static int __init nosysfs_setup(char *str) +{ + nosysfs = 1; + return 1; +} +__setup("nosysfs", nosysfs_setup); --- diff/fs/sysfs/symlink.c 2003-09-17 12:28:12.000000000 +0100 +++ source/fs/sysfs/symlink.c 2003-11-26 10:09:07.000000000 +0000 @@ -79,6 +79,9 @@ char * path; char * s; + if (nosysfs) + return 0; + depth = object_depth(kobj); size = object_path_length(target) + depth * 3 - 1; if (size > PATH_MAX) --- diff/fs/sysfs/sysfs.h 2003-09-30 15:46:19.000000000 +0100 +++ source/fs/sysfs/sysfs.h 2003-11-26 10:09:07.000000000 +0000 @@ -1,4 +1,6 @@ +struct super_block; +extern struct super_block *sysfs_sb; extern struct vfsmount * sysfs_mount; extern struct inode * sysfs_new_inode(mode_t mode); --- diff/fs/udf/super.c 2003-10-09 09:47:34.000000000 +0100 +++ source/fs/udf/super.c 2003-11-26 10:09:07.000000000 +0000 @@ -414,7 +414,7 @@ case Opt_utf8: uopt->flags |= (1 << UDF_FLAG_UTF8); break; -#ifdef CONFIG_NLS +#if defined(CONFIG_NLS) || defined(CONFIG_NLS_MODULE) case Opt_iocharset: uopt->nls_map = load_nls(args[0].from); uopt->flags |= (1 << UDF_FLAG_NLS_MAP); @@ -1510,7 +1510,7 @@ "utf8 cannot be combined with iocharset\n"); goto error_out; } -#ifdef CONFIG_NLS +#if defined(CONFIG_NLS) || defined(CONFIG_NLS_MODULE) if ((uopt.flags & (1 << UDF_FLAG_NLS_MAP)) && !uopt.nls_map) { uopt.nls_map = load_nls_default(); @@ -1674,7 +1674,7 @@ udf_release_data(UDF_SB_TYPESPAR(sb, UDF_SB_PARTITION(sb)).s_spar_map[i]); } } -#ifdef CONFIG_NLS +#if defined(CONFIG_NLS) || defined(CONFIG_NLS_MODULE) if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP)) unload_nls(UDF_SB(sb)->s_nls_map); #endif @@ -1766,7 +1766,7 @@ udf_release_data(UDF_SB_TYPESPAR(sb, UDF_SB_PARTITION(sb)).s_spar_map[i]); } } -#ifdef CONFIG_NLS +#if defined(CONFIG_NLS) || defined(CONFIG_NLS_MODULE) if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP)) unload_nls(UDF_SB(sb)->s_nls_map); #endif --- diff/fs/xfs/linux/xfs_aops.c 2003-10-27 09:20:39.000000000 +0000 +++ source/fs/xfs/linux/xfs_aops.c 2003-11-26 10:09:07.000000000 +0000 @@ -974,7 +974,7 @@ unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); page_buf_bmap_t pbmap; int maps = 1; @@ -984,7 +984,8 @@ if (error) return -error; - return blockdev_direct_IO(rw, iocb, inode, pbmap.pbm_target->pbr_bdev, + return blockdev_direct_IO_no_locking(rw, iocb, inode, + pbmap.pbm_target->pbr_bdev, iov, offset, nr_segs, linvfs_get_blocks_direct, linvfs_unwritten_convert_direct); --- diff/fs/xfs/linux/xfs_file.c 2003-10-27 09:20:39.000000000 +0000 +++ source/fs/xfs/linux/xfs_file.c 2003-11-26 10:09:07.000000000 +0000 @@ -112,7 +112,7 @@ { struct iovec iov = {(void *)buf, count}; struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); int error; @@ -160,7 +160,7 @@ unsigned long nr_segs, loff_t *ppos) { - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); struct kiocb kiocb; int error; @@ -207,7 +207,7 @@ unsigned long nr_segs, loff_t *ppos) { - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; + struct inode *inode = file->f_mapping->host; vnode_t *vp = LINVFS_GET_VP(inode); struct kiocb kiocb; int error; --- diff/fs/xfs/xfs_inode.c 2003-10-27 09:20:39.000000000 +0000 +++ source/fs/xfs/xfs_inode.c 2003-11-26 10:09:07.000000000 +0000 @@ -3722,7 +3722,7 @@ * Read/write DACs are always overridable. * Executable DACs are overridable if at least one exec bit is set. */ - if ((orgmode & (S_IRUSR|S_IWUSR)) || (inode->i_mode & S_IXUGO)) + if (!(orgmode & S_IXUSR) || (inode->i_mode & S_IXUGO)) if (capable_cred(cr, CAP_DAC_OVERRIDE)) return 0; --- diff/include/asm-alpha/spinlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-alpha/spinlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -6,6 +6,10 @@ #include <linux/kernel.h> #include <asm/current.h> +#ifdef CONFIG_LOCKMETER +#undef DEBUG_SPINLOCK +#undef DEBUG_RWLOCK +#endif /* * Simple spin lock operations. There are two variants, one clears IRQ's @@ -95,9 +99,18 @@ typedef struct { volatile int write_lock:1, read_counter:31; +#ifdef CONFIG_LOCKMETER + /* required for LOCKMETER since all bits in lock are used */ + /* need this storage for CPU and lock INDEX ............. */ + unsigned magic; +#endif } /*__attribute__((aligned(32)))*/ rwlock_t; +#ifdef CONFIG_LOCKMETER +#define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0, 0 } +#else #define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 } +#endif #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) #define rwlock_is_locked(x) (*(volatile int *)(x) != 0) @@ -169,4 +182,41 @@ : "m" (*lock) : "memory"); } +#ifdef CONFIG_LOCKMETER +static inline int _raw_write_trylock(rwlock_t *lock) +{ + long temp,result; + + __asm__ __volatile__( + " ldl_l %1,%0\n" + " mov $31,%2\n" + " bne %1,1f\n" + " or $31,1,%2\n" + " stl_c %2,%0\n" + "1: mb\n" + : "=m" (*(volatile int *)lock), "=&r" (temp), "=&r" (result) + : "m" (*(volatile int *)lock) + ); + + return (result); +} + +static inline int _raw_read_trylock(rwlock_t *lock) +{ + unsigned long temp,result; + + __asm__ __volatile__( + " ldl_l %1,%0\n" + " mov $31,%2\n" + " blbs %1,1f\n" + " subl %1,2,%2\n" + " stl_c %2,%0\n" + "1: mb\n" + : "=m" (*(volatile int *)lock), "=&r" (temp), "=&r" (result) + : "m" (*(volatile int *)lock) + ); + return (result); +} +#endif /* CONFIG_LOCKMETER */ + #endif /* _ALPHA_SPINLOCK_H */ --- diff/include/asm-generic/cpumask_const_value.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/asm-generic/cpumask_const_value.h 2003-11-26 10:09:07.000000000 +0000 @@ -3,7 +3,7 @@ typedef const cpumask_t cpumask_const_t; -#define mk_cpumask_const(map) ((cpumask_const_t)(map)) +#define mk_cpumask_const(map) (map) #define cpu_isset_const(cpu, map) cpu_isset(cpu, map) #define cpus_and_const(dst,src1,src2) cpus_and(dst, src1, src2) #define cpus_or_const(dst,src1,src2) cpus_or(dst, src1, src2) --- diff/include/asm-i386/bugs.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/bugs.h 2003-11-26 10:09:07.000000000 +0000 @@ -1,11 +1,11 @@ /* * include/asm-i386/bugs.h * - * Copyright (C) 1994 Linus Torvalds + * Copyright (C) 1994 Linus Torvalds * * Cyrix stuff, June 1998 by: * - Rafael R. Reilova (moved everything from head.S), - * <rreilova@ececs.uc.edu> + * <rreilova@ececs.uc.edu> * - Channing Corn (tests & fixes), * - Andrew D. Balsa (code cleanup). * @@ -25,7 +25,20 @@ #include <asm/processor.h> #include <asm/i387.h> #include <asm/msr.h> - +#ifdef CONFIG_KGDB +/* + * Provied the command line "gdb" initial break + */ +int __init kgdb_initial_break(char * str) +{ + if (*str == '\0'){ + breakpoint(); + return 1; + } + return 0; +} +__setup("gdb",kgdb_initial_break); +#endif static int __init no_halt(char *s) { boot_cpu_data.hlt_works_ok = 0; @@ -140,7 +153,7 @@ : "ecx", "edi" ); /* If this fails, it means that any user program may lock the CPU hard. Too bad. */ if (res != 12345678) printk( "Buggy.\n" ); - else printk( "OK.\n" ); + else printk( "OK.\n" ); #endif } --- diff/include/asm-i386/checksum.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-i386/checksum.h 2003-11-26 10:09:07.000000000 +0000 @@ -25,7 +25,7 @@ * better 64-bit) boundary */ -asmlinkage unsigned int csum_partial_copy_generic( const char *src, char *dst, int len, int sum, +asmlinkage unsigned int direct_csum_partial_copy_generic( const char *src, char *dst, int len, int sum, int *src_err_ptr, int *dst_err_ptr); /* @@ -39,14 +39,19 @@ unsigned int csum_partial_copy_nocheck ( const char *src, char *dst, int len, int sum) { - return csum_partial_copy_generic ( src, dst, len, sum, NULL, NULL); + /* + * The direct function is OK for kernel-space => kernel-space copies: + */ + return direct_csum_partial_copy_generic ( src, dst, len, sum, NULL, NULL); } static __inline__ unsigned int csum_partial_copy_from_user ( const char *src, char *dst, int len, int sum, int *err_ptr) { - return csum_partial_copy_generic ( src, dst, len, sum, err_ptr, NULL); + if (copy_from_user(dst, src, len)) + *err_ptr = -EFAULT; + return csum_partial(dst, len, sum); } /* @@ -172,11 +177,26 @@ * Copy and checksum to user */ #define HAVE_CSUM_COPY_USER -static __inline__ unsigned int csum_and_copy_to_user(const char *src, char *dst, +static __inline__ unsigned int direct_csum_and_copy_to_user(const char *src, char *dst, int len, int sum, int *err_ptr) { if (access_ok(VERIFY_WRITE, dst, len)) - return csum_partial_copy_generic(src, dst, len, sum, NULL, err_ptr); + return direct_csum_partial_copy_generic(src, dst, len, sum, NULL, err_ptr); + + if (len) + *err_ptr = -EFAULT; + + return -1; /* invalid checksum */ +} + +static __inline__ unsigned int csum_and_copy_to_user(const char *src, char *dst, + int len, int sum, int *err_ptr) +{ + if (access_ok(VERIFY_WRITE, dst, len)) { + if (copy_to_user(dst, src, len)) + *err_ptr = -EFAULT; + return csum_partial(src, len, sum); + } if (len) *err_ptr = -EFAULT; --- diff/include/asm-i386/desc.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/desc.h 2003-11-26 10:09:07.000000000 +0000 @@ -21,6 +21,13 @@ extern struct Xgt_desc_struct idt_descr, cpu_gdt_descr[NR_CPUS]; +extern void trap_init_virtual_IDT(void); +extern void trap_init_virtual_GDT(void); + +asmlinkage int system_call(void); +asmlinkage void lcall7(void); +asmlinkage void lcall27(void); + #define load_TR_desc() __asm__ __volatile__("ltr %%ax"::"a" (GDT_ENTRY_TSS*8)) #define load_LDT_desc() __asm__ __volatile__("lldt %%ax"::"a" (GDT_ENTRY_LDT*8)) @@ -30,6 +37,7 @@ */ extern struct desc_struct default_ldt[]; extern void set_intr_gate(unsigned int irq, void * addr); +extern void set_trap_gate(unsigned int n, void *addr); #define _set_tssldt_desc(n,addr,limit,type) \ __asm__ __volatile__ ("movw %w3,0(%2)\n\t" \ @@ -90,31 +98,8 @@ #undef C } -static inline void clear_LDT(void) -{ - int cpu = get_cpu(); - - set_ldt_desc(cpu, &default_ldt[0], 5); - load_LDT_desc(); - put_cpu(); -} - -/* - * load one particular LDT into the current CPU - */ -static inline void load_LDT_nolock(mm_context_t *pc, int cpu) -{ - void *segments = pc->ldt; - int count = pc->size; - - if (likely(!count)) { - segments = &default_ldt[0]; - count = 5; - } - - set_ldt_desc(cpu, segments, count); - load_LDT_desc(); -} +extern struct page *default_ldt_page; +extern void load_LDT_nolock(mm_context_t *pc, int cpu); static inline void load_LDT(mm_context_t *pc) { @@ -123,6 +108,6 @@ put_cpu(); } -#endif /* !__ASSEMBLY__ */ +#endif /* !__ASSEMBLY__ */ #endif --- diff/include/asm-i386/fixmap.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/fixmap.h 2003-11-26 10:09:07.000000000 +0000 @@ -18,17 +18,15 @@ #include <asm/acpi.h> #include <asm/apicdef.h> #include <asm/page.h> -#ifdef CONFIG_HIGHMEM #include <linux/threads.h> #include <asm/kmap_types.h> -#endif /* * Here we define all the compile-time 'special' virtual * addresses. The point is to have a constant address at * compile time, but to set the physical address only - * in the boot process. We allocate these special addresses - * from the end of virtual memory (0xfffff000) backwards. + * in the boot process. We allocate these special addresses + * from the end of virtual memory (0xffffe000) backwards. * Also this lets us do fail-safe vmalloc(), we * can guarantee that these special addresses and * vmalloc()-ed addresses never overlap. @@ -41,11 +39,20 @@ * TLB entries of such buffers will not be flushed across * task switches. */ + +/* + * on UP currently we will have no trace of the fixmap mechanizm, + * no page table allocations, etc. This might change in the + * future, say framebuffers for the console driver(s) could be + * fix-mapped? + */ enum fixed_addresses { FIX_HOLE, FIX_VSYSCALL, #ifdef CONFIG_X86_LOCAL_APIC FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */ +#else + FIX_VSTACK_HOLE_1, #endif #ifdef CONFIG_X86_IO_APIC FIX_IO_APIC_BASE_0, @@ -57,16 +64,21 @@ FIX_LI_PCIA, /* Lithium PCI Bridge A */ FIX_LI_PCIB, /* Lithium PCI Bridge B */ #endif -#ifdef CONFIG_X86_F00F_BUG - FIX_F00F_IDT, /* Virtual mapping for IDT */ -#endif + FIX_IDT, + FIX_GDT_1, + FIX_GDT_0, + FIX_TSS_3, + FIX_TSS_2, + FIX_TSS_1, + FIX_TSS_0, + FIX_ENTRY_TRAMPOLINE_1, + FIX_ENTRY_TRAMPOLINE_0, #ifdef CONFIG_X86_CYCLONE_TIMER FIX_CYCLONE_TIMER, /*cyclone timer register*/ + FIX_VSTACK_HOLE_2, #endif -#ifdef CONFIG_HIGHMEM FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */ FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1, -#endif #ifdef CONFIG_ACPI_BOOT FIX_ACPI_BEGIN, FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1, @@ -95,12 +107,15 @@ __set_fixmap(idx, 0, __pgprot(0)) /* - * used by vmalloc.c. + * used by vmalloc.c and various other places. * * Leave one empty page between vmalloc'ed areas and * the start of the fixmap. + * + * IMPORTANT: dont change FIXADDR_TOP without adjusting KM_VSTACK0 + * and KM_VSTACK1 so that the virtual stack is 8K aligned. */ -#define FIXADDR_TOP (0xfffff000UL) +#define FIXADDR_TOP (0xffffe000UL) #define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT) #define FIXADDR_START (FIXADDR_TOP - __FIXADDR_SIZE) --- diff/include/asm-i386/highmem.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/asm-i386/highmem.h 2003-11-26 10:09:07.000000000 +0000 @@ -25,26 +25,19 @@ #include <linux/threads.h> #include <asm/kmap_types.h> #include <asm/tlbflush.h> +#include <asm/atomic_kmap.h> /* declarations for highmem.c */ extern unsigned long highstart_pfn, highend_pfn; -extern pte_t *kmap_pte; -extern pgprot_t kmap_prot; extern pte_t *pkmap_page_table; - -extern void kmap_init(void); +extern void kmap_init(void) __init; /* * Right now we initialize only a single pte table. It can be extended * easily, subsequent pte tables have to be allocated in one physical * chunk of RAM. */ -#if NR_CPUS <= 32 -#define PKMAP_BASE (0xff800000UL) -#else -#define PKMAP_BASE (0xff600000UL) -#endif #ifdef CONFIG_X86_PAE #define LAST_PKMAP 512 #else --- diff/include/asm-i386/hw_irq.h 2003-10-27 09:20:44.000000000 +0000 +++ source/include/asm-i386/hw_irq.h 2003-11-26 10:09:07.000000000 +0000 @@ -41,6 +41,7 @@ asmlinkage void error_interrupt(void); asmlinkage void spurious_interrupt(void); asmlinkage void thermal_interrupt(struct pt_regs); +#define platform_legacy_irq(irq) ((irq) < 16) #endif void mask_irq(unsigned int irq); --- diff/include/asm-i386/io_apic.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/asm-i386/io_apic.h 2003-11-26 10:09:07.000000000 +0000 @@ -13,6 +13,46 @@ #ifdef CONFIG_X86_IO_APIC +#ifdef CONFIG_PCI_USE_VECTOR +static inline int use_pci_vector(void) {return 1;} +static inline void disable_edge_ioapic_vector(unsigned int vector) { } +static inline void mask_and_ack_level_ioapic_vector(unsigned int vector) { } +static inline void end_edge_ioapic_vector (unsigned int vector) { } +#define startup_level_ioapic startup_level_ioapic_vector +#define shutdown_level_ioapic mask_IO_APIC_vector +#define enable_level_ioapic unmask_IO_APIC_vector +#define disable_level_ioapic mask_IO_APIC_vector +#define mask_and_ack_level_ioapic mask_and_ack_level_ioapic_vector +#define end_level_ioapic end_level_ioapic_vector +#define set_ioapic_affinity set_ioapic_affinity_vector + +#define startup_edge_ioapic startup_edge_ioapic_vector +#define shutdown_edge_ioapic disable_edge_ioapic_vector +#define enable_edge_ioapic unmask_IO_APIC_vector +#define disable_edge_ioapic disable_edge_ioapic_vector +#define ack_edge_ioapic ack_edge_ioapic_vector +#define end_edge_ioapic end_edge_ioapic_vector +#else +static inline int use_pci_vector(void) {return 0;} +static inline void disable_edge_ioapic_irq(unsigned int irq) { } +static inline void mask_and_ack_level_ioapic_irq(unsigned int irq) { } +static inline void end_edge_ioapic_irq (unsigned int irq) { } +#define startup_level_ioapic startup_level_ioapic_irq +#define shutdown_level_ioapic mask_IO_APIC_irq +#define enable_level_ioapic unmask_IO_APIC_irq +#define disable_level_ioapic mask_IO_APIC_irq +#define mask_and_ack_level_ioapic mask_and_ack_level_ioapic_irq +#define end_level_ioapic end_level_ioapic_irq +#define set_ioapic_affinity set_ioapic_affinity_irq + +#define startup_edge_ioapic startup_edge_ioapic_irq +#define shutdown_edge_ioapic disable_edge_ioapic_irq +#define enable_edge_ioapic unmask_IO_APIC_irq +#define disable_edge_ioapic disable_edge_ioapic_irq +#define ack_edge_ioapic ack_edge_ioapic_irq +#define end_edge_ioapic end_edge_ioapic_irq +#endif + #define APIC_MISMATCH_DEBUG #define IO_APIC_BASE(idx) \ @@ -177,4 +217,6 @@ #define io_apic_assign_pci_irqs 0 #endif +extern int assign_irq_vector(int irq); + #endif --- diff/include/asm-i386/kmap_types.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/kmap_types.h 2003-11-26 10:09:07.000000000 +0000 @@ -3,30 +3,36 @@ #include <linux/config.h> -#ifdef CONFIG_DEBUG_HIGHMEM -# define D(n) __KM_FENCE_##n , -#else -# define D(n) -#endif - enum km_type { -D(0) KM_BOUNCE_READ, -D(1) KM_SKB_SUNRPC_DATA, -D(2) KM_SKB_DATA_SOFTIRQ, -D(3) KM_USER0, -D(4) KM_USER1, -D(5) KM_BIO_SRC_IRQ, -D(6) KM_BIO_DST_IRQ, -D(7) KM_PTE0, -D(8) KM_PTE1, -D(9) KM_PTE2, -D(10) KM_IRQ0, -D(11) KM_IRQ1, -D(12) KM_SOFTIRQ0, -D(13) KM_SOFTIRQ1, -D(14) KM_TYPE_NR -}; - -#undef D + /* + * IMPORTANT: don't move these 3 entries, and only add entries in + * pairs: the 4G/4G virtual stack must be 8K aligned on each cpu. + */ + KM_BOUNCE_READ, + KM_VSTACK1, + KM_VSTACK0, + KM_LDT_PAGE15, + KM_LDT_PAGE0 = KM_LDT_PAGE15 + 16-1, + KM_USER_COPY, + KM_VSTACK_HOLE, + KM_SKB_SUNRPC_DATA, + KM_SKB_DATA_SOFTIRQ, + KM_USER0, + KM_USER1, + KM_BIO_SRC_IRQ, + KM_BIO_DST_IRQ, + KM_PTE0, + KM_PTE1, + KM_PTE2, + KM_IRQ0, + KM_IRQ1, + KM_SOFTIRQ0, + KM_SOFTIRQ1, + /* + * Add new entries in pairs: + * the 4G/4G virtual stack must be 8K aligned on each cpu. + */ + KM_TYPE_NR +}; #endif --- diff/include/asm-i386/mach-default/irq_vectors.h 2003-10-27 09:20:44.000000000 +0000 +++ source/include/asm-i386/mach-default/irq_vectors.h 2003-11-26 10:09:07.000000000 +0000 @@ -76,6 +76,18 @@ * Since vectors 0x00-0x1f are used/reserved for the CPU, * the usable vector space is 0x20-0xff (224 vectors) */ + +/* + * The maximum number of vectors supported by i386 processors + * is limited to 256. For processors other than i386, NR_VECTORS + * should be changed accordingly. + */ +#define NR_VECTORS 256 + +#ifdef CONFIG_PCI_USE_VECTOR +#define NR_IRQS FIRST_SYSTEM_VECTOR +#define NR_IRQ_VECTORS NR_IRQS +#else #ifdef CONFIG_X86_IO_APIC #define NR_IRQS 224 # if (224 >= 32 * NR_CPUS) @@ -87,6 +99,7 @@ #define NR_IRQS 16 #define NR_IRQ_VECTORS NR_IRQS #endif +#endif #define FPU_IRQ 13 --- diff/include/asm-i386/mach-default/mach_apic.h 2003-09-30 15:46:19.000000000 +0100 +++ source/include/asm-i386/mach-default/mach_apic.h 2003-11-26 10:09:07.000000000 +0000 @@ -5,12 +5,12 @@ #define APIC_DFR_VALUE (APIC_DFR_FLAT) -static inline cpumask_t target_cpus(void) +static inline cpumask_const_t target_cpus(void) { #ifdef CONFIG_SMP - return cpu_online_map; + return mk_cpumask_const(cpu_online_map); #else - return cpumask_of_cpu(0); + return mk_cpumask_const(cpumask_of_cpu(0)); #endif } #define TARGET_CPUS (target_cpus()) --- diff/include/asm-i386/mmu.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/mmu.h 2003-11-26 10:09:07.000000000 +0000 @@ -8,10 +8,13 @@ * * cpu_vm_mask is used to optimize ldt flushing. */ + +#define MAX_LDT_PAGES 16 + typedef struct { int size; struct semaphore sem; - void *ldt; + struct page *ldt_pages[MAX_LDT_PAGES]; } mm_context_t; #endif --- diff/include/asm-i386/mmu_context.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/mmu_context.h 2003-11-26 10:09:07.000000000 +0000 @@ -29,6 +29,10 @@ { int cpu = smp_processor_id(); +#ifdef CONFIG_X86_SWITCH_PAGETABLES + if (tsk->mm) + tsk->thread_info->user_pgd = (void *)__pa(tsk->mm->pgd); +#endif if (likely(prev != next)) { /* stop flush ipis for the previous mm */ cpu_clear(cpu, prev->cpu_vm_mask); @@ -39,12 +43,14 @@ cpu_set(cpu, next->cpu_vm_mask); /* Re-load page tables */ +#if !defined(CONFIG_X86_SWITCH_PAGETABLES) load_cr3(next->pgd); +#endif /* * load the LDT, if the LDT is different: */ - if (unlikely(prev->context.ldt != next->context.ldt)) + if (unlikely(prev->context.size + next->context.size)) load_LDT_nolock(&next->context, cpu); } #ifdef CONFIG_SMP @@ -56,7 +62,9 @@ /* We were in lazy tlb mode and leave_mm disabled * tlb flush IPI delivery. We must reload %cr3. */ +#if !defined(CONFIG_X86_SWITCH_PAGETABLES) load_cr3(next->pgd); +#endif load_LDT_nolock(&next->context, cpu); } } @@ -67,6 +75,6 @@ asm("movl %0,%%fs ; movl %0,%%gs": :"r" (0)) #define activate_mm(prev, next) \ - switch_mm((prev),(next),NULL) + switch_mm((prev),(next),current) #endif --- diff/include/asm-i386/page.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/page.h 2003-11-26 10:09:07.000000000 +0000 @@ -1,6 +1,8 @@ #ifndef _I386_PAGE_H #define _I386_PAGE_H +#include <linux/config.h> + /* PAGE_SHIFT determines the page size */ #define PAGE_SHIFT 12 #define PAGE_SIZE (1UL << PAGE_SHIFT) @@ -9,11 +11,10 @@ #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) -#ifdef __KERNEL__ -#ifndef __ASSEMBLY__ - #include <linux/config.h> +#ifdef __KERNEL__ +#ifndef __ASSEMBLY__ #ifdef CONFIG_X86_USE_3DNOW #include <asm/mmx.h> @@ -88,8 +89,19 @@ * * If you want more physical memory than this then see the CONFIG_HIGHMEM4G * and CONFIG_HIGHMEM64G options in the kernel configuration. + * + * Note: on PAE the kernel must never go below 32 MB, we use the + * first 8 entries of the 2-level boot pgd for PAE magic. */ +#ifdef CONFIG_X86_4G_VM_LAYOUT +#define __PAGE_OFFSET (0x02000000) +#define TASK_SIZE (0xff000000) +#else +#define __PAGE_OFFSET (0xc0000000) +#define TASK_SIZE (0xc0000000) +#endif + /* * This much address space is reserved for vmalloc() and iomap() * as well as fixmap mappings. @@ -114,16 +126,10 @@ #endif /* __ASSEMBLY__ */ -#ifdef __ASSEMBLY__ -#define __PAGE_OFFSET (0xC0000000) -#else -#define __PAGE_OFFSET (0xC0000000UL) -#endif - - #define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) #define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE) -#define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE) +#define __MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE) +#define MAXMEM ((unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE)) #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) --- diff/include/asm-i386/pgtable.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/asm-i386/pgtable.h 2003-11-26 10:09:07.000000000 +0000 @@ -32,16 +32,17 @@ #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) extern unsigned long empty_zero_page[1024]; extern pgd_t swapper_pg_dir[1024]; -extern kmem_cache_t *pgd_cache; -extern kmem_cache_t *pmd_cache; +extern kmem_cache_t *pgd_cache, *pmd_cache, *kpmd_cache; extern spinlock_t pgd_lock; extern struct list_head pgd_list; void pmd_ctor(void *, kmem_cache_t *, unsigned long); +void kpmd_ctor(void *, kmem_cache_t *, unsigned long); void pgd_ctor(void *, kmem_cache_t *, unsigned long); void pgd_dtor(void *, kmem_cache_t *, unsigned long); void pgtable_cache_init(void); void paging_init(void); +void setup_identity_mappings(pgd_t *pgd_base, unsigned long start, unsigned long end); #endif /* !__ASSEMBLY__ */ @@ -51,6 +52,11 @@ * newer 3-level PAE-mode page tables. */ #ifndef __ASSEMBLY__ + +extern void set_system_gate(unsigned int n, void *addr); +extern void init_entry_mappings(void); +extern void entry_trampoline_setup(void); + #ifdef CONFIG_X86_PAE # include <asm/pgtable-3level.h> #else @@ -63,7 +69,12 @@ #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) -#define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) +#if defined(CONFIG_X86_PAE) && defined(CONFIG_X86_4G_VM_LAYOUT) +# define USER_PTRS_PER_PGD 4 +#else +# define USER_PTRS_PER_PGD ((TASK_SIZE/PGDIR_SIZE) + ((TASK_SIZE % PGDIR_SIZE) + PGDIR_SIZE-1)/PGDIR_SIZE) +#endif + #define FIRST_USER_PGD_NR 0 #define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT) @@ -233,6 +244,7 @@ #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) #define mk_pte_huge(entry) ((entry).pte_low |= _PAGE_PRESENT | _PAGE_PSE) +#define mk_pte_phys(physpage, pgprot) pfn_pte((physpage) >> PAGE_SHIFT, pgprot) static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { --- diff/include/asm-i386/processor.h 2003-10-27 09:20:39.000000000 +0000 +++ source/include/asm-i386/processor.h 2003-11-26 10:09:07.000000000 +0000 @@ -291,11 +291,6 @@ extern unsigned int BIOS_revision; extern unsigned int mca_pentium_flag; -/* - * User space process size: 3GB (default). - */ -#define TASK_SIZE (PAGE_OFFSET) - /* This decides where the kernel will search for a free chunk of vm * space during mmap's. */ @@ -406,7 +401,9 @@ struct thread_struct { /* cached TLS descriptors. */ struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES]; + void *stack_page0, *stack_page1; unsigned long esp0; + unsigned long sysenter_cs; unsigned long eip; unsigned long esp; unsigned long fs; @@ -428,6 +425,7 @@ #define INIT_THREAD { \ .vm86_info = NULL, \ + .sysenter_cs = __KERNEL_CS, \ .io_bitmap_ptr = NULL, \ } @@ -447,21 +445,14 @@ .io_bitmap = { [ 0 ... IO_BITMAP_LONGS] = ~0 }, \ } -static inline void load_esp0(struct tss_struct *tss, unsigned long esp0) +static inline void +load_esp0(struct tss_struct *tss, struct thread_struct *thread) { - tss->esp0 = esp0; + tss->esp0 = thread->esp0; /* This can only happen when SEP is enabled, no need to test "SEP"arately */ - if ((unlikely(tss->ss1 != __KERNEL_CS))) { - tss->ss1 = __KERNEL_CS; - wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); - } -} - -static inline void disable_sysenter(struct tss_struct *tss) -{ - if (cpu_has_sep) { - tss->ss1 = 0; - wrmsr(MSR_IA32_SYSENTER_CS, 0, 0); + if (unlikely(tss->ss1 != thread->sysenter_cs)) { + tss->ss1 = thread->sysenter_cs; + wrmsr(MSR_IA32_SYSENTER_CS, thread->sysenter_cs, 0); } } @@ -491,6 +482,23 @@ */ extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags); +#ifdef CONFIG_X86_HIGH_ENTRY +#define virtual_esp0(tsk) \ + ((unsigned long)(tsk)->thread_info->virtual_stack + ((tsk)->thread.esp0 - (unsigned long)(tsk)->thread_info->real_stack)) +#else +# define virtual_esp0(tsk) ((tsk)->thread.esp0) +#endif + +#define load_virtual_esp0(tss, task) \ + do { \ + tss->esp0 = virtual_esp0(task); \ + if (unlikely(tss->ss1 != task->thread.sysenter_cs)) { \ + tss->ss1 = task->thread.sysenter_cs; \ + wrmsr(MSR_IA32_SYSENTER_CS, \ + task->thread.sysenter_cs, 0); \ + } \ + } while (0) + extern unsigned long thread_saved_pc(struct task_struct *tsk); void show_trace(struct task_struct *task, unsigned long *stack); --- diff/include/asm-i386/rwlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/rwlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -20,28 +20,52 @@ #define RW_LOCK_BIAS 0x01000000 #define RW_LOCK_BIAS_STR "0x01000000" -#define __build_read_lock_ptr(rw, helper) \ - asm volatile(LOCK "subl $1,(%0)\n\t" \ - "js 2f\n" \ - "1:\n" \ - LOCK_SECTION_START("") \ - "2:\tcall " helper "\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END \ - ::"a" (rw) : "memory") - -#define __build_read_lock_const(rw, helper) \ - asm volatile(LOCK "subl $1,%0\n\t" \ - "js 2f\n" \ - "1:\n" \ - LOCK_SECTION_START("") \ - "2:\tpushl %%eax\n\t" \ - "leal %0,%%eax\n\t" \ - "call " helper "\n\t" \ - "popl %%eax\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END \ - :"=m" (*(volatile int *)rw) : : "memory") +#ifdef CONFIG_SPINLINE + + #define __build_read_lock_ptr(rw, helper) \ + asm volatile(LOCK "subl $1,(%0)\n\t" \ + "jns 1f\n\t" \ + "call " helper "\n\t" \ + "1:\t" \ + ::"a" (rw) : "memory") + + #define __build_read_lock_const(rw, helper) \ + asm volatile(LOCK "subl $1,%0\n\t" \ + "jns 1f\n\t" \ + "pushl %%eax\n\t" \ + "leal %0,%%eax\n\t" \ + "call " helper "\n\t" \ + "popl %%eax\n\t" \ + "1:\t" \ + :"=m" (*(volatile int *)rw) : : "memory") + +#else /* !CONFIG_SPINLINE */ + + #define __build_read_lock_ptr(rw, helper) \ + asm volatile(LOCK "subl $1,(%0)\n\t" \ + "js 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2:\tcall " helper "\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + ::"a" (rw) : "memory") + + #define __build_read_lock_const(rw, helper) \ + asm volatile(LOCK "subl $1,%0\n\t" \ + "js 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2:\tpushl %%eax\n\t" \ + "leal %0,%%eax\n\t" \ + "call " helper "\n\t" \ + "popl %%eax\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=m" (*(volatile int *)rw) : : "memory") + +#endif /* CONFIG_SPINLINE */ + #define __build_read_lock(rw, helper) do { \ if (__builtin_constant_p(rw)) \ @@ -50,28 +74,51 @@ __build_read_lock_ptr(rw, helper); \ } while (0) -#define __build_write_lock_ptr(rw, helper) \ - asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \ - "jnz 2f\n" \ - "1:\n" \ - LOCK_SECTION_START("") \ - "2:\tcall " helper "\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END \ - ::"a" (rw) : "memory") - -#define __build_write_lock_const(rw, helper) \ - asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \ - "jnz 2f\n" \ - "1:\n" \ - LOCK_SECTION_START("") \ - "2:\tpushl %%eax\n\t" \ - "leal %0,%%eax\n\t" \ - "call " helper "\n\t" \ - "popl %%eax\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END \ - :"=m" (*(volatile int *)rw) : : "memory") +#ifdef CONFIG_SPINLINE + + #define __build_write_lock_ptr(rw, helper) \ + asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \ + "jz 1f\n\t" \ + "call " helper "\n\t" \ + "1:\n" \ + ::"a" (rw) : "memory") + + #define __build_write_lock_const(rw, helper) \ + asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \ + "jz 1f\n\t" \ + "pushl %%eax\n\t" \ + "leal %0,%%eax\n\t" \ + "call " helper "\n\t" \ + "popl %%eax\n\t" \ + "1:\n" \ + :"=m" (*(volatile int *)rw) : : "memory") + +#else /* !CONFIG_SPINLINE */ + + #define __build_write_lock_ptr(rw, helper) \ + asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \ + "jnz 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2:\tcall " helper "\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + ::"a" (rw) : "memory") + + #define __build_write_lock_const(rw, helper) \ + asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \ + "jnz 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2:\tpushl %%eax\n\t" \ + "leal %0,%%eax\n\t" \ + "call " helper "\n\t" \ + "popl %%eax\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=m" (*(volatile int *)rw) : : "memory") + +#endif /* CONFIG_SPINLINE */ #define __build_write_lock(rw, helper) do { \ if (__builtin_constant_p(rw)) \ --- diff/include/asm-i386/setup.h 2003-09-30 15:46:19.000000000 +0100 +++ source/include/asm-i386/setup.h 2003-11-26 10:09:07.000000000 +0000 @@ -29,6 +29,11 @@ #define IST_INFO (*(struct ist_info *) (PARAM+0x60)) #define DRIVE_INFO (*(struct drive_info_struct *) (PARAM+0x80)) #define SYS_DESC_TABLE (*(struct sys_desc_table_struct*)(PARAM+0xa0)) +#define EFI_SYSTAB ((efi_system_table_t *) *((unsigned long *)(PARAM+0x1c4))) +#define EFI_MEMDESC_SIZE (*((unsigned long *) (PARAM+0x1c8))) +#define EFI_MEMDESC_VERSION (*((unsigned long *) (PARAM+0x1cc))) +#define EFI_MEMMAP ((efi_memory_desc_t *) *((unsigned long *)(PARAM+0x1d0))) +#define EFI_MEMMAP_SIZE (*((unsigned long *) (PARAM+0x1d4))) #define MOUNT_ROOT_RDONLY (*(unsigned short *) (PARAM+0x1F2)) #define RAMDISK_FLAGS (*(unsigned short *) (PARAM+0x1F8)) #define VIDEO_MODE (*(unsigned short *) (PARAM+0x1FA)) --- diff/include/asm-i386/spinlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/spinlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -43,18 +43,35 @@ #define spin_is_locked(x) (*(volatile signed char *)(&(x)->lock) <= 0) #define spin_unlock_wait(x) do { barrier(); } while(spin_is_locked(x)) -#define spin_lock_string \ - "\n1:\t" \ - "lock ; decb %0\n\t" \ - "js 2f\n" \ - LOCK_SECTION_START("") \ - "2:\t" \ - "rep;nop\n\t" \ - "cmpb $0,%0\n\t" \ - "jle 2b\n\t" \ - "jmp 1b\n" \ - LOCK_SECTION_END +#ifdef CONFIG_SPINLINE + #define spin_lock_string \ + "\n1:\t" \ + "lock ; decb %0\n\t" \ + "js 2f\n" \ + "jmp 3f\n" \ + "2:\t" \ + "rep;nop\n\t" \ + "cmpb $0,%0\n\t" \ + "jle 2b\n\t" \ + "jmp 1b\n" \ + "3:\t" + +#else /* !CONFIG_SPINLINE */ + + #define spin_lock_string \ + "\n1:\t" \ + "lock ; decb %0\n\t" \ + "js 2f\n" \ + LOCK_SECTION_START("") \ + "2:\t" \ + "rep;nop\n\t" \ + "cmpb $0,%0\n\t" \ + "jle 2b\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END + +#endif /* CONFIG_SPINLINE */ /* * This works. Despite all the confusion. * (except on PPro SMP or if we are using OOSTORE) @@ -138,6 +155,11 @@ */ typedef struct { volatile unsigned int lock; +#ifdef CONFIG_LOCKMETER + /* required for LOCKMETER since all bits in lock are used */ + /* and we need this storage for CPU and lock INDEX */ + unsigned lockmeter_magic; +#endif #ifdef CONFIG_DEBUG_SPINLOCK unsigned magic; #endif @@ -145,11 +167,19 @@ #define RWLOCK_MAGIC 0xdeaf1eed +#ifdef CONFIG_LOCKMETER +#ifdef CONFIG_DEBUG_SPINLOCK +#define RWLOCK_MAGIC_INIT , 0, RWLOCK_MAGIC +#else +#define RWLOCK_MAGIC_INIT , 0 +#endif +#else /* !CONFIG_LOCKMETER */ #ifdef CONFIG_DEBUG_SPINLOCK #define RWLOCK_MAGIC_INIT , RWLOCK_MAGIC #else #define RWLOCK_MAGIC_INIT /* */ #endif +#endif /* !CONFIG_LOCKMETER */ #define RW_LOCK_UNLOCKED (rwlock_t) { RW_LOCK_BIAS RWLOCK_MAGIC_INIT } @@ -196,4 +226,60 @@ return 0; } +#ifdef CONFIG_LOCKMETER +static inline int _raw_read_trylock(rwlock_t *lock) +{ +/* FIXME -- replace with assembler */ + atomic_t *count = (atomic_t *)lock; + atomic_dec(count); + if (count->counter > 0) + return 1; + atomic_inc(count); + return 0; +} +#endif + +#if defined(CONFIG_LOCKMETER) && defined(CONFIG_HAVE_DEC_LOCK) +extern void _metered_spin_lock (spinlock_t *lock); +extern void _metered_spin_unlock(spinlock_t *lock); + +/* + * Matches what is in arch/i386/lib/dec_and_lock.c, except this one is + * "static inline" so that the spin_lock(), if actually invoked, is charged + * against the real caller, not against the catch-all atomic_dec_and_lock + */ +static inline int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +{ + int counter; + int newcount; + +repeat: + counter = atomic_read(atomic); + newcount = counter-1; + + if (!newcount) + goto slow_path; + + asm volatile("lock; cmpxchgl %1,%2" + :"=a" (newcount) + :"r" (newcount), "m" (atomic->counter), "0" (counter)); + + /* If the above failed, "eax" will have changed */ + if (newcount != counter) + goto repeat; + return 0; + +slow_path: + preempt_disable(); + _metered_spin_lock(lock); + if (atomic_dec_and_test(atomic)) + return 1; + _metered_spin_unlock(lock); + preempt_enable(); + return 0; +} + +#define ATOMIC_DEC_AND_LOCK +#endif + #endif /* __ASM_SPINLOCK_H */ --- diff/include/asm-i386/string.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/string.h 2003-11-26 10:09:07.000000000 +0000 @@ -56,6 +56,29 @@ return dest; } +/* + * This is a more generic variant of strncpy_count() suitable for + * implementing string-access routines with all sorts of return + * code semantics. It's used by mm/usercopy.c. + */ +static inline size_t strncpy_count(char * dest,const char *src,size_t count) +{ + __asm__ __volatile__( + + "1:\tdecl %0\n\t" + "js 2f\n\t" + "lodsb\n\t" + "stosb\n\t" + "testb %%al,%%al\n\t" + "jne 1b\n\t" + "2:" + "incl %0" + : "=c" (count) + :"S" (src),"D" (dest),"0" (count) : "memory"); + + return count; +} + #define __HAVE_ARCH_STRCAT static inline char * strcat(char * dest,const char * src) { @@ -299,14 +322,9 @@ static inline void * memmove(void * dest,const void * src, size_t n) { int d0, d1, d2; -if (dest<src) -__asm__ __volatile__( - "rep\n\t" - "movsb" - : "=&c" (d0), "=&S" (d1), "=&D" (d2) - :"0" (n),"1" (src),"2" (dest) - : "memory"); -else +if (dest<src) { + memcpy(dest,src,n); +} else __asm__ __volatile__( "std\n\t" "rep\n\t" --- diff/include/asm-i386/system.h 2003-06-09 14:18:20.000000000 +0100 +++ source/include/asm-i386/system.h 2003-11-26 10:09:07.000000000 +0000 @@ -470,6 +470,7 @@ extern unsigned long dmi_broken; extern int is_sony_vaio_laptop; +extern int es7000_plat; #define BROKEN_ACPI_Sx 0x0001 #define BROKEN_INIT_AFTER_S1 0x0002 --- diff/include/asm-i386/thread_info.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/thread_info.h 2003-11-26 10:09:07.000000000 +0000 @@ -33,23 +33,12 @@ 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ - struct restart_block restart_block; + void *real_stack, *virtual_stack, *user_pgd; + struct restart_block restart_block; __u8 supervisor_stack[0]; }; -#else /* !__ASSEMBLY__ */ - -/* offsets into the thread_info struct for assembly code access */ -#define TI_TASK 0x00000000 -#define TI_EXEC_DOMAIN 0x00000004 -#define TI_FLAGS 0x00000008 -#define TI_STATUS 0x0000000C -#define TI_CPU 0x00000010 -#define TI_PRE_COUNT 0x00000014 -#define TI_ADDR_LIMIT 0x00000018 -#define TI_RESTART_BLOCK 0x000001C - #endif #define PREEMPT_ACTIVE 0x4000000 @@ -61,7 +50,7 @@ */ #ifndef __ASSEMBLY__ -#define INIT_THREAD_INFO(tsk) \ +#define INIT_THREAD_INFO(tsk, thread_info) \ { \ .task = &tsk, \ .exec_domain = &default_exec_domain, \ @@ -72,6 +61,7 @@ .restart_block = { \ .fn = do_no_restart_syscall, \ }, \ + .real_stack = &thread_info, \ } #define init_thread_info (init_thread_union.thread_info) @@ -113,6 +103,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ #define TIF_IRET 5 /* return with iret */ +#define TIF_DB7 6 /* has debug registers */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) @@ -121,6 +112,7 @@ #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) #define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP) #define _TIF_IRET (1<<TIF_IRET) +#define _TIF_DB7 (1<<TIF_DB7) #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) #define _TIF_WORK_MASK 0x0000FFFE /* work to do on interrupt/exception return */ --- diff/include/asm-i386/timer.h 2003-09-30 15:46:19.000000000 +0100 +++ source/include/asm-i386/timer.h 2003-11-26 10:09:07.000000000 +0000 @@ -11,6 +11,7 @@ * last timer intruupt. */ struct timer_opts{ + char* name; int (*init)(char *override); void (*mark_offset)(void); unsigned long (*get_offset)(void); @@ -39,9 +40,13 @@ #endif extern unsigned long calibrate_tsc(void); +extern void init_cpu_khz(void); #ifdef CONFIG_HPET_TIMER extern struct timer_opts timer_hpet; extern unsigned long calibrate_tsc_hpet(unsigned long *tsc_hpet_quotient_ptr); #endif +#ifdef CONFIG_X86_PM_TIMER +extern struct timer_opts timer_pmtmr; +#endif #endif --- diff/include/asm-i386/tlbflush.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/tlbflush.h 2003-11-26 10:09:07.000000000 +0000 @@ -85,22 +85,28 @@ static inline void flush_tlb_mm(struct mm_struct *mm) { +#ifndef CONFIG_X86_SWITCH_PAGETABLES if (mm == current->active_mm) __flush_tlb(); +#endif } static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) { +#ifndef CONFIG_X86_SWITCH_PAGETABLES if (vma->vm_mm == current->active_mm) __flush_tlb_one(addr); +#endif } static inline void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { +#ifndef CONFIG_X86_SWITCH_PAGETABLES if (vma->vm_mm == current->active_mm) __flush_tlb(); +#endif } #else @@ -111,11 +117,10 @@ __flush_tlb() extern void flush_tlb_all(void); -extern void flush_tlb_current_task(void); extern void flush_tlb_mm(struct mm_struct *); extern void flush_tlb_page(struct vm_area_struct *, unsigned long); -#define flush_tlb() flush_tlb_current_task() +#define flush_tlb() flush_tlb_all() static inline void flush_tlb_range(struct vm_area_struct * vma, unsigned long start, unsigned long end) { --- diff/include/asm-i386/uaccess.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-i386/uaccess.h 2003-11-26 10:09:07.000000000 +0000 @@ -26,7 +26,7 @@ #define KERNEL_DS MAKE_MM_SEG(0xFFFFFFFFUL) -#define USER_DS MAKE_MM_SEG(PAGE_OFFSET) +#define USER_DS MAKE_MM_SEG(TASK_SIZE) #define get_ds() (KERNEL_DS) #define get_fs() (current_thread_info()->addr_limit) @@ -149,6 +149,45 @@ :"=a" (ret),"=d" (x) \ :"0" (ptr)) +extern int get_user_size(unsigned int size, void *val, const void *ptr); +extern int put_user_size(unsigned int size, const void *val, void *ptr); +extern int zero_user_size(unsigned int size, void *ptr); +extern int copy_str_fromuser_size(unsigned int size, void *val, const void *ptr); +extern int strlen_fromuser_size(unsigned int size, const void *ptr); + + +# define indirect_get_user(x,ptr) \ +({ int __ret_gu,__val_gu; \ + __typeof__(ptr) __ptr_gu = (ptr); \ + __ret_gu = get_user_size(sizeof(*__ptr_gu), &__val_gu,__ptr_gu) ? -EFAULT : 0;\ + (x) = (__typeof__(*__ptr_gu))__val_gu; \ + __ret_gu; \ +}) +#define indirect_put_user(x,ptr) \ +({ \ + __typeof__(*(ptr)) *__ptr_pu = (ptr), __x_pu = (x); \ + put_user_size(sizeof(*__ptr_pu), &__x_pu, __ptr_pu) ? -EFAULT : 0; \ +}) +#define __indirect_put_user indirect_put_user +#define __indirect_get_user indirect_get_user + +#define indirect_copy_from_user(to,from,n) get_user_size(n,to,from) +#define indirect_copy_to_user(to,from,n) put_user_size(n,from,to) + +#define __indirect_copy_from_user indirect_copy_from_user +#define __indirect_copy_to_user indirect_copy_to_user + +#define indirect_strncpy_from_user(dst, src, count) \ + copy_str_fromuser_size(count, dst, src) + +extern int strlen_fromuser_size(unsigned int size, const void *ptr); +#define indirect_strnlen_user(str, n) strlen_fromuser_size(n, str) +#define indirect_strlen_user(str) indirect_strnlen_user(str, ~0UL >> 1) + +extern int zero_user_size(unsigned int size, void *ptr); + +#define indirect_clear_user(mem, len) zero_user_size(len, mem) +#define __indirect_clear_user clear_user /* Careful: we have to cast the result to the type of the pointer for sign reasons */ /** @@ -168,7 +207,7 @@ * Returns zero on success, or -EFAULT on error. * On error, the variable @x is set to zero. */ -#define get_user(x,ptr) \ +#define direct_get_user(x,ptr) \ ({ int __ret_gu,__val_gu; \ switch(sizeof (*(ptr))) { \ case 1: __get_user_x(1,__ret_gu,__val_gu,ptr); break; \ @@ -198,7 +237,7 @@ * * Returns zero on success, or -EFAULT on error. */ -#define put_user(x,ptr) \ +#define direct_put_user(x,ptr) \ __put_user_check((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr))) @@ -222,7 +261,7 @@ * Returns zero on success, or -EFAULT on error. * On error, the variable @x is set to zero. */ -#define __get_user(x,ptr) \ +#define __direct_get_user(x,ptr) \ __get_user_nocheck((x),(ptr),sizeof(*(ptr))) @@ -245,7 +284,7 @@ * * Returns zero on success, or -EFAULT on error. */ -#define __put_user(x,ptr) \ +#define __direct_put_user(x,ptr) \ __put_user_nocheck((__typeof__(*(ptr)))(x),(ptr),sizeof(*(ptr))) #define __put_user_nocheck(x,ptr,size) \ @@ -396,7 +435,7 @@ * On success, this will be zero. */ static inline unsigned long -__copy_to_user(void __user *to, const void *from, unsigned long n) +__direct_copy_to_user(void __user *to, const void *from, unsigned long n) { if (__builtin_constant_p(n)) { unsigned long ret; @@ -434,7 +473,7 @@ * data to the requested size using zero bytes. */ static inline unsigned long -__copy_from_user(void *to, const void __user *from, unsigned long n) +__direct_copy_from_user(void *to, const void __user *from, unsigned long n) { if (__builtin_constant_p(n)) { unsigned long ret; @@ -468,11 +507,11 @@ * On success, this will be zero. */ static inline unsigned long -copy_to_user(void __user *to, const void *from, unsigned long n) +direct_copy_to_user(void __user *to, const void *from, unsigned long n) { might_sleep(); if (access_ok(VERIFY_WRITE, to, n)) - n = __copy_to_user(to, from, n); + n = __direct_copy_to_user(to, from, n); return n; } @@ -493,11 +532,11 @@ * data to the requested size using zero bytes. */ static inline unsigned long -copy_from_user(void *to, const void __user *from, unsigned long n) +direct_copy_from_user(void *to, const void __user *from, unsigned long n) { might_sleep(); if (access_ok(VERIFY_READ, from, n)) - n = __copy_from_user(to, from, n); + n = __direct_copy_from_user(to, from, n); else memset(to, 0, n); return n; @@ -520,10 +559,68 @@ * If there is a limit on the length of a valid string, you may wish to * consider using strnlen_user() instead. */ -#define strlen_user(str) strnlen_user(str, ~0UL >> 1) -long strnlen_user(const char __user *str, long n); -unsigned long clear_user(void __user *mem, unsigned long len); -unsigned long __clear_user(void __user *mem, unsigned long len); +long direct_strncpy_from_user(char *dst, const char *src, long count); +long __direct_strncpy_from_user(char *dst, const char *src, long count); +#define direct_strlen_user(str) direct_strnlen_user(str, ~0UL >> 1) +long direct_strnlen_user(const char *str, long n); +unsigned long direct_clear_user(void *mem, unsigned long len); +unsigned long __direct_clear_user(void *mem, unsigned long len); + +extern int indirect_uaccess; + +#ifdef CONFIG_X86_UACCESS_INDIRECT + +/* + * Return code and zeroing semantics: + + __clear_user 0 <-> bytes not done + clear_user 0 <-> bytes not done + __copy_to_user 0 <-> bytes not done + copy_to_user 0 <-> bytes not done + __copy_from_user 0 <-> bytes not done, zero rest + copy_from_user 0 <-> bytes not done, zero rest + __get_user 0 <-> -EFAULT + get_user 0 <-> -EFAULT + __put_user 0 <-> -EFAULT + put_user 0 <-> -EFAULT + strlen_user strlen + 1 <-> 0 + strnlen_user strlen + 1 (or n+1) <-> 0 + strncpy_from_user strlen (or n) <-> -EFAULT + + */ + +#define __clear_user(mem,len) __indirect_clear_user(mem,len) +#define clear_user(mem,len) indirect_clear_user(mem,len) +#define __copy_to_user(to,from,n) __indirect_copy_to_user(to,from,n) +#define copy_to_user(to,from,n) indirect_copy_to_user(to,from,n) +#define __copy_from_user(to,from,n) __indirect_copy_from_user(to,from,n) +#define copy_from_user(to,from,n) indirect_copy_from_user(to,from,n) +#define __get_user(val,ptr) __indirect_get_user(val,ptr) +#define get_user(val,ptr) indirect_get_user(val,ptr) +#define __put_user(val,ptr) __indirect_put_user(val,ptr) +#define put_user(val,ptr) indirect_put_user(val,ptr) +#define strlen_user(str) indirect_strlen_user(str) +#define strnlen_user(src,count) indirect_strnlen_user(src,count) +#define strncpy_from_user(dst,src,count) \ + indirect_strncpy_from_user(dst,src,count) + +#else + +#define __clear_user __direct_clear_user +#define clear_user direct_clear_user +#define __copy_to_user __direct_copy_to_user +#define copy_to_user direct_copy_to_user +#define __copy_from_user __direct_copy_from_user +#define copy_from_user direct_copy_from_user +#define __get_user __direct_get_user +#define get_user direct_get_user +#define __put_user __direct_put_user +#define put_user direct_put_user +#define strlen_user direct_strlen_user +#define strnlen_user direct_strnlen_user +#define strncpy_from_user direct_strncpy_from_user + +#endif /* CONFIG_X86_UACCESS_INDIRECT */ #endif /* __i386_UACCESS_H */ --- diff/include/asm-ia64/spinlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-ia64/spinlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -110,8 +110,18 @@ typedef struct { volatile int read_counter : 31; volatile int write_lock : 1; +#ifdef CONFIG_LOCKMETER + /* required for LOCKMETER since all bits in lock are used */ + /* and we need this storage for CPU and lock INDEX */ + unsigned lockmeter_magic; +#endif } rwlock_t; + +#ifdef CONFIG_LOCKMETER +#define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0, 0 } +#else #define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 } +#endif #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) #define rwlock_is_locked(x) (*(volatile int *) (x) != 0) @@ -127,6 +137,48 @@ } \ } while (0) +#ifdef CONFIG_LOCKMETER +/* + * HACK: This works, but still have a timing window that affects performance: + * we see that no one owns the Write lock, then someone * else grabs for Write + * lock before we do a read_lock(). + * This means that on rare occasions our read_lock() will stall and spin-wait + * until we acquire for Read, instead of simply returning a trylock failure. + */ +static inline int _raw_read_trylock(rwlock_t *rw) +{ + if (rw->write_lock) { + return 0; + } else { + _raw_read_lock(rw); + return 1; + } +} + +static inline int _raw_write_trylock(rwlock_t *rw) +{ + if (!(rw->write_lock)) { + /* isn't currently write-locked... that looks promising... */ + if (test_and_set_bit(31, rw) == 0) { + /* now it is write-locked by me... */ + if (rw->read_counter) { + /* really read-locked, so release write-lock and fail */ + clear_bit(31, rw); + } else { + /* we've the the write-lock, no read-lockers... success! */ + barrier(); + return 1; + } + + } + } + + /* falls through ... fails to write-lock */ + barrier(); + return 0; +} +#endif + #define _raw_read_unlock(rw) \ do { \ rwlock_t *__read_lock_ptr = (rw); \ @@ -190,4 +242,25 @@ clear_bit(31, (x)); \ }) +#ifdef CONFIG_LOCKMETER +extern void _metered_spin_lock (spinlock_t *lock); +extern void _metered_spin_unlock(spinlock_t *lock); + +/* + * Use a less efficient, and inline, atomic_dec_and_lock() if lockmetering + * so we can see the callerPC of who is actually doing the spin_lock(). + * Otherwise, all we see is the generic rollup of all locks done by + * atomic_dec_and_lock(). + */ +static inline int atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +{ + _metered_spin_lock(lock); + if (atomic_dec_and_test(atomic)) + return 1; + _metered_spin_unlock(lock); + return 0; +} +#define ATOMIC_DEC_AND_LOCK +#endif + #endif /* _ASM_IA64_SPINLOCK_H */ --- diff/include/asm-mips/spinlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/asm-mips/spinlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -91,9 +91,18 @@ typedef struct { volatile unsigned int lock; +#ifdef CONFIG_LOCKMETER + /* required for LOCKMETER since all bits in lock are used */ + /* and we need this storage for CPU and lock INDEX */ + unsigned lockmeter_magic; +#endif } rwlock_t; +#ifdef CONFIG_LOCKMETER +#define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 } +#else #define RW_LOCK_UNLOCKED (rwlock_t) { 0 } +#endif #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0) --- diff/include/asm-sparc64/spinlock.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-sparc64/spinlock.h 2003-11-26 10:09:07.000000000 +0000 @@ -30,15 +30,23 @@ #ifndef CONFIG_DEBUG_SPINLOCK -typedef unsigned char spinlock_t; -#define SPIN_LOCK_UNLOCKED 0 +typedef struct { + unsigned char lock; + unsigned int index; +} spinlock_t; -#define spin_lock_init(lock) (*((unsigned char *)(lock)) = 0) -#define spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0) +#ifdef CONFIG_LOCKMETER +#define SPIN_LOCK_UNLOCKED (spinlock_t) {0, 0} +#else +#define SPIN_LOCK_UNLOCKED (spinlock_t) { 0 } +#endif -#define spin_unlock_wait(lock) \ +#define spin_lock_init(__lock) do { *(__lock) = SPIN_LOCK_UNLOCKED; } while(0) +#define spin_is_locked(__lock) (*((volatile unsigned char *)(&((__lock)->lock))) != 0) + +#define spin_unlock_wait(__lock) \ do { membar("#LoadLoad"); \ -} while(*((volatile unsigned char *)lock)) +} while(*((volatile unsigned char *)(&(((spinlock_t *)__lock)->lock)))) static __inline__ void _raw_spin_lock(spinlock_t *lock) { @@ -109,17 +117,31 @@ #ifndef CONFIG_DEBUG_SPINLOCK -typedef unsigned int rwlock_t; -#define RW_LOCK_UNLOCKED 0 -#define rwlock_init(lp) do { *(lp) = RW_LOCK_UNLOCKED; } while(0) -#define rwlock_is_locked(x) (*(x) != RW_LOCK_UNLOCKED) +#ifdef CONFIG_LOCKMETER +typedef struct { + unsigned int lock; + unsigned int index; + unsigned int cpu; +} rwlock_t; +#define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0, 0xff } +#else +typedef struct { + unsigned int lock; +} rwlock_t; +#define RW_LOCK_UNLOCKED (rwlock_t) { 0 } +#endif + +#define rwlock_init(lp) do { *(lp) = RW_LOCK_UNLOCKED; } while(0) +#define rwlock_is_locked(x) ((x)->lock != 0) +extern int __read_trylock(rwlock_t *); extern void __read_lock(rwlock_t *); extern void __read_unlock(rwlock_t *); extern void __write_lock(rwlock_t *); extern void __write_unlock(rwlock_t *); extern int __write_trylock(rwlock_t *); +#define _raw_read_trylock(p) __read_trylock(p) #define _raw_read_lock(p) __read_lock(p) #define _raw_read_unlock(p) __read_unlock(p) #define _raw_write_lock(p) __write_lock(p) --- diff/include/asm-x86_64/a.out.h 2002-10-16 04:27:55.000000000 +0100 +++ source/include/asm-x86_64/a.out.h 2003-11-26 10:09:07.000000000 +0000 @@ -1,13 +1,11 @@ #ifndef __X8664_A_OUT_H__ #define __X8664_A_OUT_H__ - -/* Note: a.out is not supported in 64bit mode. This is just here to - still let some old things compile. */ +/* 32bit a.out */ struct exec { - unsigned long a_info; /* Use macros N_MAGIC, etc for access */ + unsigned int a_info; /* Use macros N_MAGIC, etc for access */ unsigned a_text; /* length of text, in bytes */ unsigned a_data; /* length of data, in bytes */ unsigned a_bss; /* length of uninitialized data area for file, in bytes */ @@ -23,7 +21,7 @@ #ifdef __KERNEL__ -#define STACK_TOP TASK_SIZE +#define STACK_TOP 0xc0000000 #endif --- diff/include/asm-x86_64/apic.h 2003-08-20 14:16:34.000000000 +0100 +++ source/include/asm-x86_64/apic.h 2003-11-26 10:09:07.000000000 +0000 @@ -79,7 +79,7 @@ extern void enable_lapic_nmi_watchdog(void); extern void disable_timer_nmi_watchdog(void); extern void enable_timer_nmi_watchdog(void); -extern inline void nmi_watchdog_tick (struct pt_regs * regs, unsigned reason); +extern void nmi_watchdog_tick (struct pt_regs * regs, unsigned reason); extern int APIC_init_uniprocessor (void); extern void disable_APIC_timer(void); extern void enable_APIC_timer(void); --- diff/include/asm-x86_64/calling.h 2003-06-09 14:18:20.000000000 +0100 +++ source/include/asm-x86_64/calling.h 2003-11-26 10:09:07.000000000 +0000 @@ -8,7 +8,7 @@ #define R14 8 #define R13 16 #define R12 24 -#define RBP 36 +#define RBP 32 #define RBX 40 /* arguments: interrupts/non tracing syscalls only save upto here*/ #define R11 48 --- diff/include/asm-x86_64/desc.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-x86_64/desc.h 2003-11-26 10:09:07.000000000 +0000 @@ -190,7 +190,7 @@ /* * load one particular LDT into the current CPU */ -extern inline void load_LDT_nolock (mm_context_t *pc, int cpu) +static inline void load_LDT_nolock (mm_context_t *pc, int cpu) { int count = pc->size; --- diff/include/asm-x86_64/fixmap.h 2003-06-30 10:07:29.000000000 +0100 +++ source/include/asm-x86_64/fixmap.h 2003-11-26 10:09:07.000000000 +0000 @@ -76,7 +76,7 @@ * directly without translation, we catch the bug with a NULL-deference * kernel oops. Illegal ranges of incoming indices are caught too. */ -extern inline unsigned long fix_to_virt(const unsigned int idx) +static inline unsigned long fix_to_virt(const unsigned int idx) { /* * this branch gets completely eliminated after inlining, --- diff/include/asm-x86_64/hw_irq.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-x86_64/hw_irq.h 2003-11-26 10:09:07.000000000 +0000 @@ -173,6 +173,8 @@ static inline void hw_resend_irq(struct hw_interrupt_type *h, unsigned int i) {} #endif +#define platform_legacy_irq(irq) ((irq) < 16) + #endif #endif /* _ASM_HW_IRQ_H */ --- diff/include/asm-x86_64/io.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/asm-x86_64/io.h 2003-11-26 10:09:07.000000000 +0000 @@ -52,7 +52,7 @@ * Talk about misusing macros.. */ #define __OUT1(s,x) \ -extern inline void out##s(unsigned x value, unsigned short port) { +static inline void out##s(unsigned x value, unsigned short port) { #define __OUT2(s,s1,s2) \ __asm__ __volatile__ ("out" #s " %" s1 "0,%" s2 "1" @@ -62,7 +62,7 @@ __OUT1(s##_p,x) __OUT2(s,s1,"w") __FULL_SLOW_DOWN_IO : : "a" (value), "Nd" (port));} \ #define __IN1(s) \ -extern inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v; +static inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v; #define __IN2(s,s1,s2) \ __asm__ __volatile__ ("in" #s " %" s2 "1,%" s1 "0" @@ -72,12 +72,12 @@ __IN1(s##_p) __IN2(s,s1,"w") __FULL_SLOW_DOWN_IO : "=a" (_v) : "Nd" (port) ,##i ); return _v; } \ #define __INS(s) \ -extern inline void ins##s(unsigned short port, void * addr, unsigned long count) \ +static inline void ins##s(unsigned short port, void * addr, unsigned long count) \ { __asm__ __volatile__ ("rep ; ins" #s \ : "=D" (addr), "=c" (count) : "d" (port),"0" (addr),"1" (count)); } #define __OUTS(s) \ -extern inline void outs##s(unsigned short port, const void * addr, unsigned long count) \ +static inline void outs##s(unsigned short port, const void * addr, unsigned long count) \ { __asm__ __volatile__ ("rep ; outs" #s \ : "=S" (addr), "=c" (count) : "d" (port),"0" (addr),"1" (count)); } @@ -125,12 +125,12 @@ * Change virtual addresses to physical addresses and vv. * These are pretty trivial */ -extern inline unsigned long virt_to_phys(volatile void * address) +static inline unsigned long virt_to_phys(volatile void * address) { return __pa(address); } -extern inline void * phys_to_virt(unsigned long address) +static inline void * phys_to_virt(unsigned long address) { return __va(address); } @@ -148,7 +148,7 @@ extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags); -extern inline void * ioremap (unsigned long offset, unsigned long size) +static inline void * ioremap (unsigned long offset, unsigned long size) { return __ioremap(offset, size, 0); } @@ -304,8 +304,8 @@ /* Disable vmerge for now. Need to fix the block layer code to check for non iommu addresses first. When the IOMMU is force it is safe to enable. */ -extern int force_iommu; -#define BIO_VERMGE_BOUNDARY (force_iommu ? 4096 : 0) +extern int iommu_merge; +#define BIO_VMERGE_BOUNDARY (iommu_merge ? 4096 : 0) #endif /* __KERNEL__ */ --- diff/include/asm-x86_64/io_apic.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/asm-x86_64/io_apic.h 2003-11-26 10:09:07.000000000 +0000 @@ -174,6 +174,24 @@ #define io_apic_assign_pci_irqs 0 #endif +static inline int use_pci_vector(void) {return 0;} +static inline void disable_edge_ioapic_irq(unsigned int irq) { } +static inline void mask_and_ack_level_ioapic_irq(unsigned int irq) { } +static inline void end_edge_ioapic_irq (unsigned int irq) { } +#define startup_level_ioapic startup_level_ioapic_irq +#define shutdown_level_ioapic mask_IO_APIC_irq +#define enable_level_ioapic unmask_IO_APIC_irq +#define disable_level_ioapic mask_IO_APIC_irq +#define mask_and_ack_level_ioapic mask_and_ack_level_ioapic_irq +#define end_level_ioapic end_level_ioapic_irq + +#define startup_edge_ioapic startup_edge_ioapic_irq +#define shutdown_edge_ioapic disable_edge_ioapic_irq +#define enable_edge_ioapic unmask_IO_APIC_irq +#define disable_edge_ioapic disable_edge_ioapic_irq +#define ack_edge_ioapic ack_edge_ioapic_irq +//#define end_edge_ioapic end_edge_ioapic_irq + void enable_NMI_through_LVT0 (void * dummy); #endif --- diff/include/asm-x86_64/msr.h 2003-05-21 11:50:16.000000000 +0100 +++ source/include/asm-x86_64/msr.h 2003-11-26 10:09:07.000000000 +0000 @@ -67,7 +67,7 @@ : "=a" (low), "=d" (high) \ : "c" (counter)) -extern inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx) +static inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx) { __asm__("cpuid" : "=a" (*eax), @@ -80,7 +80,7 @@ /* * CPUID functions returning a single datum */ -extern inline unsigned int cpuid_eax(unsigned int op) +static inline unsigned int cpuid_eax(unsigned int op) { unsigned int eax; @@ -90,7 +90,7 @@ : "bx", "cx", "dx"); return eax; } -extern inline unsigned int cpuid_ebx(unsigned int op) +static inline unsigned int cpuid_ebx(unsigned int op) { unsigned int eax, ebx; @@ -100,7 +100,7 @@ : "cx", "dx" ); return ebx; } -extern inline unsigned int cpuid_ecx(unsigned int op) +static inline unsigned int cpuid_ecx(unsigned int op) { unsigned int eax, ecx; @@ -110,7 +110,7 @@ : "bx", "dx" ); return ecx; } -extern inline unsigned int cpuid_edx(unsigned int op) +static inline unsigned int cpuid_edx(unsigned int op) { unsigned int eax, edx; --- diff/include/asm-x86_64/pgalloc.h 2003-05-21 11:50:16.000000000 +0100 +++ source/include/asm-x86_64/pgalloc.h 2003-11-26 10:09:07.000000000 +0000 @@ -69,7 +69,7 @@ free_page((unsigned long)pte); } -extern inline void pte_free(struct page *pte) +static inline void pte_free(struct page *pte) { __free_page(pte); } --- diff/include/asm-x86_64/pgtable.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/asm-x86_64/pgtable.h 2003-11-26 10:09:08.000000000 +0000 @@ -71,7 +71,7 @@ #define pml4_none(x) (!pml4_val(x)) #define pgd_none(x) (!pgd_val(x)) -extern inline int pgd_present(pgd_t pgd) { return !pgd_none(pgd); } +static inline int pgd_present(pgd_t pgd) { return !pgd_none(pgd); } static inline void set_pte(pte_t *dst, pte_t val) { @@ -88,7 +88,7 @@ pgd_val(*dst) = pgd_val(val); } -extern inline void pgd_clear (pgd_t * pgd) +static inline void pgd_clear (pgd_t * pgd) { set_pgd(pgd, __pgd(0)); } @@ -242,23 +242,23 @@ * Undefined behaviour if not.. */ static inline int pte_user(pte_t pte) { return pte_val(pte) & _PAGE_USER; } -extern inline int pte_read(pte_t pte) { return pte_val(pte) & _PAGE_USER; } -extern inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_USER; } -extern inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } -extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } -extern inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } +static inline int pte_read(pte_t pte) { return pte_val(pte) & _PAGE_USER; } +static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_USER; } +static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY; } +static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } +static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; } static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; } -extern inline pte_t pte_rdprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } -extern inline pte_t pte_exprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } -extern inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } -extern inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; } -extern inline pte_t pte_wrprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_RW)); return pte; } -extern inline pte_t pte_mkread(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; } -extern inline pte_t pte_mkexec(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; } -extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } -extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } -extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } +static inline pte_t pte_rdprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } +static inline pte_t pte_exprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; } +static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; } +static inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; } +static inline pte_t pte_wrprotect(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_RW)); return pte; } +static inline pte_t pte_mkread(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; } +static inline pte_t pte_mkexec(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; } +static inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } +static inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } +static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); } static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); } @@ -359,7 +359,7 @@ } /* Change flags of a PTE */ -extern inline pte_t pte_modify(pte_t pte, pgprot_t newprot) +static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) &= _PAGE_CHG_MASK; pte_val(pte) |= pgprot_val(newprot); --- diff/include/asm-x86_64/processor.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-x86_64/processor.h 2003-11-26 10:09:08.000000000 +0000 @@ -304,13 +304,13 @@ #define KSTK_ESP(tsk) -1 /* sorry. doesn't work for syscall. */ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ -extern inline void rep_nop(void) +static inline void rep_nop(void) { __asm__ __volatile__("rep;nop": : :"memory"); } /* Stop speculative execution */ -extern inline void sync_core(void) +static inline void sync_core(void) { int tmp; asm volatile("cpuid" : "=a" (tmp) : "0" (1) : "ebx","ecx","edx","memory"); --- diff/include/asm-x86_64/smp.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-x86_64/smp.h 2003-11-26 10:09:08.000000000 +0000 @@ -74,7 +74,7 @@ return GET_APIC_ID(*(unsigned int *)(APIC_BASE+APIC_ID)); } -#define safe_smp_processor_id() (cpuid_ebx(1) >> 24) +#define safe_smp_processor_id() (disable_apic ? 0 : hard_smp_processor_id()) #define cpu_online(cpu) cpu_isset(cpu, cpu_online_map) #endif /* !ASSEMBLY */ --- diff/include/asm-x86_64/stat.h 2002-11-18 10:11:55.000000000 +0000 +++ source/include/asm-x86_64/stat.h 2003-11-26 10:09:08.000000000 +0000 @@ -26,4 +26,19 @@ long __unused[3]; }; +/* For 32bit emulation */ +struct __old_kernel_stat { + unsigned short st_dev; + unsigned short st_ino; + unsigned short st_mode; + unsigned short st_nlink; + unsigned short st_uid; + unsigned short st_gid; + unsigned short st_rdev; + unsigned int st_size; + unsigned int st_atime; + unsigned int st_mtime; + unsigned int st_ctime; +}; + #endif --- diff/include/asm-x86_64/system.h 2003-09-17 12:28:12.000000000 +0100 +++ source/include/asm-x86_64/system.h 2003-11-26 10:09:08.000000000 +0000 @@ -188,7 +188,7 @@ #define __xg(x) ((volatile long *)(x)) -extern inline void set_64bit(volatile unsigned long *ptr, unsigned long val) +static inline void set_64bit(volatile unsigned long *ptr, unsigned long val) { *ptr = val; } --- diff/include/asm-x86_64/topology.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/asm-x86_64/topology.h 2003-11-26 10:09:08.000000000 +0000 @@ -23,6 +23,7 @@ static inline unsigned long pcibus_to_cpumask(int bus) { + BUG_ON(bus >= MAX_MP_BUSSES); return mp_bus_to_cpumask[bus] & cpu_online_map; } --- diff/include/asm-x86_64/uaccess.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/asm-x86_64/uaccess.h 2003-11-26 10:09:08.000000000 +0000 @@ -48,7 +48,7 @@ #define access_ok(type,addr,size) (__range_not_ok(addr,size) == 0) -extern inline int verify_area(int type, const void * addr, unsigned long size) +static inline int verify_area(int type, const void * addr, unsigned long size) { return access_ok(type,addr,size) ? 0 : -EFAULT; } --- diff/include/asm-x86_64/unistd.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/asm-x86_64/unistd.h 2003-11-26 10:09:08.000000000 +0000 @@ -532,6 +532,7 @@ __SYSCALL(__NR_utimes, sys_utimes) #define __NR_vserver 236 __SYSCALL(__NR_vserver, sys_ni_syscall) +/* 237,238,239 reserved for NUMA API */ #define __NR_syscall_max __NR_vserver #ifndef __NO_STUBS @@ -623,11 +624,11 @@ type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5,type6 arg6) \ { \ long __res; \ -__asm__ volatile ("movq %5,%%r10 ; movq %6,%%r8 ; movq %7,%%r9" __syscall \ +__asm__ volatile ("movq %5,%%r10 ; movq %6,%%r8 ; movq %7,%%r9 ; " __syscall \ : "=a" (__res) \ : "0" (__NR_##name),"D" ((long)(arg1)),"S" ((long)(arg2)), \ - "d" ((long)(arg3)),"g" ((long)(arg4)),"g" ((long)(arg5), \ - "g" ((long)(arg6),) : \ + "d" ((long)(arg3)), "g" ((long)(arg4)), "g" ((long)(arg5)), \ + "g" ((long)(arg6)) : \ __syscall_clobber,"r8","r10","r9" ); \ __syscall_return(type,__res); \ } --- diff/include/linux/aio.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/aio.h 2003-11-26 10:09:08.000000000 +0000 @@ -29,21 +29,26 @@ #define KIF_LOCKED 0 #define KIF_KICKED 1 #define KIF_CANCELLED 2 +#define KIF_SYNCED 3 #define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags) +#define kiocbTrySync(iocb) test_and_set_bit(KIF_SYNCED, &(iocb)->ki_flags) #define kiocbSetLocked(iocb) set_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbSetKicked(iocb) set_bit(KIF_KICKED, &(iocb)->ki_flags) #define kiocbSetCancelled(iocb) set_bit(KIF_CANCELLED, &(iocb)->ki_flags) +#define kiocbSetSynced(iocb) set_bit(KIF_SYNCED, &(iocb)->ki_flags) #define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags) #define kiocbClearCancelled(iocb) clear_bit(KIF_CANCELLED, &(iocb)->ki_flags) +#define kiocbClearSynced(iocb) clear_bit(KIF_SYNCED, &(iocb)->ki_flags) #define kiocbIsLocked(iocb) test_bit(KIF_LOCKED, &(iocb)->ki_flags) #define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags) #define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags) +#define kiocbIsSynced(iocb) test_bit(KIF_SYNCED, &(iocb)->ki_flags) struct kiocb { struct list_head ki_run_list; @@ -54,7 +59,7 @@ struct file *ki_filp; struct kioctx *ki_ctx; /* may be NULL for sync ops */ int (*ki_cancel)(struct kiocb *, struct io_event *); - long (*ki_retry)(struct kiocb *); + ssize_t (*ki_retry)(struct kiocb *); struct list_head ki_list; /* the aio core uses this * for cancellation */ @@ -63,6 +68,16 @@ __u64 ki_user_data; /* user's data for completion */ loff_t ki_pos; + /* State that we remember to be able to restart/retry */ + unsigned short ki_opcode; + size_t ki_nbytes; /* copy of iocb->aio_nbytes */ + char *ki_buf; /* remaining iocb->aio_buf */ + size_t ki_left; /* remaining bytes */ + wait_queue_t ki_wait; + long ki_retried; /* just for testing */ + long ki_kicked; /* just for testing */ + long ki_queued; /* just for testing */ + char private[KIOCB_PRIVATE_SIZE]; }; @@ -77,6 +92,8 @@ (x)->ki_ctx = &tsk->active_mm->default_kioctx; \ (x)->ki_cancel = NULL; \ (x)->ki_user_obj = tsk; \ + (x)->ki_user_data = 0; \ + init_wait((&(x)->ki_wait)); \ } while (0) #define AIO_RING_MAGIC 0xa10a10a1 @@ -159,6 +176,17 @@ #define get_ioctx(kioctx) do { if (unlikely(atomic_read(&(kioctx)->users) <= 0)) BUG(); atomic_inc(&(kioctx)->users); } while (0) #define put_ioctx(kioctx) do { if (unlikely(atomic_dec_and_test(&(kioctx)->users))) __put_ioctx(kioctx); else if (unlikely(atomic_read(&(kioctx)->users) < 0)) BUG(); } while (0) +#define in_aio() !is_sync_wait(current->io_wait) +/* may be used for debugging */ +#define warn_if_async() if (in_aio()) {\ + printk(KERN_ERR "%s(%s:%d) called in async context!\n", \ + __FUNCTION__, __FILE__, __LINE__); \ + dump_stack(); \ + } + +#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait) +#define is_retried_kiocb(iocb) ((iocb)->ki_retried > 1) + #include <linux/aio_abi.h> static inline struct kiocb *list_kiocb(struct list_head *h) @@ -167,6 +195,7 @@ } /* for sysctl: */ -extern unsigned aio_max_nr, aio_max_size, aio_max_pinned; +extern atomic_t aio_nr; +extern unsigned aio_max_nr; #endif /* __LINUX__AIO_H */ --- diff/include/linux/blkdev.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/blkdev.h 2003-11-26 10:09:08.000000000 +0000 @@ -595,6 +595,7 @@ extern int blk_queue_resize_tags(request_queue_t *, int); extern void blk_queue_invalidate_tags(request_queue_t *); extern void blk_congestion_wait(int rw, long timeout); +extern int blk_congestion_wait_wq(int rw, long timeout, wait_queue_t *wait); extern void blk_rq_bio_prep(request_queue_t *, struct request *, struct bio *); extern void blk_rq_prep_restart(struct request *); --- diff/include/linux/buffer_head.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/buffer_head.h 2003-11-26 10:09:08.000000000 +0000 @@ -162,6 +162,7 @@ void invalidate_bdev(struct block_device *, int); int sync_blockdev(struct block_device *bdev); void __wait_on_buffer(struct buffer_head *); +int __wait_on_buffer_wq(struct buffer_head *, wait_queue_t *wait); wait_queue_head_t *bh_waitq_head(struct buffer_head *bh); void wake_up_buffer(struct buffer_head *bh); int fsync_bdev(struct block_device *); @@ -173,6 +174,8 @@ void __bforget(struct buffer_head *); void __breadahead(struct block_device *, sector_t block, int size); struct buffer_head *__bread(struct block_device *, sector_t block, int size); +struct buffer_head *__bread_wq(struct block_device *, sector_t block, + int size, wait_queue_t *wait); struct buffer_head *alloc_buffer_head(int gfp_flags); void free_buffer_head(struct buffer_head * bh); void FASTCALL(unlock_buffer(struct buffer_head *bh)); @@ -207,12 +210,6 @@ int nobh_commit_write(struct file *, struct page *, unsigned, unsigned); int nobh_truncate_page(struct address_space *, loff_t); -#define OSYNC_METADATA (1<<0) -#define OSYNC_DATA (1<<1) -#define OSYNC_INODE (1<<2) -int generic_osync_inode(struct inode *, int); - - /* * inline definitions */ @@ -230,13 +227,13 @@ static inline void brelse(struct buffer_head *bh) { - if (bh) + if (bh && !IS_ERR(bh)) __brelse(bh); } static inline void bforget(struct buffer_head *bh) { - if (bh) + if (bh && !IS_ERR(bh)) __bforget(bh); } @@ -253,7 +250,12 @@ } static inline struct buffer_head * -sb_getblk(struct super_block *sb, sector_t block) +sb_bread_wq(struct super_block *sb, sector_t block, wait_queue_t *wait) +{ + return __bread_wq(sb->s_bdev, block, sb->s_blocksize, wait); +} + +static inline struct buffer_head *sb_getblk(struct super_block *sb, sector_t block) { return __getblk(sb->s_bdev, block, sb->s_blocksize); } @@ -277,16 +279,34 @@ * __wait_on_buffer() just to trip a debug check. Because debug code in inline * functions is bloaty. */ -static inline void wait_on_buffer(struct buffer_head *bh) + +static inline int wait_on_buffer_wq(struct buffer_head *bh, wait_queue_t *wait) { if (buffer_locked(bh) || atomic_read(&bh->b_count) == 0) - __wait_on_buffer(bh); + return __wait_on_buffer_wq(bh, wait); + + return 0; +} + +static inline void wait_on_buffer(struct buffer_head *bh) +{ + wait_on_buffer_wq(bh, NULL); +} + +static inline int lock_buffer_wq(struct buffer_head *bh, wait_queue_t *wait) +{ + while (test_set_buffer_locked(bh)) { + int ret = __wait_on_buffer_wq(bh, wait); + if (ret) + return ret; + } + + return 0; } static inline void lock_buffer(struct buffer_head *bh) { - while (test_set_buffer_locked(bh)) - __wait_on_buffer(bh); + lock_buffer_wq(bh, NULL); } #endif /* _LINUX_BUFFER_HEAD_H */ --- diff/include/linux/cdrom.h 2003-07-08 09:55:19.000000000 +0100 +++ source/include/linux/cdrom.h 2003-11-26 10:09:08.000000000 +0000 @@ -743,6 +743,7 @@ /* per-device flags */ __u8 sanyo_slot : 2; /* Sanyo 3 CD changer support */ __u8 reserved : 6; /* not used yet */ + int for_data; struct cdrom_write_settings write; }; @@ -776,9 +777,9 @@ }; /* the general block_device operations structure: */ -extern int cdrom_open(struct cdrom_device_info *, struct inode *, struct file *); -extern int cdrom_release(struct cdrom_device_info *, struct file *); -extern int cdrom_ioctl(struct cdrom_device_info *, struct inode *, unsigned, unsigned long); +extern int cdrom_open(struct cdrom_device_info *, struct block_device *, struct file *); +extern int cdrom_release(struct cdrom_device_info *); +extern int cdrom_ioctl(struct cdrom_device_info *, struct block_device *, unsigned, unsigned long); extern int cdrom_media_changed(struct cdrom_device_info *); extern int register_cdrom(struct cdrom_device_info *cdi); --- diff/include/linux/compat.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/compat.h 2003-11-26 10:09:08.000000000 +0000 @@ -88,7 +88,7 @@ __u32 f_namelen; __u32 f_frsize; __u32 f_spare[5]; -}; +} __attribute__((packed)); struct compat_dirent { u32 d_ino; --- diff/include/linux/compat_ioctl.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/compat_ioctl.h 2003-11-26 10:09:08.000000000 +0000 @@ -678,3 +678,10 @@ COMPATIBLE_IOCTL(NBD_PRINT_DEBUG) COMPATIBLE_IOCTL(NBD_SET_SIZE_BLOCKS) COMPATIBLE_IOCTL(NBD_DISCONNECT) +/* i2c */ +COMPATIBLE_IOCTL(I2C_SLAVE) +COMPATIBLE_IOCTL(I2C_SLAVE_FORCE) +COMPATIBLE_IOCTL(I2C_TENBIT) +COMPATIBLE_IOCTL(I2C_PEC) +COMPATIBLE_IOCTL(I2C_RETRIES) +COMPATIBLE_IOCTL(I2C_TIMEOUT) --- diff/include/linux/compiler-gcc.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/compiler-gcc.h 2003-11-26 10:09:08.000000000 +0000 @@ -13,5 +13,5 @@ shouldn't recognize the original var, and make assumptions about it */ #define RELOC_HIDE(ptr, off) \ ({ unsigned long __ptr; \ - __asm__ ("" : "=g"(__ptr) : "0"(ptr)); \ + __asm__ ("" : "=r"(__ptr) : "0"(ptr)); \ (typeof(ptr)) (__ptr + (off)); }) --- diff/include/linux/config.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/config.h 2003-11-26 10:09:08.000000000 +0000 @@ -2,5 +2,8 @@ #define _LINUX_CONFIG_H #include <linux/autoconf.h> +#if defined(__i386__) && !defined(IN_BOOTLOADER) +#include <asm/kgdb.h> +#endif #endif --- diff/include/linux/cpumask.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/linux/cpumask.h 2003-11-26 10:09:08.000000000 +0000 @@ -53,19 +53,35 @@ static inline int next_online_cpu(int cpu, cpumask_t map) { do - cpu = next_cpu_const(cpu, map); + cpu = next_cpu_const(cpu, mk_cpumask_const(map)); while (cpu < NR_CPUS && !cpu_online(cpu)); return cpu; } #define for_each_cpu(cpu, map) \ - for (cpu = first_cpu_const(map); \ + for (cpu = first_cpu_const(mk_cpumask_const(map)); \ cpu < NR_CPUS; \ - cpu = next_cpu_const(cpu,map)) + cpu = next_cpu_const(cpu,mk_cpumask_const(map))) #define for_each_online_cpu(cpu, map) \ - for (cpu = first_cpu_const(map); \ + for (cpu = first_cpu_const(mk_cpumask_const(map)); \ cpu < NR_CPUS; \ cpu = next_online_cpu(cpu,map)) +static inline int format_cpumask(char *buf, cpumask_t cpus) +{ + int k, len = 0; + + for (k = sizeof(cpumask_t)/sizeof(long) - 1; k >= 0; --k) { + int m; + cpumask_t tmp; + + cpus_shift_right(tmp, cpus, BITS_PER_LONG*k); + m = sprintf(buf, "%0*lx", (int)(2*sizeof(long)), cpus_coerce(tmp)); + len += m; + buf += m; + } + return len; +} + #endif /* __LINUX_CPUMASK_H */ --- diff/include/linux/efi.h 2003-08-20 14:16:34.000000000 +0100 +++ source/include/linux/efi.h 2003-11-26 10:09:08.000000000 +0000 @@ -16,6 +16,8 @@ #include <linux/time.h> #include <linux/types.h> #include <linux/proc_fs.h> +#include <linux/rtc.h> +#include <linux/ioport.h> #include <asm/page.h> #include <asm/system.h> @@ -77,18 +79,23 @@ #define EFI_MAX_MEMORY_TYPE 14 /* Attribute values: */ -#define EFI_MEMORY_UC 0x0000000000000001 /* uncached */ -#define EFI_MEMORY_WC 0x0000000000000002 /* write-coalescing */ -#define EFI_MEMORY_WT 0x0000000000000004 /* write-through */ -#define EFI_MEMORY_WB 0x0000000000000008 /* write-back */ -#define EFI_MEMORY_WP 0x0000000000001000 /* write-protect */ -#define EFI_MEMORY_RP 0x0000000000002000 /* read-protect */ -#define EFI_MEMORY_XP 0x0000000000004000 /* execute-protect */ -#define EFI_MEMORY_RUNTIME 0x8000000000000000 /* range requires runtime mapping */ +#define EFI_MEMORY_UC ((u64)0x0000000000000001ULL) /* uncached */ +#define EFI_MEMORY_WC ((u64)0x0000000000000002ULL) /* write-coalescing */ +#define EFI_MEMORY_WT ((u64)0x0000000000000004ULL) /* write-through */ +#define EFI_MEMORY_WB ((u64)0x0000000000000008ULL) /* write-back */ +#define EFI_MEMORY_WP ((u64)0x0000000000001000ULL) /* write-protect */ +#define EFI_MEMORY_RP ((u64)0x0000000000002000ULL) /* read-protect */ +#define EFI_MEMORY_XP ((u64)0x0000000000004000ULL) /* execute-protect */ +#define EFI_MEMORY_RUNTIME ((u64)0x8000000000000000ULL) /* range requires runtime mapping */ #define EFI_MEMORY_DESCRIPTOR_VERSION 1 #define EFI_PAGE_SHIFT 12 +/* + * For current x86 implementations of EFI, there is + * additional padding in the mem descriptors. This is not + * the case in ia64. Need to have this fixed in the f/w. + */ typedef struct { u32 type; u32 pad; @@ -96,6 +103,9 @@ u64 virt_addr; u64 num_pages; u64 attribute; +#if defined (__i386__) + u64 pad1; +#endif } efi_memory_desc_t; typedef int efi_freemem_callback_t (unsigned long start, unsigned long end, void *arg); @@ -132,11 +142,12 @@ */ #define EFI_RESET_COLD 0 #define EFI_RESET_WARM 1 +#define EFI_RESET_SHUTDOWN 2 /* * EFI Runtime Services table */ -#define EFI_RUNTIME_SERVICES_SIGNATURE 0x5652453544e5552 +#define EFI_RUNTIME_SERVICES_SIGNATURE ((u64)0x5652453544e5552ULL) #define EFI_RUNTIME_SERVICES_REVISION 0x00010000 typedef struct { @@ -169,6 +180,10 @@ typedef efi_status_t efi_get_next_high_mono_count_t (u32 *count); typedef void efi_reset_system_t (int reset_type, efi_status_t status, unsigned long data_size, efi_char16_t *data); +typedef efi_status_t efi_set_virtual_address_map_t (unsigned long memory_map_size, + unsigned long descriptor_size, + u32 descriptor_version, + efi_memory_desc_t *virtual_map); /* * EFI Configuration Table and GUID definitions @@ -194,12 +209,15 @@ #define HCDP_TABLE_GUID \ EFI_GUID( 0xf951938d, 0x620b, 0x42ef, 0x82, 0x79, 0xa8, 0x4b, 0x79, 0x61, 0x78, 0x98 ) +#define UGA_IO_PROTOCOL_GUID \ + EFI_GUID( 0x61a4d49e, 0x6f68, 0x4f1b, 0xb9, 0x22, 0xa8, 0x6e, 0xed, 0xb, 0x7, 0xa2 ) + typedef struct { efi_guid_t guid; unsigned long table; } efi_config_table_t; -#define EFI_SYSTEM_TABLE_SIGNATURE 0x5453595320494249 +#define EFI_SYSTEM_TABLE_SIGNATURE ((u64)0x5453595320494249ULL) #define EFI_SYSTEM_TABLE_REVISION ((1 << 16) | 00) typedef struct { @@ -218,6 +236,13 @@ unsigned long tables; } efi_system_table_t; +struct efi_memory_map { + efi_memory_desc_t *phys_map; + efi_memory_desc_t *map; + int nr_map; + unsigned long desc_version; +}; + /* * All runtime access to EFI goes through this structure: */ @@ -230,6 +255,7 @@ void *sal_systab; /* SAL system table */ void *boot_info; /* boot info table */ void *hcdp; /* HCDP table */ + void *uga; /* UGA table */ efi_get_time_t *get_time; efi_set_time_t *set_time; efi_get_wakeup_time_t *get_wakeup_time; @@ -239,6 +265,7 @@ efi_set_variable_t *set_variable; efi_get_next_high_mono_count_t *get_next_high_mono_count; efi_reset_system_t *reset_system; + efi_set_virtual_address_map_t *set_virtual_address_map; } efi; static inline int @@ -260,12 +287,25 @@ extern void efi_init (void); extern void efi_map_pal_code (void); +extern void efi_map_memmap(void); extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); extern void efi_gettimeofday (struct timespec *ts); extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if possible */ extern u64 efi_get_iobase (void); extern u32 efi_mem_type (unsigned long phys_addr); extern u64 efi_mem_attributes (unsigned long phys_addr); +extern void efi_initialize_iomem_resources(struct resource *code_resource, + struct resource *data_resource); +extern efi_status_t phys_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc); +extern unsigned long inline __init efi_get_time(void); +extern int inline __init efi_set_rtc_mmss(unsigned long nowtime); +extern struct efi_memory_map memmap; + +#ifdef CONFIG_EFI +extern int efi_enabled; +#else +#define efi_enabled 0 +#endif /* * Variable Attributes --- diff/include/linux/elevator.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/elevator.h 2003-11-26 10:09:08.000000000 +0000 @@ -94,6 +94,11 @@ */ extern elevator_t iosched_as; +/* + * completely fair queueing I/O scheduler + */ +extern elevator_t iosched_cfq; + extern int elevator_init(request_queue_t *, elevator_t *); extern void elevator_exit(request_queue_t *); extern inline int elv_rq_merge_ok(struct request *, struct bio *); --- diff/include/linux/errno.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/errno.h 2003-11-26 10:09:08.000000000 +0000 @@ -22,6 +22,7 @@ #define EBADTYPE 527 /* Type not supported by server */ #define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */ #define EIOCBQUEUED 529 /* iocb queued, will get completion event */ +#define EIOCBRETRY 530 /* iocb queued, will trigger a retry */ #endif --- diff/include/linux/fs.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/fs.h 2003-11-26 10:09:08.000000000 +0000 @@ -369,6 +369,7 @@ struct inode { struct hlist_node i_hash; struct list_head i_list; + struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; @@ -388,6 +389,7 @@ unsigned short i_bytes; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct semaphore i_sem; + struct rw_semaphore i_alloc_sem; struct inode_operations *i_op; struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; @@ -480,6 +482,8 @@ return MAJOR(inode->i_rdev); } +extern struct block_device *I_BDEV(struct inode *inode); + struct fown_struct { rwlock_t lock; /* protects pid, uid, euid fields */ int pid; /* pid or -pgrp where SIGIO should be sent */ @@ -526,6 +530,7 @@ /* Used by fs/eventpoll.c to link all the hooks to this file */ struct list_head f_ep_links; spinlock_t f_ep_lock; + struct address_space *f_mapping; }; extern spinlock_t files_lock; #define file_list_lock() spin_lock(&files_lock); @@ -687,6 +692,7 @@ atomic_t s_active; void *s_security; + struct list_head s_inodes; /* all inodes */ struct list_head s_dirty; /* dirty inodes */ struct list_head s_io; /* parked for writeback */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ @@ -749,6 +755,11 @@ #define DT_SOCK 12 #define DT_WHT 14 +#define OSYNC_METADATA (1<<0) +#define OSYNC_DATA (1<<1) +#define OSYNC_INODE (1<<2) +int generic_osync_inode(struct inode *, struct address_space *, int); + /* * This is the "filldir" function type, used by readdir() to let * the kernel specify what kind of dirent layout it wants to have. @@ -758,9 +769,9 @@ typedef int (*filldir_t)(void *, const char *, int, loff_t, ino_t, unsigned); struct block_device_operations { - int (*open) (struct inode *, struct file *); - int (*release) (struct inode *, struct file *); - int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); + int (*open) (struct block_device *, struct file *); + int (*release) (struct gendisk *); + int (*ioctl) (struct block_device *, struct file *, unsigned, unsigned long); int (*media_changed) (struct gendisk *); int (*revalidate_disk) (struct gendisk *); struct module *owner; @@ -1123,11 +1134,9 @@ extern int register_blkdev(unsigned int, const char *); extern int unregister_blkdev(unsigned int, const char *); extern struct block_device *bdget(dev_t); -extern int bd_acquire(struct inode *inode); extern void bd_forget(struct inode *inode); extern void bdput(struct block_device *); extern int blkdev_open(struct inode *, struct file *); -extern int blkdev_close(struct inode *, struct file *); extern struct block_device *open_by_devnum(dev_t, unsigned, int); extern struct file_operations def_blk_fops; extern struct address_space_operations def_blk_aops; @@ -1135,7 +1144,7 @@ extern struct file_operations bad_sock_fops; extern struct file_operations def_fifo_fops; extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long); -extern int blkdev_ioctl(struct inode *, struct file *, unsigned, unsigned long); +extern int blkdev_ioctl(struct block_device *, struct file *, unsigned, unsigned long); extern int blkdev_get(struct block_device *, mode_t, unsigned, int); extern int blkdev_put(struct block_device *, int); extern int bd_claim(struct block_device *, void *); @@ -1202,6 +1211,7 @@ extern int filemap_fdatawrite(struct address_space *); extern int filemap_flush(struct address_space *); extern int filemap_fdatawait(struct address_space *); +extern int filemap_write_and_wait(struct address_space *mapping); extern void sync_supers(void); extern void sync_filesystems(int wait); extern void emergency_sync(void); @@ -1295,8 +1305,7 @@ extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size); extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *); -int generic_write_checks(struct inode *inode, struct file *file, - loff_t *pos, size_t *count, int isblk); +int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t generic_file_aio_read(struct kiocb *, char __user *, size_t, loff_t); extern ssize_t __generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t *); @@ -1314,9 +1323,6 @@ file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); extern ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs); -extern int blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, - struct block_device *bdev, const struct iovec *iov, loff_t offset, - unsigned long nr_segs, get_blocks_t *get_blocks, dio_iodone_t *end_io); extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, @@ -1330,7 +1336,7 @@ read_descriptor_t * desc, read_actor_t actor) { - do_generic_mapping_read(filp->f_dentry->d_inode->i_mapping, + do_generic_mapping_read(filp->f_mapping, &filp->f_ra, filp, ppos, @@ -1338,6 +1344,32 @@ actor); } +int __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, + struct block_device *bdev, const struct iovec *iov, loff_t offset, + unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io, + int needs_special_locking); + +/* + * For filesystems which need locking between buffered and direct access + */ +static inline int blockdev_direct_IO(int rw, struct kiocb *iocb, + struct inode *inode, struct block_device *bdev, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks, + dio_iodone_t end_io) +{ + return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, + nr_segs, get_blocks, end_io, 1); +} + +static inline int blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb, + struct inode *inode, struct block_device *bdev, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks, + dio_iodone_t end_io) +{ + return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, + nr_segs, get_blocks, end_io, 0); +} + extern struct file_operations generic_ro_fops; #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m)) --- diff/include/linux/i2c-dev.h 2003-06-30 10:07:24.000000000 +0100 +++ source/include/linux/i2c-dev.h 2003-11-26 10:09:08.000000000 +0000 @@ -43,4 +43,6 @@ __u32 nmsgs; /* number of i2c_msgs */ }; +#define I2C_RDRW_IOCTL_MAX_MSGS 42 + #endif /* _LINUX_I2C_DEV_H */ --- diff/include/linux/ide.h 2003-10-27 09:20:39.000000000 +0000 +++ source/include/linux/ide.h 2003-11-26 10:09:08.000000000 +0000 @@ -1693,6 +1693,8 @@ #define GOOD_DMA_DRIVE 1 #ifdef CONFIG_BLK_DEV_IDEDMA_PCI +extern int ide_build_sglist(ide_drive_t *, struct request *); +extern int ide_raw_build_sglist(ide_drive_t *, struct request *); extern int ide_build_dmatable(ide_drive_t *, struct request *); extern void ide_destroy_dmatable(ide_drive_t *); extern ide_startstop_t ide_dma_intr(ide_drive_t *); --- diff/include/linux/init_task.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/init_task.h 2003-11-26 10:09:08.000000000 +0000 @@ -108,6 +108,7 @@ .proc_lock = SPIN_LOCK_UNLOCKED, \ .switch_lock = SPIN_LOCK_UNLOCKED, \ .journal_info = NULL, \ + .io_wait = NULL, \ } --- diff/include/linux/input.h 2003-09-30 15:46:20.000000000 +0100 +++ source/include/linux/input.h 2003-11-26 10:09:08.000000000 +0000 @@ -870,6 +870,7 @@ char *name; struct input_device_id *id_table; + struct input_device_id *blacklist; struct list_head h_list; struct list_head node; --- diff/include/linux/kernel.h 2003-10-27 09:20:44.000000000 +0000 +++ source/include/linux/kernel.h 2003-11-26 10:09:08.000000000 +0000 @@ -87,6 +87,8 @@ asmlinkage int printk(const char * fmt, ...) __attribute__ ((format (printf, 1, 2))); +unsigned long int_sqrt(unsigned long); + static inline void console_silent(void) { console_loglevel = 0; --- diff/include/linux/keyboard.h 2003-02-26 16:00:55.000000000 +0000 +++ source/include/linux/keyboard.h 2003-11-26 10:09:08.000000000 +0000 @@ -2,7 +2,6 @@ #define __LINUX_KEYBOARD_H #include <linux/wait.h> -#include <linux/input.h> #define KG_SHIFT 0 #define KG_CTRL 2 @@ -17,7 +16,7 @@ #define NR_SHIFT 9 -#define NR_KEYS (KEY_MAX+1) +#define NR_KEYS 255 #define MAX_NR_KEYMAPS 256 /* This means 128Kb if all keymaps are allocated. Only the superuser may increase the number of keymaps beyond MAX_NR_OF_USER_KEYMAPS. */ --- diff/include/linux/libata.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/libata.h 2003-11-26 10:09:08.000000000 +0000 @@ -310,6 +310,7 @@ struct ata_ioports ioaddr; /* ATA cmd/ctl/dma register blocks */ u8 ctl; /* cache of ATA control register */ + u8 last_ctl; /* Cache last written value */ unsigned int bus_state; unsigned int port_state; unsigned int pio_mask; @@ -522,12 +523,12 @@ struct ata_ioports *ioaddr = &ap->ioaddr; ap->ctl &= ~ATA_NIEN; + ap->last_ctl = ap->ctl; if (ap->flags & ATA_FLAG_MMIO) writeb(ap->ctl, ioaddr->ctl_addr); else outb(ap->ctl, ioaddr->ctl_addr); - return ata_wait_idle(ap); } --- diff/include/linux/list.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/list.h 2003-11-26 10:09:08.000000000 +0000 @@ -142,8 +142,11 @@ * Note: list_empty on entry does not return true after this, the entry is * in an undefined state. */ +#include <linux/kernel.h> /* BUG_ON */ static inline void list_del(struct list_head *entry) { + BUG_ON(entry->prev->next != entry); + BUG_ON(entry->next->prev != entry); __list_del(entry->prev, entry->next); entry->next = LIST_POISON1; entry->prev = LIST_POISON2; --- diff/include/linux/loop.h 2003-08-20 14:16:34.000000000 +0100 +++ source/include/linux/loop.h 2003-11-26 10:09:08.000000000 +0000 @@ -34,8 +34,9 @@ loff_t lo_sizelimit; int lo_flags; int (*transfer)(struct loop_device *, int cmd, - char *raw_buf, char *loop_buf, int size, - sector_t real_block); + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t real_block); char lo_file_name[LO_NAME_SIZE]; char lo_crypt_name[LO_NAME_SIZE]; char lo_encrypt_key[LO_KEY_SIZE]; @@ -70,8 +71,7 @@ /* * Loop flags */ -#define LO_FLAGS_DO_BMAP 1 -#define LO_FLAGS_READ_ONLY 2 +#define LO_FLAGS_READ_ONLY 1 #include <asm/posix_types.h> /* for __kernel_old_dev_t */ #include <asm/types.h> /* for __u64 */ @@ -128,8 +128,10 @@ /* Support for loadable transfer modules */ struct loop_func_table { int number; /* filter type */ - int (*transfer)(struct loop_device *lo, int cmd, char *raw_buf, - char *loop_buf, int size, sector_t real_block); + int (*transfer)(struct loop_device *lo, int cmd, + struct page *raw_page, unsigned raw_off, + struct page *loop_page, unsigned loop_off, + int size, sector_t real_block); int (*init)(struct loop_device *, const struct loop_info64 *); /* release is called from loop_unregister_transfer or clr_fd */ int (*release)(struct loop_device *); --- diff/include/linux/mm.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/mm.h 2003-11-26 10:09:08.000000000 +0000 @@ -143,7 +143,7 @@ struct vm_operations_struct { void (*open)(struct vm_area_struct * area); void (*close)(struct vm_area_struct * area); - struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused); + struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type); int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock); }; @@ -322,8 +322,10 @@ /* * The zone field is never updated after free_area_init_core() * sets it, so none of the operations on it need to be atomic. + * We'll have up to log2(MAX_NUMNODES * MAX_NR_ZONES) zones + * total, so we use NODES_SHIFT here to get enough bits. */ -#define ZONE_SHIFT (BITS_PER_LONG - 8) +#define ZONE_SHIFT (BITS_PER_LONG - NODES_SHIFT - MAX_NR_ZONES_SHIFT) struct zone; extern struct zone *zone_table[]; @@ -405,7 +407,7 @@ extern void show_free_areas(void); struct page *shmem_nopage(struct vm_area_struct * vma, - unsigned long address, int unused); + unsigned long address, int *type); struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags); void shmem_lock(struct file * file, int lock); int shmem_zero_setup(struct vm_area_struct *); @@ -563,7 +565,7 @@ extern void truncate_inode_pages(struct address_space *, loff_t); /* generic vm_area_ops exported for stackable file systems */ -extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int); +struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); /* mm/page-writeback.c */ int write_one_page(struct page *page, int wait); --- diff/include/linux/mmzone.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/mmzone.h 2003-11-26 10:09:08.000000000 +0000 @@ -159,7 +159,10 @@ #define ZONE_DMA 0 #define ZONE_NORMAL 1 #define ZONE_HIGHMEM 2 -#define MAX_NR_ZONES 3 + +#define MAX_NR_ZONES 3 /* Sync this with MAX_NR_ZONES_SHIFT */ +#define MAX_NR_ZONES_SHIFT 2 /* ceil(log2(MAX_NR_ZONES)) */ + #define GFP_ZONEMASK 0x03 /* @@ -284,8 +287,6 @@ struct file; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, struct file *, void *, size_t *); -extern void setup_per_zone_pages_min(void); - #ifdef CONFIG_NUMA #define MAX_NR_MEMBLKS BITS_PER_LONG /* Max number of Memory Blocks */ --- diff/include/linux/netdevice.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/netdevice.h 2003-11-26 10:09:08.000000000 +0000 @@ -456,6 +456,13 @@ /* bridge stuff */ struct net_bridge_port *br_port; +#ifdef CONFIG_KGDB + int kgdb_is_trapped; +#endif +#ifdef CONFIG_NET_POLL_CONTROLLER + void (*poll_controller)(struct net_device *); +#endif + #ifdef CONFIG_NET_FASTROUTE #define NETDEV_FASTROUTE_HMASK 0xF /* Semi-private data. Keep it at the end of device struct. */ @@ -533,6 +540,11 @@ extern struct net_device *dev_get_by_index(int ifindex); extern struct net_device *__dev_get_by_index(int ifindex); extern int dev_restart(struct net_device *dev); +#ifdef CONFIG_KGDB +extern int kgdb_eth_is_trapped(void); +extern int kgdb_net_interrupt(struct sk_buff *skb); +extern void kgdb_send_arp_request(void); +#endif typedef int gifconf_func_t(struct net_device * dev, char * bufptr, int len); extern int register_gifconf(unsigned int family, gifconf_func_t * gifconf); @@ -591,12 +603,22 @@ static inline void netif_wake_queue(struct net_device *dev) { +#ifdef CONFIG_KGDB + if (kgdb_eth_is_trapped()) { + return; + } +#endif if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state)) __netif_schedule(dev); } static inline void netif_stop_queue(struct net_device *dev) { +#ifdef CONFIG_KGDB + if (kgdb_eth_is_trapped()) { + return; + } +#endif set_bit(__LINK_STATE_XOFF, &dev->state); } --- diff/include/linux/pagemap.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/pagemap.h 2003-11-26 10:09:08.000000000 +0000 @@ -8,7 +8,6 @@ #include <linux/fs.h> #include <linux/list.h> #include <linux/highmem.h> -#include <linux/pagemap.h> #include <asm/uaccess.h> #include <linux/gfp.h> @@ -71,7 +70,7 @@ extern struct page * find_or_create_page(struct address_space *mapping, unsigned long index, unsigned int gfp_mask); extern unsigned int find_get_pages(struct address_space *mapping, - pgoff_t start, unsigned int nr_pages, + pgoff_t *next, unsigned int nr_pages, struct page **pages); /* @@ -153,17 +152,27 @@ extern void FASTCALL(__lock_page(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); -static inline void lock_page(struct page *page) + +extern int FASTCALL(__lock_page_wq(struct page *page, wait_queue_t *wait)); +static inline int lock_page_wq(struct page *page, wait_queue_t *wait) { if (TestSetPageLocked(page)) - __lock_page(page); + return __lock_page_wq(page, wait); + else + return 0; +} + +static inline void lock_page(struct page *page) +{ + lock_page_wq(page, NULL); } /* * This is exported only for wait_on_page_locked/wait_on_page_writeback. * Never use this directly! */ -extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr)); +extern int FASTCALL(wait_on_page_bit_wq(struct page *page, int bit_nr, + wait_queue_t *wait)); /* * Wait for a page to be unlocked. @@ -172,19 +181,33 @@ * ie with increased "page->count" so that the page won't * go away during the wait.. */ -static inline void wait_on_page_locked(struct page *page) +static inline int wait_on_page_locked_wq(struct page *page, wait_queue_t *wait) { if (PageLocked(page)) - wait_on_page_bit(page, PG_locked); + return wait_on_page_bit_wq(page, PG_locked, wait); + return 0; +} + +static inline int wait_on_page_writeback_wq(struct page *page, + wait_queue_t *wait) +{ + if (PageWriteback(page)) + return wait_on_page_bit_wq(page, PG_writeback, wait); + return 0; +} + +static inline void wait_on_page_locked(struct page *page) +{ + wait_on_page_locked_wq(page, NULL); } /* * Wait for a page to complete writeback */ + static inline void wait_on_page_writeback(struct page *page) { - if (PageWriteback(page)) - wait_on_page_bit(page, PG_writeback); + wait_on_page_writeback_wq(page, NULL); } extern void end_page_writeback(struct page *page); --- diff/include/linux/pagevec.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/pagevec.h 2003-11-26 10:09:08.000000000 +0000 @@ -23,7 +23,7 @@ void __pagevec_lru_add_active(struct pagevec *pvec); void pagevec_strip(struct pagevec *pvec); unsigned int pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, - pgoff_t start, unsigned int nr_pages); + pgoff_t *next, unsigned int nr_pages); static inline void pagevec_init(struct pagevec *pvec, int cold) { --- diff/include/linux/parser.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/parser.h 2003-11-26 10:09:08.000000000 +0000 @@ -1,3 +1,14 @@ +/* + * linux/include/linux/parser.h + * + * Header for lib/parser.c + * Intended use of these functions is parsing filesystem argument lists, + * but could potentially be used anywhere else that simple option=arg + * parsing is required. + */ + + +/* associates an integer enumerator with a pattern string. */ struct match_token { int token; char *pattern; @@ -5,15 +16,16 @@ typedef struct match_token match_table_t[]; +/* Maximum number of arguments that match_token will find in a pattern */ enum {MAX_OPT_ARGS = 3}; +/* Describe the location within a string of a substring */ typedef struct { char *from; char *to; } substring_t; -int match_token(char *s, match_table_t table, substring_t args[]); - +int match_token(char *, match_table_t table, substring_t args[]); int match_int(substring_t *, int *result); int match_octal(substring_t *, int *result); int match_hex(substring_t *, int *result); --- diff/include/linux/pci.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/pci.h 2003-11-26 10:09:08.000000000 +0000 @@ -36,6 +36,7 @@ #define PCI_COMMAND_WAIT 0x80 /* Enable address/data stepping */ #define PCI_COMMAND_SERR 0x100 /* Enable SERR */ #define PCI_COMMAND_FAST_BACK 0x200 /* Enable back-to-back writes */ +#define PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */ #define PCI_STATUS 0x06 /* 16 bits */ #define PCI_STATUS_CAP_LIST 0x10 /* Support Capability List */ @@ -198,6 +199,8 @@ #define PCI_CAP_ID_MSI 0x05 /* Message Signalled Interrupts */ #define PCI_CAP_ID_CHSWP 0x06 /* CompactPCI HotSwap */ #define PCI_CAP_ID_PCIX 0x07 /* PCI-X */ +#define PCI_CAP_ID_EXP 0x10 /* PCI Express */ +#define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 @@ -275,11 +278,13 @@ #define PCI_MSI_FLAGS_QSIZE 0x70 /* Message queue size configured */ #define PCI_MSI_FLAGS_QMASK 0x0e /* Maximum queue size available */ #define PCI_MSI_FLAGS_ENABLE 0x01 /* MSI feature enabled */ +#define PCI_MSI_FLAGS_MASKBIT 0x100 /* 64-bit mask bits allowed */ #define PCI_MSI_RFU 3 /* Rest of capability flags */ #define PCI_MSI_ADDRESS_LO 4 /* Lower 32 bits */ #define PCI_MSI_ADDRESS_HI 8 /* Upper 32 bits (if PCI_MSI_FLAGS_64BIT set) */ #define PCI_MSI_DATA_32 8 /* 16 bits of data for 32-bit devices */ #define PCI_MSI_DATA_64 12 /* 16 bits of data for 64-bit devices */ +#define PCI_MSI_MASK_BIT 16 /* Mask bits register */ /* CompactPCI Hotswap Register */ @@ -695,6 +700,18 @@ extern struct pci_dev *isa_bridge; #endif +#ifndef CONFIG_PCI_USE_VECTOR +static inline void pci_scan_msi_device(struct pci_dev *dev) {} +static inline int pci_enable_msi(struct pci_dev *dev) {return -1;} +static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {} +#else +extern void pci_scan_msi_device(struct pci_dev *dev); +extern int pci_enable_msi(struct pci_dev *dev); +extern void msi_remove_pci_irq_vectors(struct pci_dev *dev); +extern int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec); +extern int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec); +#endif + #endif /* CONFIG_PCI */ /* Include architecture-dependent settings and functions */ --- diff/include/linux/pci_ids.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/pci_ids.h 2003-11-26 10:09:08.000000000 +0000 @@ -900,6 +900,7 @@ #define PCI_VENDOR_ID_SGI 0x10a9 #define PCI_DEVICE_ID_SGI_IOC3 0x0003 +#define PCI_DEVICE_ID_SGI_IOC4 0x100a #define PCI_VENDOR_ID_SGI_LITHIUM 0x1002 #define PCI_VENDOR_ID_ACC 0x10aa @@ -2052,6 +2053,7 @@ #define PCI_DEVICE_ID_INTEL_82443MX_3 0x719b #define PCI_DEVICE_ID_INTEL_82443GX_0 0x71a0 #define PCI_DEVICE_ID_INTEL_82443GX_1 0x71a1 +#define PCI_DEVICE_ID_INTEL_82443GX_2 0x71a2 #define PCI_DEVICE_ID_INTEL_82372FB_0 0x7600 #define PCI_DEVICE_ID_INTEL_82372FB_1 0x7601 #define PCI_DEVICE_ID_INTEL_82372FB_2 0x7602 --- diff/include/linux/percpu_counter.h 2003-05-21 11:50:16.000000000 +0100 +++ source/include/linux/percpu_counter.h 2003-11-26 10:09:08.000000000 +0000 @@ -8,17 +8,14 @@ #include <linux/spinlock.h> #include <linux/smp.h> #include <linux/threads.h> +#include <linux/percpu.h> #ifdef CONFIG_SMP -struct __percpu_counter { - long count; -} ____cacheline_aligned; - struct percpu_counter { spinlock_t lock; long count; - struct __percpu_counter counters[NR_CPUS]; + long *counters; }; #if NR_CPUS >= 16 @@ -29,12 +26,14 @@ static inline void percpu_counter_init(struct percpu_counter *fbc) { - int i; - spin_lock_init(&fbc->lock); fbc->count = 0; - for (i = 0; i < NR_CPUS; i++) - fbc->counters[i].count = 0; + fbc->counters = alloc_percpu(long); +} + +static inline void percpu_counter_destroy(struct percpu_counter *fbc) +{ + free_percpu(fbc->counters); } void percpu_counter_mod(struct percpu_counter *fbc, long amount); @@ -69,6 +68,10 @@ fbc->count = 0; } +static inline void percpu_counter_destroy(struct percpu_counter *fbc) +{ +} + static inline void percpu_counter_mod(struct percpu_counter *fbc, long amount) { --- diff/include/linux/quota.h 2003-08-20 14:16:34.000000000 +0100 +++ source/include/linux/quota.h 2003-11-26 10:09:08.000000000 +0000 @@ -250,7 +250,7 @@ void (*free_space) (struct inode *, qsize_t); void (*free_inode) (const struct inode *, unsigned long); int (*transfer) (struct inode *, struct iattr *); - int (*sync_dquot) (struct dquot *); + int (*write_dquot) (struct dquot *); }; /* Operations handling requests from userspace */ --- diff/include/linux/sched.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/sched.h 2003-11-26 10:09:08.000000000 +0000 @@ -151,6 +151,7 @@ extern void show_state(void); extern void show_regs(struct pt_regs *); +extern void show_trace_task(task_t *tsk); /* * TASK is a pointer to the task whose backtrace we want to see (or NULL for current @@ -205,7 +206,6 @@ unsigned long rss, total_vm, locked_vm; unsigned long def_flags; cpumask_t cpu_vm_mask; - unsigned long swap_address; unsigned long saved_auxv[40]; /* for /proc/PID/auxv */ @@ -464,6 +464,13 @@ unsigned long ptrace_message; siginfo_t *last_siginfo; /* For ptrace use. */ +/* + * current io wait handle: wait queue entry to use for io waits + * If this thread is processing aio, this points at the waitqueue + * inside the currently handled kiocb. It may be NULL (i.e. default + * to a stack based synchronous wait) if its doing sync IO. + */ + wait_queue_t *io_wait; }; static inline pid_t process_group(struct task_struct *tsk) --- diff/include/linux/serial_core.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/serial_core.h 2003-11-26 10:09:08.000000000 +0000 @@ -158,7 +158,9 @@ unsigned char x_char; /* xon/xoff char */ unsigned char regshift; /* reg offset shift */ unsigned char iotype; /* io access style */ - +#ifdef CONFIG_KGDB + int kgdb; /* in use by kgdb */ +#endif #define UPIO_PORT (0) #define UPIO_HUB6 (1) #define UPIO_MEM (2) --- diff/include/linux/serio.h 2003-10-09 09:47:34.000000000 +0100 +++ source/include/linux/serio.h 2003-11-26 10:09:08.000000000 +0000 @@ -49,6 +49,7 @@ irqreturn_t (*interrupt)(struct serio *, unsigned char, unsigned int, struct pt_regs *); void (*connect)(struct serio *, struct serio_dev *dev); + int (*reconnect)(struct serio *); void (*disconnect)(struct serio *); void (*cleanup)(struct serio *); @@ -58,12 +59,13 @@ int serio_open(struct serio *serio, struct serio_dev *dev); void serio_close(struct serio *serio); void serio_rescan(struct serio *serio); +void serio_reconnect(struct serio *serio); irqreturn_t serio_interrupt(struct serio *serio, unsigned char data, unsigned int flags, struct pt_regs *regs); void serio_register_port(struct serio *serio); -void serio_register_slave_port(struct serio *serio); +void __serio_register_port(struct serio *serio); void serio_unregister_port(struct serio *serio); -void serio_unregister_slave_port(struct serio *serio); +void __serio_unregister_port(struct serio *serio); void serio_register_device(struct serio_dev *dev); void serio_unregister_device(struct serio_dev *dev); --- diff/include/linux/spinlock.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/spinlock.h 2003-11-26 10:09:08.000000000 +0000 @@ -15,6 +15,12 @@ #include <asm/processor.h> /* for cpu relax */ #include <asm/system.h> +#ifdef CONFIG_KGDB +#include <asm/current.h> +#define SET_WHO(x, him) (x)->who = him; +#else +#define SET_WHO(x, him) +#endif /* * Must define these before including other files, inline functions need them @@ -55,6 +61,9 @@ const char *module; char *owner; int oline; +#ifdef CONFIG_KGDB + struct task_struct *who; +#endif } spinlock_t; #define SPIN_LOCK_UNLOCKED (spinlock_t) { SPINLOCK_MAGIC, 0, 10, __FILE__ , NULL, 0} @@ -66,6 +75,7 @@ (x)->module = __FILE__; \ (x)->owner = NULL; \ (x)->oline = 0; \ + SET_WHO(x, NULL) \ } while (0) #define CHECK_LOCK(x) \ @@ -88,6 +98,7 @@ (x)->lock = 1; \ (x)->owner = __FILE__; \ (x)->oline = __LINE__; \ + SET_WHO(x, current) \ } while (0) /* without debugging, spin_is_locked on UP always says @@ -118,6 +129,7 @@ (x)->lock = 1; \ (x)->owner = __FILE__; \ (x)->oline = __LINE__; \ + SET_WHO(x, current) \ 1; \ }) @@ -184,6 +196,17 @@ #endif /* !SMP */ +#ifdef CONFIG_LOCKMETER +extern void _metered_spin_lock (spinlock_t *lock); +extern void _metered_spin_unlock (spinlock_t *lock); +extern int _metered_spin_trylock(spinlock_t *lock); +extern void _metered_read_lock (rwlock_t *lock); +extern void _metered_read_unlock (rwlock_t *lock); +extern void _metered_write_lock (rwlock_t *lock); +extern void _metered_write_unlock (rwlock_t *lock); +extern int _metered_write_trylock(rwlock_t *lock); +#endif + /* * Define the various spin_lock and rw_lock methods. Note we define these * regardless of whether CONFIG_SMP or CONFIG_PREEMPT are set. The various @@ -389,6 +412,141 @@ _raw_spin_trylock(lock) ? 1 : \ ({preempt_enable(); local_bh_enable(); 0;});}) +#ifdef CONFIG_LOCKMETER +#undef spin_lock +#undef spin_trylock +#undef spin_unlock +#undef spin_lock_irqsave +#undef spin_lock_irq +#undef spin_lock_bh +#undef read_lock +#undef read_unlock +#undef write_lock +#undef write_unlock +#undef write_trylock +#undef spin_unlock_bh +#undef read_lock_irqsave +#undef read_lock_irq +#undef read_lock_bh +#undef read_unlock_bh +#undef write_lock_irqsave +#undef write_lock_irq +#undef write_lock_bh +#undef write_unlock_bh + +#define spin_lock(lock) \ +do { \ + preempt_disable(); \ + _metered_spin_lock(lock); \ +} while(0) + +#define spin_trylock(lock) ({preempt_disable(); _metered_spin_trylock(lock) ? \ + 1 : ({preempt_enable(); 0;});}) +#define spin_unlock(lock) \ +do { \ + _metered_spin_unlock(lock); \ + preempt_enable(); \ +} while (0) + +#define spin_lock_irqsave(lock, flags) \ +do { \ + local_irq_save(flags); \ + preempt_disable(); \ + _metered_spin_lock(lock); \ +} while (0) + +#define spin_lock_irq(lock) \ +do { \ + local_irq_disable(); \ + preempt_disable(); \ + _metered_spin_lock(lock); \ +} while (0) + +#define spin_lock_bh(lock) \ +do { \ + local_bh_disable(); \ + preempt_disable(); \ + _metered_spin_lock(lock); \ +} while (0) + +#define spin_unlock_bh(lock) \ +do { \ + _metered_spin_unlock(lock); \ + preempt_enable(); \ + local_bh_enable(); \ +} while (0) + + +#define read_lock(lock) ({preempt_disable(); _metered_read_lock(lock);}) +#define read_unlock(lock) ({_metered_read_unlock(lock); preempt_enable();}) +#define write_lock(lock) ({preempt_disable(); _metered_write_lock(lock);}) +#define write_unlock(lock) ({_metered_write_unlock(lock); preempt_enable();}) +#define write_trylock(lock) ({preempt_disable();_metered_write_trylock(lock) ? \ + 1 : ({preempt_enable(); 0;});}) +#define spin_unlock_no_resched(lock) \ +do { \ + _metered_spin_unlock(lock); \ + preempt_enable_no_resched(); \ +} while (0) + +#define read_lock_irqsave(lock, flags) \ +do { \ + local_irq_save(flags); \ + preempt_disable(); \ + _metered_read_lock(lock); \ +} while (0) + +#define read_lock_irq(lock) \ +do { \ + local_irq_disable(); \ + preempt_disable(); \ + _metered_read_lock(lock); \ +} while (0) + +#define read_lock_bh(lock) \ +do { \ + local_bh_disable(); \ + preempt_disable(); \ + _metered_read_lock(lock); \ +} while (0) + +#define read_unlock_bh(lock) \ +do { \ + _metered_read_unlock(lock); \ + preempt_enable(); \ + local_bh_enable(); \ +} while (0) + +#define write_lock_irqsave(lock, flags) \ +do { \ + local_irq_save(flags); \ + preempt_disable(); \ + _metered_write_lock(lock); \ +} while (0) + +#define write_lock_irq(lock) \ +do { \ + local_irq_disable(); \ + preempt_disable(); \ + _metered_write_lock(lock); \ +} while (0) + +#define write_lock_bh(lock) \ +do { \ + local_bh_disable(); \ + preempt_disable(); \ + _metered_write_lock(lock); \ +} while (0) + +#define write_unlock_bh(lock) \ +do { \ + _metered_write_unlock(lock); \ + preempt_enable(); \ + local_bh_enable(); \ +} while (0) + +#endif /* !CONFIG_LOCKMETER */ + /* "lock on reference count zero" */ #ifndef ATOMIC_DEC_AND_LOCK #include <asm/atomic.h> --- diff/include/linux/sysctl.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/linux/sysctl.h 2003-11-26 10:09:08.000000000 +0000 @@ -601,6 +601,8 @@ FS_LEASE_TIME=15, /* int: maximum time to wait for a lease break */ FS_DQSTATS=16, /* disc quota usage statistics */ FS_XFS=17, /* struct: control xfs parameters */ + FS_AIO_NR=18, /* current system-wide number of aio requests */ + FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ }; /* /proc/sys/fs/quota/ */ --- diff/include/linux/sysfs.h 2003-08-26 10:00:54.000000000 +0100 +++ source/include/linux/sysfs.h 2003-11-26 10:09:08.000000000 +0000 @@ -66,4 +66,6 @@ int sysfs_create_group(struct kobject *, const struct attribute_group *); void sysfs_remove_group(struct kobject *, const struct attribute_group *); +extern int nosysfs; + #endif /* _SYSFS_H_ */ --- diff/include/linux/wait.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/wait.h 2003-11-26 10:09:08.000000000 +0000 @@ -80,6 +80,15 @@ return !list_empty(&q->task_list); } +/* + * Used to distinguish between sync and async io wait context: + * sync i/o typically specifies a NULL wait queue entry or a wait + * queue entry bound to a task (current task) to wake up. + * aio specifies a wait queue entry with an async notification + * callback routine, not associated with any task. + */ +#define is_sync_wait(wait) (!(wait) || ((wait)->task)) + extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)); extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); --- diff/include/linux/writeback.h 2003-10-09 09:47:17.000000000 +0100 +++ source/include/linux/writeback.h 2003-11-26 10:09:08.000000000 +0000 @@ -84,9 +84,13 @@ void __user *, size_t *); void page_writeback_init(void); -void balance_dirty_pages_ratelimited(struct address_space *mapping); +int balance_dirty_pages_ratelimited(struct address_space *mapping); int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); int do_writepages(struct address_space *mapping, struct writeback_control *wbc); +ssize_t sync_page_range(struct inode *inode, struct address_space *mapping, + loff_t pos, size_t count); +ssize_t sync_page_range_nolock(struct inode *inode, struct address_space + *mapping, loff_t pos, size_t count); /* pdflush.c */ extern int nr_pdflush_threads; /* Global so it can be exported to sysctl --- diff/include/net/sock.h 2003-06-30 10:07:34.000000000 +0100 +++ source/include/net/sock.h 2003-11-26 10:09:08.000000000 +0000 @@ -917,6 +917,7 @@ static inline int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { int err = 0; + int skb_len; /* Cast skb->rcvbuf to unsigned... It's pointless, but reduces number of warnings when compiling with -W --ANK @@ -937,9 +938,18 @@ skb->dev = NULL; skb_set_owner_r(skb, sk); + + /* Cache the SKB length before we tack it onto the receive + * queue. Once it is added it no longer belongs to us and + * may be freed by other threads of control pulling packets + * from the queue. + */ + skb_len = skb->len; + skb_queue_tail(&sk->sk_receive_queue, skb); + if (!sock_flag(sk, SOCK_DEAD)) - sk->sk_data_ready(sk, skb->len); + sk->sk_data_ready(sk, skb_len); out: return err; } --- diff/include/scsi/scsi_device.h 2003-11-25 15:24:59.000000000 +0000 +++ source/include/scsi/scsi_device.h 2003-11-26 10:09:08.000000000 +0000 @@ -14,11 +14,15 @@ /* * sdev state */ -enum { - SDEV_ADD, - SDEV_DEL, - SDEV_CANCEL, - SDEV_RECOVERY, +enum scsi_device_state { + SDEV_CREATED, /* device created but not added to sysfs + * Only internal commands allowed (for inq) */ + SDEV_RUNNING, /* device properly configured + * All commands allowed */ + SDEV_CANCEL, /* beginning to delete device + * Only error handler commands allowed */ + SDEV_DEL, /* device deleted + * no commands allowed */ }; struct scsi_device { @@ -99,7 +103,7 @@ struct device sdev_gendev; struct class_device sdev_classdev; - unsigned long sdev_state; + enum scsi_device_state sdev_state; }; #define to_scsi_device(d) \ container_of(d, struct scsi_device, sdev_gendev) --- diff/include/sound/i2c.h 2002-10-16 04:27:20.000000000 +0100 +++ source/include/sound/i2c.h 2003-11-26 10:09:08.000000000 +0000 @@ -58,7 +58,7 @@ snd_card_t *card; /* card which I2C belongs to */ char name[32]; /* some useful label */ - spinlock_t lock; + struct semaphore lock_mutex; snd_i2c_bus_t *master; /* master bus when SCK/SCL is shared */ struct list_head buses; /* master: slave buses sharing SCK/SCL, slave: link list */ @@ -84,15 +84,15 @@ static inline void snd_i2c_lock(snd_i2c_bus_t *bus) { if (bus->master) - spin_lock(&bus->master->lock); + down(&bus->master->lock_mutex); else - spin_lock(&bus->lock); + down(&bus->lock_mutex); } static inline void snd_i2c_unlock(snd_i2c_bus_t *bus) { if (bus->master) - spin_unlock(&bus->master->lock); + up(&bus->master->lock_mutex); else - spin_unlock(&bus->lock); + up(&bus->lock_mutex); } int snd_i2c_sendbytes(snd_i2c_device_t *device, unsigned char *bytes, int count); --- diff/init/Kconfig 2003-09-30 15:46:21.000000000 +0100 +++ source/init/Kconfig 2003-11-26 10:09:08.000000000 +0000 @@ -43,7 +43,7 @@ config STANDALONE bool "Select only drivers that don't need compile-time external firmware" if EXPERIMENTAL - default y + default n help Select this option if you don't have magic firmware for drivers that need it. @@ -196,6 +196,19 @@ source "drivers/block/Kconfig.iosched" +config CC_OPTIMIZE_FOR_SIZE + bool "Optimize for size" if EMBEDDED + default y if ARM || H8300 + default n + help + Enabling this option will pass "-Os" instead of "-O2" to gcc + resulting in a smaller kernel. + + WARNING: some versions of gcc may generate incorrect code with this + option. If problems are observed, a gcc upgrade may be needed. + + If unsure, say N. + endmenu # General setup --- diff/init/main.c 2003-10-27 09:20:44.000000000 +0000 +++ source/init/main.c 2003-11-26 10:09:08.000000000 +0000 @@ -38,6 +38,7 @@ #include <linux/moduleparam.h> #include <linux/writeback.h> #include <linux/cpu.h> +#include <linux/efi.h> #include <asm/io.h> #include <asm/bugs.h> @@ -374,7 +375,7 @@ static void rest_init(void) { - kernel_thread(init, NULL, CLONE_KERNEL); + kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND); unlock_kernel(); cpu_idle(); } @@ -395,7 +396,6 @@ lock_kernel(); printk(linux_banner); setup_arch(&command_line); - setup_per_zone_pages_min(); setup_per_cpu_areas(); /* @@ -443,6 +443,10 @@ pidmap_init(); pgtable_cache_init(); pte_chain_init(); +#ifdef CONFIG_X86 + if (efi_enabled) + efi_enter_virtual_mode(); +#endif fork_init(num_physpages); proc_caches_init(); buffer_init(); --- diff/ipc/sem.c 2003-10-09 09:47:17.000000000 +0100 +++ source/ipc/sem.c 2003-11-26 10:09:08.000000000 +0000 @@ -59,6 +59,8 @@ * (c) 1999 Manfred Spraul <manfreds@colorfullife.com> * Enforced range limit on SEM_UNDO * (c) 2001 Red Hat Inc <alan@redhat.com> + * Lockless wakeup + * (c) 2003 Manfred Spraul <manfred@colorfullife.com> */ #include <linux/config.h> @@ -118,6 +120,40 @@ #endif } +/* + * Lockless wakeup algorithm: + * Without the check/retry algorithm a lockless wakeup is possible: + * - queue.status is initialized to -EINTR before blocking. + * - wakeup is performed by + * * unlinking the queue entry from sma->sem_pending + * * setting queue.status to IN_WAKEUP + * This is the notification for the blocked thread that a + * result value is imminent. + * * call wake_up_process + * * set queue.status to the final value. + * - the previously blocked thread checks queue.status: + * * if it's IN_WAKEUP, then it must wait until the value changes + * * if it's not -EINTR, then the operation was completed by + * update_queue. semtimedop can return queue.status without + * performing any operation on the semaphore array. + * * otherwise it must acquire the spinlock and check what's up. + * + * The two-stage algorithm is necessary to protect against the following + * races: + * - if queue.status is set after wake_up_process, then the woken up idle + * thread could race forward and try (and fail) to acquire sma->lock + * before update_queue had a chance to set queue.status + * - if queue.status is written before wake_up_process and if the + * blocked process is woken up by a signal between writing + * queue.status and the wake_up_process, then the woken up + * process could return from semtimedop and die by calling + * sys_exit before wake_up_process is called. Then wake_up_process + * will oops, because the task structure is already invalid. + * (yes, this happened on s390 with sysv msg). + * + */ +#define IN_WAKEUP 1 + static int newary (key_t key, int nsems, int semflg) { int id; @@ -331,16 +367,25 @@ int error; struct sem_queue * q; - for (q = sma->sem_pending; q; q = q->next) { - + q = sma->sem_pending; + while(q) { error = try_atomic_semop(sma, q->sops, q->nsops, q->undo, q->pid); /* Does q->sleeper still need to sleep? */ if (error <= 0) { - q->status = error; + struct sem_queue *n; remove_from_queue(sma,q); + n = q->next; + q->status = IN_WAKEUP; wake_up_process(q->sleeper); + /* hands-off: q will disappear immediately after + * writing q->status. + */ + q->status = error; + q = n; + } else { + q = q->next; } } } @@ -409,10 +454,16 @@ un->semid = -1; /* Wake up all pending processes and let them fail with EIDRM. */ - for (q = sma->sem_pending; q; q = q->next) { - q->status = -EIDRM; + q = sma->sem_pending; + while(q) { + struct sem_queue *n; + /* lazy remove_from_queue: we are killing the whole queue */ q->prev = NULL; + n = q->next; + q->status = IN_WAKEUP; wake_up_process(q->sleeper); /* doesn't sleep */ + q->status = -EIDRM; /* hands-off q */ + q = n; } /* Remove the semaphore set from the ID array*/ @@ -1083,6 +1134,18 @@ else schedule(); + error = queue.status; + while(unlikely(error == IN_WAKEUP)) { + cpu_relax(); + error = queue.status; + } + + if (error != -EINTR) { + /* fast path: update_queue already obtained all requested + * resources */ + goto out_free; + } + sma = sem_lock(semid); if(sma==NULL) { if(queue.prev != NULL) @@ -1095,7 +1158,7 @@ * If queue.status != -EINTR we are woken up by another process */ error = queue.status; - if (queue.status != -EINTR) { + if (error != -EINTR) { goto out_unlock_free; } --- diff/kernel/Makefile 2003-10-09 09:47:34.000000000 +0100 +++ source/kernel/Makefile 2003-11-26 10:09:08.000000000 +0000 @@ -11,6 +11,7 @@ obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_SMP) += cpu.o +obj-$(CONFIG_LOCKMETER) += lockmeter.o obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_KALLSYMS) += kallsyms.o --- diff/kernel/compat.c 2003-09-17 12:28:12.000000000 +0100 +++ source/kernel/compat.c 2003-11-26 10:09:08.000000000 +0000 @@ -204,7 +204,8 @@ ret = sys_sigprocmask(how, set ? &s : NULL, oset ? &s : NULL); set_fs(old_fs); if (ret == 0) - ret = put_user(s, oset); + if (oset) + ret = put_user(s, oset); return ret; } --- diff/kernel/fork.c 2003-10-27 09:20:39.000000000 +0000 +++ source/kernel/fork.c 2003-11-26 10:09:08.000000000 +0000 @@ -129,7 +129,12 @@ { unsigned long flags; - __set_current_state(state); + /* + * don't alter the task state if this is just going to + * queue an async wait queue callback + */ + if (is_sync_wait(wait)) + __set_current_state(state); wait->flags &= ~WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) @@ -144,7 +149,12 @@ { unsigned long flags; - __set_current_state(state); + /* + * don't alter the task state if this is just going to + * queue an async wait queue callback + */ + if (is_sync_wait(wait)) + __set_current_state(state); wait->flags |= WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) @@ -290,9 +300,9 @@ atomic_dec(&inode->i_writecount); /* insert tmp into the share list, just after mpnt */ - down(&inode->i_mapping->i_shared_sem); + down(&file->f_mapping->i_shared_sem); list_add_tail(&tmp->shared, &mpnt->shared); - up(&inode->i_mapping->i_shared_sem); + up(&file->f_mapping->i_shared_sem); } /* @@ -903,6 +913,7 @@ p->start_time = get_jiffies_64(); p->security = NULL; p->io_context = NULL; + p->io_wait = NULL; retval = -ENOMEM; if ((retval = security_task_alloc(p))) @@ -1014,6 +1025,7 @@ if (current->signal->group_exit) { spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); + retval = -EAGAIN; goto bad_fork_cleanup_namespace; } p->tgid = current->tgid; --- diff/kernel/futex.c 2003-10-27 09:20:39.000000000 +0000 +++ source/kernel/futex.c 2003-11-26 10:09:08.000000000 +0000 @@ -246,7 +246,7 @@ * Drop a reference to the resource addressed by a key. * The hash bucket spinlock must not be held. */ -static inline void drop_key_refs(union futex_key *key) +static void drop_key_refs(union futex_key *key) { if (key->both.ptr != 0) { if (key->both.offset & 1) @@ -260,7 +260,7 @@ * The hash bucket lock must be held when this is called. * Afterwards, the futex_q must not be accessed. */ -static inline void wake_futex(struct futex_q *q) +static void wake_futex(struct futex_q *q) { list_del_init(&q->list); if (q->filp) @@ -384,7 +384,7 @@ */ /* The key must be already stored in q->key. */ -static inline void queue_me(struct futex_q *q, int fd, struct file *filp) +static void queue_me(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -577,6 +577,7 @@ filp->f_op = &futex_fops; filp->f_vfsmnt = mntget(futex_mnt); filp->f_dentry = dget(futex_mnt->mnt_root); + filp->f_mapping = filp->f_dentry->d_inode->i_mapping; if (signal) { int err; --- diff/kernel/kmod.c 2003-10-27 09:20:44.000000000 +0000 +++ source/kernel/kmod.c 2003-11-26 10:09:08.000000000 +0000 @@ -185,14 +185,19 @@ sub_info->retval = 0; pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD); - if (pid < 0) + if (pid < 0) { sub_info->retval = pid; - else + } else { /* We don't have a SIGCHLD signal handler, so this * always returns -ECHILD, but the important thing is * that it blocks. */ - sys_wait4(pid, NULL, 0, NULL); + mm_segment_t fs; + fs = get_fs(); + set_fs(KERNEL_DS); + sys_wait4(pid, &sub_info->retval, 0, NULL); + set_fs(fs); + } complete(sub_info->complete); return 0; } @@ -210,7 +215,7 @@ * until that is done. */ if (sub_info->wait) pid = kernel_thread(wait_for_helper, sub_info, - CLONE_KERNEL | SIGCHLD); + CLONE_FS | CLONE_FILES | SIGCHLD); else pid = kernel_thread(____call_usermodehelper, sub_info, CLONE_VFORK | SIGCHLD); --- diff/kernel/pid.c 2003-10-27 09:20:39.000000000 +0000 +++ source/kernel/pid.c 2003-11-26 10:09:08.000000000 +0000 @@ -268,6 +268,9 @@ * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or * more. */ +#ifdef CONFIG_KGDB +int kgdb_pid_init_done; /* so we don't call prior to... */ +#endif void __init pidhash_init(void) { int i, j, pidhash_size; @@ -289,6 +292,9 @@ for (j = 0; j < pidhash_size; j++) INIT_LIST_HEAD(&pid_hash[i][j]); } +#ifdef CONFIG_KGDB + kgdb_pid_init_done++; +#endif } void __init pidmap_init(void) --- diff/kernel/printk.c 2003-10-09 09:47:34.000000000 +0100 +++ source/kernel/printk.c 2003-11-26 10:09:08.000000000 +0000 @@ -447,9 +447,13 @@ char *p; static char printk_buf[1024]; static int log_level_unknown = 1; + static int printk_cpu = -1; - if (oops_in_progress) { - /* If a crash is occurring, make sure we can't deadlock */ + if (oops_in_progress && printk_cpu == smp_processor_id()) { + /* + * If a crash is occurring during printk() on this CPU, make + * sure we can't deadlock + */ spin_lock_init(&logbuf_lock); /* And make sure that we print immediately */ init_MUTEX(&console_sem); @@ -457,6 +461,7 @@ /* This stops the holder of console_sem just where we want him */ spin_lock_irqsave(&logbuf_lock, flags); + printk_cpu = smp_processor_id(); /* Emit the output into the temporary buffer */ va_start(args, fmt); --- diff/kernel/sched.c 2003-11-25 15:24:59.000000000 +0000 +++ source/kernel/sched.c 2003-11-26 10:09:08.000000000 +0000 @@ -1061,10 +1061,11 @@ * the lock held. * * We fend off statistical fluctuations in runqueue lengths by - * saving the runqueue length during the previous load-balancing - * operation and using the smaller one the current and saved lengths. - * If a runqueue is long enough for a longer amount of time then - * we recognize it and pull tasks from it. + * saving the runqueue length (as seen by the balancing CPU) during + * the previous load-balancing operation and using the smaller one + * of the current and saved lengths. If a runqueue is long enough + * for a longer amount of time then we recognize it and pull tasks + * from it. * * The 'current runqueue length' is a statistical maximum variable, * for that one we take the longer one - to avoid fluctuations in @@ -1512,33 +1513,20 @@ spin_lock_irq(&rq->lock); - /* - * if entering off of a kernel preemption go straight - * to picking the next task. - */ - if (unlikely(preempt_count() & PREEMPT_ACTIVE)) - goto pick_next_task; - - switch (prev->state) { - case TASK_INTERRUPTIBLE: - if (unlikely(signal_pending(prev))) { + if (prev->state != TASK_RUNNING && + likely(!(preempt_count() & PREEMPT_ACTIVE)) ) { + if (unlikely(signal_pending(prev)) && + prev->state == TASK_INTERRUPTIBLE) prev->state = TASK_RUNNING; - break; - } - default: - deactivate_task(prev, rq); - prev->nvcsw++; - break; - case TASK_RUNNING: - prev->nivcsw++; + else + deactivate_task(prev, rq); } -pick_next_task: - if (unlikely(!rq->nr_running)) { + #ifdef CONFIG_SMP + if (unlikely(!rq->nr_running)) load_balance(rq, 1, cpu_to_node_mask(smp_processor_id())); - if (rq->nr_running) - goto pick_next_task; #endif + if (unlikely(!rq->nr_running)) { next = rq->idle; rq->expired_timestamp = 0; goto switch_tasks; @@ -1585,6 +1573,12 @@ prev->timestamp = now; if (likely(prev != next)) { + if (prev->state == TASK_RUNNING || + unlikely(preempt_count() & PREEMPT_ACTIVE)) + prev->nivcsw++; + else + prev->nvcsw++; + next->timestamp = now; rq->nr_switches++; rq->curr = next; @@ -1891,6 +1885,13 @@ EXPORT_SYMBOL(set_user_nice); +#if defined( CONFIG_KGDB) +struct task_struct * kgdb_get_idle(int this_cpu) +{ + return cpu_rq(this_cpu)->idle; +} +#endif + #ifndef __alpha__ /* @@ -2445,17 +2446,16 @@ static void show_task(task_t * p) { - unsigned long free = 0; task_t *relative; - int state; - static const char * stat_nam[] = { "R", "S", "D", "T", "Z", "W" }; + unsigned state; + static const char *stat_nam[] = { "R", "S", "D", "T", "Z", "W" }; printk("%-13.13s ", p->comm); state = p->state ? __ffs(p->state) + 1 : 0; - if (((unsigned) state) < sizeof(stat_nam)/sizeof(char *)) + if (state < ARRAY_SIZE(stat_nam)) printk(stat_nam[state]); else - printk(" "); + printk("?"); #if (BITS_PER_LONG == 32) if (p == current) printk(" current "); @@ -2467,13 +2467,7 @@ else printk(" %016lx ", thread_saved_pc(p)); #endif - { - unsigned long * n = (unsigned long *) (p->thread_info+1); - while (!*n) - n++; - free = (unsigned long) n - (unsigned long)(p->thread_info+1); - } - printk("%5lu %5d %6d ", free, p->pid, p->parent->pid); + printk("%5d %6d ", p->pid, p->parent->pid); if ((relative = eldest_child(p))) printk("%5d ", relative->pid); else @@ -2500,12 +2494,12 @@ #if (BITS_PER_LONG == 32) printk("\n" - " free sibling\n"); - printk(" task PC stack pid father child younger older\n"); + " sibling\n"); + printk(" task PC pid father child younger older\n"); #else printk("\n" - " free sibling\n"); - printk(" task PC stack pid father child younger older\n"); + " sibling\n"); + printk(" task PC pid father child younger older\n"); #endif read_lock(&tasklist_lock); do_each_thread(g, p) { --- diff/kernel/softirq.c 2003-10-09 09:47:34.000000000 +0100 +++ source/kernel/softirq.c 2003-11-26 10:09:08.000000000 +0000 @@ -117,11 +117,22 @@ void local_bh_enable(void) { + if (in_irq()) { + printk("local_bh_enable() was called in hard irq context. " + "This is probably a bug\n"); + dump_stack(); + } + __local_bh_enable(); - WARN_ON(irqs_disabled()); - if (unlikely(!in_interrupt() && - local_softirq_pending())) + if (unlikely(!in_interrupt() && local_softirq_pending())) { + if (irqs_disabled()) { + printk("local_bh_enable() was called with local " + "interrupts disabled. This is probably a" + " bug\n"); + dump_stack(); + } invoke_softirq(); + } preempt_check_resched(); } EXPORT_SYMBOL(local_bh_enable); --- diff/kernel/sys.c 2003-10-27 09:20:44.000000000 +0000 +++ source/kernel/sys.c 2003-11-26 10:09:08.000000000 +0000 @@ -1323,8 +1323,6 @@ * either stopped or zombied. In the zombied case the task won't get * reaped till shortly after the call to getrusage(), in both cases the * task being examined is in a frozen state so the counters won't change. - * - * FIXME! Get the fault counts properly! */ int getrusage(struct task_struct *p, int who, struct rusage __user *ru) { --- diff/kernel/sysctl.c 2003-10-09 09:47:34.000000000 +0100 +++ source/kernel/sysctl.c 2003-11-26 10:09:08.000000000 +0000 @@ -794,6 +794,22 @@ .mode = 0644, .proc_handler = &proc_dointvec, }, + { + .ctl_name = FS_AIO_NR, + .procname = "aio-nr", + .data = &aio_nr, + .maxlen = sizeof(aio_nr), + .mode = 0444, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = FS_AIO_MAX_NR, + .procname = "aio-max-nr", + .data = &aio_max_nr, + .maxlen = sizeof(aio_max_nr), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = 0 } }; --- diff/lib/Makefile 2003-11-25 15:24:59.000000000 +0000 +++ source/lib/Makefile 2003-11-26 10:09:08.000000000 +0000 @@ -5,7 +5,7 @@ lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \ bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \ - kobject.o idr.o div64.o parser.o + kobject.o idr.o div64.o parser.o int_sqrt.o lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o --- diff/lib/kobject.c 2003-10-27 09:20:39.000000000 +0000 +++ source/lib/kobject.c 2003-11-26 10:09:08.000000000 +0000 @@ -16,6 +16,7 @@ #include <linux/string.h> #include <linux/module.h> #include <linux/stat.h> +#include <linux/sysfs.h> /** * populate_dir - populate directory with attributes. @@ -170,8 +171,10 @@ memset (kobj_path, 0x00, kobj_path_length); fill_kobj_path (kset, kobj, kobj_path, kobj_path_length); - envp [i++] = scratch; - scratch += sprintf (scratch, "DEVPATH=%s", kobj_path) + 1; + if (!nosysfs) { + envp [i++] = scratch; + scratch += sprintf (scratch, "DEVPATH=%s", kobj_path) + 1; + } if (kset->hotplug_ops->hotplug) { /* have the kset specific function add its stuff */ --- diff/lib/parser.c 2003-10-09 09:47:34.000000000 +0100 +++ source/lib/parser.c 2003-11-26 10:09:08.000000000 +0000 @@ -11,6 +11,17 @@ #include <linux/slab.h> #include <linux/string.h> +/** + * match_one: - Determines if a string matches a simple pattern + * @s: the string to examine for presense of the pattern + * @p: the string containing the pattern + * @args: array of %MAX_OPT_ARGS &substring_t elements. Used to return match + * locations. + * + * Description: Determines if the pattern @p is present in string @s. Can only + * match extremely simple token=arg style patterns. If the pattern is found, + * the location(s) of the arguments will be returned in the @args array. + */ static int match_one(char *s, char *p, substring_t args[]) { char *meta; @@ -74,6 +85,20 @@ } } +/** + * match_token: - Find a token (and optional args) in a string + * @s: the string to examine for token/argument pairs + * @table: match_table_t describing the set of allowed option tokens and the + * arguments that may be associated with them. Must be terminated with a + * &struct match_token whose pattern is set to the NULL pointer. + * @args: array of %MAX_OPT_ARGS &substring_t elements. Used to return match + * locations. + * + * Description: Detects which if any of a set of token strings has been passed + * to it. Tokens can include up to MAX_OPT_ARGS instances of basic c-style + * format identifiers which will be taken into account when matching the + * tokens, and whose locations will be returned in the @args array. + */ int match_token(char *s, match_table_t table, substring_t args[]) { struct match_token *p; @@ -84,6 +109,16 @@ return p->token; } +/** + * match_number: scan a number in the given base from a substring_t + * @s: substring to be scanned + * @result: resulting integer on success + * @base: base to use when converting string + * + * Description: Given a &substring_t and a base, attempts to parse the substring + * as a number in that base. On success, sets @result to the integer represented + * by the string and returns 0. Returns either -ENOMEM or -EINVAL on failure. + */ static int match_number(substring_t *s, int *result, int base) { char *endp; @@ -103,27 +138,71 @@ return ret; } +/** + * match_int: - scan a decimal representation of an integer from a substring_t + * @s: substring_t to be scanned + * @result: resulting integer on success + * + * Description: Attempts to parse the &substring_t @s as a decimal integer. On + * success, sets @result to the integer represented by the string and returns 0. + * Returns either -ENOMEM or -EINVAL on failure. + */ int match_int(substring_t *s, int *result) { return match_number(s, result, 0); } +/** + * match_octal: - scan an octal representation of an integer from a substring_t + * @s: substring_t to be scanned + * @result: resulting integer on success + * + * Description: Attempts to parse the &substring_t @s as an octal integer. On + * success, sets @result to the integer represented by the string and returns + * 0. Returns either -ENOMEM or -EINVAL on failure. + */ int match_octal(substring_t *s, int *result) { return match_number(s, result, 8); } +/** + * match_hex: - scan a hex representation of an integer from a substring_t + * @s: substring_t to be scanned + * @result: resulting integer on success + * + * Description: Attempts to parse the &substring_t @s as a hexadecimal integer. + * On success, sets @result to the integer represented by the string and + * returns 0. Returns either -ENOMEM or -EINVAL on failure. + */ int match_hex(substring_t *s, int *result) { return match_number(s, result, 16); } +/** + * match_strcpy: - copies the characters from a substring_t to a string + * @to: string to copy characters to. + * @s: &substring_t to copy + * + * Description: Copies the set of characters represented by the given + * &substring_t @s to the c-style string @to. Caller guarantees that @to is + * large enough to hold the characters of @s. + */ void match_strcpy(char *to, substring_t *s) { memcpy(to, s->from, s->to - s->from); to[s->to - s->from] = '\0'; } +/** + * match_strdup: - allocate a new string with the contents of a substring_t + * @s: &substring_t to copy + * + * Description: Allocates and returns a string filled with the contents of + * the &substring_t @s. The caller is responsible for freeing the returned + * string with kfree(). + */ char *match_strdup(substring_t *s) { char *p = kmalloc(s->to - s->from + 1, GFP_KERNEL); --- diff/mm/Makefile 2003-10-09 09:47:17.000000000 +0100 +++ source/mm/Makefile 2003-11-26 10:09:08.000000000 +0000 @@ -12,3 +12,6 @@ slab.o swap.o truncate.o vmscan.o $(mmu-y) obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o + +obj-$(CONFIG_X86_4G) += usercopy.o + --- diff/mm/fadvise.c 2003-09-30 15:46:21.000000000 +0100 +++ source/mm/fadvise.c 2003-11-26 10:09:08.000000000 +0000 @@ -23,7 +23,6 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) { struct file *file = fget(fd); - struct inode *inode; struct address_space *mapping; struct backing_dev_info *bdi; pgoff_t start_index; @@ -33,8 +32,7 @@ if (!file) return -EBADF; - inode = file->f_dentry->d_inode; - mapping = inode->i_mapping; + mapping = file->f_mapping; if (!mapping) { ret = -EINVAL; goto out; --- diff/mm/filemap.c 2003-11-25 15:24:59.000000000 +0000 +++ source/mm/filemap.c 2003-11-26 10:09:08.000000000 +0000 @@ -73,6 +73,9 @@ * ->mmap_sem * ->i_sem (msync) * + * ->i_sem + * ->i_alloc_sem (various) + * * ->inode_lock * ->sb_lock (fs/fs-writeback.c) * ->mapping->page_lock (__sync_single_inode) @@ -226,6 +229,18 @@ EXPORT_SYMBOL(filemap_fdatawait); +int filemap_write_and_wait(struct address_space *mapping) +{ + int retval = 0; + + if (mapping->nrpages) { + retval = filemap_fdatawrite(mapping); + if (retval == 0) + retval = filemap_fdatawait(mapping); + } + return retval; +} + /* * This adds a page to the page cache, starting out as locked, unreferenced, * not uptodate and with no errors. @@ -292,22 +307,42 @@ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; } -void wait_on_page_bit(struct page *page, int bit_nr) +/* + * wait for the specified page bit to be cleared + * this could be a synchronous wait or could just queue an async + * notification callback depending on the wait queue entry parameter + * + * A NULL wait queue parameter defaults to sync behaviour + */ +int wait_on_page_bit_wq(struct page *page, int bit_nr, wait_queue_t *wait) { wait_queue_head_t *waitqueue = page_waitqueue(page); - DEFINE_WAIT(wait); + DEFINE_WAIT(local_wait); + + if (!wait) + wait = &local_wait; /* default to a sync wait entry */ do { - prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(waitqueue, wait, TASK_UNINTERRUPTIBLE); if (test_bit(bit_nr, &page->flags)) { sync_page(page); + if (!is_sync_wait(wait)) { + /* + * if we've queued an async wait queue + * callback do not block; just tell the + * caller to return and retry later when + * the callback is notified + */ + return -EIOCBRETRY; + } io_schedule(); } } while (test_bit(bit_nr, &page->flags)); - finish_wait(waitqueue, &wait); -} + finish_wait(waitqueue, wait); -EXPORT_SYMBOL(wait_on_page_bit); + return 0; +} +EXPORT_SYMBOL(wait_on_page_bit_wq); /** * unlock_page() - unlock a locked page @@ -317,7 +352,9 @@ * Unlocks the page and wakes up sleepers in ___wait_on_page_locked(). * Also wakes sleepers in wait_on_page_writeback() because the wakeup * mechananism between PageLocked pages and PageWriteback pages is shared. - * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep. + * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep, + * or in case the wakeup notifies async wait queue entries, as in the case + * of aio, retries would be triggered and may re-queue their callbacks. * * The first mb is necessary to safely close the critical section opened by the * TestSetPageLocked(), the second mb is necessary to enforce ordering between @@ -358,26 +395,51 @@ EXPORT_SYMBOL(end_page_writeback); /* - * Get a lock on the page, assuming we need to sleep to get it. + * Get a lock on the page, assuming we need to either sleep to get it + * or to queue an async notification callback to try again when its + * available. + * + * A NULL wait queue parameter defaults to sync behaviour. Otherwise + * it specifies the wait queue entry to be used for async notification + * or waiting. * * Ugly: running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some * random driver's requestfn sets TASK_RUNNING, we could busywait. However * chances are that on the second loop, the block layer's plug list is empty, * so sync_page() will then return in state TASK_UNINTERRUPTIBLE. */ -void __lock_page(struct page *page) +int __lock_page_wq(struct page *page, wait_queue_t *wait) { wait_queue_head_t *wqh = page_waitqueue(page); - DEFINE_WAIT(wait); + DEFINE_WAIT(local_wait); + + if (!wait) + wait = &local_wait; while (TestSetPageLocked(page)) { - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); if (PageLocked(page)) { sync_page(page); + if (!is_sync_wait(wait)) { + /* + * if we've queued an async wait queue + * callback do not block; just tell the + * caller to return and retry later when + * the callback is notified + */ + return -EIOCBRETRY; + } io_schedule(); } } - finish_wait(wqh, &wait); + finish_wait(wqh, wait); + return 0; +} +EXPORT_SYMBOL(__lock_page_wq); + +void __lock_page(struct page *page) +{ + __lock_page_wq(page, NULL); } EXPORT_SYMBOL(__lock_page); @@ -432,8 +494,8 @@ * * Returns zero if the page was not present. find_lock_page() may sleep. */ -struct page *find_lock_page(struct address_space *mapping, - unsigned long offset) +struct page *find_lock_page_wq(struct address_space *mapping, + unsigned long offset, wait_queue_t *wait) { struct page *page; @@ -444,7 +506,10 @@ page_cache_get(page); if (TestSetPageLocked(page)) { spin_unlock(&mapping->page_lock); - lock_page(page); + if (-EIOCBRETRY == lock_page_wq(page, wait)) { + page_cache_release(page); + return ERR_PTR(-EIOCBRETRY); + } spin_lock(&mapping->page_lock); /* Has the page been truncated while we slept? */ @@ -461,6 +526,12 @@ EXPORT_SYMBOL(find_lock_page); +struct page *find_lock_page(struct address_space *mapping, + unsigned long offset) +{ + return find_lock_page_wq(mapping, offset, NULL); +} + /** * find_or_create_page - locate or add a pagecache page * @@ -521,9 +592,12 @@ * The search returns a group of mapping-contiguous pages with ascending * indexes. There may be holes in the indices due to not-present pages. * - * find_get_pages() returns the number of pages which were found. + * find_get_pages() returns the number of pages which were found + * and also atomically sets the next offset to continue looking up + * mapping contiguous pages from (useful when doing a range of + * pagevec lookups in chunks of PAGEVEC_SIZE). */ -unsigned int find_get_pages(struct address_space *mapping, pgoff_t start, +unsigned int find_get_pages(struct address_space *mapping, pgoff_t *next, unsigned int nr_pages, struct page **pages) { unsigned int i; @@ -531,9 +605,12 @@ spin_lock(&mapping->page_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, - (void **)pages, start, nr_pages); + (void **)pages, *next, nr_pages); for (i = 0; i < ret; i++) page_cache_get(pages[i]); + if (ret) + *next = pages[ret - 1]->index + 1; + spin_unlock(&mapping->page_lock); return ret; } @@ -587,21 +664,46 @@ read_actor_t actor) { struct inode *inode = mapping->host; - unsigned long index, offset; + unsigned long index, offset, first, last, end_index; + loff_t isize = i_size_read(inode); struct page *cached_page; int error; cached_page = NULL; - index = *ppos >> PAGE_CACHE_SHIFT; + first = *ppos >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; + last = (*ppos + desc->count) >> PAGE_CACHE_SHIFT; + end_index = isize >> PAGE_CACHE_SHIFT; + if (last > end_index) + last = end_index; + /* Don't repeat the readahead if we are executing aio retries */ + if (in_aio()) { + if (is_retried_kiocb(io_wait_to_kiocb(current->io_wait))) + goto done_readahead; + } + + /* + * Let the readahead logic know upfront about all + * the pages we'll need to satisfy this request + */ + for (index = first; index < last; index++) + page_cache_readahead(mapping, ra, filp, index); + + if (ra->next_size == -1UL) { + /* the readahead window was maximally shrunk */ + /* explicitly readahead at least what is needed now */ + for (index = first; index < last; index++) + handle_ra_miss(mapping, ra, index); + do_page_cache_readahead(mapping, filp, first, last - first); + } + +done_readahead: + index = first; for (;;) { struct page *page; - unsigned long end_index, nr, ret; - loff_t isize = i_size_read(inode); + unsigned long nr, ret; - end_index = isize >> PAGE_CACHE_SHIFT; - if (index > end_index) break; nr = PAGE_CACHE_SIZE; @@ -612,7 +714,6 @@ } cond_resched(); - page_cache_readahead(mapping, ra, filp, index); nr = nr - offset; find_page: @@ -662,7 +763,12 @@ goto page_ok; /* Get exclusive access to the page ... */ - lock_page(page); + + if (lock_page_wq(page, current->io_wait)) { + pr_debug("queued lock page \n"); + error = -EIOCBRETRY; + goto sync_error; + } /* Did it get unhashed before we got the lock? */ if (!page->mapping) { @@ -684,13 +790,23 @@ if (!error) { if (PageUptodate(page)) goto page_ok; - wait_on_page_locked(page); + if (wait_on_page_locked_wq(page, current->io_wait)) { + pr_debug("queued wait_on_page \n"); + error = -EIOCBRETRY; + goto sync_error; + } + if (PageUptodate(page)) goto page_ok; error = -EIO; } - /* UHHUH! A synchronous read error occurred. Report it */ +sync_error: + /* We don't have uptodate data in the page yet */ + /* Could be due to an error or because we need to + * retry when we get an async i/o notification. + * Report the reason. + */ desc->error = error; page_cache_release(page); break; @@ -804,7 +920,7 @@ struct address_space *mapping; struct inode *inode; - mapping = filp->f_dentry->d_inode->i_mapping; + mapping = filp->f_mapping; inode = mapping->host; retval = 0; if (!count) @@ -844,22 +960,19 @@ out: return retval; } - EXPORT_SYMBOL(__generic_file_aio_read); -ssize_t -generic_file_aio_read(struct kiocb *iocb, char __user *buf, size_t count, loff_t pos) +ssize_t generic_file_aio_read(struct kiocb *iocb, char __user *buf, + size_t count, loff_t pos) { struct iovec local_iov = { .iov_base = buf, .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); return __generic_file_aio_read(iocb, &local_iov, 1, &iocb->ki_pos); } - EXPORT_SYMBOL(generic_file_aio_read); -ssize_t -generic_file_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) +ssize_t generic_file_read(struct file *filp, char __user *buf, + size_t count, loff_t *ppos) { struct iovec local_iov = { .iov_base = buf, .iov_len = count }; struct kiocb kiocb; @@ -871,10 +984,10 @@ ret = wait_on_sync_kiocb(&kiocb); return ret; } - EXPORT_SYMBOL(generic_file_read); -int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) +int file_send_actor(read_descriptor_t * desc, struct page *page, + unsigned long offset, unsigned long size) { ssize_t written; unsigned long count = desc->count; @@ -936,7 +1049,7 @@ file = fget(fd); if (file) { if (file->f_mode & FMODE_READ) { - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; unsigned long start = offset >> PAGE_CACHE_SHIFT; unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT; unsigned long len = end - start + 1; @@ -955,7 +1068,7 @@ static int FASTCALL(page_cache_read(struct file * file, unsigned long offset)); static int page_cache_read(struct file * file, unsigned long offset) { - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct page *page; int error; @@ -990,16 +1103,16 @@ * it in the page cache, and handles the special cases reasonably without * having a lot of duplicated code. */ -struct page * filemap_nopage(struct vm_area_struct * area, unsigned long address, int unused) +struct page * filemap_nopage(struct vm_area_struct * area, unsigned long address, int *type) { int error; struct file *file = area->vm_file; - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct file_ra_state *ra = &file->f_ra; struct inode *inode = mapping->host; struct page *page; unsigned long size, pgoff, endoff; - int did_readaround = 0; + int did_readaround = 0, majmin = VM_FAULT_MINOR; pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff; endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff; @@ -1048,6 +1161,14 @@ if (ra->mmap_miss > ra->mmap_hit + MMAP_LOTSAMISS) goto no_cached_page; + /* + * To keep the pgmajfault counter straight, we need to + * check did_readaround, as this is an inner loop. + */ + if (!did_readaround) { + majmin = VM_FAULT_MAJOR; + inc_page_state(pgmajfault); + } did_readaround = 1; do_page_cache_readahead(mapping, file, pgoff & ~(MMAP_READAROUND-1), MMAP_READAROUND); @@ -1069,6 +1190,8 @@ * Found the page and have a reference on it. */ mark_page_accessed(page); + if (type) + *type = majmin; return page; outside_data_content: @@ -1104,7 +1227,10 @@ return NULL; page_not_uptodate: - inc_page_state(pgmajfault); + if (!did_readaround) { + majmin = VM_FAULT_MAJOR; + inc_page_state(pgmajfault); + } lock_page(page); /* Did it get unhashed while we waited for it? */ @@ -1166,7 +1292,7 @@ static struct page * filemap_getpage(struct file *file, unsigned long pgoff, int nonblock) { - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct page *page; int error; @@ -1278,7 +1404,7 @@ int nonblock) { struct file *file = vma->vm_file; - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; unsigned long size; struct mm_struct *mm = vma->vm_mm; @@ -1337,7 +1463,7 @@ int generic_file_mmap(struct file * file, struct vm_area_struct * vma) { - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; if (!mapping->a_ops->readpage) @@ -1460,7 +1586,9 @@ int err; struct page *page; repeat: - page = find_lock_page(mapping, index); + page = find_lock_page_wq(mapping, index, current->io_wait); + if (IS_ERR(page)) + return page; if (!page) { if (!*cached_page) { *cached_page = page_cache_alloc(mapping); @@ -1605,9 +1733,9 @@ * Returns appropriate error code that caller should return or * zero in case that write should be allowed. */ -inline int generic_write_checks(struct inode *inode, - struct file *file, loff_t *pos, size_t *count, int isblk) +inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk) { + struct inode *inode = file->f_mapping->host; unsigned long limit = current->rlim[RLIMIT_FSIZE].rlim_cur; if (unlikely(*pos < 0)) @@ -1669,7 +1797,7 @@ *count = inode->i_sb->s_maxbytes - *pos; } else { loff_t isize; - if (bdev_read_only(inode->i_bdev)) + if (bdev_read_only(I_BDEV(inode))) return -EPERM; isize = i_size_read(inode); if (*pos >= isize) { @@ -1687,6 +1815,7 @@ /* * Write to a file through the page cache. + * Called under i_sem for S_ISREG files. * * We put everything into the page cache prior to writing it. This is not a * problem when writing full pages. With partial pages, however, we first have @@ -1695,11 +1824,11 @@ * okir@monad.swb.de */ ssize_t -generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, +__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) { struct file *file = iocb->ki_filp; - struct address_space * mapping = file->f_dentry->d_inode->i_mapping; + struct address_space * mapping = file->f_mapping; struct address_space_operations *a_ops = mapping->a_ops; size_t ocount; /* original count */ size_t count; /* after file limit checks */ @@ -1746,11 +1875,10 @@ current->backing_dev_info = mapping->backing_dev_info; written = 0; - err = generic_write_checks(inode, file, &pos, &count, isblk); + err = generic_write_checks(file, &pos, &count, isblk); if (err) goto out; - if (count == 0) goto out; @@ -1775,12 +1903,19 @@ /* * Sync the fs metadata but not the minor inode changes and * of course not the data as we did direct DMA for the IO. + * i_sem is held, which protects generic_osync_inode() from + * livelocking. */ if (written >= 0 && file->f_flags & O_SYNC) - status = generic_osync_inode(inode, OSYNC_METADATA); + status = generic_osync_inode(inode, mapping, OSYNC_METADATA); if (written >= 0 && !is_sync_kiocb(iocb)) written = -EIOCBQUEUED; - goto out_status; + if (written != -ENOTBLK) + goto out_status; + /* + * direct-io write to a hole: fall through to buffered I/O + */ + written = 0; } buf = iov->iov_base; @@ -1804,6 +1939,10 @@ fault_in_pages_readable(buf, bytes); page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); + if (IS_ERR(page)) { + status = PTR_ERR(page); + break; + } if (!page) { status = -ENOMEM; break; @@ -1852,7 +1991,11 @@ page_cache_release(page); if (status < 0) break; - balance_dirty_pages_ratelimited(mapping); + status = balance_dirty_pages_ratelimited(mapping); + if (status < 0) { + pr_debug("async balance_dirty_pages\n"); + break; + } cond_resched(); } while (count); *ppos = pos; @@ -1863,12 +2006,22 @@ /* * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC */ - if (status >= 0) { - if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) - status = generic_osync_inode(inode, - OSYNC_METADATA|OSYNC_DATA); - } + if (likely(status >= 0)) { + if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + if (!a_ops->writepage) + status = generic_osync_inode(inode, mapping, + OSYNC_METADATA|OSYNC_DATA); + } + } + /* + * If we get here for O_DIRECT writes then we must have fallen through + * to buffered writes (block instantiation inside i_size). So we sync + * the file data here, to try to honour O_DIRECT expectations. + */ + if (unlikely(file->f_flags & O_DIRECT) && written) + status = filemap_write_and_wait(mapping); + out_status: err = written ? written : status; out: @@ -1880,6 +2033,55 @@ EXPORT_SYMBOL(generic_file_aio_write_nolock); ssize_t +generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t *ppos) +{ + struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t ret; + loff_t pos = *ppos; + + if (!iov->iov_base && !is_sync_kiocb(iocb)) { + /* nothing to transfer, may just need to sync data */ + ret = iov->iov_len; /* vector AIO not supported yet */ + goto osync; + } + + ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, ppos); + + /* + * Avoid doing a sync in parts for aio - its more efficient to + * call in again after all the data has been copied + */ + if (!is_sync_kiocb(iocb)) + return ret; + +osync: + if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + ret = sync_page_range_nolock(inode, mapping, pos, ret); + if (ret >= 0) + *ppos = pos + ret; + } + return ret; +} + + +ssize_t +__generic_file_write_nolock(struct file *file, const struct iovec *iov, + unsigned long nr_segs, loff_t *ppos) +{ + struct kiocb kiocb; + ssize_t ret; + + init_sync_kiocb(&kiocb, file); + ret = __generic_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos); + if (-EIOCBQUEUED == ret) + ret = wait_on_sync_kiocb(&kiocb); + return ret; +} + +ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) { @@ -1899,36 +2101,62 @@ size_t count, loff_t pos) { struct file *file = iocb->ki_filp; - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; - ssize_t err; - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t ret; + struct iovec local_iov = { .iov_base = (void __user *)buf, + .iov_len = count }; - BUG_ON(iocb->ki_pos != pos); + if (!buf && !is_sync_kiocb(iocb)) { + /* nothing to transfer, may just need to sync data */ + ret = count; + goto osync; + } down(&inode->i_sem); - err = generic_file_aio_write_nolock(iocb, &local_iov, 1, + ret = __generic_file_aio_write_nolock(iocb, &local_iov, 1, &iocb->ki_pos); up(&inode->i_sem); - return err; -} + /* + * Avoid doing a sync in parts for aio - its more efficient to + * call in again after all the data has been copied + */ + if (!is_sync_kiocb(iocb)) + return ret; +osync: + if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + ret = sync_page_range(inode, mapping, pos, ret); + if (ret >= 0) + iocb->ki_pos = pos + ret; + } + return ret; +} EXPORT_SYMBOL(generic_file_aio_write); ssize_t generic_file_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { - struct inode *inode = file->f_dentry->d_inode->i_mapping->host; - ssize_t err; - struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count }; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + ssize_t ret; + struct iovec local_iov = { .iov_base = (void __user *)buf, + .iov_len = count }; down(&inode->i_sem); - err = generic_file_write_nolock(file, &local_iov, 1, ppos); + ret = __generic_file_write_nolock(file, &local_iov, 1, ppos); up(&inode->i_sem); - return err; -} + if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + ssize_t err; + err = sync_page_range(inode, mapping, *ppos - ret, ret); + if (err < 0) + ret = err; + } + return ret; +} EXPORT_SYMBOL(generic_file_write); ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, @@ -1947,39 +2175,46 @@ EXPORT_SYMBOL(generic_file_readv); ssize_t generic_file_writev(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t * ppos) + unsigned long nr_segs, loff_t *ppos) { - struct inode *inode = file->f_dentry->d_inode; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; ssize_t ret; down(&inode->i_sem); - ret = generic_file_write_nolock(file, iov, nr_segs, ppos); + ret = __generic_file_write_nolock(file, iov, nr_segs, ppos); up(&inode->i_sem); + + if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { + ssize_t err; + + err = sync_page_range(inode, mapping, *ppos - ret, ret); + if (err < 0) + ret = err; + } return ret; } EXPORT_SYMBOL(generic_file_writev); +/* + * Called under i_sem for writes to S_ISREG files + */ ssize_t generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct address_space *mapping = file->f_mapping; ssize_t retval; - if (mapping->nrpages) { - retval = filemap_fdatawrite(mapping); - if (retval == 0) - retval = filemap_fdatawait(mapping); - if (retval) - goto out; + retval = filemap_write_and_wait(mapping); + if (retval == 0) { + retval = mapping->a_ops->direct_IO(rw, iocb, iov, + offset, nr_segs); + if (rw == WRITE && mapping->nrpages) + invalidate_inode_pages2(mapping); } - - retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); - if (rw == WRITE && mapping->nrpages) - invalidate_inode_pages2(mapping); -out: return retval; } --- diff/mm/highmem.c 2003-10-09 09:47:34.000000000 +0100 +++ source/mm/highmem.c 2003-11-26 10:09:08.000000000 +0000 @@ -285,7 +285,7 @@ struct bio_vec *tovec, *fromvec; int i; - __bio_for_each_segment(tovec, to, i, 0) { + bio_for_each_segment(tovec, to, i) { fromvec = from->bi_io_vec + i; /* @@ -314,7 +314,7 @@ /* * free up bounce indirect pages used */ - __bio_for_each_segment(bvec, bio, i, 0) { + bio_for_each_segment(bvec, bio, i) { org_vec = bio_orig->bi_io_vec + i; if (bvec->bv_page == org_vec->bv_page) continue; @@ -437,7 +437,7 @@ bio->bi_rw = (*bio_orig)->bi_rw; bio->bi_vcnt = (*bio_orig)->bi_vcnt; - bio->bi_idx = 0; + bio->bi_idx = (*bio_orig)->bi_idx; bio->bi_size = (*bio_orig)->bi_size; if (pool == page_pool) { --- diff/mm/madvise.c 2003-08-20 14:16:34.000000000 +0100 +++ source/mm/madvise.c 2003-11-26 10:09:08.000000000 +0000 @@ -65,7 +65,7 @@ end = vma->vm_end; end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; - force_page_cache_readahead(file->f_dentry->d_inode->i_mapping, + force_page_cache_readahead(file->f_mapping, file, start, max_sane_readahead(end - start)); return 0; } --- diff/mm/memory.c 2003-11-25 15:24:59.000000000 +0000 +++ source/mm/memory.c 2003-11-26 10:09:08.000000000 +0000 @@ -107,7 +107,8 @@ pte_free_tlb(tlb, page); } -static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir) +static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir, + int pgd_idx) { int j; pmd_t * pmd; @@ -121,8 +122,11 @@ } pmd = pmd_offset(dir, 0); pgd_clear(dir); - for (j = 0; j < PTRS_PER_PMD ; j++) + for (j = 0; j < PTRS_PER_PMD ; j++) { + if (pgd_idx * PGDIR_SIZE + j * PMD_SIZE >= TASK_SIZE) + break; free_one_pmd(tlb, pmd+j); + } pmd_free_tlb(tlb, pmd); } @@ -135,11 +139,13 @@ void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr) { pgd_t * page_dir = tlb->mm->pgd; + int pgd_idx = first; page_dir += first; do { - free_one_pgd(tlb, page_dir); + free_one_pgd(tlb, page_dir, pgd_idx); page_dir++; + pgd_idx++; } while (--nr); } @@ -437,7 +443,7 @@ unsigned long address, unsigned long size) { pmd_t * pmd; - unsigned long end; + unsigned long end, pgd_boundary; if (pgd_none(*dir)) return; @@ -448,8 +454,9 @@ } pmd = pmd_offset(dir, address); end = address + size; - if (end > ((address + PGDIR_SIZE) & PGDIR_MASK)) - end = ((address + PGDIR_SIZE) & PGDIR_MASK); + pgd_boundary = ((address + PGDIR_SIZE) & PGDIR_MASK); + if (pgd_boundary && (end > pgd_boundary)) + end = pgd_boundary; do { zap_pte_range(tlb, pmd, address, end - address); address = (address + PMD_SIZE) & PMD_MASK; @@ -603,6 +610,11 @@ might_sleep(); if (is_vm_hugetlb_page(vma)) { + static int x; + if (x < 10) { + x++; + dump_stack(); + } zap_hugepage_range(vma, address, size); return; } @@ -685,6 +697,7 @@ struct page **pages, struct vm_area_struct **vmas) { int i; + int vm_io; unsigned int flags; /* @@ -741,8 +754,10 @@ } #endif - if (!vma || (pages && (vma->vm_flags & VM_IO)) - || !(flags & vma->vm_flags)) + if (!vma) + return i ? : -EFAULT; + vm_io = vma->vm_flags & VM_IO; + if ((pages && vm_io) || !(flags & vma->vm_flags)) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { @@ -750,9 +765,17 @@ &start, &len, i); continue; } + spin_lock(&mm->page_table_lock); do { - struct page *map; + struct page *map = NULL; + + /* + * We don't follow pagetables for VM_IO regions - they + * may have no pageframes. + */ + if (vm_io) + goto no_follow; while (!(map = follow_page(mm, start, write))) { spin_unlock(&mm->page_table_lock); switch (handle_mm_fault(mm,vma,start,write)) { @@ -784,6 +807,7 @@ if (!PageReserved(pages[i])) page_cache_get(pages[i]); } +no_follow: if (vmas) vmas[i] = vma; i++; @@ -1147,7 +1171,7 @@ invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen); up(&mapping->i_shared_sem); } -EXPORT_SYMBOL_GPL(invalidate_mmap_range); +EXPORT_SYMBOL(invalidate_mmap_range); /* * Handle all mappings that got truncated by a "truncate()" @@ -1400,7 +1424,7 @@ pte_t entry; struct pte_chain *pte_chain; int sequence = 0; - int ret; + int ret = VM_FAULT_MINOR; if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, page_table, @@ -1409,12 +1433,12 @@ spin_unlock(&mm->page_table_lock); if (vma->vm_file) { - mapping = vma->vm_file->f_dentry->d_inode->i_mapping; + mapping = vma->vm_file->f_mapping; sequence = atomic_read(&mapping->truncate_count); } smp_rmb(); /* Prevent CPU from reordering lock-free ->nopage() */ retry: - new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0); + new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret); /* no page was available -- either SIGBUS or OOM */ if (new_page == NOPAGE_SIGBUS) @@ -1483,14 +1507,12 @@ pte_unmap(page_table); page_cache_release(new_page); spin_unlock(&mm->page_table_lock); - ret = VM_FAULT_MINOR; goto out; } /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); spin_unlock(&mm->page_table_lock); - ret = VM_FAULT_MAJOR; goto out; oom: ret = VM_FAULT_OOM; --- diff/mm/mincore.c 2003-06-09 14:18:20.000000000 +0100 +++ source/mm/mincore.c 2003-11-26 10:09:08.000000000 +0000 @@ -26,7 +26,7 @@ unsigned long pgoff) { unsigned char present = 0; - struct address_space * as = vma->vm_file->f_dentry->d_inode->i_mapping; + struct address_space * as = vma->vm_file->f_mapping; struct page * page; page = find_get_page(as, pgoff); --- diff/mm/mmap.c 2003-10-27 09:20:44.000000000 +0000 +++ source/mm/mmap.c 2003-11-26 10:09:08.000000000 +0000 @@ -79,11 +79,10 @@ struct file *file = vma->vm_file; if (file) { - struct inode *inode = file->f_dentry->d_inode; - - down(&inode->i_mapping->i_shared_sem); - __remove_shared_vm_struct(vma, inode); - up(&inode->i_mapping->i_shared_sem); + struct address_space *mapping = file->f_mapping; + down(&mapping->i_shared_sem); + __remove_shared_vm_struct(vma, file->f_dentry->d_inode); + up(&mapping->i_shared_sem); } } @@ -234,11 +233,10 @@ file = vma->vm_file; if (file) { - struct inode * inode = file->f_dentry->d_inode; - struct address_space *mapping = inode->i_mapping; + struct address_space *mapping = file->f_mapping; if (vma->vm_flags & VM_DENYWRITE) - atomic_dec(&inode->i_writecount); + atomic_dec(&file->f_dentry->d_inode->i_writecount); if (vma->vm_flags & VM_SHARED) list_add_tail(&vma->shared, &mapping->i_mmap_shared); @@ -264,7 +262,7 @@ struct address_space *mapping = NULL; if (vma->vm_file) - mapping = vma->vm_file->f_dentry->d_inode->i_mapping; + mapping = vma->vm_file->f_mapping; if (mapping) down(&mapping->i_shared_sem); @@ -382,7 +380,7 @@ if (vm_flags & VM_SPECIAL) return 0; - i_shared_sem = file ? &inode->i_mapping->i_shared_sem : NULL; + i_shared_sem = file ? &file->f_mapping->i_shared_sem : NULL; if (!prev) { prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb); @@ -1197,7 +1195,7 @@ new->vm_ops->open(new); if (vma->vm_file) - mapping = vma->vm_file->f_dentry->d_inode->i_mapping; + mapping = vma->vm_file->f_mapping; if (mapping) down(&mapping->i_shared_sem); --- diff/mm/msync.c 2003-05-21 11:50:10.000000000 +0100 +++ source/mm/msync.c 2003-11-26 10:09:08.000000000 +0000 @@ -146,20 +146,20 @@ ret = filemap_sync(vma, start, end-start, flags); if (!ret && (flags & MS_SYNC)) { - struct inode *inode = file->f_dentry->d_inode; + struct address_space *mapping = file->f_mapping; int err; - down(&inode->i_sem); - ret = filemap_fdatawrite(inode->i_mapping); + down(&mapping->host->i_sem); + ret = filemap_fdatawrite(mapping); if (file->f_op && file->f_op->fsync) { err = file->f_op->fsync(file,file->f_dentry,1); if (err && !ret) ret = err; } - err = filemap_fdatawait(inode->i_mapping); + err = filemap_fdatawait(mapping); if (!ret) ret = err; - up(&inode->i_sem); + up(&mapping->host->i_sem); } } return ret; --- diff/mm/oom_kill.c 2003-10-09 09:47:17.000000000 +0100 +++ source/mm/oom_kill.c 2003-11-26 10:09:08.000000000 +0000 @@ -24,20 +24,6 @@ /* #define DEBUG */ /** - * int_sqrt - oom_kill.c internal function, rough approximation to sqrt - * @x: integer of which to calculate the sqrt - * - * A very rough approximation to the sqrt() function. - */ -static unsigned int int_sqrt(unsigned int x) -{ - unsigned int out = x; - while (x & ~(unsigned int)1) x >>=2, out >>=1; - if (x) out -= out >> 2; - return (out ? out : 1); -} - -/** * oom_badness - calculate a numeric value for how bad this task has been * @p: task struct of which task we should calculate * @@ -57,7 +43,7 @@ static int badness(struct task_struct *p) { - int points, cpu_time, run_time; + int points, cpu_time, run_time, s; if (!p->mm) return 0; @@ -77,8 +63,12 @@ cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3); run_time = (get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10); - points /= int_sqrt(cpu_time); - points /= int_sqrt(int_sqrt(run_time)); + s = int_sqrt(cpu_time); + if (s) + points /= s; + s = int_sqrt(int_sqrt(run_time)); + if (s) + points /= s; /* * Niced processes are most likely less important, so double --- diff/mm/page-writeback.c 2003-10-27 09:20:39.000000000 +0000 +++ source/mm/page-writeback.c 2003-11-26 10:09:08.000000000 +0000 @@ -28,6 +28,7 @@ #include <linux/smp.h> #include <linux/sysctl.h> #include <linux/cpu.h> +#include <linux/pagevec.h> /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -146,7 +147,7 @@ * If we're over `background_thresh' then pdflush is woken to perform some * writeout. */ -static void balance_dirty_pages(struct address_space *mapping) +static int balance_dirty_pages(struct address_space *mapping) { struct page_state ps; long nr_reclaimable; @@ -163,6 +164,7 @@ .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = write_chunk, + .nonblocking = !is_sync_wait(current->io_wait) }; get_dirty_limits(&ps, &background_thresh, &dirty_thresh); @@ -189,7 +191,11 @@ if (pages_written >= write_chunk) break; /* We've done our duty */ } - blk_congestion_wait(WRITE, HZ/10); + if (-EIOCBRETRY == blk_congestion_wait_wq(WRITE, HZ/10, + current->io_wait)) { + pr_debug("async blk congestion wait\n"); + return -EIOCBRETRY; + } } if (nr_reclaimable + ps.nr_writeback <= dirty_thresh) @@ -197,6 +203,8 @@ if (!writeback_in_progress(bdi) && nr_reclaimable > background_thresh) pdflush_operation(background_writeout, 0); + + return 0; } /** @@ -212,7 +220,7 @@ * decrease the ratelimiting by a lot, to prevent individual processes from * overshooting the limit by (ratelimit_pages) each. */ -void balance_dirty_pages_ratelimited(struct address_space *mapping) +int balance_dirty_pages_ratelimited(struct address_space *mapping) { static DEFINE_PER_CPU(int, ratelimits) = 0; long ratelimit; @@ -228,10 +236,10 @@ if (get_cpu_var(ratelimits)++ >= ratelimit) { __get_cpu_var(ratelimits) = 0; put_cpu_var(ratelimits); - balance_dirty_pages(mapping); - return; + return balance_dirty_pages(mapping); } put_cpu_var(ratelimits); + return 0; } /* @@ -567,3 +575,152 @@ return 0; } EXPORT_SYMBOL(test_clear_page_dirty); + + +static ssize_t operate_on_page_range(struct address_space *mapping, + loff_t pos, size_t count, int (*operator)(struct page *)) +{ + pgoff_t first = pos >> PAGE_CACHE_SHIFT; + pgoff_t last = (pos + count - 1) >> PAGE_CACHE_SHIFT; /* inclusive */ + pgoff_t next = first, curr = first; + struct pagevec pvec; + ssize_t ret = 0, bytes = 0; + int i, nr; + + if (count == 0) + return 0; + + pagevec_init(&pvec, 0); + while ((nr = pagevec_lookup(&pvec, mapping, &next, + min((pgoff_t)PAGEVEC_SIZE, last - next + 1)))) { + for (i = 0; i < pagevec_count(&pvec); i++) { + struct page *page = pvec.pages[i]; + + curr = page->index; + if (page->mapping != mapping) /* truncated ?*/ { + curr = next; + break; + } else { + ret = (*operator)(page); + if (ret == -EIOCBRETRY) + break; + if (PageError(page)) { + if (!ret) + ret = -EIO; + } else + curr++; + } + } + pagevec_release(&pvec); + if ((ret == -EIOCBRETRY) || (next > last)) + break; + } + if (!nr) + curr = last + 1; + + bytes = (curr << PAGE_CACHE_SHIFT) - pos; + if (bytes > count) + bytes = count; + return (bytes && (!ret || (ret == -EIOCBRETRY))) ? bytes : ret; +} + +static int page_waiter(struct page *page) +{ + return wait_on_page_writeback_wq(page, current->io_wait); +} + +static size_t +wait_on_page_range(struct address_space *mapping, loff_t pos, size_t count) +{ + return operate_on_page_range(mapping, pos, count, page_waiter); +} + +static int page_writer(struct page *page) +{ + struct writeback_control wbc = { + .sync_mode = WB_SYNC_ALL, + .nr_to_write = 1, + }; + + lock_page(page); + if (!page->mapping) { /* truncated */ + unlock_page(page); + return 0; + } + if (!test_clear_page_dirty(page)) { + unlock_page(page); + return 0; + } + wait_on_page_writeback(page); + return page->mapping->a_ops->writepage(page, &wbc); +} + +static ssize_t +write_out_page_range(struct address_space *mapping, loff_t pos, size_t count) +{ + return operate_on_page_range(mapping, pos, count, page_writer); +} + +/* + * Write and wait upon all the pages in the passed range. This is a "data + * integrity" operation. It waits upon in-flight writeout before starting and + * waiting upon new writeout. If there was an IO error, return it. + * + * We need to re-take i_sem during the generic_osync_inode list walk because + * it is otherwise livelockable. + */ +ssize_t sync_page_range(struct inode *inode, struct address_space *mapping, + loff_t pos, size_t count) +{ + int ret = 0; + + if (in_aio()) { + /* Already issued writeouts for this iocb ? */ + if (kiocbTrySync(io_wait_to_kiocb(current->io_wait))) + goto do_wait; /* just need to check if done */ + } + if (!mapping->a_ops->writepage) + return 0; + if (mapping->backing_dev_info->memory_backed) + return 0; + ret = write_out_page_range(mapping, pos, count); + if (ret >= 0) { + down(&inode->i_sem); + ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); + up(&inode->i_sem); + } +do_wait: + if (ret >= 0) + ret = wait_on_page_range(mapping, pos, count); + return ret; +} + +/* + * It is really better to use sync_page_range, rather than call + * sync_page_range_nolock while holding i_sem, if you don't + * want to block parallel O_SYNC writes until the pages in this + * range are written out. + */ +ssize_t sync_page_range_nolock(struct inode *inode, struct address_space + *mapping, loff_t pos, size_t count) +{ + ssize_t ret = 0; + + if (in_aio()) { + /* Already issued writeouts for this iocb ? */ + if (kiocbTrySync(io_wait_to_kiocb(current->io_wait))) + goto do_wait; /* just need to check if done */ + } + if (!mapping->a_ops->writepage) + return 0; + if (mapping->backing_dev_info->memory_backed) + return 0; + ret = write_out_page_range(mapping, pos, count); + if (ret >= 0) { + ret = generic_osync_inode(inode, mapping, OSYNC_METADATA); + } +do_wait: + if (ret >= 0) + ret = wait_on_page_range(mapping, pos, count); + return ret; +} --- diff/mm/page_alloc.c 2003-10-09 09:47:34.000000000 +0100 +++ source/mm/page_alloc.c 2003-11-26 10:09:08.000000000 +0000 @@ -672,6 +672,7 @@ printk("%s: page allocation failure." " order:%d, mode:0x%x\n", p->comm, order, gfp_mask); + dump_stack(); } return NULL; got_pg: @@ -1589,7 +1590,7 @@ * that the pages_{min,low,high} values for each zone are set correctly * with respect to min_free_kbytes. */ -void setup_per_zone_pages_min(void) +static void setup_per_zone_pages_min(void) { unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; @@ -1633,6 +1634,45 @@ } /* + * Initialise min_free_kbytes. + * + * For small machines we want it small (128k min). For large machines + * we want it large (16MB max). But it is not linear, because network + * bandwidth does not increase linearly with machine size. We use + * + * min_free_kbytes = sqrt(lowmem_kbytes) + * + * which yields + * + * 16MB: 128k + * 32MB: 181k + * 64MB: 256k + * 128MB: 362k + * 256MB: 512k + * 512MB: 724k + * 1024MB: 1024k + * 2048MB: 1448k + * 4096MB: 2048k + * 8192MB: 2896k + * 16384MB: 4096k + */ +static int __init init_per_zone_pages_min(void) +{ + unsigned long lowmem_kbytes; + + lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10); + + min_free_kbytes = int_sqrt(lowmem_kbytes); + if (min_free_kbytes < 128) + min_free_kbytes = 128; + if (min_free_kbytes > 16384) + min_free_kbytes = 16384; + setup_per_zone_pages_min(); + return 0; +} +module_init(init_per_zone_pages_min) + +/* * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so * that we can call setup_per_zone_pages_min() whenever min_free_kbytes * changes. --- diff/mm/pdflush.c 2003-10-09 09:47:17.000000000 +0100 +++ source/mm/pdflush.c 2003-11-26 10:09:08.000000000 +0000 @@ -84,6 +84,8 @@ unsigned long when_i_went_to_sleep; }; +static int wakeup_count = 100; + static int __pdflush(struct pdflush_work *my_work) { daemonize("pdflush"); @@ -112,7 +114,10 @@ spin_lock_irq(&pdflush_lock); if (!list_empty(&my_work->list)) { - printk("pdflush: bogus wakeup!\n"); + if (wakeup_count > 0) { + wakeup_count--; + printk("pdflush: bogus wakeup!\n"); + } my_work->fn = NULL; continue; } @@ -182,6 +187,7 @@ { unsigned long flags; int ret = 0; + static int poke_count = 0; if (fn == NULL) BUG(); /* Hard to diagnose if it's deferred */ @@ -190,9 +196,19 @@ if (list_empty(&pdflush_list)) { spin_unlock_irqrestore(&pdflush_lock, flags); ret = -1; + if (wakeup_count < 100 && poke_count < 10) { + printk("%s: no threads\n", __FUNCTION__); + dump_stack(); + poke_count++; + } } else { struct pdflush_work *pdf; + if (wakeup_count < 100 && poke_count < 10) { + printk("%s: found a thread\n", __FUNCTION__); + dump_stack(); + poke_count++; + } pdf = list_entry(pdflush_list.next, struct pdflush_work, list); list_del_init(&pdf->list); if (list_empty(&pdflush_list)) --- diff/mm/readahead.c 2003-10-09 09:47:34.000000000 +0100 +++ source/mm/readahead.c 2003-11-26 10:09:08.000000000 +0000 @@ -347,6 +347,8 @@ unsigned min; unsigned orig_next_size; unsigned actual; + int first_access=0; + unsigned long preoffset=0; /* * Here we detect the case where the application is performing @@ -370,16 +372,18 @@ min = get_min_readahead(ra); orig_next_size = ra->next_size; - if (ra->next_size == 0 && offset == 0) { + if (ra->next_size == 0) { /* - * Special case - first read from first page. + * Special case - first read. * We'll assume it's a whole-file read, and * grow the window fast. */ + first_access=1; ra->next_size = max / 2; goto do_io; } + preoffset = ra->prev_page; ra->prev_page = offset; if (offset >= ra->start && offset <= (ra->start + ra->size)) { @@ -439,20 +443,44 @@ * ahead window and get some I/O underway for the new * current window. */ + if (!first_access && preoffset >= ra->start && + preoffset < (ra->start + ra->size)) { + /* Heuristic: If 'n' pages were + * accessed in the current window, there + * is a high probability that around 'n' pages + * shall be used in the next current window. + * + * To minimize lazy-readahead triggered + * in the next current window, read in + * an extra page. + */ + ra->next_size = preoffset - ra->start + 2; + } ra->start = offset; ra->size = ra->next_size; ra->ahead_start = 0; /* Invalidate these */ ra->ahead_size = 0; actual = do_page_cache_readahead(mapping, filp, offset, ra->size); - check_ra_success(ra, ra->size, actual, orig_next_size); + if(!first_access) { + /* + * do not adjust the readahead window size the first + * time, the ahead window might get closed if all + * the pages are already in the cache. + */ + check_ra_success(ra, ra->size, actual, orig_next_size); + } } else { /* * This read request is within the current window. It is time * to submit I/O for the ahead window while the application is - * crunching through the current window. + * about to step into the ahead window. + * Heuristic: Defer reading the ahead window till we hit + * the last page in the current window. (lazy readahead) + * If we read in earlier we run the risk of wasting + * the ahead window. */ - if (ra->ahead_start == 0) { + if (ra->ahead_start == 0 && offset == (ra->start + ra->size -1)) { ra->ahead_start = ra->start + ra->size; ra->ahead_size = ra->next_size; actual = do_page_cache_readahead(mapping, filp, @@ -488,7 +516,7 @@ const unsigned long max = get_max_readahead(ra); if (offset != ra->prev_page + 1) { - ra->size = 0; /* Not sequential */ + ra->size = ra->size?ra->size-1:0; /* Not sequential */ } else { ra->size++; /* A sequential read */ if (ra->size >= max) { /* Resume readahead */ --- diff/mm/shmem.c 2003-10-27 09:20:44.000000000 +0000 +++ source/mm/shmem.c 2003-11-26 10:09:08.000000000 +0000 @@ -71,7 +71,7 @@ }; static int shmem_getpage(struct inode *inode, unsigned long idx, - struct page **pagep, enum sgp_type sgp); + struct page **pagep, enum sgp_type sgp, int *type); static inline struct page *shmem_dir_alloc(unsigned int gfp_mask) { @@ -540,7 +540,7 @@ if (attr->ia_size & (PAGE_CACHE_SIZE-1)) { (void) shmem_getpage(inode, attr->ia_size>>PAGE_CACHE_SHIFT, - &page, SGP_READ); + &page, SGP_READ, NULL); } /* * Reset SHMEM_PAGEIN flag so that shmem_truncate can @@ -765,7 +765,7 @@ * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache */ -static int shmem_getpage(struct inode *inode, unsigned long idx, struct page **pagep, enum sgp_type sgp) +static int shmem_getpage(struct inode *inode, unsigned long idx, struct page **pagep, enum sgp_type sgp, int *type) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); @@ -774,7 +774,7 @@ struct page *swappage; swp_entry_t *entry; swp_entry_t swap; - int error; + int error, majmin = VM_FAULT_MINOR; if (idx >= SHMEM_MAX_INDEX) return -EFBIG; @@ -811,6 +811,10 @@ if (!swappage) { shmem_swp_unmap(entry); spin_unlock(&info->lock); + /* here we actually do the io */ + if (majmin == VM_FAULT_MINOR && type) + inc_page_state(pgmajfault); + majmin = VM_FAULT_MAJOR; swapin_readahead(swap); swappage = read_swap_cache_async(swap); if (!swappage) { @@ -959,6 +963,8 @@ } else *pagep = ZERO_PAGE(0); } + if (type) + *type = majmin; return 0; failed: @@ -969,7 +975,7 @@ return error; } -struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int unused) +struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int *type) { struct inode *inode = vma->vm_file->f_dentry->d_inode; struct page *page = NULL; @@ -980,7 +986,7 @@ idx += vma->vm_pgoff; idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; - error = shmem_getpage(inode, idx, &page, SGP_CACHE); + error = shmem_getpage(inode, idx, &page, SGP_CACHE, type); if (error) return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS; @@ -1007,7 +1013,7 @@ /* * Will need changing if PAGE_CACHE_SIZE != PAGE_SIZE */ - err = shmem_getpage(inode, pgoff, &page, sgp); + err = shmem_getpage(inode, pgoff, &page, sgp, NULL); if (err) return err; if (page) { @@ -1157,7 +1163,7 @@ shmem_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) { struct inode *inode = page->mapping->host; - return shmem_getpage(inode, page->index, &page, SGP_WRITE); + return shmem_getpage(inode, page->index, &page, SGP_WRITE, NULL); } static ssize_t @@ -1180,7 +1186,7 @@ pos = *ppos; written = 0; - err = generic_write_checks(inode, file, &pos, &count, 0); + err = generic_write_checks(file, &pos, &count, 0); if (err || !count) goto out; @@ -1214,7 +1220,7 @@ * But it still may be a good idea to prefault below. */ - err = shmem_getpage(inode, index, &page, SGP_WRITE); + err = shmem_getpage(inode, index, &page, SGP_WRITE, NULL); if (err) break; @@ -1296,7 +1302,7 @@ break; } - desc->error = shmem_getpage(inode, index, &page, SGP_READ); + desc->error = shmem_getpage(inode, index, &page, SGP_READ, NULL); if (desc->error) { if (desc->error == -EINVAL) desc->error = 0; @@ -1552,7 +1558,7 @@ iput(inode); return -ENOMEM; } - error = shmem_getpage(inode, 0, &page, SGP_WRITE); + error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL); if (error) { vm_unacct_memory(VM_ACCT(1)); iput(inode); @@ -1590,7 +1596,7 @@ static int shmem_readlink(struct dentry *dentry, char __user *buffer, int buflen) { struct page *page = NULL; - int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ); + int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL); if (res) return res; res = vfs_readlink(dentry, buffer, buflen, kmap(page)); @@ -1603,7 +1609,7 @@ static int shmem_follow_link(struct dentry *dentry, struct nameidata *nd) { struct page *page = NULL; - int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ); + int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL); if (res) return res; res = vfs_follow_link(nd, kmap(page)); @@ -1972,6 +1978,7 @@ inode->i_nlink = 0; /* It is unlinked */ file->f_vfsmnt = mntget(shm_mnt); file->f_dentry = dentry; + file->f_mapping = inode->i_mapping; file->f_op = &shmem_file_operations; file->f_mode = FMODE_WRITE | FMODE_READ; return(file); --- diff/mm/slab.c 2003-10-27 09:20:44.000000000 +0000 +++ source/mm/slab.c 2003-11-26 10:09:08.000000000 +0000 @@ -1180,7 +1180,8 @@ cachep = NULL; goto opps; } - slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab)); + slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t) + + sizeof(struct slab)); /* * If the slab has been placed off-slab, and we have enough space then @@ -1224,10 +1225,13 @@ * the cache that's used by kmalloc(24), otherwise * the creation of further caches will BUG(). */ - cachep->array[smp_processor_id()] = &initarray_generic.cache; + cachep->array[smp_processor_id()] = + &initarray_generic.cache; g_cpucache_up = PARTIAL; } else { - cachep->array[smp_processor_id()] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL); + cachep->array[smp_processor_id()] = + kmalloc(sizeof(struct arraycache_init), + GFP_KERNEL); } BUG_ON(!ac_data(cachep)); ac_data(cachep)->avail = 0; @@ -1241,7 +1245,7 @@ } cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 + - ((unsigned long)cachep)%REAPTIMEOUT_LIST3; + ((unsigned long)cachep)%REAPTIMEOUT_LIST3; /* Need the semaphore to access the chain. */ down(&cache_chain_sem); @@ -1254,16 +1258,24 @@ list_for_each(p, &cache_chain) { kmem_cache_t *pc = list_entry(p, kmem_cache_t, next); char tmp; - /* This happens when the module gets unloaded and doesn't - destroy its slab cache and noone else reuses the vmalloc - area of the module. Print a warning. */ - if (__get_user(tmp,pc->name)) { - printk("SLAB: cache with size %d has lost its name\n", - pc->objsize); + + /* + * This happens when the module gets unloaded and + * doesn't destroy its slab cache and noone else reuses + * the vmalloc area of the module. Print a warning. + */ +#ifdef CONFIG_X86_UACCESS_INDIRECT + if (__direct_get_user(tmp,pc->name)) { +#else + if (__get_user(tmp,pc->name)) { +#endif + printk("SLAB: cache with size %d has lost its " + "name\n", pc->objsize); continue; } if (!strcmp(pc->name,name)) { - printk("kmem_cache_create: duplicate cache %s\n",name); + printk("kmem_cache_create: duplicate " + "cache %s\n",name); up(&cache_chain_sem); BUG(); } @@ -1890,6 +1902,15 @@ *dbg_redzone1(cachep, objp) = RED_ACTIVE; *dbg_redzone2(cachep, objp) = RED_ACTIVE; } + { + int objnr; + struct slab *slabp; + + slabp = GET_PAGE_SLAB(virt_to_page(objp)); + + objnr = (objp - slabp->s_mem) / cachep->objsize; + slab_bufctl(slabp)[objnr] = (int)caller; + } objp += obj_dbghead(cachep); if (cachep->ctor && cachep->flags & SLAB_POISON) { unsigned long ctor_flags = SLAB_CTOR_CONSTRUCTOR; @@ -1951,12 +1972,14 @@ objnr = (objp - slabp->s_mem) / cachep->objsize; check_slabp(cachep, slabp); #if DEBUG +#if 0 if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) { printk(KERN_ERR "slab: double free detected in cache '%s', objp %p.\n", cachep->name, objp); BUG(); } #endif +#endif slab_bufctl(slabp)[objnr] = slabp->free; slabp->free = objnr; STATS_DEC_ACTIVE(cachep); @@ -2693,6 +2716,29 @@ .show = s_show, }; +static void do_dump_slabp(kmem_cache_t *cachep) +{ +#if DEBUG + struct list_head *q; + + check_irq_on(); + spin_lock_irq(&cachep->spinlock); + list_for_each(q,&cachep->lists.slabs_full) { + struct slab *slabp; + int i; + slabp = list_entry(q, struct slab, list); + for (i = 0; i < cachep->num; i++) { + unsigned long sym = slab_bufctl(slabp)[i]; + + printk("obj %p/%d: %p", slabp, i, (void *)sym); + print_symbol(" <%s>", sym); + printk("\n"); + } + } + spin_unlock_irq(&cachep->spinlock); +#endif +} + #define MAX_SLABINFO_WRITE 128 /** * slabinfo_write - Tuning for the slab allocator @@ -2733,9 +2779,11 @@ batchcount < 1 || batchcount > limit || shared < 0) { - res = -EINVAL; + do_dump_slabp(cachep); + res = 0; } else { - res = do_tune_cpucache(cachep, limit, batchcount, shared); + res = do_tune_cpucache(cachep, limit, + batchcount, shared); } break; } --- diff/mm/swap.c 2003-11-25 15:24:59.000000000 +0000 +++ source/mm/swap.c 2003-11-26 10:09:08.000000000 +0000 @@ -348,12 +348,15 @@ * The search returns a group of mapping-contiguous pages with ascending * indexes. There may be holes in the indices due to not-present pages. * - * pagevec_lookup() returns the number of pages which were found. + * pagevec_lookup() returns the number of pages which were found + * and also atomically sets the next offset to continue looking up + * mapping contiguous pages from (useful when doing a range of + * pagevec lookups in chunks of PAGEVEC_SIZE). */ unsigned int pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, - pgoff_t start, unsigned int nr_pages) + pgoff_t *next, unsigned int nr_pages) { - pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages); + pvec->nr = find_get_pages(mapping, next, nr_pages, pvec->pages); return pagevec_count(pvec); } @@ -386,17 +389,19 @@ #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { + long count; + long *pcount; int cpu = get_cpu(); - long count = fbc->counters[cpu].count; - count += amount; + pcount = per_cpu_ptr(fbc->counters, cpu); + count = *pcount + amount; if (count >= FBC_BATCH || count <= -FBC_BATCH) { spin_lock(&fbc->lock); fbc->count += count; spin_unlock(&fbc->lock); count = 0; } - fbc->counters[cpu].count = count; + *pcount = count; put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); --- diff/mm/swapfile.c 2003-10-27 09:20:39.000000000 +0000 +++ source/mm/swapfile.c 2003-11-26 10:09:08.000000000 +0000 @@ -912,7 +912,7 @@ sector_t last_block; int ret; - inode = sis->swap_file->f_dentry->d_inode; + inode = sis->swap_file->f_mapping->host; if (S_ISBLK(inode->i_mode)) { ret = add_swap_extent(sis, 0, sis->max, 0); goto done; @@ -1031,13 +1031,13 @@ if (IS_ERR(victim)) goto out; - mapping = victim->f_dentry->d_inode->i_mapping; + mapping = victim->f_mapping; prev = -1; swap_list_lock(); for (type = swap_list.head; type >= 0; type = swap_info[type].next) { p = swap_info + type; if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) { - if (p->swap_file->f_dentry->d_inode->i_mapping==mapping) + if (p->swap_file->f_mapping == mapping) break; } prev = type; @@ -1099,13 +1099,12 @@ swap_device_unlock(p); swap_list_unlock(); vfree(swap_map); - if (S_ISBLK(swap_file->f_dentry->d_inode->i_mode)) { - struct block_device *bdev; - bdev = swap_file->f_dentry->d_inode->i_bdev; + if (S_ISBLK(mapping->host->i_mode)) { + struct block_device *bdev = I_BDEV(mapping->host); set_blocksize(bdev, p->old_block_size); bd_release(bdev); } else { - up(&swap_file->f_dentry->d_inode->i_mapping->host->i_sem); + up(&mapping->host->i_sem); } filp_close(swap_file, NULL); err = 0; @@ -1231,8 +1230,8 @@ int swapfilesize; unsigned short *swap_map; struct page *page = NULL; - struct inode *inode; - struct inode *downed_inode = NULL; + struct inode *inode = NULL; + int did_down = 0; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1279,8 +1278,8 @@ } p->swap_file = swap_file; - inode = swap_file->f_dentry->d_inode; - mapping = swap_file->f_dentry->d_inode->i_mapping; + mapping = swap_file->f_mapping; + inode = mapping->host; error = -EBUSY; for (i = 0; i < nr_swapfiles; i++) { @@ -1288,32 +1287,32 @@ if (i == type || !q->swap_file) continue; - if (mapping == q->swap_file->f_dentry->d_inode->i_mapping) + if (mapping == q->swap_file->f_mapping) goto bad_swap; } error = -EINVAL; if (S_ISBLK(inode->i_mode)) { - bdev = inode->i_bdev; + bdev = I_BDEV(inode); error = bd_claim(bdev, sys_swapon); if (error < 0) { bdev = NULL; goto bad_swap; } p->old_block_size = block_size(bdev); - error = set_blocksize(inode->i_bdev, PAGE_SIZE); + error = set_blocksize(bdev, PAGE_SIZE); if (error < 0) goto bad_swap; p->bdev = bdev; } else if (S_ISREG(inode->i_mode)) { p->bdev = inode->i_sb->s_bdev; - downed_inode = mapping->host; - down(&downed_inode->i_sem); + down(&inode->i_sem); + did_down = 1; } else { goto bad_swap; } - swapfilesize = i_size_read(mapping->host) >> PAGE_SHIFT; + swapfilesize = i_size_read(inode) >> PAGE_SHIFT; /* * Read the swap header. @@ -1465,8 +1464,8 @@ } if (name) putname(name); - if (error && downed_inode) - up(&downed_inode->i_sem); + if (error && did_down) + up(&inode->i_sem); return error; } --- diff/mm/truncate.c 2003-10-09 09:47:34.000000000 +0100 +++ source/mm/truncate.c 2003-11-26 10:09:08.000000000 +0000 @@ -122,14 +122,10 @@ pagevec_init(&pvec, 0); next = start; - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + while (pagevec_lookup(&pvec, mapping, &next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; - pgoff_t page_index = page->index; - if (page_index > next) - next = page_index; - next++; if (TestSetPageLocked(page)) continue; if (PageWriteback(page)) { @@ -155,7 +151,7 @@ next = start; for ( ; ; ) { - if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + if (!pagevec_lookup(&pvec, mapping, &next, PAGEVEC_SIZE)) { if (next == start) break; next = start; @@ -166,14 +162,19 @@ lock_page(page); wait_on_page_writeback(page); - if (page->index > next) - next = page->index; - next++; truncate_complete_page(mapping, page); unlock_page(page); } pagevec_release(&pvec); } + + if (lstart == 0) { + WARN_ON(mapping->nrpages); + WARN_ON(!list_empty(&mapping->clean_pages)); + WARN_ON(!list_empty(&mapping->dirty_pages)); + WARN_ON(!list_empty(&mapping->locked_pages)); + WARN_ON(!list_empty(&mapping->io_pages)); + } } EXPORT_SYMBOL(truncate_inode_pages); @@ -201,17 +202,13 @@ pagevec_init(&pvec, 0); while (next <= end && - pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + pagevec_lookup(&pvec, mapping, &next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; if (TestSetPageLocked(page)) { - next++; continue; } - if (page->index > next) - next = page->index; - next++; if (PageDirty(page) || PageWriteback(page)) goto unlock; if (page_mapped(page)) @@ -250,14 +247,13 @@ int i; pagevec_init(&pvec, 0); - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + while (pagevec_lookup(&pvec, mapping, &next, PAGEVEC_SIZE)) { for (i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); if (page->mapping == mapping) { /* truncate race? */ wait_on_page_writeback(page); - next = page->index + 1; if (page_mapped(page)) clear_page_dirty(page); else --- diff/mm/vmscan.c 2003-10-09 09:47:34.000000000 +0100 +++ source/mm/vmscan.c 2003-11-26 10:09:08.000000000 +0000 @@ -779,7 +779,7 @@ count = atomic_read(&zone->refill_counter); if (count > SWAP_CLUSTER_MAX * 4) count = SWAP_CLUSTER_MAX * 4; - atomic_sub(count, &zone->refill_counter); + atomic_set(&zone->refill_counter, 0); refill_inactive_zone(zone, count, ps, priority); } return shrink_cache(nr_pages, zone, gfp_mask, --- diff/net/core/dev.c 2003-11-25 15:24:59.000000000 +0000 +++ source/net/core/dev.c 2003-11-26 10:09:08.000000000 +0000 @@ -111,6 +111,10 @@ #endif /* CONFIG_NET_RADIO */ #include <asm/current.h> +#ifdef CONFIG_KGDB +#include <asm/kgdb.h> +#endif + /* This define, if set, will randomly drop a packet when congestion * is more than moderate. It helps fairness in the multi-interface * case when one of them is a hog, but it kills performance for the @@ -1380,7 +1384,6 @@ } #endif - /** * netif_rx - post buffer to the network code * @skb: buffer to post @@ -1405,6 +1408,21 @@ struct softnet_data *queue; unsigned long flags; +#ifdef CONFIG_KGDB + /* See if kgdb_eth wants this packet */ + if (!kgdb_net_interrupt(skb)) { + /* No.. if we're 'trapped' then junk it */ + if (kgdb_eth_is_trapped()) { + kfree_skb(skb); + return NET_RX_DROP; + } + } else { + /* kgdb_eth ate the packet... drop it silently */ + kfree_skb(skb); + return NET_RX_DROP; + } +#endif + if (!skb->stamp.tv_sec) do_gettimeofday(&skb->stamp); --- diff/net/econet/af_econet.c 2003-10-09 09:47:34.000000000 +0100 +++ source/net/econet/af_econet.c 2003-11-26 10:09:08.000000000 +0000 @@ -1041,12 +1041,15 @@ if (!sk) goto drop; - return ec_queue_packet(sk, skb, edev->net, hdr->src_stn, hdr->cb, - hdr->port); + if (ec_queue_packet(sk, skb, edev->net, hdr->src_stn, hdr->cb, + hdr->port)) + goto drop; + + return 0; drop: kfree_skb(skb); - return 0; + return NET_RX_DROP; } static struct packet_type econet_packet_type = { --- diff/net/ipv6/mcast.c 2003-11-25 15:24:59.000000000 +0000 +++ source/net/ipv6/mcast.c 2003-11-26 10:09:08.000000000 +0000 @@ -47,6 +47,9 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> +#include <linux/netfilter.h> +#include <linux/netfilter_ipv6.h> + #include <net/sock.h> #include <net/snmp.h> @@ -1270,6 +1273,7 @@ struct mld2_report *pmr = (struct mld2_report *)skb->h.raw; int payload_len, mldlen; struct inet6_dev *idev = in6_dev_get(skb->dev); + int err; payload_len = skb->tail - (unsigned char *)skb->nh.ipv6h - sizeof(struct ipv6hdr); @@ -1278,8 +1282,10 @@ pmr->csum = csum_ipv6_magic(&pip6->saddr, &pip6->daddr, mldlen, IPPROTO_ICMPV6, csum_partial(skb->h.raw, mldlen, 0)); - dev_queue_xmit(skb); - ICMP6_INC_STATS(idev,Icmp6OutMsgs); + err = NF_HOOK(PF_INET6, NF_IP6_LOCAL_OUT, skb, NULL, skb->dev, + dev_queue_xmit); + if (!err) + ICMP6_INC_STATS(idev,Icmp6OutMsgs); if (likely(idev != NULL)) in6_dev_put(idev); } @@ -1608,12 +1614,15 @@ idev = in6_dev_get(skb->dev); - dev_queue_xmit(skb); - if (type == ICMPV6_MGM_REDUCTION) - ICMP6_INC_STATS(idev, Icmp6OutGroupMembReductions); - else - ICMP6_INC_STATS(idev, Icmp6OutGroupMembResponses); - ICMP6_INC_STATS(idev, Icmp6OutMsgs); + err = NF_HOOK(PF_INET6, NF_IP6_LOCAL_OUT, skb, NULL, skb->dev, + dev_queue_xmit); + if (!err) { + if (type == ICMPV6_MGM_REDUCTION) + ICMP6_INC_STATS(idev, Icmp6OutGroupMembReductions); + else + ICMP6_INC_STATS(idev, Icmp6OutGroupMembResponses); + ICMP6_INC_STATS(idev, Icmp6OutMsgs); + } if (likely(idev != NULL)) in6_dev_put(idev); --- diff/net/socket.c 2003-11-25 15:24:59.000000000 +0000 +++ source/net/socket.c 2003-11-26 10:09:09.000000000 +0000 @@ -394,6 +394,7 @@ file->f_dentry->d_op = &sockfs_dentry_operations; d_add(file->f_dentry, SOCK_INODE(sock)); file->f_vfsmnt = mntget(sock_mnt); + file->f_mapping = file->f_dentry->d_inode->i_mapping; sock->file = file; file->f_op = SOCK_INODE(sock)->i_fop = &socket_file_ops; --- diff/sound/core/pcm_native.c 2003-10-27 09:20:44.000000000 +0000 +++ source/sound/core/pcm_native.c 2003-11-26 10:09:09.000000000 +0000 @@ -2779,7 +2779,7 @@ return mask; } -static struct page * snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int no_share) +static struct page * snd_pcm_mmap_status_nopage(struct vm_area_struct *area, unsigned long address, int *type) { snd_pcm_substream_t *substream = (snd_pcm_substream_t *)area->vm_private_data; snd_pcm_runtime_t *runtime; @@ -2791,6 +2791,8 @@ page = virt_to_page(runtime->status); if (!PageReserved(page)) get_page(page); + if (type) + *type = VM_FAULT_MINOR; return page; } @@ -2817,7 +2819,7 @@ return 0; } -static struct page * snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int no_share) +static struct page * snd_pcm_mmap_control_nopage(struct vm_area_struct *area, unsigned long address, int *type) { snd_pcm_substream_t *substream = (snd_pcm_substream_t *)area->vm_private_data; snd_pcm_runtime_t *runtime; @@ -2829,6 +2831,8 @@ page = virt_to_page(runtime->control); if (!PageReserved(page)) get_page(page); + if (type) + *type = VM_FAULT_MINOR; return page; } @@ -2867,7 +2871,7 @@ atomic_dec(&substream->runtime->mmap_count); } -static struct page * snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int no_share) +static struct page * snd_pcm_mmap_data_nopage(struct vm_area_struct *area, unsigned long address, int *type) { snd_pcm_substream_t *substream = (snd_pcm_substream_t *)area->vm_private_data; snd_pcm_runtime_t *runtime; @@ -2895,6 +2899,8 @@ } if (!PageReserved(page)) get_page(page); + if (type) + *type = VM_FAULT_MINOR; return page; } --- diff/sound/i2c/i2c.c 2003-06-09 14:18:21.000000000 +0100 +++ source/sound/i2c/i2c.c 2003-11-26 10:09:09.000000000 +0000 @@ -84,7 +84,7 @@ bus = (snd_i2c_bus_t *)snd_magic_kcalloc(snd_i2c_bus_t, 0, GFP_KERNEL); if (bus == NULL) return -ENOMEM; - spin_lock_init(&bus->lock); + init_MUTEX(&bus->lock_mutex); INIT_LIST_HEAD(&bus->devices); INIT_LIST_HEAD(&bus->buses); bus->card = card; --- diff/sound/oss/Kconfig 2003-10-27 09:20:39.000000000 +0000 +++ source/sound/oss/Kconfig 2003-11-26 10:09:09.000000000 +0000 @@ -25,7 +25,7 @@ depends on SOUND_PRIME!=n && SOUND && PCI help Say Y or M if you have a PCI sound card using the CMI8338 - or the CMI8378 chipset. Data on these chips are available at + or the CMI8738 chipset. Data on these chips are available at <http://www.cmedia.com.tw/>. A userspace utility to control some internal registers of these --- diff/sound/oss/cmpci.c 2003-09-17 12:28:13.000000000 +0100 +++ source/sound/oss/cmpci.c 2003-11-26 10:09:09.000000000 +0000 @@ -2876,7 +2876,6 @@ void initialize_chip(struct pci_dev *pcidev) { struct cm_state *s; - mm_segment_t fs; int i, val; #if defined(CONFIG_SOUND_CMPCI_MIDI) || defined(CONFIG_SOUND_CMPCI_FM) unsigned char reg_mask = 0; @@ -3038,8 +3037,6 @@ #endif pci_set_master(pcidev); /* enable bus mastering */ /* initialize the chips */ - fs = get_fs(); - set_fs(KERNEL_DS); /* set mixer output */ frobindir(s, DSP_MIX_OUTMIXIDX, 0x1f, 0x1f); /* set mixer input */ --- diff/sound/oss/emu10k1/audio.c 2003-09-17 12:28:13.000000000 +0100 +++ source/sound/oss/emu10k1/audio.c 2003-11-26 10:09:09.000000000 +0000 @@ -989,7 +989,7 @@ return 0; } -static struct page *emu10k1_mm_nopage (struct vm_area_struct * vma, unsigned long address, int write_access) +static struct page *emu10k1_mm_nopage (struct vm_area_struct * vma, unsigned long address, int *type) { struct emu10k1_wavedevice *wave_dev = vma->vm_private_data; struct woinst *woinst = wave_dev->woinst; @@ -1032,6 +1032,8 @@ get_page (dmapage); DPD(3, "page: %#lx\n", (unsigned long) dmapage); + if (type) + *type = VM_FAULT_MINOR; return dmapage; } --- diff/sound/oss/via82cxxx_audio.c 2003-10-27 09:20:39.000000000 +0000 +++ source/sound/oss/via82cxxx_audio.c 2003-11-26 10:09:09.000000000 +0000 @@ -2116,7 +2116,7 @@ static struct page * via_mm_nopage (struct vm_area_struct * vma, - unsigned long address, int write_access) + unsigned long address, int *type) { struct via_info *card = vma->vm_private_data; struct via_channel *chan = &card->ch_out; @@ -2124,12 +2124,11 @@ unsigned long pgoff; int rd, wr; - DPRINTK ("ENTER, start %lXh, ofs %lXh, pgoff %ld, addr %lXh, wr %d\n", + DPRINTK ("ENTER, start %lXh, ofs %lXh, pgoff %ld, addr %lXh\n", vma->vm_start, address - vma->vm_start, (address - vma->vm_start) >> PAGE_SHIFT, - address, - write_access); + address); if (address > vma->vm_end) { DPRINTK ("EXIT, returning NOPAGE_SIGBUS\n"); @@ -2167,6 +2166,8 @@ DPRINTK ("EXIT, returning page %p for cpuaddr %lXh\n", dmapage, (unsigned long) chan->pgtbl[pgoff].cpuaddr); get_page (dmapage); + if (type) + *type = VM_FAULT_MINOR; return dmapage; } --- diff/sound/pci/intel8x0.c 2003-10-09 09:47:17.000000000 +0100 +++ source/sound/pci/intel8x0.c 2003-11-26 10:09:09.000000000 +0000 @@ -2271,10 +2271,8 @@ t = stop_time.tv_sec - start_time.tv_sec; t *= 1000000; - if (stop_time.tv_usec < start_time.tv_usec) - t -= start_time.tv_usec - stop_time.tv_usec; - else - t += stop_time.tv_usec - start_time.tv_usec; + t += stop_time.tv_usec - start_time.tv_usec; + printk(KERN_INFO "%s: measured %lu usecs\n", __FUNCTION__, t); if (t == 0) { snd_printk(KERN_ERR "?? calculation error..\n"); return; --- diff/usr/gen_init_cpio.c 2003-09-30 15:46:21.000000000 +0100 +++ source/usr/gen_init_cpio.c 2003-11-26 10:09:09.000000000 +0000 @@ -197,6 +197,7 @@ for (i = 0; i < buf.st_size; ++i) fputc(filebuf[i], stdout); + offset += buf.st_size; close(file); free(filebuf); push_pad(); --- diff/Documentation/MSI-HOWTO.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/MSI-HOWTO.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,321 @@ + The MSI Driver Guide HOWTO + Tom L Nguyen tom.l.nguyen@intel.com + 10/03/2003 + +1. About this guide + +This guide describes the basics of Message Signaled Interrupts(MSI), the +advantages of using MSI over traditional interrupt mechanisms, and how +to enable your driver to use MSI or MSI-X. Also included is a Frequently +Asked Questions. + +2. Copyright 2003 Intel Corporation + +3. What is MSI/MSI-X? + +Message Signaled Interrupt (MSI), as described in the PCI Local Bus +Specification Revision 2.3 or latest, is an optional feature, and a +required feature for PCI Express devices. MSI enables a device function +to request service by sending an Inbound Memory Write on its PCI bus to +the FSB as a Message Signal Interrupt transaction. Because MSI is +generated in the form of a Memory Write, all transaction conditions, +such as a Retry, Master-Abort, Target-Abort or normal completion, are +supported. + +A PCI device that supports MSI must also support pin IRQ assertion +interrupt mechanism to provide backward compatibility for systems that +do not support MSI. In Systems, which support MSI, the bus driver is +responsible for initializing the message address and message data of +the device function's MSI/MSI-X capability structure during device +initial configuration. + +An MSI capable device function indicates MSI support by implementing +the MSI/MSI-X capability structure in its PCI capability list. The +device function may implement both the MSI capability structure and +the MSI-X capability structure; however, the bus driver should not +enable both, but instead enable only the MSI-X capability structure. + +The MSI capability structure contains Message Control register, +Message Address register and Message Data register. These registers +provide the bus driver control over MSI. The Message Control register +indicates the MSI capability supported by the device. The Message +Address register specifies the target address and the Message Data +register specifies the characteristics of the message. To request +service, the device function writes the content of the Message Data +register to the target address. The device and its software driver +are prohibited from writing to these registers. + +The MSI-X capability structure is an optional extension to MSI. It +uses an independent and separate capability structure. There are +some key advantages to implementing the MSI-X capability structure +over the MSI capability structure as described below. + + - Support a larger maximum number of vectors per function. + + - Provide the ability for system software to configure + each vector with an independent message address and message + data, specified by a table that resides in Memory Space. + + - MSI and MSI-X both support per-vector masking. Per-vector + masking is an optional extension of MSI but a required + feature for MSI-X. Per-vector masking provides the kernel + the ability to mask/unmask MSI when servicing its software + interrupt service routing handler. If per-vector masking is + not supported, then the device driver should provide the + hardware/software synchronization to ensure that the device + generates MSI when the driver wants it to do so. + +4. Why use MSI? + +As a benefit the simplification of board design, MSI allows board +designers to remove out of band interrupt routing. MSI is another +step towards a legacy-free environment. + +Due to increasing pressure on chipset and processor packages to +reduce pin count, the need for interrupt pins is expected to +diminish over time. Devices, due to pin constraints, may implement +messages to increase performance. + +PCI Express endpoints uses INTx emulation (in-band messages) instead +of IRQ pin assertion. Using INTx emulation requires interrupt +sharing among devices connected to the same node (PCI bridge) while +MSI is unique (non-shared) and does not require BIOS configuration +support. As a result, the PCI Express technology requires MSI +support for better interrupt performance. + +Using MSI enables the device functions to support two or more +vectors, which can be configure to target different CPU's to +increase scalability. + +5. Configuring a driver to use MSI/MSI-X + +By default, the kernel will not enable MSI/MSI-X on all devices that +support this capability once the patch is installed. A kernel +configuration option must be selected to enable MSI/MSI-X support. + +5.1 Including MSI support into the kernel + +To include MSI support into the kernel requires users to patch the +VECTOR-base patch first and then the MSI patch because the MSI +support needs VECTOR based scheme. Once these patches are installed, +setting CONFIG_PCI_USE_VECTOR enables the VECTOR based scheme and +the option for MSI-capable device drivers to selectively enable MSI +(using pci_enable_msi as desribed below). + +Since the target of the inbound message is the local APIC, providing +CONFIG_PCI_USE_VECTOR is dependent on whether CONFIG_X86_LOCAL_APIC +is enabled or not. + +int pci_enable_msi(struct pci_dev *) + +With this new API, any existing device driver, which like to have +MSI enabled on its device function, must call this explicitly. A +successful call will initialize the MSI/MSI-X capability structure +with ONE vector, regardless of whether the device function is +capable of supporting multiple messages. This vector replaces the +pre-assigned dev->irq with a new MSI vector. To avoid the conflict +of new assigned vector with existing pre-assigned vector requires +the device driver to call this API before calling request_irq(...). + +The below diagram shows the events, which switches the interrupt +mode on the MSI-capable device function between MSI mode and +PIN-IRQ assertion mode. + + ------------ pci_enable_msi ------------------------ + | | <=============== | | + | MSI MODE | | PIN-IRQ ASSERTION MODE | + | | ===============> | | + ------------ free_irq ------------------------ + +5.2 Configuring for MSI support + +Due to the non-contiguous fashion in vector assignment of the +existing Linux kernel, this patch does not support multiple +messages regardless of the device function is capable of supporting +more than one vector. The bus driver initializes only entry 0 of +this capability if pci_enable_msi(...) is called successfully by +the device driver. + +5.3 Configuring for MSI-X support + +Both the MSI capability structure and the MSI-X capability structure +share the same above semantics; however, due to the ability of the +system software to configure each vector of the MSI-X capability +structure with an independent message address and message data, the +non-contiguous fashion in vector assignment of the existing Linux +kernel has no impact on supporting multiple messages on an MSI-X +capable device functions. By default, as mentioned above, ONE vector +should be always allocated to the MSI-X capability structure at +entry 0. The bus driver does not initialize other entries of the +MSI-X table. + +Note that the PCI subsystem should have full control of a MSI-X +table that resides in Memory Space. The software device driver +should not access this table. + +To request for additional vectors, the device software driver should +call function msi_alloc_vectors(). It is recommended that the +software driver should call this function once during the +initialization phase of the device driver. + +The function msi_alloc_vectors(), once invoked, enables either +all or nothing, depending on the current availability of vector +resources. If no vector resources are available, the device function +still works with ONE vector. If the vector resources are available +for the number of vectors requested by the driver, this function +will reconfigure the MSI-X capability structure of the device with +additional messages, starting from entry 1. To emphasize this +reason, for example, the device may be capable for supporting the +maximum of 32 vectors while its software driver usually may request +4 vectors. + +For each vector, after this successful call, the device driver is +responsible to call other functions like request_irq(), enable_irq(), +etc. to enable this vector with its corresponding interrupt service +handler. It is the device driver's choice to have all vectors shared +the same interrupt service handler or each vector with a unique +interrupt service handler. + +In addition to the function msi_alloc_vectors(), another function +msi_free_vectors() is provided to allow the software driver to +release a number of vectors back to the vector resources. Once +invoked, the PCI subsystem disables (masks) each vector released. +These vectors are no longer valid for the hardware device and its +software driver to use. Like free_irq, it recommends that the +device driver should also call msi_free_vectors to release all +additional vectors previously requested. + +int msi_alloc_vectors(struct pci_dev *dev, int *vector, int nvec) + +This API enables the software driver to request the PCI subsystem +for additional messages. Depending on the number of vectors +available, the PCI subsystem enables either all or nothing. + +Argument dev points to the device (pci_dev) structure. +Argument vector is a pointer of integer type. The number of +elements is indicated in argument nvec. +Argument nvec is an integer indicating the number of messages +requested. +A return of zero indicates that the number of allocated vector is +successfully allocated. Otherwise, indicate resources not +available. + +int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec) + +This API enables the software driver to inform the PCI subsystem +that it is willing to release a number of vectors back to the +MSI resource pool. Once invoked, the PCI subsystem disables each +MSI-X entry associated with each vector stored in the argument 2. +These vectors are no longer valid for the hardware device and +its software driver to use. + +Argument dev points to the device (pci_dev) structure. +Argument vector is a pointer of integer type. The number of +elements is indicated in argument nvec. +Argument nvec is an integer indicating the number of messages +released. +A return of zero indicates that the number of allocated vectors +is successfully released. Otherwise, indicates a failure. + +5.4 Hardware requirements for MSI support +MSI support requires support from both system hardware and +individual hardware device functions. + +5.4.1 System hardware support +Since the target of MSI address is the local APIC CPU, enabling +MSI support in Linux kernel is dependent on whether existing +system hardware supports local APIC. Users should verify their +system whether it runs when CONFIG_X86_LOCAL_APIC=y. + +In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set; +however, in UP environment, users must manually set +CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting +CONFIG_PCI_USE_VECTOR enables the VECTOR based scheme and +the option for MSI-capable device drivers to selectively enable +MSI (using pci_enable_msi as desribed below). + +Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI +vector is allocated new during runtime and MSI support does not +depend on BIOS support. This key independency enables MSI support +on future IOxAPIC free platform. + +5.4.2 Device hardware support +The hardware device function supports MSI by indicating the +MSI/MSI-X capability structure on its PCI capability list. By +default, this capability structure will not be initialized by +the kernel to enable MSI during the system boot. In other words, +the device function is running on its default pin assertion mode. +Note that in many cases the hardware supporting MSI have bugs, +which may result in system hang. The software driver of specific +MSI-capable hardware is responsible for whether calling +pci_enable_msi or not. A return of zero indicates the kernel +successfully initializes the MSI/MSI-X capability structure of the +device funtion. The device function is now running on MSI mode. + +5.5 How to tell whether MSI is enabled on device function + +At the driver level, a return of zero from pci_enable_msi(...) +indicates to the device driver that its device function is +initialized successfully and ready to run in MSI mode. + +At the user level, users can use command 'cat /proc/interrupts' +to display the vector allocated for the device and its interrupt +mode, as shown below. + + CPU0 CPU1 + 0: 324639 0 IO-APIC-edge timer + 1: 1186 0 IO-APIC-edge i8042 + 2: 0 0 XT-PIC cascade + 12: 2797 0 IO-APIC-edge i8042 + 14: 6543 0 IO-APIC-edge ide0 + 15: 1 0 IO-APIC-edge ide1 +169: 0 0 IO-APIC-level uhci-hcd +185: 0 0 IO-APIC-level uhci-hcd +193: 138 10 PCI MSI aic79xx +201: 30 0 PCI MSI aic79xx +225: 30 0 IO-APIC-level aic7xxx +233: 30 0 IO-APIC-level aic7xxx +NMI: 0 0 +LOC: 324553 325068 +ERR: 0 +MIS: 0 + +6. FAQ + +Q1. Are there any limitations on using the MSI? + +A1. If the PCI device supports MSI and conforms to the +specification and the platform supports the APIC local bus, +then using MSI should work. + +Q2. Will it work on all the Pentium processors (P3, P4, Xeon, +AMD processors)? In P3 IPI's are transmitted on the APIC local +bus and in P4 and Xeon they are transmitted on the system +bus. Are there any implications with this? + +A2. MSI support enables a PCI device sending an inbound +memory write (0xfeexxxxx as target address) on its PCI bus +directly to the FSB. Since the message address has a +redirection hint bit cleared, it should work. + +Q3. The target address 0xfeexxxxx will be translated by the +Host Bridge into an interrupt message. Are there any +limitations on the chipsets such as Intel 8xx, Intel e7xxx, +or VIA? + +A3. If these chipsets support an inbound memory write with +target address set as 0xfeexxxxx, as conformed to PCI +specification 2.3 or latest, then it should work. + +Q4. From the driver point of view, if the MSI is lost because +of the errors occur during inbound memory write, then it may +wait for ever. Is there a mechanism for it to recover? + +A4. Since the target of the transaction is an inbound memory +write, all transaction termination conditions (Retry, +Master-Abort, Target-Abort, or normal completion) are +supported. A device sending an MSI must abide by all the PCI +rules and conditions regarding that inbound memory write. So, +if a retry is signaled it must retry, etc... We believe that +the recommendation for Abort is also a retry (refer to PCI +specification 2.3 or latest). --- diff/Documentation/i386/kgdb/andthen 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/andthen 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,100 @@ + +define set_andthen + set var $thp=0 + set var $thp=(struct kgdb_and_then_struct *)&kgdb_data[0] + set var $at_size = (sizeof kgdb_data)/(sizeof *$thp) + set var $at_oc=kgdb_and_then_count + set var $at_cc=$at_oc +end + +define andthen_next + set var $at_cc=$arg0 +end + +define andthen + andthen_set_edge + if ($at_cc >= $at_oc) + printf "Outside window. Window size is %d\n",($at_oc-$at_low) + else + printf "%d: ",$at_cc + output *($thp+($at_cc++ % $at_size )) + printf "\n" + end +end +define andthen_set_edge + set var $at_oc=kgdb_and_then_count + set var $at_low = $at_oc - $at_size + if ($at_low < 0 ) + set var $at_low = 0 + end + if (( $at_cc > $at_oc) || ($at_cc < $at_low)) + printf "Count outside of window, setting count to " + if ($at_cc >= $at_oc) + set var $at_cc = $at_oc + else + set var $at_cc = $at_low + end + printf "%d\n",$at_cc + end +end + +define beforethat + andthen_set_edge + if ($at_cc <= $at_low) + printf "Outside window. Window size is %d\n",($at_oc-$at_low) + else + printf "%d: ",$at_cc-1 + output *($thp+(--$at_cc % $at_size )) + printf "\n" + end +end + +document andthen_next + andthen_next <count> + . sets the number of the event to display next. If this event + . is not in the event pool, either andthen or beforethat will + . correct it to the nearest event pool edge. The event pool + . ends at the last event recorded and begins <number of events> + . prior to that. If beforethat is used next, it will display + . event <count> -1. +. + andthen commands are: set_andthen, andthen_next, andthen and beforethat +end + + +document andthen + andthen +. displays the next event in the list. <set_andthen> sets up to display +. the oldest saved event first. +. <count> (optional) count of the event to display. +. note the number of events saved is specified at configure time. +. if events are saved between calls to andthen the index will change +. but the displayed event will be the next one (unless the event buffer +. is overrun). +. +. andthen commands are: set_andthen, andthen_next, andthen and beforethat +end + +document set_andthen + set_andthen +. sets up to use the <andthen> and <beforethat> commands. +. if you have defined your own struct, use the above and +. then enter the following: +. p $thp=(struct kgdb_and_then_structX *)&kgdb_data[0] +. where <kgdb_and_then_structX> is the name of your structure. +. +. andthen commands are: set_andthen, andthen_next, andthen and beforethat +end + +document beforethat + beforethat +. displays the next prior event in the list. <set_andthen> sets up to +. display the last occuring event first. +. +. note the number of events saved is specified at configure time. +. if events are saved between calls to beforethat the index will change +. but the displayed event will be the next one (unless the event buffer +. is overrun). +. +. andthen commands are: set_andthen, andthen_next, andthen and beforethat +end --- diff/Documentation/i386/kgdb/debug-nmi.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/debug-nmi.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,37 @@ +Subject: Debugging with NMI +Date: Mon, 12 Jul 1999 11:28:31 -0500 +From: David Grothe <dave@gcom.com> +Organization: Gcom, Inc +To: David Grothe <dave@gcom.com> + +Kernel hackers: + +Maybe this is old hat, but it is new to me -- + +On an ISA bus machine, if you short out the A1 and B1 pins of an ISA +slot you will generate an NMI to the CPU. This interrupts even a +machine that is hung in a loop with interrupts disabled. Used in +conjunction with kgdb < +ftp://ftp.gcom.com/pub/linux/src/kgdb-2.3.35/kgdb-2.3.35.tgz > you can +gain debugger control of a machine that is hung in the kernel! Even +without kgdb the kernel will print a stack trace so you can find out +where it was hung. + +The A1/B1 pins are directly opposite one another and the farthest pins +towards the bracket end of the ISA bus socket. You can stick a paper +clip or multi-meter probe between them to short them out. + +I had a spare ISA bus to PC104 bus adapter around. The PC104 end of the +board consists of two rows of wire wrap pins. So I wired a push button +between the A1/B1 pins and now have an ISA board that I can stick into +any ISA bus slot for debugger entry. + +Microsoft has a circuit diagram of a PCI card at +http://www.microsoft.com/hwdev/DEBUGGING/DMPSW.HTM. If you want to +build one you will have to mail them and ask for the PAL equations. +Nobody makes one comercially. + +[THIS TIP COMES WITH NO WARRANTY WHATSOEVER. It works for me, but if +your machine catches fire, it is your problem, not mine.] + +-- Dave (the kgdb guy) --- diff/Documentation/i386/kgdb/gdb-globals.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/gdb-globals.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,71 @@ +Sender: akale@veritas.com +Date: Fri, 23 Jun 2000 19:26:35 +0530 +From: "Amit S. Kale" <akale@veritas.com> +Organization: Veritas Software (India) +To: Dave Grothe <dave@gcom.com>, linux-kernel@vger.rutgers.edu +CC: David Milburn <dmilburn@wirespeed.com>, + "Edouard G. Parmelan" <Edouard.Parmelan@quadratec.fr>, + ezannoni@cygnus.com, Keith Owens <kaos@ocs.com.au> +Subject: Re: Module debugging using kgdb + +Dave Grothe wrote: +> +> Amit: +> +> There is a 2.4.0 version of kgdb on our ftp site: +> ftp://ftp.gcom.com/pub/linux/src/kgdb. I mirrored your version of gdb +> and loadmodule.sh there. +> +> Have a look at the README file and see if I go it right. If not, send +> me some corrections and I will update it. +> +> Does your version of gdb solve the global variable problem? + +Yes. +Thanks to Elena Zanoni, gdb (developement version) can now calculate +correctly addresses of dynamically loaded object files. I have not been +following gdb developement for sometime and am not sure when symbol +address calculation fix is going to appear in a gdb stable version. + +Elena, any idea when the fix will make it to a prebuilt gdb from a +redhat release? + +For the time being I have built a gdb developement version. It can be +used for module debugging with loadmodule.sh script. + +The problem with calculating of module addresses with previous versions +of gdb was as follows: +gdb did not use base address of a section while calculating address of +a symbol in the section in an object file loaded via 'add-symbol-file'. +It used address of .text segment instead. Due to this addresses of +symbols in .data, .bss etc. (e.g. global variables) were calculated incorrectly. + +Above mentioned fix allow gdb to use base address of a segment while +calculating address of a symbol in it. It adds a parameter '-s' to +'add-symbol-file' command for specifying base address of a segment. + +loadmodule.sh script works as follows. + +1. Copy a module file to target machine. +2. Load the module on the target machine using insmod with -m parameter. +insmod produces a module load map which contains base addresses of all +sections in the module and addresses of symbols in the module file. +3. Find all sections and their base addresses in the module from +the module map. +4. Generate a script that loads the module file. The script uses +'add-symbol-file' and specifies address of text segment followed by +addresses of all segments in the module. + +Here is an example gdb script produced by loadmodule.sh script. + +add-symbol-file foo 0xd082c060 -s .text.lock 0xd08cbfb5 +-s .fixup 0xd08cfbdf -s .rodata 0xd08cfde0 -s __ex_table 0xd08e3b38 +-s .data 0xd08e3d00 -s .bss 0xd08ec8c0 -s __ksymtab 0xd08ee838 + +With this command gdb can calculate addresses of symbols in ANY segment +in a module file. + +Regards. +-- +Amit Kale +Veritas Software ( http://www.veritas.com ) --- diff/Documentation/i386/kgdb/gdbinit 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/gdbinit 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,14 @@ +shell echo -e "\003" >/dev/ttyS0 +set remotebaud 38400 +target remote /dev/ttyS0 +define si +stepi +printf "EAX=%08x EBX=%08x ECX=%08x EDX=%08x\n", $eax, $ebx, $ecx, $edx +printf "ESI=%08x EDI=%08x EBP=%08x ESP=%08x\n", $esi, $edi, $ebp, $esp +x/i $eip +end +define ni +nexti +printf "EAX=%08x EBX=%08x ECX=%08x EDX=%08x\n", $eax, $ebx, $ecx, $edx +printf "ESI=%08x EDI=%08x EBP=%08x ESP=%08x\n", $esi, $edi, $ebp, $esp +x/i $eip --- diff/Documentation/i386/kgdb/gdbinit-modules 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/gdbinit-modules 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,146 @@ +# +# Usefull GDB user-command to debug Linux Kernel Modules with gdbstub. +# +# This don't work for Linux-2.0 or older. +# +# Author Edouard G. Parmelan <Edouard.Parmelan@quadratec.fr> +# +# +# Fri Apr 30 20:33:29 CEST 1999 +# First public release. +# +# Major cleanup after experiment Linux-2.0 kernel without success. +# Symbols of a module are not in the correct order, I can't explain +# why :( +# +# Fri Mar 19 15:41:40 CET 1999 +# Initial version. +# +# Thu Jan 6 16:29:03 CST 2000 +# A little fixing by Dave Grothe <dave@gcom.com> +# +# Mon Jun 19 09:33:13 CDT 2000 +# Alignment changes from Edouard Parmelan +# +# The basic idea is to find where insmod load the module and inform +# GDB to load the symbol table of the module with the GDB command +# ``add-symbol-file <object> <address>''. +# +# The Linux kernel holds the list of all loaded modules in module_list, +# this list end with &kernel_module (exactly with module->next == NULL, +# but the last module is not a real module). +# +# Insmod allocates the struct module before the object file. Since +# Linux-2.1, this structure contain his size. The real address of +# the object file is then (char*)module + module->size_of_struct. +# +# You can use three user functions ``mod-list'', ``mod-print-symbols'' +# and ``add-module-symbols''. +# +# mod-list list all loaded modules with the format: +# <module-address> <module-name> +# +# As soon as you have found the address of your module, you can +# print its exported symbols (mod-print-symbols) or inform GDB to add +# symbols from your module file (mod-add-symbols). +# +# The argument that you give to mod-print-symbols or mod-add-symbols +# is the <module-address> from the mod-list command. +# +# When using the mod-add-symbols command you must also give the full +# pathname of the modules object code file. +# +# The command mod-add-lis is an example of how to make this easier. +# You can edit this macro to contain the path name of your own +# favorite module and then use it as a shorthand to load it. You +# still need the module-address, however. +# +# The internal function ``mod-validate'' set the GDB variable $mod +# as a ``struct module*'' if the kernel known the module otherwise +# $mod is set to NULL. This ensure to not add symbols for a wrong +# address. +# +# Have a nice hacking day ! +# +# +define mod-list + set $mod = (struct module*)module_list + # the last module is the kernel, ignore it + while $mod != &kernel_module + printf "%p\t%s\n", (long)$mod, ($mod)->name + set $mod = $mod->next + end +end +document mod-list +List all modules in the form: <module-address> <module-name> +Use the <module-address> as the argument for the other +mod-commands: mod-print-symbols, mod-add-symbols. +end + +define mod-validate + set $mod = (struct module*)module_list + while ($mod != $arg0) && ($mod != &kernel_module) + set $mod = $mod->next + end + if $mod == &kernel_module + set $mod = 0 + printf "%p is not a module\n", $arg0 + end +end +document mod-validate +mod-validate <module-address> +Internal user-command used to validate the module parameter. +If <module> is a real loaded module, set $mod to it otherwise set $mod to 0. +end + + +define mod-print-symbols + mod-validate $arg0 + if $mod != 0 + set $i = 0 + while $i < $mod->nsyms + set $sym = $mod->syms[$i] + printf "%p\t%s\n", $sym->value, $sym->name + set $i = $i + 1 + end + end +end +document mod-print-symbols +mod-print-symbols <module-address> +Print all exported symbols of the module. see mod-list +end + + +define mod-add-symbols-align + mod-validate $arg0 + if $mod != 0 + set $mod_base = ($mod->size_of_struct + (long)$mod) + if ($arg2 != 0) && (($mod_base & ($arg2 - 1)) != 0) + set $mod_base = ($mod_base | ($arg2 - 1)) + 1 + end + add-symbol-file $arg1 $mod_base + end +end +document mod-add-symbols-align +mod-add-symbols-align <module-address> <object file path name> <align> +Load the symbols table of the module from the object file where +first section aligment is <align>. +To retreive alignment, use `objdump -h <object file path name>'. +end + +define mod-add-symbols + mod-add-symbols-align $arg0 $arg1 sizeof(long) +end +document mod-add-symbols +mod-add-symbols <module-address> <object file path name> +Load the symbols table of the module from the object file. +Default alignment is 4. See mod-add-symbols-align. +end + +define mod-add-lis + mod-add-symbols-align $arg0 /usr/src/LiS/streams.o 16 +end +document mod-add-lis +mod-add-lis <module-address> +Does mod-add-symbols <module-address> /usr/src/LiS/streams.o +end --- diff/Documentation/i386/kgdb/gdbinit.hw 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/gdbinit.hw 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,117 @@ + +#Using ia-32 hardware breakpoints. +# +#4 hardware breakpoints are available in ia-32 processors. These breakpoints +#do not need code modification. They are set using debug registers. +# +#Each hardware breakpoint can be of one of the +#three types: execution, write, access. +#1. An Execution breakpoint is triggered when code at the breakpoint address is +#executed. +#2. A write breakpoint ( aka watchpoints ) is triggered when memory location +#at the breakpoint address is written. +#3. An access breakpoint is triggered when memory location at the breakpoint +#address is either read or written. +# +#As hardware breakpoints are available in limited number, use software +#breakpoints ( br command in gdb ) instead of execution hardware breakpoints. +# +#Length of an access or a write breakpoint defines length of the datatype to +#be watched. Length is 1 for char, 2 short , 3 int. +# +#For placing execution, write and access breakpoints, use commands +#hwebrk, hwwbrk, hwabrk +#To remove a breakpoint use hwrmbrk command. +# +#These commands take following types of arguments. For arguments associated +#with each command, use help command. +#1. breakpointno: 0 to 3 +#2. length: 1 to 3 +#3. address: Memory location in hex ( without 0x ) e.g c015e9bc +# +#Use the command exinfo to find which hardware breakpoint occured. + +#hwebrk breakpointno address +define hwebrk + maintenance packet Y$arg0,0,0,$arg1 +end +document hwebrk + hwebrk <breakpointno> <address> + Places a hardware execution breakpoint + <breakpointno> = 0 - 3 + <address> = Hex digits without leading "0x". +end + +#hwwbrk breakpointno length address +define hwwbrk + maintenance packet Y$arg0,1,$arg1,$arg2 +end +document hwwbrk + hwwbrk <breakpointno> <length> <address> + Places a hardware write breakpoint + <breakpointno> = 0 - 3 + <length> = 1 (1 byte), 2 (2 byte), 3 (4 byte) + <address> = Hex digits without leading "0x". +end + +#hwabrk breakpointno length address +define hwabrk + maintenance packet Y$arg0,1,$arg1,$arg2 +end +document hwabrk + hwabrk <breakpointno> <length> <address> + Places a hardware access breakpoint + <breakpointno> = 0 - 3 + <length> = 1 (1 byte), 2 (2 byte), 3 (4 byte) + <address> = Hex digits without leading "0x". +end + +#hwrmbrk breakpointno +define hwrmbrk + maintenance packet y$arg0 +end +document hwrmbrk + hwrmbrk <breakpointno> + <breakpointno> = 0 - 3 + Removes a hardware breakpoint +end + +define reboot + maintenance packet r +end +#exinfo +define exinfo + maintenance packet qE +end +document exinfo + exinfo + Gives information about a breakpoint. +end +define get_th + p $th=(struct thread_info *)((int)$esp & ~8191) +end +document get_th + get_tu + Gets and prints the current thread_info pointer, Defines th to be it. +end +define get_cu + p $cu=((struct thread_info *)((int)$esp & ~8191))->task +end +document get_cu + get_cu + Gets and print the "current" value. Defines $cu to be it. +end +define int_off + set var $flags=$eflags + set $eflags=$eflags&~0x200 + end +define int_on + set var $eflags|=$flags&0x200 + end +document int_off + saves the current interrupt state and clears the processor interrupt + flag. Use int_on to restore the saved flag. +end +document int_on + Restores the interrupt flag saved by int_off. +end --- diff/Documentation/i386/kgdb/kgdb.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/kgdb.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,775 @@ +Last edit: <20030806.1637.12> +This file has information specific to the i386 kgdb option. Other +platforms with the kgdb option may behave in a similar fashion. + +New features: +============ +20030806.1557.37 +This version was made against the 2.6.0-test2 kernel. We have made the +following changes: + +- The getthread() code in the stub calls find_task_by_pid(). It fails + if we are early in the bring up such that the pid arrays have yet to + be allocated. We have added a line to kernel/pid.c to make + "kgdb_pid_init_done" true once the arrays are allocated. This way the + getthread() code knows not to call. This is only used by the thread + debugging stuff and threads will not yet exist at this point in the + boot. + +- For some reason, gdb was not asking for a new thread list when the + "info thread" command was given. We changed to the newer version of + the thread info command and gdb now seems to ask when needed. Result, + we now get all threads in the thread list. + +- We now respond to the ThreadExtraInfo request from gdb with the thread + name from task_struct .comm. This then appears in the thread list. + Thoughts on additional options for this are welcome. Things such as + "has BKL" and "Preempted" come to mind. I think we could have a flag + word that could enable different bits of info here. + +- We now honor, sort of, the C and S commands. These are continue and + single set after delivering a signal. We ignore the signal and do the + requested action. This only happens when we told gdb that a signal + was the reason for entry, which is only done on memory faults. The + result is that you can now continue into the Oops. + +- We changed the -g to -gdwarf-2. This seems to be the same as -ggdb, + but it is more exact on what language to use. + +- We added two dwarf2 include files and a bit of code at the end of + entry.S. This does not yet work, so it is disabled. Still we want to + keep track of the code and "maybe" someone out there can fix it. + +- Randy Dunlap sent some fix ups for this file which are now merged. + +- Hugh Dickins sent a fix to a bit of code in traps.c that prevents a + compiler warning if CONFIG_KGDB is off (now who would do that :). + +- Andrew Morton sent a fix for the serial driver which is now merged. + +- Andrew also sent a change to the stub around the cpu managment code + which is also merged. + +- Andrew also sent a patch to make "f" as well as "g" work as SysRq + commands to enter kgdb, merged. + +- If CONFIG_KGDB and CONFIG_DEBUG_SPINLOCKS are both set we added a + "who" field to the spinlock data struct. This is filled with + "current" when ever the spinlock suceeds. Useful if you want to know + who has the lock. + +_ And last, but not least, we fixed the "get_cu" macro to properly get + the current value of "current". + +New features: +============ +20030505.1827.27 +We are starting to align with the sourceforge version, at least in +commands. To this end, the boot command string to start kgdb at +boot time has been changed from "kgdb" to "gdb". + +Andrew Morton sent a couple of patches which are now included as follows: +1.) We now return a flag to the interrupt handler. +2.) We no longer use smp_num_cpus (a conflict with the lock meter). +3.) And from William Lee Irwin III <wli@holomorphy.com> code to make + sure high-mem is set up before we attempt to register our interrupt + handler. +We now include asm/kgdb.h from config.h so you will most likely never +have to include it. It also 'NULLS' the kgdb macros you might have in +your code when CONFIG_KGDB is not defined. This allows you to just +turn off CONFIG_KGDB to turn off all the kgdb_ts() calls and such. +This include is conditioned on the machine being an x86 so as to not +mess with other archs. + +20020801.1129.03 +This is currently the version for the 2.4.18 (and beyond?) kernel. + +We have several new "features" beginning with this version: + +1.) Kgdb now syncs the "other" CPUs with a cross-CPU NMI. No more + waiting and it will pull that guy out of an IRQ off spin lock :) + +2.) We doctored up the code that tells where a task is waiting and + included it so that the "info thread" command will show a bit more + than "schedule()". Try it... + +3.) Added the ability to call a function from gdb. All the standard gdb + issues apply, i.e. if you hit a breakpoint in the function, you are + not allowed to call another (gdb limitation, not kgdb). To help + this capability we added a memory allocation function. Gdb does not + return this memory (it is used for strings that you pass to that function + you are calling from gdb) so we fixed up a way to allow you to + manually return the memory (see below). + +4.) Kgdb time stamps (kgdb_ts()) are enhanced to expand what was the + interrupt flag to now also include the preemption count and the + "in_interrupt" info. The flag is now called "with_pif" to indicate + the order, preempt_count, in_interrupt, flag. The preempt_count is + shifted left by 4 bits so you can read the count in hex by dropping + the low order digit. In_interrupt is in bit 1, and the flag is in + bit 0. + +5.) The command: "p kgdb_info" is now expanded and prints something + like: +(gdb) p kgdb_info +$2 = {used_malloc = 0, called_from = 0xc0107506, entry_tsc = 67468627259, + errcode = 0, vector = 3, print_debug_info = 0, hold_on_sstep = 1, + cpus_waiting = {{task = 0xc027a000, pid = 32768, hold = 0, + regs = 0xc027bf84}, {task = 0x0, pid = 0, hold = 0, regs = 0x0}}} + + Things to note here: a.) used_malloc is the amount of memory that + has been malloc'ed to do calls from gdb. You can reclaim this + memory like this: "p kgdb_info.used_malloc=0" Cool, huh? b.) + cpus_waiting is now "sized" by the number of CPUs you enter at + configure time in the kgdb configure section. This is NOT used + anywhere else in the system, but it is "nice" here. c.) The task's + "pid" is now in the structure. This is the pid you will need to use + to decode to the thread id to get gdb to look at that thread. + Remember that the "info thread" command prints a list of threads + wherein it numbers each thread with its reference number followed + by the thread's pid. Note that the per-CPU idle threads actually + have pids of 0 (yes, there is more than one pid 0 in an SMP system). + To avoid confusion, kgdb numbers these threads with numbers beyond + the MAX_PID. That is why you see 32768 and above. + +6.) A subtle change, we now provide the complete register set for tasks + that are active on the other CPUs. This allows better trace back on + those tasks. + + And, let's mention what we could not fix. Back-trace from all but the + thread that we trapped will, most likely, have a bogus entry in it. + The problem is that gdb does not recognize the entry code for + functions that use "current" near (at all?) the entry. The compiler + is putting the "current" decode as the first two instructions of the + function where gdb expects to find %ebp changing code. Back trace + also has trouble with interrupt frames. I am talking with Daniel + Jacobowitz about some way to fix this, but don't hold your breath. + +20011220.0050.35 +Major enhancement with this version is the ability to hold one or more +CPUs in an SMP system while allowing the others to continue. Also, by +default only the current CPU is enabled on single-step commands (please +note that gdb issues single-step commands at times other than when you +use the si command). + +Another change is to collect some useful information in +a global structure called "kgdb_info". You should be able to just: + +p kgdb_info + +although I have seen cases where the first time this is done gdb just +prints the first member but prints the whole structure if you then enter +CR (carriage return or enter). This also works: + +p *&kgdb_info + +Here is a sample: +(gdb) p kgdb_info +$4 = {called_from = 0xc010732c, entry_tsc = 32804123790856, errcode = 0, + vector = 3, print_debug_info = 0} + +"Called_from" is the return address from the current entry into kgdb. +Sometimes it is useful to know why you are in kgdb, for example, was +it an NMI or a real breakpoint? The simple way to interrogate this +return address is: + +l *0xc010732c + +which will print the surrounding few lines of source code. + +"Entry_tsc" is the CPU TSC on entry to kgdb (useful to compare to the +kgdb_ts entries). + +"errcode" and "vector" are other entry parameters which may be helpful on +some traps. + +"print_debug_info" is the internal debugging kgdb print enable flag. Yes, +you can modify it. + +In SMP systems kgdb_info also includes the "cpus_waiting" structure and +"hold_on_step": + +(gdb) p kgdb_info +$7 = {called_from = 0xc0112739, entry_tsc = 1034936624074, errcode = 0, + vector = 2, print_debug_info = 0, hold_on_sstep = 1, cpus_waiting = {{ + task = 0x0, hold = 0, regs = 0x0}, {task = 0xc71b8000, hold = 0, + regs = 0xc71b9f70}, {task = 0x0, hold = 0, regs = 0x0}, {task = 0x0, + hold = 0, regs = 0x0}, {task = 0x0, hold = 0, regs = 0x0}, {task = 0x0, + hold = 0, regs = 0x0}, {task = 0x0, hold = 0, regs = 0x0}, {task = 0x0, + hold = 0, regs = 0x0}}} + +"Cpus_waiting" has an entry for each CPU other than the current one that +has been stopped. Each entry contains the task_struct address for that +CPU, the address of the regs for that task and a hold flag. All these +have the proper typing so that, for example: + +p *kgdb_info.cpus_waiting[1].regs + +will print the registers for CPU 1. + +"Hold_on_sstep" is a new feature with this version and comes up set or +true. What this means is that whenever kgdb is asked to single-step all +other CPUs are held (i.e. not allowed to execute). The flag applies to +all but the current CPU and, again, can be changed: + +p kgdb_info.hold_on_sstep=0 + +restores the old behavior of letting all CPUs run during single-stepping. + +Likewise, each CPU has a "hold" flag, which if set, locks that CPU out +of execution. Note that this has some risk in cases where the CPUs need +to communicate with each other. If kgdb finds no CPU available on exit, +it will push a message thru gdb and stay in kgdb. Note that it is legal +to hold the current CPU as long as at least one CPU can execute. + +20010621.1117.09 +This version implements an event queue. Events are signaled by calling +a function in the kgdb stub and may be examined from gdb. See EVENTS +below for details. This version also tightens up the interrupt and SMP +handling to not allow interrupts on the way to kgdb from a breakpoint +trap. It is fine to allow these interrupts for user code, but not +system debugging. + +Version +======= + +This version of the kgdb package was developed and tested on +kernel version 2.4.16. It will not install on any earlier kernels. +It is possible that it will continue to work on later versions +of 2.4 and then versions of 2.5 (I hope). + + +Debugging Setup +=============== + +Designate one machine as the "development" machine. This is the +machine on which you run your compiles and which has your source +code for the kernel. Designate a second machine as the "target" +machine. This is the machine that will run your experimental +kernel. + +The two machines will be connected together via a serial line out +one or the other of the COM ports of the PC. You will need the +appropriate modem eliminator (null modem) cable(s) for this. + +Decide on which tty port you want the machines to communicate, then +connect them up back-to-back using the null modem cable. COM1 is +/dev/ttyS0 and COM2 is /dev/ttyS1. You should test this connection +with the two machines prior to trying to debug a kernel. Once you +have it working, on the TARGET machine, enter: + +setserial /dev/ttyS0 (or what ever tty you are using) + +and record the port address and the IRQ number. + +On the DEVELOPMENT machine you need to apply the patch for the kgdb +hooks. You have probably already done that if you are reading this +file. + +On your DEVELOPMENT machine, go to your kernel source directory and do +"make Xconfig" where X is one of "x", "menu", or "". If you are +configuring in the standard serial driver, it must not be a module. +Either yes or no is ok, but making the serial driver a module means it +will initialize after kgdb has set up the UART interrupt code and may +cause a failure of the control-C option discussed below. The configure +question for the serial driver is under the "Character devices" heading +and is: + +"Standard/generic (8250/16550 and compatible UARTs) serial support" + +Go down to the kernel debugging menu item and open it up. Enable the +kernel kgdb stub code by selecting that item. You can also choose to +turn on the "-ggdb -O1" compile options. The -ggdb causes the compiler +to put more debug info (like local symbols) in the object file. On the +i386 -g and -ggdb are the same so this option just reduces to "O1". The +-O1 reduces the optimization level. This may be helpful in some cases, +be aware, however, that this may also mask the problem you are looking +for. + +The baud rate. Default is 115200. What ever you choose be sure that +the host machine is set to the same speed. I recommend the default. + +The port. This is the I/O address of the serial UART that you should +have gotten using setserial as described above. The standard COM1 port +(3f8) using IRQ 4 is default. COM2 is 2f8 which by convention uses IRQ +3. + +The port IRQ (see above). + +Stack overflow test. This option makes a minor change in the trap, +system call and interrupt code to detect stack overflow and transfer +control to kgdb if it happens. (Some platforms have this in the +baseline code, but the i386 does not.) + +You can also configure the system to recognize the boot option +"console=kgdb" which if given will cause all console output during +booting to be put thru gdb as well as other consoles. This option +requires that gdb and kgdb be connected prior to sending console output +so, if they are not, a breakpoint is executed to force the connection. +This will happen before any kernel output (it is going thru gdb, right), +and will stall the boot until the connection is made. + +You can also configure in a patch to SysRq to enable the kGdb SysRq. +This request generates a breakpoint. Since the serial port IRQ line is +set up after any serial drivers, it is possible that this command will +work when the control-C will not. + +Save and exit the Xconfig program. Then do "make clean" , "make dep" +and "make bzImage" (or whatever target you want to make). This gets the +kernel compiled with the "-g" option set -- necessary for debugging. + +You have just built the kernel on your DEVELOPMENT machine that you +intend to run on your TARGET machine. + +To install this new kernel, use the following installation procedure. +Remember, you are on the DEVELOPMENT machine patching the kernel source +for the kernel that you intend to run on the TARGET machine. + +Copy this kernel to your target machine using your usual procedures. I +usually arrange to copy development: +/usr/src/linux/arch/i386/boot/bzImage to /vmlinuz on the TARGET machine +via a LAN based NFS access. That is, I run the cp command on the target +and copy from the development machine via the LAN. Run Lilo (see "man +lilo" for details on how to set this up) on the new kernel on the target +machine so that it will boot! Then boot the kernel on the target +machine. + +On the DEVELOPMENT machine, create a file called .gdbinit in the +directory /usr/src/linux. An example .gdbinit file looks like this: + +shell echo -e "\003" >/dev/ttyS0 +set remotebaud 38400 (or what ever speed you have chosen) +target remote /dev/ttyS0 + + +Change the "echo" and "target" definition so that it specifies the tty +port that you intend to use. Change the "remotebaud" definition to +match the data rate that you are going to use for the com line. + +You are now ready to try it out. + +Boot your target machine with "kgdb" in the boot command i.e. something +like: + +lilo> test kgdb + +or if you also want console output thru gdb: + +lilo> test kgdb console=kgdb + +You should see the lilo message saying it has loaded the kernel and then +all output stops. The kgdb stub is trying to connect with gdb. Start +gdb something like this: + + +On your DEVELOPMENT machine, cd /usr/src/linux and enter "gdb vmlinux". +When gdb gets the symbols loaded it will read your .gdbinit file and, if +everything is working correctly, you should see gdb print out a few +lines indicating that a breakpoint has been taken. It will actually +show a line of code in the target kernel inside the kgdb activation +code. + +The gdb interaction should look something like this: + + linux-dev:/usr/src/linux# gdb vmlinux + GDB is free software and you are welcome to distribute copies of it + under certain conditions; type "show copying" to see the conditions. + There is absolutely no warranty for GDB; type "show warranty" for details. + GDB 4.15.1 (i486-slackware-linux), + Copyright 1995 Free Software Foundation, Inc... + breakpoint () at i386-stub.c:750 + 750 } + (gdb) + +You can now use whatever gdb commands you like to set breakpoints. +Enter "continue" to start your target machine executing again. At this +point the target system will run at full speed until it encounters +your breakpoint or gets a segment violation in the kernel, or whatever. + +If you have the kgdb console enabled when you continue, gdb will print +out all the console messages. + +The above example caused a breakpoint relatively early in the boot +process. For the i386 kgdb it is possible to code a break instruction +as the first C-language point in init/main.c, i.e. as the first instruction +in start_kernel(). This could be done as follows: + +#include <asm/kgdb.h> + breakpoint(); + +This breakpoint() is really a function that sets up the breakpoint and +single-step hardware trap cells and then executes a breakpoint. Any +early hard coded breakpoint will need to use this function. Once the +trap cells are set up they need not be set again, but doing it again +does not hurt anything, so you don't need to be concerned about which +breakpoint is hit first. Once the trap cells are set up (and the kernel +sets them up in due course even if breakpoint() is never called) the +macro: + +BREAKPOINT; + +will generate an inline breakpoint. This may be more useful as it stops +the processor at the instruction instead of in a function a step removed +from the location of interest. In either case <asm/kgdb.h> must be +included to define both breakpoint() and BREAKPOINT. + +Triggering kgdbstub at other times +================================== + +Often you don't need to enter the debugger until much later in the boot +or even after the machine has been running for some time. Once the +kernel is booted and interrupts are on, you can force the system to +enter the debugger by sending a control-C to the debug port. This is +what the first line of the recommended .gdbinit file does. This allows +you to start gdb any time after the system is up as well as when the +system is already at a breakpoint. (In the case where the system is +already at a breakpoint the control-C is not needed, however, it will +be ignored by the target so no harm is done. Also note the the echo +command assumes that the port speed is already set. This will be true +once gdb has connected, but it is best to set the port speed before you +run gdb.) + +Another simple way to do this is to put the following file in you ~/bin +directory: + +#!/bin/bash +echo -e "\003" > /dev/ttyS0 + +Here, the ttyS0 should be replaced with what ever port you are using. +The "\003" is control-C. Once you are connected with gdb, you can enter +control-C at the command prompt. + +An alternative way to get control to the debugger is to enable the kGdb +SysRq command. Then you would enter Alt-SysRq-g (all three keys at the +same time, but push them down in the order given). To refresh your +memory of the available SysRq commands try Alt-SysRq-=. Actually any +undefined command could replace the "=", but I like to KNOW that what I +am pushing will never be defined. + +Debugging hints +=============== + +You can break into the target machine at any time from the development +machine by typing ^C (see above paragraph). If the target machine has +interrupts enabled this will stop it in the kernel and enter the +debugger. + +There is unfortunately no way of breaking into the kernel if it is +in a loop with interrupts disabled, so if this happens to you then +you need to place exploratory breakpoints or printk's into the kernel +to find out where it is looping. The exploratory breakpoints can be +entered either thru gdb or hard coded into the source. This is very +handy if you do something like: + +if (<it hurts>) BREAKPOINT; + + +There is a copy of an e-mail in the Documentation/i386/kgdb/ directory +(debug-nmi.txt) which describes how to create an NMI on an ISA bus +machine using a paper clip. I have a sophisticated version of this made +by wiring a push button switch into a PC104/ISA bus adapter card. The +adapter card nicely furnishes wire wrap pins for all the ISA bus +signals. + +When you are done debugging the kernel on the target machine it is a +good idea to leave it in a running state. This makes reboots faster, +bypassing the fsck. So do a gdb "continue" as the last gdb command if +this is possible. To terminate gdb itself on the development machine +and leave the target machine running, first clear all breakpoints and +continue, then type ^Z to suspend gdb and then kill it with "kill %1" or +something similar. + +If gdbstub Does Not Work +======================== + +If it doesn't work, you will have to troubleshoot it. Do the easy +things first like double checking your cabling and data rates. You +might try some non-kernel based programs to see if the back-to-back +connection works properly. Just something simple like cat /etc/hosts +>/dev/ttyS0 on one machine and cat /dev/ttyS0 on the other will tell you +if you can send data from one machine to the other. Make sure it works +in both directions. There is no point in tearing out your hair in the +kernel if the line doesn't work. + +All of the real action takes place in the file +/usr/src/linux/arch/i386/kernel/kgdb_stub.c. That is the code on the target +machine that interacts with gdb on the development machine. In gdb you can +turn on a debug switch with the following command: + + set remotedebug + +This will print out the protocol messages that gdb is exchanging with +the target machine. + +Another place to look is /usr/src/arch/i386/lib/kgdb_serial.c. This is +the code that talks to the serial port on the target side. There might +be a problem there. In particular there is a section of this code that +tests the UART which will tell you what UART you have if you define +"PRNT" (just remove "_off" from the #define PRNT_off). To view this +report you will need to boot the system without any beakpoints. This +allows the kernel to run to the point where it calls kgdb to set up +interrupts. At this time kgdb will test the UART and print out the type +it finds. (You need to wait so that the printks are actually being +printed. Early in the boot they are cached, waiting for the console to +be enabled. Also, if kgdb is entered thru a breakpoint it is possible +to cause a dead lock by calling printk when the console is locked. The +stub thus avoids doing printks from breakpoints, especially in the +serial code.) At this time, if the UART fails to do the expected thing, +kgdb will print out (using printk) information on what failed. (These +messages will be buried in all the other boot up messages. Look for +lines that start with "gdb_hook_interrupt:". You may want to use dmesg +once the system is up to view the log. If this fails or if you still +don't connect, review your answers for the port address. Use: + +setserial /dev/ttyS0 + +to get the current port and IRQ information. This command will also +tell you what the system found for the UART type. The stub recognizes +the following UART types: + +16450, 16550, and 16550A + +If you are really desperate you can use printk debugging in the +kgdbstub code in the target kernel until you get it working. In particular, +there is a global variable in /usr/src/linux/arch/i386/kernel/kgdb_stub.c +named "remote_debug". Compile your kernel with this set to 1, rather +than 0 and the debug stub will print out lots of stuff as it does +what it does. Likewise there are debug printks in the kgdb_serial.c +code that can be turned on with simple changes in the macro defines. + + +Debugging Loadable Modules +========================== + +This technique comes courtesy of Edouard Parmelan +<Edouard.Parmelan@quadratec.fr> + +When you run gdb, enter the command + +source gdbinit-modules + +This will read in a file of gdb macros that was installed in your +kernel source directory when kgdb was installed. This file implements +the following commands: + +mod-list + Lists the loaded modules in the form <module-address> <module-name> + +mod-print-symbols <module-address> + Prints all the symbols in the indicated module. + +mod-add-symbols <module-address> <object-file-path-name> + Loads the symbols from the object file and associates them + with the indicated module. + +After you have loaded the module that you want to debug, use the command +mod-list to find the <module-address> of your module. Then use that +address in the mod-add-symbols command to load your module's symbols. +From that point onward you can debug your module as if it were a part +of the kernel. + +The file gdbinit-modules also contains a command named mod-add-lis as +an example of how to construct a command of your own to load your +favorite module. The idea is to "can" the pathname of the module +in the command so you don't have to type so much. + +Threads +======= + +Each process in a target machine is seen as a gdb thread. gdb thread +related commands (info threads, thread n) can be used. + +ia-32 hardware breakpoints +========================== + +kgdb stub contains support for hardware breakpoints using debugging features +of ia-32(x86) processors. These breakpoints do not need code modification. +They use debugging registers. 4 hardware breakpoints are available in ia-32 +processors. + +Each hardware breakpoint can be of one of the following three types. + +1. Execution breakpoint - An Execution breakpoint is triggered when code + at the breakpoint address is executed. + + As limited number of hardware breakpoints are available, it is + advisable to use software breakpoints ( break command ) instead + of execution hardware breakpoints, unless modification of code + is to be avoided. + +2. Write breakpoint - A write breakpoint is triggered when memory + location at the breakpoint address is written. + + A write or can be placed for data of variable length. Length of + a write breakpoint indicates length of the datatype to be + watched. Length is 1 for 1 byte data , 2 for 2 byte data, 3 for + 4 byte data. + +3. Access breakpoint - An access breakpoint is triggered when memory + location at the breakpoint address is either read or written. + + Access breakpoints also have lengths similar to write breakpoints. + +IO breakpoints in ia-32 are not supported. + +Since gdb stub at present does not use the protocol used by gdb for hardware +breakpoints, hardware breakpoints are accessed through gdb macros. gdb macros +for hardware breakpoints are described below. + +hwebrk - Places an execution breakpoint + hwebrk breakpointno address +hwwbrk - Places a write breakpoint + hwwbrk breakpointno length address +hwabrk - Places an access breakpoint + hwabrk breakpointno length address +hwrmbrk - Removes a breakpoint + hwrmbrk breakpointno +exinfo - Tells whether a software or hardware breakpoint has occurred. + Prints number of the hardware breakpoint if a hardware breakpoint has + occurred. + +Arguments required by these commands are as follows +breakpointno - 0 to 3 +length - 1 to 3 +address - Memory location in hex digits ( without 0x ) e.g c015e9bc + +SMP support +========== + +When a breakpoint occurs or user issues a break ( Ctrl + C ) to gdb +client, all the processors are forced to enter the debugger. Current +thread corresponds to the thread running on the processor where +breakpoint occurred. Threads running on other processor(s) appear +similar to other non-running threads in the 'info threads' output. +Within the kgdb stub there is a structure "waiting_cpus" in which kgdb +records the values of "current" and "regs" for each CPU other than the +one that hit the breakpoint. "current" is a pointer to the task +structure for the task that CPU is running, while "regs" points to the +saved registers for the task. This structure can be examined with the +gdb "p" command. + +ia-32 hardware debugging registers on all processors are set to same +values. Hence any hardware breakpoints may occur on any processor. + +gdb troubleshooting +=================== + +1. gdb hangs +Kill it. restart gdb. Connect to target machine. + +2. gdb cannot connect to target machine (after killing a gdb and +restarting another) If the target machine was not inside debugger when +you killed gdb, gdb cannot connect because the target machine won't +respond. In this case echo "Ctrl+C"(ASCII 3) to the serial line. +e.g. echo -e "\003" > /dev/ttyS1 +This forces that target machine into the debugger, after which you +can connect. + +3. gdb cannot connect even after echoing Ctrl+C into serial line +Try changing serial line settings min to 1 and time to 0 +e.g. stty min 1 time 0 < /dev/ttyS1 +Try echoing again + +Check serial line speed and set it to correct value if required +e.g. stty ispeed 115200 ospeed 115200 < /dev/ttyS1 + +EVENTS +====== + +Ever want to know the order of things happening? Which CPU did what and +when? How did the spinlock get the way it is? Then events are for +you. Events are defined by calls to an event collection interface and +saved for later examination. In this case, kgdb events are saved by a +very fast bit of code in kgdb which is fully SMP and interrupt protected +and they are examined by using gdb to display them. Kgdb keeps only +the last N events, where N must be a power of two and is defined at +configure time. + + +Events are signaled to kgdb by calling: + +kgdb_ts(data0,data1) + +For each call kgdb records each call in an array along with other info. +Here is the array definition: + +struct kgdb_and_then_struct { +#ifdef CONFIG_SMP + int on_cpu; +#endif + long long at_time; + int from_ln; + char * in_src; + void *from; + int with_if; + int data0; + int data1; +}; + +For SMP machines the CPU is recorded, for all machines the TSC is +recorded (gets a time stamp) as well as the line number and source file +the call was made from. The address of the (from), the "if" (interrupt +flag) and the two data items are also recorded. The macro kgdb_ts casts +the types to int, so you can put any 32-bit values here. There is a +configure option to select the number of events you want to keep. A +nice number might be 128, but you can keep up to 1024 if you want. The +number must be a power of two. An "andthen" macro library is provided +for gdb to help you look at these events. It is also possible to define +a different structure for the event storage and cast the data to this +structure. For example the following structure is defined in kgdb: + +struct kgdb_and_then_struct2 { +#ifdef CONFIG_SMP + int on_cpu; +#endif + long long at_time; + int from_ln; + char * in_src; + void *from; + int with_if; + struct task_struct *t1; + struct task_struct *t2; +}; + +If you use this for display, the data elements will be displayed as +pointers to task_struct entries. You may want to define your own +structure to use in casting. You should only change the last two items +and you must keep the structure size the same. Kgdb will handle these +as 32-bit ints, but within that constraint you can define a structure to +cast to any 32-bit quantity. This need only be available to gdb and is +only used for casting in the display code. + +Final Items +=========== + +I picked up this code from Amit S. Kale and enhanced it. + +If you make some really cool modification to this stuff, or if you +fix a bug, please let me know. + +George Anzinger +<george@mvista.com> + +Amit S. Kale +<akale@veritas.com> + +(First kgdb by David Grothe <dave@gcom.com>) + +(modified by Tigran Aivazian <tigran@sco.com>) + Putting gdbstub into the kernel config menu. + +(modified by Scott Foehner <sfoehner@engr.sgi.com>) + Hooks for entering gdbstub at boot time. + +(modified by Amit S. Kale <akale@veritas.com>) + Threads, ia-32 hw debugging, mp support, console support, + nmi watchdog handling. + +(modified by George Anzinger <george@mvista.com>) + Extended threads to include the idle threads. + Enhancements to allow breakpoint() at first C code. + Use of module_init() and __setup() to automate the configure. + Enhanced the cpu "collection" code to work in early bring-up. + Added ability to call functions from gdb + Print info thread stuff without going back to schedule() + Now collect the "other" cpus with an IPI/ NMI. --- diff/Documentation/i386/kgdb/kgdbeth.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/kgdbeth.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,118 @@ +KGDB over ethernet +================== + +Authors +------- + +Robert Walsh <rjwalsh@durables.org> (2.6 port) +wangdi <wangdi@clusterfs.com> (2.6 port) +San Mehat (original 2.4 code) + + +Introduction +------------ + +KGDB supports debugging over ethernet. Only a limited set of ethernet +devices are supported right now, but adding support for new devices +should not be too complicated. See "New Devices" below for details. + + +Terminology +----------- + +This document uses the following terms: + + TARGET: the machine being debugged. + HOST: the machine running gdb. + + +Usage +----- + +You need to use the following command-line options on the TARGET kernel: + + gdbeth=DEVICENUM + gdbeth_remoteip=HOSTIPADDR + gdbeth_remotemac=REMOTEMAC + gdbeth_localmac=LOCALMAC + +kgdbeth=DEVICENUM sets the ethernet device number to listen on for +debugging packets. e.g. kgdbeth=0 listens on eth0. + +kgdbeth_remoteip=HOSTIPADDR sets the IP address of the HOST machine. +Only packets originating from this IP address will be accepted by the +debugger. e.g. kgdbeth_remoteip=192.168.2.2 + +kgdbeth_remotemac=REMOTEMAC sets the ethernet address of the HOST machine. +e.g. kgdbeth_remotemac=00:07:70:12:4E:F5 + +kgdbeth_localmac=LOCALMAC sets the ethernet address of the TARGET machine. +e.g. kgdbeth_localmac=00:10:9F:18:21:3C + +You can also set the following command-line option on the TARGET kernel: + + kgdbeth_listenport=PORT + +kgdbeth_listenport sets the UDP port to listen on for gdb debugging +packets. The default value is "6443". e.g. kgdbeth_listenport=7654 +causes the kernel to listen on UDP port 7654 for debugging packets. + +On the HOST side, run gdb as normal and use a remote UDP host as the +target: + + % gdb ./vmlinux + GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) + Copyright 2003 Free Software Foundation, Inc. + GDB is free software, covered by the GNU General Public License, and you are + welcome to change it and/or distribute copies of it under certain conditions. + Type "show copying" to see the conditions. + There is absolutely no warranty for GDB. Type "show warranty" for details. + This GDB was configured as "i386-redhat-linux-gnu"... + (gdb) target remote udp:HOSTNAME:6443 + +You can now continue as if you were debugging over a serial line. + +Observations +------------ + +I've used this with NFS and various other network applications (ssh, +etc.) and it's doesn't appear to interfere with their operation in +any way. It doesn't seem to effect the NIC it uses - i.e. you don't +need a dedicated NIC for this. + +Limitations +----------- + +In the inital release of this code you _must_ break into the system with the +debugger by hand, early after boot, as described above. + +Otherwise, the first time the kernel tries to enter the debugger (say, via an +oops or a BUG), the kgdb stub will doublefault and die because things aren't +fully set up yet. + +Supported devices +----------------- + +Right now, the following drivers are supported: + + e100 driver (drivers/net/e100/*) + 3c59x driver (drivers/net/3c59x.c) + + +New devices +----------- + +Supporting a new device is straightforward. Just add a "poll" routine to +the driver and hook it into the poll_controller field in the netdevice +structure. For an example, look in drivers/net/3c59x.c and search +for CONFIG_KGDB (two places.) + +The poll routine is usually quite simple - it's usually enough to just +disable interrupts, call the device's interrupt routine and re-enable +interrupts again. + + +Bug reports +----------- + +Send bug reports to Robert Walsh <rjwalsh@durables.org>. --- diff/Documentation/i386/kgdb/loadmodule.sh 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/i386/kgdb/loadmodule.sh 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,78 @@ +#/bin/sh +# This script loads a module on a target machine and generates a gdb script. +# source generated gdb script to load the module file at appropriate addresses +# in gdb. +# +# Usage: +# Loading the module on target machine and generating gdb script) +# [foo]$ loadmodule.sh <modulename> +# +# Loading the module file into gdb +# (gdb) source <gdbscriptpath> +# +# Modify following variables according to your setup. +# TESTMACHINE - Name of the target machine +# GDBSCRIPTS - The directory where a gdb script will be generated +# +# Author: Amit S. Kale (akale@veritas.com). +# +# If you run into problems, please check files pointed to by following +# variables. +# ERRFILE - /tmp/<modulename>.errs contains stderr output of insmod +# MAPFILE - /tmp/<modulename>.map contains stdout output of insmod +# GDBSCRIPT - $GDBSCRIPTS/load<modulename> gdb script. + +TESTMACHINE=foo +GDBSCRIPTS=/home/bar + +if [ $# -lt 1 ] ; then { + echo Usage: $0 modulefile + exit +} ; fi + +MODULEFILE=$1 +MODULEFILEBASENAME=`basename $1` + +if [ $MODULEFILE = $MODULEFILEBASENAME ] ; then { + MODULEFILE=`pwd`/$MODULEFILE +} fi + +ERRFILE=/tmp/$MODULEFILEBASENAME.errs +MAPFILE=/tmp/$MODULEFILEBASENAME.map +GDBSCRIPT=$GDBSCRIPTS/load$MODULEFILEBASENAME + +function findaddr() { + local ADDR=0x$(echo "$SEGMENTS" | \ + grep "$1" | sed 's/^[^ ]*[ ]*[^ ]*[ ]*//' | \ + sed 's/[ ]*[^ ]*$//') + echo $ADDR +} + +function checkerrs() { + if [ "`cat $ERRFILE`" != "" ] ; then { + cat $ERRFILE + exit + } fi +} + +#load the module +echo Copying $MODULEFILE to $TESTMACHINE +rcp $MODULEFILE root@${TESTMACHINE}: + +echo Loading module $MODULEFILE +rsh -l root $TESTMACHINE /sbin/insmod -m ./`basename $MODULEFILE` \ + > $MAPFILE 2> $ERRFILE +checkerrs + +SEGMENTS=`head -n 11 $MAPFILE | tail -n 10` +TEXTADDR=$(findaddr "\\.text[^.]") +LOADSTRING="add-symbol-file $MODULEFILE $TEXTADDR" +SEGADDRS=`echo "$SEGMENTS" | awk '//{ + if ($1 != ".text" && $1 != ".this" && + $1 != ".kstrtab" && $1 != ".kmodtab") { + print " -s " $1 " 0x" $3 " " + } +}'` +LOADSTRING="$LOADSTRING $SEGADDRS" +echo Generating script $GDBSCRIPT +echo $LOADSTRING > $GDBSCRIPT --- diff/Documentation/must-fix.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/must-fix.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,292 @@ + +Must-fix bugs +============= + +drivers/char/ +~~~~~~~~~~~~~ + +o TTY locking is broken. + + o see FIXME in do_tty_hangup(). This causes ppp BUGs in local_bh_enable() + + o Other problems: aviro, dipankar, Alan have details. + + o somebody will have to document the tty driver and ldisc API + +drivers/tty +~~~~~~~~~~~ + +o viro: tty_driver refcounting, tty/misc/upper levels of sound still not + completely fixed. + +drivers/block/ +~~~~~~~~~~~~~~ + +o ideraid hasn't been ported to 2.5 at all yet. + + We need to understand whether the proposed BIO split code will suffice + for this. + +drivers/input/ +~~~~~~~~~~~~~~ + +o rmk: unconverted keyboard/mouse drivers (there's a deadline of 2.6.0 + currently on these remaining in my/Linus' tree.) + +o viro: large absence of locking. + +o viro: parport is nearly as bad as that and there the code is more hairy. + IMO parport is more of "figure out what API changes are needed for its + users, get them done ASAP, then fix generic layer at leisure" + +o (Albert Cahalan) Lots of people (check Google) get this message from the + kernel: + + psmouse.c: Lost synchronization, throwing 2 bytes away. + + (the number of bytes will be 1, 2, or 3) + + At work, I get it when there is heavy NFS traffic. The mouse goes crazy, + jumping around and doing random cut-and-paste all over everything. This + is with a decently fast and modern PC. + +o There seem to be too many reports of keyboards and mice failing or acting + strangely. + + +drivers/misc/ +~~~~~~~~~~~~~ + +o rmk: UCB1[23]00 drivers, currently sitting in drivers/misc in the ARM + tree. (touchscreen, audio, gpio, type device.) + + These need to be moved out of drivers/misc/ and into real places + +o viro: actually, misc.c has a good chance to die. With cdev-cidr that's + trivial. + +drivers/net/ +~~~~~~~~~~~~ + +drivers/net/irda/ +~~~~~~~~~~~~~~~~~ + +o dongle drivers need to be converted to sir-dev + +o irport need to be converted to sir-kthread + +o new drivers (irtty-sir/smsc-ircc2/donauboe) need more testing + +o rmk: Refuse IrDA initialisation if sizeof(structures) is incorrect (I'm + not sure if we still need this; I think gcc 2.95.3 on ARM shows this + problem though.) + +drivers/pci/ +~~~~~~~~~~~~ + +o alan: Some cardbus crashes the system + + (bugzilla, please?) + +drivers/pcmcia/ +~~~~~~~~~~~~~~~ + +o alan: This is a locking disaster. + + (rmk, brodo: in progress) + +drivers/pld/ +~~~~~~~~~~~~ + +o rmk: EPXA (ARM platform) PLD hotswap drivers (drivers/pld) + + (rmk: will work out what to do here. maybe drivers/arm/) + +drivers/video/ +~~~~~~~~~~~~~~ + +o Lots of drivers don't compile, others do but don't work. + +drivers/scsi/ +~~~~~~~~~~~~~ + +o hch: shost->my_devices isn't locked down at all + +o Convert am53c974, dpt_i2o, initio and pci2220i to DMA-mapping + +o Make inia100, cpqfc, pci2000 and dc390t compile + +o Convert + + wd33c99 based: a2091 a3000 gpv11 mvme174 sgiwd93 + + 53c7xx based: amiga7xxx bvme6000 mvme16x initio am53c974 pci2000 + pci2220i dc390t + + To new error handling + + It also might be possible to shift the 53c7xx based drivers over to + 53c700 which does the new EH stuff, but I don't have the hardware to check + such a shift. + + For the non-compiling stuff, I've probably missed a few that just aren't + compilable on my platforms, so any updates would be welcome. Also, are + some of our non-compiling or unconverted drivers obsolete? + +o rmk: I have a pending todo: I need to put the scsi error handling through + a workout on my scsi bus from hell to make sure it does the right thing + and doesn't get wedged. + +o James B: USB hot-removal crash: "It's a known scsi refcounting issue." + +o James B: refcounting issues in SCSI and in the block layer. + +fs/ +~~~ + +o AIO/direct-IO writes can race with truncate and wreck filesystems. + (Badari has a patch) + +o viro: fs/char_dev.c needs removal of aeb stuff and merge of cdev-cidr. + In progress. + +o forward-port sct's O_DIRECT fixes (Badari has a patch) + +o viro: there is some generic stuff for namei/namespace/super, but that's a + slow-merge and can go in 2.6 just fine + +o andi: also soft needs to be fixed - there are quite a lot of + uninterruptible waits in sunrpc/nfs + +o trond: NFS has a mmap-versus-truncate problem + +kernel/sched.c +~~~~~~~~~~~~~~ + +o Starvation, general interactivity need close monitoring. + +kernel/ +~~~~~~~ + +o Alan: 32bit uid support is *still* broken for process accounting. + + Create a 32bit uid, turn accounting on. Shock horror it doesn't work + because the field is 16bit. We need an acct structure flag day for 2.6 + IMHO + + (alan has patch) + +o viro: core sysctl code is racy. And its interaction wiuth sysfs + +o (ingo) rwsems (on x86) are limited to 32766 waiting processes. This + means that setting pid_max to above 32K is unsafe :-( + + An option is to use CONFIG_RWSEM_GENERIC_SPINLOCK variant all the time, + for all archs, and not inline any part of the ops. + +lib/kobject.c +~~~~~~~~~~~~~ + +o kobject refcounting (comments from Al Viro): + + _anything_ can grab a temporary reference to kobject. IOW, if kobject is + embedded into something that could be freed - it _MUST_ have a destructor + and that destructor _MUST_ be the destructor for containing object. + + Any violation of the above (and we already have a bunch of those) is a + user-triggerable memory corruption. + + We can tolerate it for a while in 2.5 (e.g. during work on susbsystem we + can decide to switch to that way of handling objects and have subsystem + vulnerable for a while), but all such windows must be closed before 2.6 + and during 2.6 we can't open them at all. + +o All block drivers which control multiple gendisks with a single + request_queue are broken, due to one-to-one assumptions in the request + queue sysfs hookup. + +mm/ +~~~ + +o GFP_DMA32 (or something like that). Lots of ideas. jejb, zaitcev, + willy, arjan, wli. + + Specifically, 64-bit systems need to be able to enforce 32-bit addressing + limits for device metadata like network cards' ring buffers and SCSI + command descriptors. + +o access_process_vm() doesn't flush right. We probably need new flushing + primitives to do this (davem?) + + +modules +~~~~~~~ + + (Rusty) + +net/ +~~~~ + + (davem) + +o UDP apps can in theory deadlock, because the ip_append_data path can end + up sleeping while the socket lock is held. + + It is OK to sleep with the socket held held, normally. But in this case + the sleep happens while waiting for socket memory/space to become + available, if another context needs to take the socket lock to free up the + space we could hang. + + I sent a rough patch on how to fix this to Alexey, and he is analyzing + the situation. I expect a final fix from him next week or so. + +o Semantics for IPSEC during operations such as TCP connect suck currently. + + When we first try to connect to a destination, we may need to ask the + IPSEC key management daemon to resolve the IPSEC routes for us. For the + purposes of what the kernel needs to do, you can think of it like ARP. We + can't send the packet out properly until we resolve the path. + + What happens now for IPSEC is basically this: + + O_NONBLOCK: returns -EAGAIN over and over until route is resolved + + !O_NONBLOCK: Sleeps until route is resolved + + These semantics are total crap. The solution, which Alexey is working + on, is to allow incomplete routes to exist. These "incomplete" routes + merely put the packet onto a "resolution queue", and once the key manager + does it's thing we finish the output of the packet. This is precisely how + ARP works. + + I don't know when Alexey will be done with this. + +net/*/netfilter/ +~~~~~~~~~~~~~~~~ + + (Rusty) + +o Rework conntrack hashing. + +o Module relationship bogosity fix (trivial, have patch). + +sound/ +~~~~~~ + +global +~~~~~~ + +o viro: 64-bit dev_t (not a mustfix for 2.6.0). 32-bit dev_t is done, 64-bit + means extra work on nfsd/raid/etc. + +o alan: Forward port 2.4 fixes + - Chris Wright: Security fixes including execve holes, execve vs proc races + +o There are about 60 or 70 security related checks that need doing + (copy_user etc) from Stanford tools. (badari is looking into this, and + hollisb) + +o A couple of hundred real looking bugzilla bugs + +o viro: cdev rework. Mostly done. + --- diff/Documentation/should-fix.txt 1970-01-01 01:00:00.000000000 +0100 +++ source/Documentation/should-fix.txt 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,620 @@ +Not-ready features and speedups +=============================== + +Legend: + +PRI1: We're totally lame if this doesn't get in +PRI2: Would be nice +PRI3: Not very important + +drivers/block/ +~~~~~~~~~~~~~~ + +o viro: paride drivers need a big cleanup. Partially done, but ATAPI drivers + need serious work and bug fixing. + + PRI2 + +drivers/char/rtc/ +~~~~~~~~~~~~~~~~~ + +o rmk, trini: add support for alarms to the existing generic rtc driver. + + PRI2 + +console drivers +~~~~~~~~~~~~~~~ + (Pavel Machek <pavel@ucw.cz>) + +o There are few must-fix bugs in cursor handling. + +o Play with gpm selection for a while and your cursor gets corrupted with + random dots. Ouch. + +device mapper +~~~~~~~~~~~~~ + +o ioctl interface cleanup patch is ready (redo the structure layouts) + + PRI1 + +o A port of the 2.4 snapshot and mirror targets is in progress + + PRI1 + +o the fs interface to dm needs to be redone. gregkh was going to work on + this. viro is interested in seeing work thus-far. + + PRI2 + +drivers/net/wireless/ +~~~~~~~~~~~~~~~~~~~~~ + + (Jean Tourrilhes <jt@bougret.hpl.hp.com>) + +o get latest orinoco changes from David. + + PRI1 + +o get the latest airo.c fixes from CVS. This will hopefully fix problems + people have reported on the LKML. + + PRI1 + +o get HostAP driver in the kernel. No consolidation of the 802.11 + management across driver can happen until this one is in (which is probably + 2.7.X material). I think Jouni is mostly ready but didn't find time for + it. + + PRI2 + +o get more wireless drivers into the kernel. The most "integrable" drivers + at this point seem the NWN driver, Pavel's Spectrum driver and the Atmel + driver. + + PRI1 + +o The last two drivers mentioned above are held up by firmware issues (see + flamewar on LKML a few days ago). So maybe fixing those firmware issues + should be a requirement for 2.6.X, because we can expect more wireless + devices to need firmware upload at startup coming to market. + + (in progress?) + + PRI1 + +drivers/usb/gadget/ +~~~~~~~~~~~~~~~~~~~ + +o rmk: SA11xx USB client/gadget code (David B has been doing some work on + this, and keeps trying to prod me, but unfortunately I haven't had the time + to look at his work, sorry David.) + + PRI3 + +fs/ +~~~ + +o ext3 and ext2 block allocators have serious failure modes - interleaved + allocations. + + PRI3 + +o Integrate Chris Mason's 2.4 reiserfs ordered data and data journaling + patches. They make reiserfs a lot safer. + + Ordered: PRI2 + data journalled: PRI3 + +o (Trond:) Yes: I'm still working on an atomic "open()", i.e. one + where we short-circuit the usual VFS path_walk() + lookup() + + permission() + create() + .... bullsh*t... + + I have several reasons for wanting to do this (all of + them related to NFS of course, but much of the reasoning applies + to *all* networked file systems). + + 1) The above sequence is simply not atomic on *any* networked + filesystem. + + 2) It introduces a sh*tload of completely unnecessary RPC calls (why + do a 'permission' RPC call when the server is in *any* case going to + tell you whether or not this operations is allowed. Why do a + 'lookup()' when the 'create()' call can be made to tell you whether or + not a file already exists). + + 3) It is incompatible with some operations: the current create() + doesn't pass an 'EXCLUSIVE' flag down to the filesystems. + + 4) (NFS specific?) open() has very different cache consistency + requirements when compared to most other VFS operations. + + I'd very much like for something like Peter Braam's 'lookup with + intent' or (better yet) for a proper dentry->open() to be integrated with + path_walk()/open_namei(). I'm still working on the latter (Peter has + already completed the lookup with intent stuff). + + (All this is in progress, see http://www.fys.uio.no/~trondmy/src) + + (Is awaiting Pater Braam's intent patches. Applicable to CIFS) + + PRI2 (?) + +o viro: convert more filesystems to use lib/parser.c for options. + + PRI2 + +o aio: fs IO isn't async at present. suparna has restart patches, they're + in -mm. Need to get Ben to review/comment. + + PRI1. + +o drepper: various filesystems use ->pid wrongly + + PRI1 + +o hch: devfs: there's a fundamental lookup vs devfsd race that's only + fixable by introducing a lookup vs devfs deadlock. I can't see how this is + fixable without getting rid of the current devfsd design. Mandrake seems + to have a workaround for this so this is at least not triggered so easily, + but that's not what I'd consider a fix.. + + PRI2 + +kernel/ +~~~~~~~ + +o rusty: Zippel's Reference count simplification. Tricky code, but cuts + about 120 lines from module.c. Patch exists, needs stressing. + + PRI3 + +o rusty: Fix module-failed-init races by starting module "disabled". Patch + exists, requires some subsystems (ie. add_partition) to explicitly say + "make module live now". Without patch we are no worse off than 2.4 etc. + + PRI1 + +o Integrate userspace irq balancing daemon. + + PRI2 + +o kexec. Seems to work, was in -mm. + + PRI3 + +o rmk: lib/inflate.c must not use static variables (causes these to be + referenced via GOTOFF relocations in PIC decompressor. We have a PIC + decompressor to avoid having to hard code a per platform zImage link + address into the makefiles.) + + PRI2 + +o klibc merge? + + PRI2 + +mm/ +~~~ + +o dropbehind for large files + + PRI2 + +net/ +~~~~ + + (davem) + +o Real serious use of IPSEC is hampered by lack of MPLS support. MPLS is a + switching technology that works by switching based upon fixed length labels + prepended to packets. Many people use this and IPSEC to implement VPNs + over public networks, it is also used for things like traffic engineering. + + A good reference site is: + + http://www.mplsrc.com/ + + Anyways, an existing (crappy) implementation exists. I've almost + completed a rewrite, I should have something in the tree next week. + + PRI1 + +o Sometimes we generate IP fragments when it truly isn't necessary. + + The way IP fragmentation is specified, each fragment must be modulo 8 + bytes in length. So suppose the device has an MTU that is not 0 modulo 8, + ethernet even classifies in this way. 1500 == (8 * 187) + 4 + + Our IP fragmenting engine can fragment on packets that are sized within + the last modulo 8 bytes of the MTU. This happens in obscure cases, but it + does happen. + + I've proposed a fix to Alexey, whereby very late in the output path we + check the packet, if we fragmented but the data length would fit into the + MTU we unfragment the packet. + + This is low priority, because technically it creates suboptimal behavior + rather than mis-operation. + + PRI1 + +net/*/netfilter/ +~~~~~~~~~~~~~~~~ + +o Lots of misc. cleanups, which are happening slowly. + + PRI2 + +power management +~~~~~~~~~~~~~~~~ + +o PM code in mainline is currently b0rked. Fixes in -mm + + PRI1 + +o Pat and Pavel disagree over swsusp. Need to sort that out. + + PRI2 + +o Frame buffer restore codepaths (that requires some deep PCI magic) + + PRI2 + +o XFree86 hooks + + PRI2 + +o AGP restoration + + PRI2 + +o DRI restoration + + (davej/Alan: not super-critical, can crash laptop on restore. davej + looking into it.) + + PRI2 + +o IDE suspend/resume without races (Ben is looking at this a little) + + PRI2 + +o Pat: There are already CPU device structures; MTRRs should be a + dynamically registered interface of CPUs, which implies there needs + to be some other glue to know that there are MTRRs that need to be + saved/restored. + + PRI1 + +global +~~~~~~ + +o We need a kernel side API for reporting error events to userspace (could + be async to 2.6 itself) + + (Prototype core based on netlink exists) + + PRI2 + +o Kai: Introduce a sane, easy and standard way to build external modules + - make clean and make modules_install are both broken + + PRI2 + +drivers +~~~~~~~ + +o Alan: Cardbus/PCMCIA requires all Russell's stuff is merged to do + multiheader right and so on + + PRI1 + +drivers/acpi/ +~~~~~~~~~~~~~ + +o Fix acpi for all newer IBM Thinkpads see + http://bugme.osdl.org/show_bug.cgi?id=1038 for more information + +o alan: VIA APIC stuff is one bit of this, there are also some other + reports that were caused by ACPI not setting level v edge trigger some + times + + PRI1 + +o mochel: it seems the acpi irq routing code could use a serious rewrite. + + grover: The problem is the ACPI irq routing code is trying to piggyback + on the existing MPS-specific data structures, and it's generally a hack. + So yes mochel is right, but it is also purging MPS-ities from common code + as well. I've done some preliminary work in this area and it doesn't seem + to break anything (yet) but a rewrite in this area imho should not be + rushed out the door. And, I think the above bugs can be fixed w/o the + rewrite. + + PRI2 + +o mochel: ACPI suspend doesn't work. Important, not cricital. Pat is + working it. + + PRI2 + +drivers/block/ +~~~~~~~~~~~~~~ + +o Floppy is almost unusably buggy still + + akpm: we need more people to test & report. + + alan: "Floppy has worked for me since the patches that went in 2.5.69-ac + and I think -bk somewhere" + + PRI1 + +drivers/char/ +~~~~~~~~~~~~~ + + +drivers/ide/ +~~~~~~~~~~~~ + + (Alan) + +o IDE PIO has occasional unexplained PIO disk eating reports + + PRI1 + +o IDE has multiple zillions of races/hangs in 2.5 still + + PRI1 + +o IDE scsi needs rewriting + + PRI2 + +o IDE needs significant reworking to handle Simplex right + + PRI2 + +o IDE hotplug handling for 2.5 is completely broken still + + PRI2 + +o There are lots of other IDE bugs that wont go away until the taskfile + stuff is included, the locking bugs that allow any user to hang the IDE + layer in 2.5, and some other updates are forward ported. (esp. HPT372N). + + PRI1 + +drivers/isdn/ +~~~~~~~~~~~~~ + + (Kai, rmk) + +o isdn_tty locking is completely broken (cli() and friends) + + PRI2 + +o fix other drivers + + PRI2 + +o lots more cleanups, adaption to recent APIs etc + + PRI3 + +o fixup tty-based ISDN drivers which provide TIOCM* ioctls (see my recent + 3-set patch for serial stuff) + + Alternatively, we could re-introduce the fallback to driver ioctl parsing + for these if not enough drivers get updated. + + PRI3 + +drivers/net/ +~~~~~~~~~~~~ + +o davej: Either Wireless network drivers or PCMCIA broke somewhen. A + configuration that worked fine under 2.4 doesn't receive any packets. Need + to look into this more to make sure I don't have any misconfiguration that + just 'happened to work' under 2.4 + + PRI1 + +drivers/scsi/ +~~~~~~~~~~~~~ + +o jejb: qlogic - + + o Merge the feral driver. It covers all qlogic chips: 1020 all the way + up to 23xxx. http://linux-scsi.bkbits.net/scsi-isp-2.5 + + o qla2xxx: only for FC chips. Has significant build issues. hch + promises to send me a "must fix" list for this. + http://linux-scsi.bkbits.net/scsi-qla2xxx-2.5 + + PRI2 + +o hch, Mike Anderson, Badari Pulavarty: scsi locking issues + + o there are lots of members of struct Scsi_Host/scsi_device/scsi_cmnd + with very unclear locking, many of them probably want to become + atomic_t's or bitmaps (for the 1bit bitfields). + + o there's lots of volatile abuse in the scsi code that needs to be + thought about. + + o there's some global variables incremented without any locks + + PRI2 + +sound/ +~~~~~~ + +o rmk: several OSS drivers for SA11xx-based hardware in need of + ALSA-ification and L3 bus support code for these. + +o rmk: linux/sound/drivers/mpu401/mpu401.c and + linux/sound/drivers/virmidi.c complained about 'errno' at some time in the + past, need to confirm whether this is still a problem. + +o rmk: need to complete ALSA-ification of the WaveArtist driver for both + NetWinder and other stuff (there's some fairly fundamental differences in + the way the mixer needs to be handled for the NetWinder.) + + (Issues with forward-porting 2.4 bugfixes.) + (Killing off OSS is 2.7 material) + +PRI2 + +arch/i386/ +~~~~~~~~~~ + +o Also PC9800 merge needs finishing to the point we want for 2.6 (not all). + + PRI3 + +o davej: PAT support (for mtrr exhaustion w/ AGP) + + PRI2 + +o 2.5.x won't boot on some 440GX + + alan: Problem understood now, feasible fix in 2.4/2.4-ac. (440GX has two + IRQ routers, we use the $PIR table with the PIIX, but the 440GX doesnt use + the PIIX for its IRQ routing). Fall back to BIOS for 440GX works and Intel + concurs. + + PRI1 + +o 2.5.x doesn't handle VIA APIC right yet. + + 1. We must write the PCI_INTERRUPT_LINE + + 2. We have quirk handlers that seem to trash it. + + PRI1 + +o ECC driver questions are not yet sorted (DaveJ is working on this) (Dan + Hollis) + + alan: ECC - I have some test bits from Dan's stuff - they need no kernel + core changes for most platforms. That means we can treat it as a random + driver merge. + + PRI3 + +o alan: 2.4 has some fixes for tsc handling bugs. One where some bioses in + SMM mode mess up our toggle on the time high/low or mangle the counter and + one where a few chips need religious use of _p for timer access and we + don't do that. This is forward porting little bits of fixup. + + ACPI HZ stuff we can't trap - a lot of ACPI is implemented as outb's + triggering SMM traps + + PRI1 + +arch/x86_64/ +~~~~~~~~~~~~ + + (Andi) + +o time handling is broken. Need to move up 2.4 time.c code. + + PRI1 + +o NMI watchdog seems to tick too fast + + PRI2 + +o need to coredump 64bit vsyscall code with dwarf2 + + PRI2 + +o Consider merging of Erich Focht's very clean and simple homenode NUMA + scheduler (I have my own in 2.4, but Erich's 2.5 version is much cleaner) + + PRI2 + +o Consider port of the Simple NUMA API from 2.4/homenode. + + PRI3 + +o move 64bit signal trampolines into vsyscall code and add dwarf2 for it. + (in progress) + + PRI1 + +o describe kernel assembly with dwarf2 annotations for kgdb + + PRI3 + +arch/alpha/ +~~~~~~~~~~~ + +o rth: Ptrace writes are broken. This means we can't (reliably) set + breakpoints or modify variables from gdb. + + PRI1 + +arch/arm/ +~~~~~~~~~ + +o rmk: missing raw keyboard translation tables for all ARM machines. + Haven't even looked into this at all. This could be messy since there + isn't an ARM architecture standard. I'm presently hoping that it won't be + an issue. If it does, I guess we'll see drivers/char/keyboard.c explode. + + PRI2 + +arch/others/ +~~~~~~~~~~~~ + +o SH needs resyncing, as do some other ports. SH64 needs merging. + No impact on mainstream platforms hopefully. + + PRI2 + +arch/s390/ +~~~~~~~~~ + +o A nastly memory management problem causes random crashes. These appear + to be fixed/hidden by the objrmap patch, more investigation is needed. + + PRI1 + +drivers/s390/ +~~~~~~~~~~~~~ + +o Early userspace and 64 bit dev_t will allow the removal of most of + dasd_devmap.c and dasd_genhd.c. + + PRI2 + +o The 3270 console driver needs to be replaced with a working one + (prototype is there, needs to be finished). + + PRI2 + +o Minor interface changes are pending in cio/ when the z990 machines are + out. + + PRI2 + +o Jan Glauber is working on a fix for the timer issues related to running + on virtualized CPUs (wall-clock vs. cpu time). + + PRI1 + +o a block device driver for ramdisks shared among virtual machines + + PRI3 + +o driver for crypto hardware + + PRI3 + +o 'claw' network device driver + + PRI3 + --- diff/arch/i386/kernel/efi.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/kernel/efi.c 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,645 @@ +/* + * Extensible Firmware Interface + * + * Based on Extensible Firmware Interface Specification version 1.0 + * + * Copyright (C) 1999 VA Linux Systems + * Copyright (C) 1999 Walt Drummond <drummond@valinux.com> + * Copyright (C) 1999-2002 Hewlett-Packard Co. + * David Mosberger-Tang <davidm@hpl.hp.com> + * Stephane Eranian <eranian@hpl.hp.com> + * + * All EFI Runtime Services are not implemented yet as EFI only + * supports physical mode addressing on SoftSDV. This is to be fixed + * in a future version. --drummond 1999-07-20 + * + * Implemented EFI runtime services and virtual mode calls. --davidm + * + * Goutham Rao: <goutham.rao@intel.com> + * Skip non-WB memory and ignore empty memory ranges. + */ + +#include <linux/config.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/mm.h> +#include <linux/types.h> +#include <linux/time.h> +#include <linux/spinlock.h> +#include <linux/bootmem.h> +#include <linux/ioport.h> +#include <linux/proc_fs.h> +#include <linux/efi.h> + +#include <asm/setup.h> +#include <asm/io.h> +#include <asm/page.h> +#include <asm/pgtable.h> +#include <asm/processor.h> +#include <asm/desc.h> +#include <asm/pgalloc.h> +#include <asm/tlbflush.h> + +#define EFI_DEBUG 0 +#define PFX "EFI: " + +extern efi_status_t asmlinkage efi_call_phys(void *, ...); + +struct efi efi; +struct efi efi_phys __initdata; +struct efi_memory_map memmap __initdata; + +/* + * We require an early boot_ioremap mapping mechanism initially + */ +extern void * boot_ioremap(unsigned long, unsigned long); + +/* + * efi_dir is allocated here, but the directory isn't created + * here, as proc_mkdir() doesn't work this early in the bootup + * process. Therefore, each module, like efivars, must test for + * if (!efi_dir) efi_dir = proc_mkdir("efi", NULL); + * prior to creating their own entries under /proc/efi. + */ +#ifdef CONFIG_PROC_FS +struct proc_dir_entry *efi_dir; +#endif + + +/* + * To make EFI call EFI runtime service in physical addressing mode we need + * prelog/epilog before/after the invocation to disable interrupt, to + * claim EFI runtime service handler exclusively and to duplicate a memory in + * low memory space say 0 - 3G. + */ + +static unsigned long efi_rt_eflags; +static spinlock_t efi_rt_lock = SPIN_LOCK_UNLOCKED; +static pgd_t efi_bak_pg_dir_pointer[2]; + +static void efi_call_phys_prelog(void) +{ + unsigned long cr4; + unsigned long temp; + + spin_lock(&efi_rt_lock); + local_irq_save(efi_rt_eflags); + + /* + * If I don't have PSE, I should just duplicate two entries in page + * directory. If I have PSE, I just need to duplicate one entry in + * page directory. + */ + __asm__ __volatile__("movl %%cr4, %0":"=r"(cr4)); + + if (cr4 & X86_CR4_PSE) { + efi_bak_pg_dir_pointer[0].pgd = + swapper_pg_dir[pgd_index(0)].pgd; + swapper_pg_dir[0].pgd = + swapper_pg_dir[pgd_index(PAGE_OFFSET)].pgd; + } else { + efi_bak_pg_dir_pointer[0].pgd = + swapper_pg_dir[pgd_index(0)].pgd; + efi_bak_pg_dir_pointer[1].pgd = + swapper_pg_dir[pgd_index(0x400000)].pgd; + swapper_pg_dir[pgd_index(0)].pgd = + swapper_pg_dir[pgd_index(PAGE_OFFSET)].pgd; + temp = PAGE_OFFSET + 0x400000; + swapper_pg_dir[pgd_index(0x400000)].pgd = + swapper_pg_dir[pgd_index(temp)].pgd; + } + + /* + * After the lock is released, the original page table is restored. + */ + local_flush_tlb(); + + cpu_gdt_descr[0].address = __pa(cpu_gdt_descr[0].address); + __asm__ __volatile__("lgdt %0":"=m" + (*(struct Xgt_desc_struct *) __pa(&cpu_gdt_descr[0]))); +} + +static void efi_call_phys_epilog(void) +{ + unsigned long cr4; + + cpu_gdt_descr[0].address = + (unsigned long) __va(cpu_gdt_descr[0].address); + __asm__ __volatile__("lgdt %0":"=m"(cpu_gdt_descr)); + __asm__ __volatile__("movl %%cr4, %0":"=r"(cr4)); + + if (cr4 & X86_CR4_PSE) { + swapper_pg_dir[pgd_index(0)].pgd = + efi_bak_pg_dir_pointer[0].pgd; + } else { + swapper_pg_dir[pgd_index(0)].pgd = + efi_bak_pg_dir_pointer[0].pgd; + swapper_pg_dir[pgd_index(0x400000)].pgd = + efi_bak_pg_dir_pointer[1].pgd; + } + + /* + * After the lock is released, the original page table is restored. + */ + local_flush_tlb(); + + local_irq_restore(efi_rt_eflags); + spin_unlock(&efi_rt_lock); +} + +static efi_status_t +phys_efi_set_virtual_address_map(unsigned long memory_map_size, + unsigned long descriptor_size, + u32 descriptor_version, + efi_memory_desc_t *virtual_map) +{ + efi_status_t status; + + efi_call_phys_prelog(); + status = efi_call_phys(efi_phys.set_virtual_address_map, + memory_map_size, descriptor_size, + descriptor_version, virtual_map); + efi_call_phys_epilog(); + return status; +} + +efi_status_t +phys_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) +{ + efi_status_t status; + + efi_call_phys_prelog(); + status = efi_call_phys(efi_phys.get_time, tm, tc); + efi_call_phys_epilog(); + return status; +} + +int inline efi_set_rtc_mmss(unsigned long nowtime) +{ + int real_seconds, real_minutes; + efi_status_t status; + efi_time_t eft; + efi_time_cap_t cap; + + spin_lock(&efi_rt_lock); + status = efi.get_time(&eft, &cap); + spin_unlock(&efi_rt_lock); + if (status != EFI_SUCCESS) + panic("Ooops, efitime: can't read time!\n"); + real_seconds = nowtime % 60; + real_minutes = nowtime / 60; + + if (((abs(real_minutes - eft.minute) + 15)/30) & 1) + real_minutes += 30; + real_minutes %= 60; + + eft.minute = real_minutes; + eft.second = real_seconds; + + if (status != EFI_SUCCESS) { + printk("Ooops: efitime: can't read time!\n"); + return -1; + } + return 0; +} +/* + * This should only be used during kernel init and before runtime + * services have been remapped, therefore, we'll need to call in physical + * mode. Note, this call isn't used later, so mark it __init. + */ +unsigned long inline __init efi_get_time(void) +{ + efi_status_t status; + efi_time_t eft; + efi_time_cap_t cap; + + status = phys_efi_get_time(&eft, &cap); + if (status != EFI_SUCCESS) + printk("Oops: efitime: can't read time status: 0x%lx\n",status); + + return mktime(eft.year, eft.month, eft.day, eft.hour, + eft.minute, eft.second); +} + +int is_available_memory(efi_memory_desc_t * md) +{ + if (!(md->attribute & EFI_MEMORY_WB)) + return 0; + + switch (md->type) { + case EFI_LOADER_CODE: + case EFI_LOADER_DATA: + case EFI_BOOT_SERVICES_CODE: + case EFI_BOOT_SERVICES_DATA: + case EFI_CONVENTIONAL_MEMORY: + return 1; + } + return 0; +} + +/* + * We need to map the EFI memory map again after paging_init(). + */ +void __init efi_map_memmap(void) +{ + memmap.map = NULL; + + memmap.map = (efi_memory_desc_t *) + bt_ioremap((unsigned long) memmap.phys_map, + (memmap.nr_map * sizeof(efi_memory_desc_t))); + + if (memmap.map == NULL) + printk(KERN_ERR PFX "Could not remap the EFI memmap!\n"); +} + +void __init print_efi_memmap(void) +{ + efi_memory_desc_t *md; + int i; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + printk(KERN_INFO "mem%02u: type=%u, attr=0x%llx, " + "range=[0x%016llx-0x%016llx) (%lluMB)\n", + i, md->type, md->attribute, md->phys_addr, + md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT), + (md->num_pages >> (20 - EFI_PAGE_SHIFT))); + } +} + +/* + * Walks the EFI memory map and calls CALLBACK once for each EFI + * memory descriptor that has memory that is available for kernel use. + */ +void efi_memmap_walk(efi_freemem_callback_t callback, void *arg) +{ + int prev_valid = 0; + struct range { + unsigned long start; + unsigned long end; + } prev, curr; + efi_memory_desc_t *md; + unsigned long start, end; + int i; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + + if ((md->num_pages == 0) || (!is_available_memory(md))) + continue; + + curr.start = md->phys_addr; + curr.end = curr.start + (md->num_pages << EFI_PAGE_SHIFT); + + if (!prev_valid) { + prev = curr; + prev_valid = 1; + } else { + if (curr.start < prev.start) + printk(KERN_INFO PFX "Unordered memory map\n"); + if (prev.end == curr.start) + prev.end = curr.end; + else { + start = + (unsigned long) (PAGE_ALIGN(prev.start)); + end = (unsigned long) (prev.end & PAGE_MASK); + if ((end > start) + && (*callback) (start, end, arg) < 0) + return; + prev = curr; + } + } + } + if (prev_valid) { + start = (unsigned long) PAGE_ALIGN(prev.start); + end = (unsigned long) (prev.end & PAGE_MASK); + if (end > start) + (*callback) (start, end, arg); + } +} + +void __init efi_init(void) +{ + efi_config_table_t *config_tables; + efi_runtime_services_t *runtime; + efi_char16_t *c16; + char vendor[100] = "unknown"; + unsigned long num_config_tables; + int i = 0; + + memset(&efi, 0, sizeof(efi) ); + memset(&efi_phys, 0, sizeof(efi_phys)); + + efi_phys.systab = EFI_SYSTAB; + memmap.phys_map = EFI_MEMMAP; + memmap.nr_map = EFI_MEMMAP_SIZE/EFI_MEMDESC_SIZE; + memmap.desc_version = EFI_MEMDESC_VERSION; + + efi.systab = (efi_system_table_t *) + boot_ioremap((unsigned long) efi_phys.systab, + sizeof(efi_system_table_t)); + /* + * Verify the EFI Table + */ + if (efi.systab == NULL) + printk(KERN_ERR PFX "Woah! Couldn't map the EFI system table.\n"); + if (efi.systab->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) + printk(KERN_ERR PFX "Woah! EFI system table signature incorrect\n"); + if ((efi.systab->hdr.revision ^ EFI_SYSTEM_TABLE_REVISION) >> 16 != 0) + printk(KERN_ERR PFX + "Warning: EFI system table major version mismatch: " + "got %d.%02d, expected %d.%02d\n", + efi.systab->hdr.revision >> 16, + efi.systab->hdr.revision & 0xffff, + EFI_SYSTEM_TABLE_REVISION >> 16, + EFI_SYSTEM_TABLE_REVISION & 0xffff); + /* + * Grab some details from the system table + */ + num_config_tables = efi.systab->nr_tables; + config_tables = (efi_config_table_t *)efi.systab->tables; + runtime = efi.systab->runtime; + + /* + * Show what we know for posterity + */ + c16 = (efi_char16_t *) boot_ioremap(efi.systab->fw_vendor, 2); + if (c16) { + for (i = 0; i < sizeof(vendor) && *c16; ++i) + vendor[i] = *c16++; + vendor[i] = '\0'; + } else + printk(KERN_ERR PFX "Could not map the firmware vendor!\n"); + + printk(KERN_INFO PFX "EFI v%u.%.02u by %s \n", + efi.systab->hdr.revision >> 16, + efi.systab->hdr.revision & 0xffff, vendor); + + /* + * Let's see what config tables the firmware passed to us. + */ + config_tables = (efi_config_table_t *) + boot_ioremap((unsigned long) config_tables, + num_config_tables * sizeof(efi_config_table_t)); + + if (config_tables == NULL) + printk(KERN_ERR PFX "Could not map EFI Configuration Table!\n"); + + for (i = 0; i < num_config_tables; i++) { + if (efi_guidcmp(config_tables[i].guid, MPS_TABLE_GUID) == 0) { + efi.mps = (void *)config_tables[i].table; + printk(KERN_INFO " MPS=0x%lx ", config_tables[i].table); + } else + if (efi_guidcmp(config_tables[i].guid, ACPI_20_TABLE_GUID) == 0) { + efi.acpi20 = __va(config_tables[i].table); + printk(KERN_INFO " ACPI 2.0=0x%lx ", config_tables[i].table); + } else + if (efi_guidcmp(config_tables[i].guid, ACPI_TABLE_GUID) == 0) { + efi.acpi = __va(config_tables[i].table); + printk(KERN_INFO " ACPI=0x%lx ", config_tables[i].table); + } else + if (efi_guidcmp(config_tables[i].guid, SMBIOS_TABLE_GUID) == 0) { + efi.smbios = (void *) config_tables[i].table; + printk(KERN_INFO " SMBIOS=0x%lx ", config_tables[i].table); + } else + if (efi_guidcmp(config_tables[i].guid, HCDP_TABLE_GUID) == 0) { + efi.hcdp = (void *)config_tables[i].table; + printk(KERN_INFO " HCDP=0x%lx ", config_tables[i].table); + } else + if (efi_guidcmp(config_tables[i].guid, UGA_IO_PROTOCOL_GUID) == 0) { + efi.uga = (void *)config_tables[i].table; + printk(KERN_INFO " UGA=0x%lx ", config_tables[i].table); + } + } + printk("\n"); + + /* + * Check out the runtime services table. We need to map + * the runtime services table so that we can grab the physical + * address of several of the EFI runtime functions, needed to + * set the firmware into virtual mode. + */ + + runtime = (efi_runtime_services_t *) boot_ioremap((unsigned long) + runtime, + sizeof(efi_runtime_services_t)); + if (runtime != NULL) { + /* + * We will only need *early* access to the following + * two EFI runtime services before set_virtual_address_map + * is invoked. + */ + efi_phys.get_time = (efi_get_time_t *) runtime->get_time; + efi_phys.set_virtual_address_map = + (efi_set_virtual_address_map_t *) + runtime->set_virtual_address_map; + } else + printk(KERN_ERR PFX "Could not map the runtime service table!\n"); + + /* Map the EFI memory map for use until paging_init() */ + + memmap.map = (efi_memory_desc_t *) + boot_ioremap((unsigned long) EFI_MEMMAP, EFI_MEMMAP_SIZE); + + if (memmap.map == NULL) + printk(KERN_ERR PFX "Could not map the EFI memory map!\n"); + + if (EFI_MEMDESC_SIZE != sizeof(efi_memory_desc_t)) { + printk(KERN_WARNING PFX "Warning! Kernel-defined memdesc doesn't " + "match the one from EFI!\n"); + } +#if EFI_DEBUG + print_efi_memmap(); +#endif +} + +/* + * This function will switch the EFI runtime services to virtual mode. + * Essentially, look through the EFI memmap and map every region that + * has the runtime attribute bit set in its memory descriptor and update + * that memory descriptor with the virtual address obtained from ioremap(). + * This enables the runtime services to be called without having to + * thunk back into physical mode for every invocation. + */ + +void __init efi_enter_virtual_mode(void) +{ + efi_memory_desc_t *md; + efi_status_t status; + int i; + + efi.systab = NULL; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + + if (md->attribute & EFI_MEMORY_RUNTIME) { + md->virt_addr = + (unsigned long)ioremap(md->phys_addr, + md->num_pages << EFI_PAGE_SHIFT); + if (!(unsigned long)md->virt_addr) { + printk(KERN_ERR PFX "ioremap of 0x%lX failed\n", + (unsigned long)md->phys_addr); + } + + if (((unsigned long)md->phys_addr <= + (unsigned long)efi_phys.systab) && + ((unsigned long)efi_phys.systab < + md->phys_addr + + ((unsigned long)md->num_pages << + EFI_PAGE_SHIFT))) { + unsigned long addr; + + addr = md->virt_addr - md->phys_addr + + (unsigned long)efi_phys.systab; + efi.systab = (efi_system_table_t *)addr; + } + } + } + + if (!efi.systab) + BUG(); + + status = phys_efi_set_virtual_address_map( + sizeof(efi_memory_desc_t) * memmap.nr_map, + sizeof(efi_memory_desc_t), + memmap.desc_version, + memmap.phys_map); + + if (status != EFI_SUCCESS) { + printk (KERN_ALERT "You are screwed! " + "Unable to switch EFI into virtual mode " + "(status=%lx)\n", status); + panic("EFI call to SetVirtualAddressMap() failed!"); + } + + /* + * Now that EFI is in virtual mode, update the function + * pointers in the runtime service table to the new virtual addresses. + */ + + efi.get_time = (efi_get_time_t *) efi.systab->runtime->get_time; + efi.set_time = (efi_set_time_t *) efi.systab->runtime->set_time; + efi.get_wakeup_time = (efi_get_wakeup_time_t *) + efi.systab->runtime->get_wakeup_time; + efi.set_wakeup_time = (efi_set_wakeup_time_t *) + efi.systab->runtime->set_wakeup_time; + efi.get_variable = (efi_get_variable_t *) + efi.systab->runtime->get_variable; + efi.get_next_variable = (efi_get_next_variable_t *) + efi.systab->runtime->get_next_variable; + efi.set_variable = (efi_set_variable_t *) + efi.systab->runtime->set_variable; + efi.get_next_high_mono_count = (efi_get_next_high_mono_count_t *) + efi.systab->runtime->get_next_high_mono_count; + efi.reset_system = (efi_reset_system_t *) + efi.systab->runtime->reset_system; +} + +void __init +efi_initialize_iomem_resources(struct resource *code_resource, + struct resource *data_resource) +{ + struct resource *res; + efi_memory_desc_t *md; + int i; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + + if ((md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT)) > + 0x100000000ULL) + continue; + res = alloc_bootmem_low(sizeof(struct resource)); + switch (md->type) { + case EFI_RESERVED_TYPE: + res->name = "Reserved Memory"; + break; + case EFI_LOADER_CODE: + res->name = "Loader Code"; + break; + case EFI_LOADER_DATA: + res->name = "Loader Data"; + break; + case EFI_BOOT_SERVICES_DATA: + res->name = "BootServices Data"; + break; + case EFI_BOOT_SERVICES_CODE: + res->name = "BootServices Code"; + break; + case EFI_RUNTIME_SERVICES_CODE: + res->name = "Runtime Service Code"; + break; + case EFI_RUNTIME_SERVICES_DATA: + res->name = "Runtime Service Data"; + break; + case EFI_CONVENTIONAL_MEMORY: + res->name = "Conventional Memory"; + break; + case EFI_UNUSABLE_MEMORY: + res->name = "Unusable Memory"; + break; + case EFI_ACPI_RECLAIM_MEMORY: + res->name = "ACPI Reclaim"; + break; + case EFI_ACPI_MEMORY_NVS: + res->name = "ACPI NVS"; + break; + case EFI_MEMORY_MAPPED_IO: + res->name = "Memory Mapped IO"; + break; + case EFI_MEMORY_MAPPED_IO_PORT_SPACE: + res->name = "Memory Mapped IO Port Space"; + break; + default: + res->name = "Reserved"; + break; + } + res->start = md->phys_addr; + res->end = res->start + ((md->num_pages << EFI_PAGE_SHIFT) - 1); + res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; + if (request_resource(&iomem_resource, res) < 0) + printk(KERN_ERR PFX "Failed to allocate res %s : 0x%lx-0x%lx\n", + res->name, res->start, res->end); + /* + * We don't know which region contains kernel data so we try + * it repeatedly and let the resource manager test it. + */ + if (md->type == EFI_CONVENTIONAL_MEMORY) { + request_resource(res, code_resource); + request_resource(res, data_resource); + } + } +} + +/* + * Convenience functions to obtain memory types and attributes + */ + +u32 efi_mem_type(unsigned long phys_addr) +{ + efi_memory_desc_t *md; + int i; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + if ((md->phys_addr <= phys_addr) && (phys_addr < + (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) + return md->type; + } + return 0; +} + +u64 efi_mem_attributes(unsigned long phys_addr) +{ + efi_memory_desc_t *md; + int i; + + for (i = 0; i < memmap.nr_map; i++) { + md = &memmap.map[i]; + if ((md->phys_addr <= phys_addr) && (phys_addr < + (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) + return md->attribute; + } + return 0; +} --- diff/arch/i386/kernel/efi_stub.S 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/kernel/efi_stub.S 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,124 @@ +/* + * EFI call stub for IA32. + * + * This stub allows us to make EFI calls in physical mode with interrupts + * turned off. + */ + +#include <linux/config.h> +#include <linux/linkage.h> +#include <asm/page.h> +#include <asm/pgtable.h> + +/* + * efi_call_phys(void *, ...) is a function with variable parameters. + * All the callers of this function assure that all the parameters are 4-bytes. + */ + +/* + * In gcc calling convention, EBX, ESP, EBP, ESI and EDI are all callee save. + * So we'd better save all of them at the beginning of this function and restore + * at the end no matter how many we use, because we can not assure EFI runtime + * service functions will comply with gcc calling convention, too. + */ + +.text +ENTRY(efi_call_phys) + /* + * 0. The function can only be called in Linux kernel. So CS has been + * set to 0x0010, DS and SS have been set to 0x0018. In EFI, I found + * the values of these registers are the same. And, the corresponding + * GDT entries are identical. So I will do nothing about segment reg + * and GDT, but change GDT base register in prelog and epilog. + */ + + /* + * 1. Now I am running with EIP = <physical address> + PAGE_OFFSET. + * But to make it smoothly switch from virtual mode to flat mode. + * The mapping of lower virtual memory has been created in prelog and + * epilog. + */ + movl $1f, %edx + subl $__PAGE_OFFSET, %edx + jmp *%edx +1: + + /* + * 2. Now on the top of stack is the return + * address in the caller of efi_call_phys(), then parameter 1, + * parameter 2, ..., param n. To make things easy, we save the return + * address of efi_call_phys in a global variable. + */ + popl %edx + movl %edx, saved_return_addr + /* get the function pointer into ECX*/ + popl %ecx + movl %ecx, efi_rt_function_ptr + movl $2f, %edx + subl $__PAGE_OFFSET, %edx + pushl %edx + + /* + * 3. Clear PG bit in %CR0. + */ + movl %cr0, %edx + andl $0x7fffffff, %edx + movl %edx, %cr0 + jmp 1f +1: + + /* + * 4. Adjust stack pointer. + */ + subl $__PAGE_OFFSET, %esp + + /* + * 5. Call the physical function. + */ + jmp *%ecx + +2: + /* + * 6. After EFI runtime service returns, control will return to + * following instruction. We'd better readjust stack pointer first. + */ + addl $__PAGE_OFFSET, %esp + + /* + * 7. Restore PG bit + */ + movl %cr0, %edx + orl $0x80000000, %edx + movl %edx, %cr0 + jmp 1f +1: + /* + * 8. Now restore the virtual mode from flat mode by + * adding EIP with PAGE_OFFSET. + */ + movl $1f, %edx + jmp *%edx +1: + + /* + * 9. Balance the stack. And because EAX contain the return value, + * we'd better not clobber it. + */ + leal efi_rt_function_ptr, %edx + movl (%edx), %ecx + pushl %ecx + + /* + * 10. Push the saved return address onto the stack and return. + */ + leal saved_return_addr, %edx + movl (%edx), %ecx + pushl %ecx + ret +.previous + +.data +saved_return_addr: + .long 0 +efi_rt_function_ptr: + .long 0 --- diff/arch/i386/kernel/entry_trampoline.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/kernel/entry_trampoline.c 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,75 @@ +/* + * linux/arch/i386/kernel/entry_trampoline.c + * + * (C) Copyright 2003 Ingo Molnar + * + * This file contains the needed support code for 4GB userspace + */ + +#include <linux/init.h> +#include <linux/smp.h> +#include <linux/mm.h> +#include <linux/sched.h> +#include <linux/kernel.h> +#include <linux/string.h> +#include <linux/highmem.h> +#include <asm/desc.h> +#include <asm/atomic_kmap.h> + +extern char __entry_tramp_start, __entry_tramp_end, __start___entry_text; + +void __init init_entry_mappings(void) +{ +#ifdef CONFIG_X86_HIGH_ENTRY + void *tramp; + + /* + * We need a high IDT and GDT for the 4G/4G split: + */ + trap_init_virtual_IDT(); + + __set_fixmap(FIX_ENTRY_TRAMPOLINE_0, __pa((unsigned long)&__entry_tramp_start), PAGE_KERNEL); + __set_fixmap(FIX_ENTRY_TRAMPOLINE_1, __pa((unsigned long)&__entry_tramp_start) + PAGE_SIZE, PAGE_KERNEL); + tramp = (void *)fix_to_virt(FIX_ENTRY_TRAMPOLINE_0); + + printk("mapped 4G/4G trampoline to %p.\n", tramp); + BUG_ON((void *)&__start___entry_text != tramp); + /* + * Virtual kernel stack: + */ + BUG_ON(__kmap_atomic_vaddr(KM_VSTACK0) & 8191); + BUG_ON(sizeof(struct desc_struct)*NR_CPUS*GDT_ENTRIES > 2*PAGE_SIZE); + BUG_ON((unsigned int)&__entry_tramp_end - (unsigned int)&__entry_tramp_start > 2*PAGE_SIZE); + + /* + * set up the initial thread's virtual stack related + * fields: + */ + current->thread.stack_page0 = virt_to_page((char *)current->thread_info); + current->thread.stack_page1 = virt_to_page((char *)current->thread_info + PAGE_SIZE); + current->thread_info->virtual_stack = (void *)__kmap_atomic_vaddr(KM_VSTACK0); + + __kunmap_atomic_type(KM_VSTACK0); + __kunmap_atomic_type(KM_VSTACK1); + __kmap_atomic(current->thread.stack_page0, KM_VSTACK0); + __kmap_atomic(current->thread.stack_page1, KM_VSTACK1); + +#endif + printk("current: %p\n", current); + printk("current->thread_info: %p\n", current->thread_info); + current->thread_info->real_stack = (void *)current->thread_info; + current->thread_info->user_pgd = NULL; + current->thread.esp0 = (unsigned long)current->thread_info->real_stack + THREAD_SIZE; +} + + + +void __init entry_trampoline_setup(void) +{ + /* + * old IRQ entries set up by the boot code will still hang + * around - they are a sign of hw trouble anyway, now they'll + * produce a double fault message. + */ + trap_init_virtual_GDT(); +} --- diff/arch/i386/kernel/kgdb_stub.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/kernel/kgdb_stub.c 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,2492 @@ +/* + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2, or (at your option) any + * later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + */ + +/* + * Copyright (c) 2000 VERITAS Software Corporation. + * + */ +/**************************************************************************** + * Header: remcom.c,v 1.34 91/03/09 12:29:49 glenne Exp $ + * + * Module name: remcom.c $ + * Revision: 1.34 $ + * Date: 91/03/09 12:29:49 $ + * Contributor: Lake Stevens Instrument Division$ + * + * Description: low level support for gdb debugger. $ + * + * Considerations: only works on target hardware $ + * + * Written by: Glenn Engel $ + * Updated by: David Grothe <dave@gcom.com> + * Updated by: Robert Walsh <rjwalsh@durables.org> + * Updated by: wangdi <wangdi@clusterfs.com> + * ModuleState: Experimental $ + * + * NOTES: See Below $ + * + * Modified for 386 by Jim Kingdon, Cygnus Support. + * Compatibility with 2.1.xx kernel by David Grothe <dave@gcom.com> + * + * Changes to allow auto initilization. All that is needed is that it + * be linked with the kernel and a break point (int 3) be executed. + * The header file <asm/kgdb.h> defines BREAKPOINT to allow one to do + * this. It should also be possible, once the interrupt system is up, to + * call putDebugChar("+"). Once this is done, the remote debugger should + * get our attention by sending a ^C in a packet. George Anzinger + * <george@mvista.com> + * Integrated into 2.2.5 kernel by Tigran Aivazian <tigran@sco.com> + * Added thread support, support for multiple processors, + * support for ia-32(x86) hardware debugging. + * Amit S. Kale ( akale@veritas.com ) + * + * Modified to support debugging over ethernet by Robert Walsh + * <rjwalsh@durables.org> and wangdi <wangdi@clusterfs.com>, based on + * code by San Mehat. + * + * + * To enable debugger support, two things need to happen. One, a + * call to set_debug_traps() is necessary in order to allow any breakpoints + * or error conditions to be properly intercepted and reported to gdb. + * Two, a breakpoint needs to be generated to begin communication. This + * is most easily accomplished by a call to breakpoint(). Breakpoint() + * simulates a breakpoint by executing an int 3. + * + ************* + * + * The following gdb commands are supported: + * + * command function Return value + * + * g return the value of the CPU registers hex data or ENN + * G set the value of the CPU registers OK or ENN + * + * mAA..AA,LLLL Read LLLL bytes at address AA..AA hex data or ENN + * MAA..AA,LLLL: Write LLLL bytes at address AA.AA OK or ENN + * + * c Resume at current address SNN ( signal NN) + * cAA..AA Continue at address AA..AA SNN + * + * s Step one instruction SNN + * sAA..AA Step one instruction from AA..AA SNN + * + * k kill + * + * ? What was the last sigval ? SNN (signal NN) + * + * All commands and responses are sent with a packet which includes a + * checksum. A packet consists of + * + * $<packet info>#<checksum>. + * + * where + * <packet info> :: <characters representing the command or response> + * <checksum> :: < two hex digits computed as modulo 256 sum of <packetinfo>> + * + * When a packet is received, it is first acknowledged with either '+' or '-'. + * '+' indicates a successful transfer. '-' indicates a failed transfer. + * + * Example: + * + * Host: Reply: + * $m0,10#2a +$00010203040506070809101112131415#42 + * + ****************************************************************************/ +#define KGDB_VERSION "<20030915.1651.33>" +#include <linux/config.h> +#include <linux/types.h> +#include <asm/string.h> /* for strcpy */ +#include <linux/kernel.h> +#include <linux/sched.h> +#include <asm/vm86.h> +#include <asm/system.h> +#include <asm/ptrace.h> /* for linux pt_regs struct */ +#include <asm/kgdb_local.h> +#include <linux/list.h> +#include <asm/atomic.h> +#include <asm/processor.h> +#include <linux/irq.h> +#include <asm/desc.h> +#include <linux/inet.h> +#include <linux/kallsyms.h> + +/************************************************************************ + * + * external low-level support routines + */ +typedef void (*Function) (void); /* pointer to a function */ + +/* Thread reference */ +typedef unsigned char threadref[8]; + +extern int tty_putDebugChar(int); /* write a single character */ +extern int tty_getDebugChar(void); /* read and return a single char */ +extern void tty_flushDebugChar(void); /* flush pending characters */ +extern int eth_putDebugChar(int); /* write a single character */ +extern int eth_getDebugChar(void); /* read and return a single char */ +extern void eth_flushDebugChar(void); /* flush pending characters */ +extern void kgdb_eth_set_trapmode(int); +extern void kgdb_eth_reply_arp(void); /*send arp request */ +extern volatile int kgdb_eth_is_initializing; + + +/************************************************************************/ +/* BUFMAX defines the maximum number of characters in inbound/outbound buffers*/ +/* at least NUMREGBYTES*2 are needed for register packets */ +/* Longer buffer is needed to list all threads */ +#define BUFMAX 400 + +char *kgdb_version = KGDB_VERSION; + +/* debug > 0 prints ill-formed commands in valid packets & checksum errors */ +int debug_regs = 0; /* set to non-zero to print registers */ + +/* filled in by an external module */ +char *gdb_module_offsets; + +static const char hexchars[] = "0123456789abcdef"; + +/* Number of bytes of registers. */ +#define NUMREGBYTES 64 +/* + * Note that this register image is in a different order than + * the register image that Linux produces at interrupt time. + * + * Linux's register image is defined by struct pt_regs in ptrace.h. + * Just why GDB uses a different order is a historical mystery. + */ +enum regnames { _EAX, /* 0 */ + _ECX, /* 1 */ + _EDX, /* 2 */ + _EBX, /* 3 */ + _ESP, /* 4 */ + _EBP, /* 5 */ + _ESI, /* 6 */ + _EDI, /* 7 */ + _PC /* 8 also known as eip */ , + _PS /* 9 also known as eflags */ , + _CS, /* 10 */ + _SS, /* 11 */ + _DS, /* 12 */ + _ES, /* 13 */ + _FS, /* 14 */ + _GS /* 15 */ +}; + +/*************************** ASSEMBLY CODE MACROS *************************/ +/* + * Put the error code here just in case the user cares. + * Likewise, the vector number here (since GDB only gets the signal + * number through the usual means, and that's not very specific). + * The called_from is the return address so he can tell how we entered kgdb. + * This will allow him to seperate out the various possible entries. + */ +#define REMOTE_DEBUG 0 /* set != to turn on printing (also available in info) */ + +#define PID_MAX PID_MAX_DEFAULT + +#ifdef CONFIG_SMP +void smp_send_nmi_allbutself(void); +#define IF_SMP(x) x +#undef MAX_NO_CPUS +#ifndef CONFIG_NO_KGDB_CPUS +#define CONFIG_NO_KGDB_CPUS 2 +#endif +#if CONFIG_NO_KGDB_CPUS > NR_CPUS +#define MAX_NO_CPUS NR_CPUS +#else +#define MAX_NO_CPUS CONFIG_NO_KGDB_CPUS +#endif +#define hold_init hold_on_sstep: 1, +#define MAX_CPU_MASK (unsigned long)((1LL << MAX_NO_CPUS) - 1LL) +#define NUM_CPUS num_online_cpus() +#else +#define IF_SMP(x) +#define hold_init +#undef MAX_NO_CPUS +#define MAX_NO_CPUS 1 +#define NUM_CPUS 1 +#endif +#define NOCPU (struct task_struct *)0xbad1fbad +/* *INDENT-OFF* */ +struct kgdb_info { + int used_malloc; + void *called_from; + long long entry_tsc; + int errcode; + int vector; + int print_debug_info; +#ifdef CONFIG_SMP + int hold_on_sstep; + struct { + volatile struct task_struct *task; + int pid; + int hold; + struct pt_regs *regs; + } cpus_waiting[MAX_NO_CPUS]; +#endif +} kgdb_info = {hold_init print_debug_info:REMOTE_DEBUG, vector:-1}; + +/* *INDENT-ON* */ + +#define used_m kgdb_info.used_malloc +/* + * This is little area we set aside to contain the stack we + * need to build to allow gdb to call functions. We use one + * per cpu to avoid locking issues. We will do all this work + * with interrupts off so that should take care of the protection + * issues. + */ +#define LOOKASIDE_SIZE 200 /* should be more than enough */ +#define MALLOC_MAX 200 /* Max malloc size */ +struct { + unsigned int esp; + int array[LOOKASIDE_SIZE]; +} fn_call_lookaside[MAX_NO_CPUS]; + +static int trap_cpu; +static unsigned int OLD_esp; + +#define END_OF_LOOKASIDE &fn_call_lookaside[trap_cpu].array[LOOKASIDE_SIZE] +#define IF_BIT 0x200 +#define TF_BIT 0x100 + +#define MALLOC_ROUND 8-1 + +static char malloc_array[MALLOC_MAX]; +IF_SMP(static void to_gdb(const char *mess)); +void * +malloc(int size) +{ + + if (size <= (MALLOC_MAX - used_m)) { + int old_used = used_m; + used_m += ((size + MALLOC_ROUND) & (~MALLOC_ROUND)); + return &malloc_array[old_used]; + } else { + return NULL; + } +} + +/* + * I/O dispatch functions... + * Based upon kgdb_eth, either call the ethernet + * handler or the serial one.. + */ +void +putDebugChar(int c) +{ + if (kgdb_eth == -1) { + tty_putDebugChar(c); + } else { + eth_putDebugChar(c); + } +} + +int +getDebugChar(void) +{ + if (kgdb_eth == -1) { + return tty_getDebugChar(); + } else { + return eth_getDebugChar(); + } +} + +void +flushDebugChar(void) +{ + if (kgdb_eth == -1) { + tty_flushDebugChar(); + } else { + eth_flushDebugChar(); + } +} + +/* + * Gdb calls functions by pushing agruments, including a return address + * on the stack and the adjusting EIP to point to the function. The + * whole assumption in GDB is that we are on a different stack than the + * one the "user" i.e. code that hit the break point, is on. This, of + * course is not true in the kernel. Thus various dodges are needed to + * do the call without directly messing with EIP (which we can not change + * as it is just a location and not a register. To adjust it would then + * require that we move every thing below EIP up or down as needed. This + * will not work as we may well have stack relative pointer on the stack + * (such as the pointer to regs, for example). + + * So here is what we do: + * We detect gdb attempting to store into the stack area and instead, store + * into the fn_call_lookaside.array at the same relative location as if it + * were the area ESP pointed at. We also trap ESP modifications + * and uses these to adjust fn_call_lookaside.esp. On entry + * fn_call_lookaside.esp will be set to point at the last entry in + * fn_call_lookaside.array. This allows us to check if it has changed, and + * if so, on exit, we add the registers we will use to do the move and a + * trap/ interrupt return exit sequence. We then adjust the eflags in the + * regs array (remember we now have a copy in the fn_call_lookaside.array) to + * kill the interrupt bit, AND we change EIP to point at our set up stub. + * As part of the register set up we preset the registers to point at the + * begining and end of the fn_call_lookaside.array, so all the stub needs to + * do is move words from the array to the stack until ESP= the desired value + * then do the rti. This will then transfer to the desired function with + * all the correct registers. Nifty huh? + */ +extern asmlinkage void fn_call_stub(void); +extern asmlinkage void fn_rtn_stub(void); +/* *INDENT-OFF* */ +__asm__("fn_rtn_stub:\n\t" + "movl %eax,%esp\n\t" + "fn_call_stub:\n\t" + "1:\n\t" + "addl $-4,%ebx\n\t" + "movl (%ebx), %eax\n\t" + "pushl %eax\n\t" + "cmpl %esp,%ecx\n\t" + "jne 1b\n\t" + "popl %eax\n\t" + "popl %ebx\n\t" + "popl %ecx\n\t" + "iret \n\t"); +/* *INDENT-ON* */ +#define gdb_i386vector kgdb_info.vector +#define gdb_i386errcode kgdb_info.errcode +#define waiting_cpus kgdb_info.cpus_waiting +#define remote_debug kgdb_info.print_debug_info +#define hold_cpu(cpu) kgdb_info.cpus_waiting[cpu].hold +/* gdb locks */ + +#ifdef CONFIG_SMP +static int in_kgdb_called; +static spinlock_t waitlocks[MAX_NO_CPUS] = + {[0 ... MAX_NO_CPUS - 1] = SPIN_LOCK_UNLOCKED }; +/* + * The following array has the thread pointer of each of the "other" + * cpus. We make it global so it can be seen by gdb. + */ +volatile int in_kgdb_entry_log[MAX_NO_CPUS]; +volatile struct pt_regs *in_kgdb_here_log[MAX_NO_CPUS]; +/* +static spinlock_t continuelocks[MAX_NO_CPUS]; +*/ +spinlock_t kgdb_spinlock = SPIN_LOCK_UNLOCKED; +/* waiters on our spinlock plus us */ +static atomic_t spinlock_waiters = ATOMIC_INIT(1); +static int spinlock_count = 0; +static int spinlock_cpu = 0; +/* + * Note we use nested spin locks to account for the case where a break + * point is encountered when calling a function by user direction from + * kgdb. Also there is the memory exception recursion to account for. + * Well, yes, but this lets other cpus thru too. Lets add a + * cpu id to the lock. + */ +#define KGDB_SPIN_LOCK(x) if( spinlock_count == 0 || \ + spinlock_cpu != smp_processor_id()){\ + atomic_inc(&spinlock_waiters); \ + while (! spin_trylock(x)) {\ + in_kgdb(®s);\ + }\ + atomic_dec(&spinlock_waiters); \ + spinlock_count = 1; \ + spinlock_cpu = smp_processor_id(); \ + }else{ \ + spinlock_count++; \ + } +#define KGDB_SPIN_UNLOCK(x) if( --spinlock_count == 0) spin_unlock(x) +#else +unsigned kgdb_spinlock = 0; +#define KGDB_SPIN_LOCK(x) --*x +#define KGDB_SPIN_UNLOCK(x) ++*x +#endif + +int +hex(char ch) +{ + if ((ch >= 'a') && (ch <= 'f')) + return (ch - 'a' + 10); + if ((ch >= '0') && (ch <= '9')) + return (ch - '0'); + if ((ch >= 'A') && (ch <= 'F')) + return (ch - 'A' + 10); + return (-1); +} + +/* scan for the sequence $<data>#<checksum> */ +void +getpacket(char *buffer) +{ + unsigned char checksum; + unsigned char xmitcsum; + int i; + int count; + char ch; + + do { + /* wait around for the start character, ignore all other characters */ + while ((ch = (getDebugChar() & 0x7f)) != '$') ; + checksum = 0; + xmitcsum = -1; + + count = 0; + + /* now, read until a # or end of buffer is found */ + while (count < BUFMAX) { + ch = getDebugChar() & 0x7f; + if (ch == '#') + break; + checksum = checksum + ch; + buffer[count] = ch; + count = count + 1; + } + buffer[count] = 0; + + if (ch == '#') { + xmitcsum = hex(getDebugChar() & 0x7f) << 4; + xmitcsum += hex(getDebugChar() & 0x7f); + if ((remote_debug) && (checksum != xmitcsum)) { + printk + ("bad checksum. My count = 0x%x, sent=0x%x. buf=%s\n", + checksum, xmitcsum, buffer); + } + + if (checksum != xmitcsum) + putDebugChar('-'); /* failed checksum */ + else { + putDebugChar('+'); /* successful transfer */ + /* if a sequence char is present, reply the sequence ID */ + if (buffer[2] == ':') { + putDebugChar(buffer[0]); + putDebugChar(buffer[1]); + /* remove sequence chars from buffer */ + count = strlen(buffer); + for (i = 3; i <= count; i++) + buffer[i - 3] = buffer[i]; + } + } + } + } while (checksum != xmitcsum); + + if (remote_debug) + printk("R:%s\n", buffer); + flushDebugChar(); +} + +/* send the packet in buffer. */ + +void +putpacket(char *buffer) +{ + unsigned char checksum; + int count; + char ch; + + /* $<packet info>#<checksum>. */ + + if (kgdb_eth == -1) { + do { + if (remote_debug) + printk("T:%s\n", buffer); + putDebugChar('$'); + checksum = 0; + count = 0; + + while ((ch = buffer[count])) { + putDebugChar(ch); + checksum += ch; + count += 1; + } + + putDebugChar('#'); + putDebugChar(hexchars[checksum >> 4]); + putDebugChar(hexchars[checksum % 16]); + flushDebugChar(); + + } while ((getDebugChar() & 0x7f) != '+'); + } else { + /* + * For udp, we can not transfer too much bytes once. + * We only transfer MAX_SEND_COUNT size bytes each time + */ + +#define MAX_SEND_COUNT 30 + + int send_count = 0, i = 0; + char send_buf[MAX_SEND_COUNT]; + + do { + if (remote_debug) + printk("T:%s\n", buffer); + putDebugChar('$'); + checksum = 0; + count = 0; + send_count = 0; + while ((ch = buffer[count])) { + if (send_count >= MAX_SEND_COUNT) { + for(i = 0; i < MAX_SEND_COUNT; i++) { + putDebugChar(send_buf[i]); + } + flushDebugChar(); + send_count = 0; + } else { + send_buf[send_count] = ch; + checksum += ch; + count ++; + send_count++; + } + } + for(i = 0; i < send_count; i++) + putDebugChar(send_buf[i]); + putDebugChar('#'); + putDebugChar(hexchars[checksum >> 4]); + putDebugChar(hexchars[checksum % 16]); + flushDebugChar(); + } while ((getDebugChar() & 0x7f) != '+'); + } +} + +static char remcomInBuffer[BUFMAX]; +static char remcomOutBuffer[BUFMAX]; +static short error; + +void +debug_error(char *format, char *parm) +{ + if (remote_debug) + printk(format, parm); +} + +static void +print_regs(struct pt_regs *regs) +{ + printk("EAX=%08lx ", regs->eax); + printk("EBX=%08lx ", regs->ebx); + printk("ECX=%08lx ", regs->ecx); + printk("EDX=%08lx ", regs->edx); + printk("\n"); + printk("ESI=%08lx ", regs->esi); + printk("EDI=%08lx ", regs->edi); + printk("EBP=%08lx ", regs->ebp); + printk("ESP=%08lx ", (long) ®s->esp); + printk("\n"); + printk(" DS=%08x ", regs->xds); + printk(" ES=%08x ", regs->xes); + printk(" SS=%08x ", __KERNEL_DS); + printk(" FL=%08lx ", regs->eflags); + printk("\n"); + printk(" CS=%08x ", regs->xcs); + printk(" IP=%08lx ", regs->eip); +#if 0 + printk(" FS=%08x ", regs->fs); + printk(" GS=%08x ", regs->gs); +#endif + printk("\n"); + +} /* print_regs */ + +#define NEW_esp fn_call_lookaside[trap_cpu].esp + +static void +regs_to_gdb_regs(int *gdb_regs, struct pt_regs *regs) +{ + gdb_regs[_EAX] = regs->eax; + gdb_regs[_EBX] = regs->ebx; + gdb_regs[_ECX] = regs->ecx; + gdb_regs[_EDX] = regs->edx; + gdb_regs[_ESI] = regs->esi; + gdb_regs[_EDI] = regs->edi; + gdb_regs[_EBP] = regs->ebp; + gdb_regs[_DS] = regs->xds; + gdb_regs[_ES] = regs->xes; + gdb_regs[_PS] = regs->eflags; + gdb_regs[_CS] = regs->xcs; + gdb_regs[_PC] = regs->eip; + /* Note, as we are a debugging the kernel, we will always + * trap in kernel code, this means no priviledge change, + * and so the pt_regs structure is not completely valid. In a non + * privilege change trap, only EFLAGS, CS and EIP are put on the stack, + * SS and ESP are not stacked, this means that the last 2 elements of + * pt_regs is not valid (they would normally refer to the user stack) + * also, using regs+1 is no good because you end up will a value that is + * 2 longs (8) too high. This used to cause stepping over functions + * to fail, so my fix is to use the address of regs->esp, which + * should point at the end of the stack frame. Note I have ignored + * completely exceptions that cause an error code to be stacked, such + * as double fault. Stuart Hughes, Zentropix. + * original code: gdb_regs[_ESP] = (int) (regs + 1) ; + + * this is now done on entry and moved to OLD_esp (as well as NEW_esp). + */ + gdb_regs[_ESP] = NEW_esp; + gdb_regs[_SS] = __KERNEL_DS; + gdb_regs[_FS] = 0xFFFF; + gdb_regs[_GS] = 0xFFFF; +} /* regs_to_gdb_regs */ + +static void +gdb_regs_to_regs(int *gdb_regs, struct pt_regs *regs) +{ + regs->eax = gdb_regs[_EAX]; + regs->ebx = gdb_regs[_EBX]; + regs->ecx = gdb_regs[_ECX]; + regs->edx = gdb_regs[_EDX]; + regs->esi = gdb_regs[_ESI]; + regs->edi = gdb_regs[_EDI]; + regs->ebp = gdb_regs[_EBP]; + regs->xds = gdb_regs[_DS]; + regs->xes = gdb_regs[_ES]; + regs->eflags = gdb_regs[_PS]; + regs->xcs = gdb_regs[_CS]; + regs->eip = gdb_regs[_PC]; + NEW_esp = gdb_regs[_ESP]; /* keep the value */ +#if 0 /* can't change these */ + regs->esp = gdb_regs[_ESP]; + regs->xss = gdb_regs[_SS]; + regs->fs = gdb_regs[_FS]; + regs->gs = gdb_regs[_GS]; +#endif + +} /* gdb_regs_to_regs */ +extern void scheduling_functions_start_here(void); +extern void scheduling_functions_end_here(void); +#define first_sched ((unsigned long) scheduling_functions_start_here) +#define last_sched ((unsigned long) scheduling_functions_end_here) + +int thread_list = 0; + +void +get_gdb_regs(struct task_struct *p, struct pt_regs *regs, int *gdb_regs) +{ + unsigned long stack_page; + int count = 0; + IF_SMP(int i); + if (!p || p == current) { + regs_to_gdb_regs(gdb_regs, regs); + return; + } +#ifdef CONFIG_SMP + for (i = 0; i < MAX_NO_CPUS; i++) { + if (p == kgdb_info.cpus_waiting[i].task) { + regs_to_gdb_regs(gdb_regs, + kgdb_info.cpus_waiting[i].regs); + gdb_regs[_ESP] = + (int) &kgdb_info.cpus_waiting[i].regs->esp; + + return; + } + } +#endif + memset(gdb_regs, 0, NUMREGBYTES); + gdb_regs[_ESP] = p->thread.esp; + gdb_regs[_PC] = p->thread.eip; + gdb_regs[_EBP] = *(int *) gdb_regs[_ESP]; + gdb_regs[_EDI] = *(int *) (gdb_regs[_ESP] + 4); + gdb_regs[_ESI] = *(int *) (gdb_regs[_ESP] + 8); + +/* + * This code is to give a more informative notion of where a process + * is waiting. It is used only when the user asks for a thread info + * list. If he then switches to the thread, s/he will find the task + * is in schedule, but a back trace should show the same info we come + * up with. This code was shamelessly purloined from process.c. It was + * then enhanced to provide more registers than simply the program + * counter. + */ + + if (!thread_list) { + return; + } + + if (p->state == TASK_RUNNING) + return; + stack_page = (unsigned long) p->thread_info; + if (gdb_regs[_ESP] < stack_page || gdb_regs[_ESP] > 8188 + stack_page) + return; + /* include/asm-i386/system.h:switch_to() pushes ebp last. */ + do { + if (gdb_regs[_EBP] < stack_page || + gdb_regs[_EBP] > 8184 + stack_page) + return; + gdb_regs[_PC] = *(unsigned long *) (gdb_regs[_EBP] + 4); + gdb_regs[_ESP] = gdb_regs[_EBP] + 8; + gdb_regs[_EBP] = *(unsigned long *) gdb_regs[_EBP]; + if (gdb_regs[_PC] < first_sched || gdb_regs[_PC] >= last_sched) + return; + } while (count++ < 16); + return; +} + +/* Indicate to caller of mem2hex or hex2mem that there has been an + error. */ +static volatile int mem_err = 0; +static volatile int mem_err_expected = 0; +static volatile int mem_err_cnt = 0; +static int garbage_loc = -1; + +int +get_char(char *addr) +{ + return *addr; +} + +void +set_char(char *addr, int val, int may_fault) +{ + /* + * This code traps references to the area mapped to the kernel + * stack as given by the regs and, instead, stores to the + * fn_call_lookaside[cpu].array + */ + if (may_fault && + (unsigned int) addr < OLD_esp && + ((unsigned int) addr > (OLD_esp - (unsigned int) LOOKASIDE_SIZE))) { + addr = (char *) END_OF_LOOKASIDE - ((char *) OLD_esp - addr); + } + *addr = val; +} + +/* convert the memory pointed to by mem into hex, placing result in buf */ +/* return a pointer to the last char put in buf (null) */ +/* If MAY_FAULT is non-zero, then we should set mem_err in response to + a fault; if zero treat a fault like any other fault in the stub. */ +char * +mem2hex(char *mem, char *buf, int count, int may_fault) +{ + int i; + unsigned char ch; + + if (may_fault) { + mem_err_expected = 1; + mem_err = 0; + } + for (i = 0; i < count; i++) { + /* printk("%lx = ", mem) ; */ + + ch = get_char(mem++); + + /* printk("%02x\n", ch & 0xFF) ; */ + if (may_fault && mem_err) { + if (remote_debug) + printk("Mem fault fetching from addr %lx\n", + (long) (mem - 1)); + *buf = 0; /* truncate buffer */ + return (buf); + } + *buf++ = hexchars[ch >> 4]; + *buf++ = hexchars[ch % 16]; + } + *buf = 0; + if (may_fault) + mem_err_expected = 0; + return (buf); +} + +/* convert the hex array pointed to by buf into binary to be placed in mem */ +/* return a pointer to the character AFTER the last byte written */ +/* NOTE: We use the may fault flag to also indicate if the write is to + * the registers (0) or "other" memory (!=0) + */ +char * +hex2mem(char *buf, char *mem, int count, int may_fault) +{ + int i; + unsigned char ch; + + if (may_fault) { + mem_err_expected = 1; + mem_err = 0; + } + for (i = 0; i < count; i++) { + ch = hex(*buf++) << 4; + ch = ch + hex(*buf++); + set_char(mem++, ch, may_fault); + + if (may_fault && mem_err) { + if (remote_debug) + printk("Mem fault storing to addr %lx\n", + (long) (mem - 1)); + return (mem); + } + } + if (may_fault) + mem_err_expected = 0; + return (mem); +} + +/**********************************************/ +/* WHILE WE FIND NICE HEX CHARS, BUILD AN INT */ +/* RETURN NUMBER OF CHARS PROCESSED */ +/**********************************************/ +int +hexToInt(char **ptr, int *intValue) +{ + int numChars = 0; + int hexValue; + + *intValue = 0; + + while (**ptr) { + hexValue = hex(**ptr); + if (hexValue >= 0) { + *intValue = (*intValue << 4) | hexValue; + numChars++; + } else + break; + + (*ptr)++; + } + + return (numChars); +} + +#define stubhex(h) hex(h) +#ifdef old_thread_list + +static int +stub_unpack_int(char *buff, int fieldlength) +{ + int nibble; + int retval = 0; + + while (fieldlength) { + nibble = stubhex(*buff++); + retval |= nibble; + fieldlength--; + if (fieldlength) + retval = retval << 4; + } + return retval; +} +#endif +static char * +pack_hex_byte(char *pkt, int byte) +{ + *pkt++ = hexchars[(byte >> 4) & 0xf]; + *pkt++ = hexchars[(byte & 0xf)]; + return pkt; +} + +#define BUF_THREAD_ID_SIZE 16 + +static char * +pack_threadid(char *pkt, threadref * id) +{ + char *limit; + unsigned char *altid; + + altid = (unsigned char *) id; + limit = pkt + BUF_THREAD_ID_SIZE; + while (pkt < limit) + pkt = pack_hex_byte(pkt, *altid++); + return pkt; +} + +#ifdef old_thread_list +static char * +unpack_byte(char *buf, int *value) +{ + *value = stub_unpack_int(buf, 2); + return buf + 2; +} + +static char * +unpack_threadid(char *inbuf, threadref * id) +{ + char *altref; + char *limit = inbuf + BUF_THREAD_ID_SIZE; + int x, y; + + altref = (char *) id; + + while (inbuf < limit) { + x = stubhex(*inbuf++); + y = stubhex(*inbuf++); + *altref++ = (x << 4) | y; + } + return inbuf; +} +#endif +void +int_to_threadref(threadref * id, int value) +{ + unsigned char *scan; + + scan = (unsigned char *) id; + { + int i = 4; + while (i--) + *scan++ = 0; + } + *scan++ = (value >> 24) & 0xff; + *scan++ = (value >> 16) & 0xff; + *scan++ = (value >> 8) & 0xff; + *scan++ = (value & 0xff); +} +int +int_to_hex_v(unsigned char * id, int value) +{ + unsigned char *start = id; + int shift; + int ch; + + for (shift = 28; shift >= 0; shift -= 4) { + if ((ch = (value >> shift) & 0xf) || (id != start)) { + *id = hexchars[ch]; + id++; + } + } + if (id == start) + *id++ = '0'; + return id - start; +} +#ifdef old_thread_list + +static int +threadref_to_int(threadref * ref) +{ + int i, value = 0; + unsigned char *scan; + + scan = (char *) ref; + scan += 4; + i = 4; + while (i-- > 0) + value = (value << 8) | ((*scan++) & 0xff); + return value; +} +#endif +static int +cmp_str(char *s1, char *s2, int count) +{ + while (count--) { + if (*s1++ != *s2++) + return 0; + } + return 1; +} + +#if 1 /* this is a hold over from 2.4 where O(1) was "sometimes" */ +extern struct task_struct *kgdb_get_idle(int cpu); +#define idle_task(cpu) kgdb_get_idle(cpu) +#else +#define idle_task(cpu) init_tasks[cpu] +#endif + +extern int kgdb_pid_init_done; + +struct task_struct * +getthread(int pid) +{ + struct task_struct *thread; + if (pid >= PID_MAX && pid <= (PID_MAX + MAX_NO_CPUS)) { + + return idle_task(pid - PID_MAX); + } else { + /* + * find_task_by_pid is relatively safe all the time + * Other pid functions require lock downs which imply + * that we may be interrupting them (as we get here + * in the middle of most any lock down). + * Still we don't want to call until the table exists! + */ + if (kgdb_pid_init_done){ + thread = find_task_by_pid(pid); + if (thread) { + return thread; + } + } + } + return NULL; +} +/* *INDENT-OFF* */ +struct hw_breakpoint { + unsigned enabled; + unsigned type; + unsigned len; + unsigned addr; +} breakinfo[4] = { {enabled:0}, + {enabled:0}, + {enabled:0}, + {enabled:0}}; +/* *INDENT-ON* */ +unsigned hw_breakpoint_status; +void +correct_hw_break(void) +{ + int breakno; + int correctit; + int breakbit; + unsigned dr7; + + asm volatile ("movl %%db7, %0\n":"=r" (dr7) + :); + /* *INDENT-OFF* */ + do { + unsigned addr0, addr1, addr2, addr3; + asm volatile ("movl %%db0, %0\n" + "movl %%db1, %1\n" + "movl %%db2, %2\n" + "movl %%db3, %3\n" + :"=r" (addr0), "=r"(addr1), + "=r"(addr2), "=r"(addr3) + :); + } while (0); + /* *INDENT-ON* */ + correctit = 0; + for (breakno = 0; breakno < 3; breakno++) { + breakbit = 2 << (breakno << 1); + if (!(dr7 & breakbit) && breakinfo[breakno].enabled) { + correctit = 1; + dr7 |= breakbit; + dr7 &= ~(0xf0000 << (breakno << 2)); + dr7 |= (((breakinfo[breakno].len << 2) | + breakinfo[breakno].type) << 16) << + (breakno << 2); + switch (breakno) { + case 0: + asm volatile ("movl %0, %%dr0\n"::"r" + (breakinfo[breakno].addr)); + break; + + case 1: + asm volatile ("movl %0, %%dr1\n"::"r" + (breakinfo[breakno].addr)); + break; + + case 2: + asm volatile ("movl %0, %%dr2\n"::"r" + (breakinfo[breakno].addr)); + break; + + case 3: + asm volatile ("movl %0, %%dr3\n"::"r" + (breakinfo[breakno].addr)); + break; + } + } else if ((dr7 & breakbit) && !breakinfo[breakno].enabled) { + correctit = 1; + dr7 &= ~breakbit; + dr7 &= ~(0xf0000 << (breakno << 2)); + } + } + if (correctit) { + asm volatile ("movl %0, %%db7\n"::"r" (dr7)); + } +} + +int +remove_hw_break(unsigned breakno) +{ + if (!breakinfo[breakno].enabled) { + return -1; + } + breakinfo[breakno].enabled = 0; + return 0; +} + +int +set_hw_break(unsigned breakno, unsigned type, unsigned len, unsigned addr) +{ + if (breakinfo[breakno].enabled) { + return -1; + } + breakinfo[breakno].enabled = 1; + breakinfo[breakno].type = type; + breakinfo[breakno].len = len; + breakinfo[breakno].addr = addr; + return 0; +} + +#ifdef CONFIG_SMP +static int in_kgdb_console = 0; + +int +in_kgdb(struct pt_regs *regs) +{ + unsigned flags; + int cpu = smp_processor_id(); + in_kgdb_called = 1; + if (!spin_is_locked(&kgdb_spinlock)) { + if (in_kgdb_here_log[cpu] || /* we are holding this cpu */ + in_kgdb_console) { /* or we are doing slow i/o */ + return 1; + } + return 0; + } + + /* As I see it the only reason not to let all cpus spin on + * the same spin_lock is to allow selected ones to proceed. + * This would be a good thing, so we leave it this way. + * Maybe someday.... Done ! + + * in_kgdb() is called from an NMI so we don't pretend + * to have any resources, like printk() for example. + */ + + kgdb_local_irq_save(flags); /* only local here, to avoid hanging */ + /* + * log arival of this cpu + * The NMI keeps on ticking. Protect against recurring more + * than once, and ignor the cpu that has the kgdb lock + */ + in_kgdb_entry_log[cpu]++; + in_kgdb_here_log[cpu] = regs; + if (cpu == spinlock_cpu || waiting_cpus[cpu].task) { + goto exit_in_kgdb; + } + /* + * For protection of the initilization of the spin locks by kgdb + * it locks the kgdb spinlock before it gets the wait locks set + * up. We wait here for the wait lock to be taken. If the + * kgdb lock goes away first?? Well, it could be a slow exit + * sequence where the wait lock is removed prior to the kgdb lock + * so if kgdb gets unlocked, we just exit. + */ + while (spin_is_locked(&kgdb_spinlock) && + !spin_is_locked(waitlocks + cpu)) ; + if (!spin_is_locked(&kgdb_spinlock)) { + goto exit_in_kgdb; + } + waiting_cpus[cpu].task = current; + waiting_cpus[cpu].pid = (current->pid) ? : (PID_MAX + cpu); + waiting_cpus[cpu].regs = regs; + + spin_unlock_wait(waitlocks + cpu); + /* + * log departure of this cpu + */ + waiting_cpus[cpu].task = 0; + waiting_cpus[cpu].pid = 0; + waiting_cpus[cpu].regs = 0; + correct_hw_break(); + exit_in_kgdb: + in_kgdb_here_log[cpu] = 0; + kgdb_local_irq_restore(flags); + return 1; + /* + spin_unlock(continuelocks + smp_processor_id()); + */ +} + +void +smp__in_kgdb(struct pt_regs regs) +{ + ack_APIC_irq(); + in_kgdb(®s); +} +#else +int +in_kgdb(struct pt_regs *regs) +{ + return (kgdb_spinlock); +} +#endif + +void +printexceptioninfo(int exceptionNo, int errorcode, char *buffer) +{ + unsigned dr6; + int i; + switch (exceptionNo) { + case 1: /* debug exception */ + break; + case 3: /* breakpoint */ + sprintf(buffer, "Software breakpoint"); + return; + default: + sprintf(buffer, "Details not available"); + return; + } + asm volatile ("movl %%db6, %0\n":"=r" (dr6) + :); + if (dr6 & 0x4000) { + sprintf(buffer, "Single step"); + return; + } + for (i = 0; i < 4; ++i) { + if (dr6 & (1 << i)) { + sprintf(buffer, "Hardware breakpoint %d", i); + return; + } + } + sprintf(buffer, "Unknown trap"); + return; +} + +/* + * This function does all command procesing for interfacing to gdb. + * + * NOTE: The INT nn instruction leaves the state of the interrupt + * enable flag UNCHANGED. That means that when this routine + * is entered via a breakpoint (INT 3) instruction from code + * that has interrupts enabled, then interrupts will STILL BE + * enabled when this routine is entered. The first thing that + * we do here is disable interrupts so as to prevent recursive + * entries and bothersome serial interrupts while we are + * trying to run the serial port in polled mode. + * + * For kernel version 2.1.xx the kgdb_cli() actually gets a spin lock so + * it is always necessary to do a restore_flags before returning + * so as to let go of that lock. + */ +int +kgdb_handle_exception(int exceptionVector, + int signo, int err_code, struct pt_regs *linux_regs) +{ + struct task_struct *usethread = NULL; + struct task_struct *thread_list_start = 0, *thread = NULL; + int addr, length; + unsigned long address; + int breakno, breaktype; + char *ptr; + int newPC; + threadref thref; + int threadid; + int thread_min = PID_MAX + MAX_NO_CPUS; +#ifdef old_thread_list + int maxthreads; +#endif + int nothreads; + unsigned long flags; + int gdb_regs[NUMREGBYTES / 4]; + int dr6; + IF_SMP(int entry_state = 0); /* 0, ok, 1, no nmi, 2 sync failed */ +#define NO_NMI 1 +#define NO_SYNC 2 +#define regs (*linux_regs) +#define NUMREGS NUMREGBYTES/4 + /* + * If the entry is not from the kernel then return to the Linux + * trap handler and let it process the interrupt normally. + */ + if ((linux_regs->eflags & VM_MASK) || (3 & linux_regs->xcs)) { + printk("ignoring non-kernel exception\n"); + print_regs(®s); + return (0); + } + /* + * If we're using eth mode, set the 'mode' in the netdevice. + */ + + __asm__("movl %%cr2,%0":"=r" (address)); + + if (kgdb_eth != -1) { + kgdb_eth_set_trapmode(1); + } + + kgdb_local_irq_save(flags); + + /* Get kgdb spinlock */ + + KGDB_SPIN_LOCK(&kgdb_spinlock); + rdtscll(kgdb_info.entry_tsc); + /* + * We depend on this spinlock and the NMI watch dog to control the + * other cpus. They will arrive at "in_kgdb()" as a result of the + * NMI and will wait there for the following spin locks to be + * released. + */ +#ifdef CONFIG_SMP + +#if 0 + if (cpu_callout_map & ~MAX_CPU_MASK) { + printk("kgdb : too many cpus, possibly not mapped" + " in contiguous space, change MAX_NO_CPUS" + " in kgdb_stub and make new kernel.\n" + " cpu_callout_map is %lx\n", cpu_callout_map); + goto exit_just_unlock; + } +#endif + if (spinlock_count == 1) { + int time, end_time, dum; + int i; + int cpu_logged_in[MAX_NO_CPUS] = {[0 ... MAX_NO_CPUS - 1] = (0) + }; + if (remote_debug) { + printk("kgdb : cpu %d entry, syncing others\n", + smp_processor_id()); + } + for (i = 0; i < MAX_NO_CPUS; i++) { + /* + * Use trylock as we may already hold the lock if + * we are holding the cpu. Net result is all + * locked. + */ + spin_trylock(&waitlocks[i]); + } + for (i = 0; i < MAX_NO_CPUS; i++) + cpu_logged_in[i] = 0; + /* + * Wait for their arrival. We know the watch dog is active if + * in_kgdb() has ever been called, as it is always called on a + * watchdog tick. + */ + rdtsc(dum, time); + end_time = time + 2; /* Note: we use the High order bits! */ + i = 1; + if (num_online_cpus() > 1) { + int me_in_kgdb = in_kgdb_entry_log[smp_processor_id()]; + smp_send_nmi_allbutself(); + while (i < num_online_cpus() && time != end_time) { + int j; + for (j = 0; j < MAX_NO_CPUS; j++) { + if (waiting_cpus[j].task && + !cpu_logged_in[j]) { + i++; + cpu_logged_in[j] = 1; + if (remote_debug) { + printk + ("kgdb : cpu %d arrived at kgdb\n", + j); + } + break; + } else if (!waiting_cpus[j].task && + !cpu_online(j)) { + waiting_cpus[j].task = NOCPU; + cpu_logged_in[j] = 1; + waiting_cpus[j].hold = 1; + break; + } + if (!waiting_cpus[j].task && + in_kgdb_here_log[j]) { + + int wait = 100000; + while (wait--) ; + if (!waiting_cpus[j].task && + in_kgdb_here_log[j]) { + printk + ("kgdb : cpu %d stall" + " in in_kgdb\n", + j); + i++; + cpu_logged_in[j] = 1; + waiting_cpus[j].task = + (struct task_struct + *) 1; + } + } + } + + if (in_kgdb_entry_log[smp_processor_id()] > + (me_in_kgdb + 10)) { + break; + } + + rdtsc(dum, time); + } + if (i < num_online_cpus()) { + printk + ("kgdb : time out, proceeding without sync\n"); +#if 0 + printk("kgdb : Waiting_cpus: 0 = %d, 1 = %d\n", + waiting_cpus[0].task != 0, + waiting_cpus[1].task != 0); + printk("kgdb : Cpu_logged in: 0 = %d, 1 = %d\n", + cpu_logged_in[0], cpu_logged_in[1]); + printk + ("kgdb : in_kgdb_here_log in: 0 = %d, 1 = %d\n", + in_kgdb_here_log[0] != 0, + in_kgdb_here_log[1] != 0); +#endif + entry_state = NO_SYNC; + } else { +#if 0 + int ent = + in_kgdb_entry_log[smp_processor_id()] - + me_in_kgdb; + printk("kgdb : sync after %d entries\n", ent); +#endif + } + } else { + if (remote_debug) { + printk + ("kgdb : %d cpus, but watchdog not active\n" + "proceeding without locking down other cpus\n", + num_online_cpus()); + entry_state = NO_NMI; + } + } + } +#endif + + if (remote_debug) { + printk("handle_exception(exceptionVector=%d, " + "signo=%d, err_code=%d, linux_regs=%p)\n", + exceptionVector, signo, err_code, linux_regs); + printk(" address: %lx\n", address); + + if (debug_regs) { + print_regs(®s); + show_trace(current, (unsigned long *)®s); + } + } + + /* Disable hardware debugging while we are in kgdb */ + /* Get the debug register status register */ +/* *INDENT-OFF* */ + __asm__("movl %0,%%db7" + : /* no output */ + :"r"(0)); + + asm volatile ("movl %%db6, %0\n" + :"=r" (hw_breakpoint_status) + :); + +/* *INDENT-ON* */ + switch (exceptionVector) { + case 0: /* divide error */ + case 1: /* debug exception */ + case 2: /* NMI */ + case 3: /* breakpoint */ + case 4: /* overflow */ + case 5: /* bounds check */ + case 6: /* invalid opcode */ + case 7: /* device not available */ + case 8: /* double fault (errcode) */ + case 10: /* invalid TSS (errcode) */ + case 12: /* stack fault (errcode) */ + case 16: /* floating point error */ + case 17: /* alignment check (errcode) */ + default: /* any undocumented */ + break; + case 11: /* segment not present (errcode) */ + case 13: /* general protection (errcode) */ + case 14: /* page fault (special errcode) */ + case 19: /* cache flush denied */ + if (mem_err_expected) { + /* + * This fault occured because of the + * get_char or set_char routines. These + * two routines use either eax of edx to + * indirectly reference the location in + * memory that they are working with. + * For a page fault, when we return the + * instruction will be retried, so we + * have to make sure that these + * registers point to valid memory. + */ + mem_err = 1; /* set mem error flag */ + mem_err_expected = 0; + mem_err_cnt++; /* helps in debugging */ + /* make valid address */ + regs.eax = (long) &garbage_loc; + /* make valid address */ + regs.edx = (long) &garbage_loc; + if (remote_debug) + printk("Return after memory error: " + "mem_err_cnt=%d\n", mem_err_cnt); + if (debug_regs) + print_regs(®s); + goto exit_kgdb; + } + break; + } + if (remote_debug) + printk("kgdb : entered kgdb on cpu %d\n", smp_processor_id()); + + gdb_i386vector = exceptionVector; + gdb_i386errcode = err_code; + kgdb_info.called_from = __builtin_return_address(0); +#ifdef CONFIG_SMP + /* + * OK, we can now communicate, lets tell gdb about the sync. + * but only if we had a problem. + */ + switch (entry_state) { + case NO_NMI: + to_gdb("NMI not active, other cpus not stopped\n"); + break; + case NO_SYNC: + to_gdb("Some cpus not stopped, see 'kgdb_info' for details\n"); + default:; + } + +#endif +/* + * Set up the gdb function call area. + */ + trap_cpu = smp_processor_id(); + OLD_esp = NEW_esp = (int) (&linux_regs->esp); + + IF_SMP(once_again:) + /* reply to host that an exception has occurred */ + remcomOutBuffer[0] = 'S'; + remcomOutBuffer[1] = hexchars[signo >> 4]; + remcomOutBuffer[2] = hexchars[signo % 16]; + remcomOutBuffer[3] = 0; + + if (kgdb_eth_is_initializing) { + kgdb_eth_is_initializing = 0; + } else { + putpacket(remcomOutBuffer); + } + + kgdb_eth_reply_arp(); + while (1 == 1) { + error = 0; + remcomOutBuffer[0] = 0; + getpacket(remcomInBuffer); + switch (remcomInBuffer[0]) { + case '?': + remcomOutBuffer[0] = 'S'; + remcomOutBuffer[1] = hexchars[signo >> 4]; + remcomOutBuffer[2] = hexchars[signo % 16]; + remcomOutBuffer[3] = 0; + break; + case 'd': + remote_debug = !(remote_debug); /* toggle debug flag */ + printk("Remote debug %s\n", + remote_debug ? "on" : "off"); + break; + case 'g': /* return the value of the CPU registers */ + get_gdb_regs(usethread, ®s, gdb_regs); + mem2hex((char *) gdb_regs, + remcomOutBuffer, NUMREGBYTES, 0); + break; + case 'G': /* set the value of the CPU registers - return OK */ + hex2mem(&remcomInBuffer[1], + (char *) gdb_regs, NUMREGBYTES, 0); + if (!usethread || usethread == current) { + gdb_regs_to_regs(gdb_regs, ®s); + strcpy(remcomOutBuffer, "OK"); + } else { + strcpy(remcomOutBuffer, "E00"); + } + break; + + case 'P':{ /* set the value of a single CPU register - + return OK */ + /* + * For some reason, gdb wants to talk about psudo + * registers (greater than 15). These may have + * meaning for ptrace, but for us it is safe to + * ignor them. We do this by dumping them into + * _GS which we also ignor, but do have memory for. + */ + int regno; + + ptr = &remcomInBuffer[1]; + regs_to_gdb_regs(gdb_regs, ®s); + if ((!usethread || usethread == current) && + hexToInt(&ptr, ®no) && + *ptr++ == '=' && (regno >= 0)) { + regno = + (regno >= NUMREGS ? _GS : regno); + hex2mem(ptr, (char *) &gdb_regs[regno], + 4, 0); + gdb_regs_to_regs(gdb_regs, ®s); + strcpy(remcomOutBuffer, "OK"); + break; + } + strcpy(remcomOutBuffer, "E01"); + break; + } + + /* mAA..AA,LLLL Read LLLL bytes at address AA..AA */ + case 'm': + /* TRY TO READ %x,%x. IF SUCCEED, SET PTR = 0 */ + ptr = &remcomInBuffer[1]; + if (hexToInt(&ptr, &addr) && + (*(ptr++) == ',') && (hexToInt(&ptr, &length))) { + ptr = 0; + /* + * hex doubles the byte count + */ + if (length > (BUFMAX / 2)) + length = BUFMAX / 2; + mem2hex((char *) addr, + remcomOutBuffer, length, 1); + if (mem_err) { + strcpy(remcomOutBuffer, "E03"); + debug_error("memory fault\n", NULL); + } + } + + if (ptr) { + strcpy(remcomOutBuffer, "E01"); + debug_error + ("malformed read memory command: %s\n", + remcomInBuffer); + } + break; + + /* MAA..AA,LLLL: + Write LLLL bytes at address AA.AA return OK */ + case 'M': + /* TRY TO READ '%x,%x:'. IF SUCCEED, SET PTR = 0 */ + ptr = &remcomInBuffer[1]; + if (hexToInt(&ptr, &addr) && + (*(ptr++) == ',') && + (hexToInt(&ptr, &length)) && (*(ptr++) == ':')) { + hex2mem(ptr, (char *) addr, length, 1); + + if (mem_err) { + strcpy(remcomOutBuffer, "E03"); + debug_error("memory fault\n", NULL); + } else { + strcpy(remcomOutBuffer, "OK"); + } + + ptr = 0; + } + if (ptr) { + strcpy(remcomOutBuffer, "E02"); + debug_error + ("malformed write memory command: %s\n", + remcomInBuffer); + } + break; + case 'S': + remcomInBuffer[0] = 's'; + case 'C': + /* Csig;AA..AA where ;AA..AA is optional + * continue with signal + * Since signals are meaning less to us, delete that + * part and then fall into the 'c' code. + */ + ptr = &remcomInBuffer[1]; + length = 2; + while (*ptr && *ptr != ';') { + length++; + ptr++; + } + if (*ptr) { + do { + ptr++; + *(ptr - length++) = *ptr; + } while (*ptr); + } else { + remcomInBuffer[1] = 0; + } + + /* cAA..AA Continue at address AA..AA(optional) */ + /* sAA..AA Step one instruction from AA..AA(optional) */ + /* D detach, reply OK and then continue */ + case 'c': + case 's': + case 'D': + + /* try to read optional parameter, + pc unchanged if no parm */ + ptr = &remcomInBuffer[1]; + if (hexToInt(&ptr, &addr)) { + if (remote_debug) + printk("Changing EIP to 0x%x\n", addr); + + regs.eip = addr; + } + + newPC = regs.eip; + + if (kgdb_eth != -1) { + kgdb_eth_set_trapmode(0); + } + + /* clear the trace bit */ + regs.eflags &= 0xfffffeff; + + /* set the trace bit if we're stepping */ + if (remcomInBuffer[0] == 's') + regs.eflags |= 0x100; + + /* detach is a friendly version of continue. Note that + debugging is still enabled (e.g hit control C) + */ + if (remcomInBuffer[0] == 'D') { + strcpy(remcomOutBuffer, "OK"); + putpacket(remcomOutBuffer); + } + + if (remote_debug) { + printk("Resuming execution\n"); + print_regs(®s); + } + asm volatile ("movl %%db6, %0\n":"=r" (dr6) + :); + if (!(dr6 & 0x4000)) { + for (breakno = 0; breakno < 4; ++breakno) { + if (dr6 & (1 << breakno) && + (breakinfo[breakno].type == 0)) { + /* Set restore flag */ + regs.eflags |= 0x10000; + break; + } + } + } + correct_hw_break(); + asm volatile ("movl %0, %%db6\n"::"r" (0)); + goto exit_kgdb; + + /* kill the program */ + case 'k': /* do nothing */ + break; + + /* query */ + case 'q': + nothreads = 0; + switch (remcomInBuffer[1]) { + case 'f': + threadid = 1; + thread_list = 2; + thread_list_start = (usethread ? : current); + case 's': + if (!cmp_str(&remcomInBuffer[2], + "ThreadInfo", 10)) + break; + + remcomOutBuffer[nothreads++] = 'm'; + for (; threadid < PID_MAX + MAX_NO_CPUS; + threadid++) { + thread = getthread(threadid); + if (thread) { + nothreads += int_to_hex_v( + &remcomOutBuffer[ + nothreads], + threadid); + if (thread_min > threadid) + thread_min = threadid; + remcomOutBuffer[ + nothreads] = ','; + nothreads++; + if (nothreads > BUFMAX - 10) + break; + } + } + if (remcomOutBuffer[nothreads - 1] == 'm') { + remcomOutBuffer[nothreads - 1] = 'l'; + } else { + nothreads--; + } + remcomOutBuffer[nothreads] = 0; + break; + +#ifdef old_thread_list /* Old thread info request */ + case 'L': + /* List threads */ + thread_list = 2; + thread_list_start = (usethread ? : current); + unpack_byte(remcomInBuffer + 3, &maxthreads); + unpack_threadid(remcomInBuffer + 5, &thref); + do { + int buf_thread_limit = + (BUFMAX - 22) / BUF_THREAD_ID_SIZE; + if (maxthreads > buf_thread_limit) { + maxthreads = buf_thread_limit; + } + } while (0); + remcomOutBuffer[0] = 'q'; + remcomOutBuffer[1] = 'M'; + remcomOutBuffer[4] = '0'; + pack_threadid(remcomOutBuffer + 5, &thref); + + threadid = threadref_to_int(&thref); + for (nothreads = 0; + nothreads < maxthreads && + threadid < PID_MAX + MAX_NO_CPUS; + threadid++) { + thread = getthread(threadid); + if (thread) { + int_to_threadref(&thref, + threadid); + pack_threadid(remcomOutBuffer + + 21 + + nothreads * 16, + &thref); + nothreads++; + if (thread_min > threadid) + thread_min = threadid; + } + } + + if (threadid == PID_MAX + MAX_NO_CPUS) { + remcomOutBuffer[4] = '1'; + } + pack_hex_byte(remcomOutBuffer + 2, nothreads); + remcomOutBuffer[21 + nothreads * 16] = '\0'; + break; +#endif + case 'C': + /* Current thread id */ + remcomOutBuffer[0] = 'Q'; + remcomOutBuffer[1] = 'C'; + threadid = current->pid; + if (!threadid) { + /* + * idle thread + */ + for (threadid = PID_MAX; + threadid < PID_MAX + MAX_NO_CPUS; + threadid++) { + if (current == + idle_task(threadid - + PID_MAX)) + break; + } + } + int_to_threadref(&thref, threadid); + pack_threadid(remcomOutBuffer + 2, &thref); + remcomOutBuffer[18] = '\0'; + break; + + case 'E': + /* Print exception info */ + printexceptioninfo(exceptionVector, + err_code, remcomOutBuffer); + break; + case 'T':{ + char * nptr; + /* Thread extra info */ + if (!cmp_str(&remcomInBuffer[2], + "hreadExtraInfo,", 15)) { + break; + } + ptr = &remcomInBuffer[17]; + hexToInt(&ptr, &threadid); + thread = getthread(threadid); + nptr = &thread->comm[0]; + length = 0; + ptr = &remcomOutBuffer[0]; + do { + length++; + ptr = pack_hex_byte(ptr, *nptr++); + } while (*nptr && length < 16); + /* + * would like that 16 to be the size of + * task_struct.comm but don't know the + * syntax.. + */ + *ptr = 0; + } + } + break; + + /* task related */ + case 'H': + switch (remcomInBuffer[1]) { + case 'g': + ptr = &remcomInBuffer[2]; + hexToInt(&ptr, &threadid); + thread = getthread(threadid); + if (!thread) { + remcomOutBuffer[0] = 'E'; + remcomOutBuffer[1] = '\0'; + break; + } + /* + * Just in case I forget what this is all about, + * the "thread info" command to gdb causes it + * to ask for a thread list. It then switches + * to each thread and asks for the registers. + * For this (and only this) usage, we want to + * fudge the registers of tasks not on the run + * list (i.e. waiting) to show the routine that + * called schedule. Also, gdb, is a minimalist + * in that if the current thread is the last + * it will not re-read the info when done. + * This means that in this case we must show + * the real registers. So here is how we do it: + * Each entry we keep track of the min + * thread in the list (the last that gdb will) + * get info for. We also keep track of the + * starting thread. + * "thread_list" is cleared when switching back + * to the min thread if it is was current, or + * if it was not current, thread_list is set + * to 1. When the switch to current comes, + * if thread_list is 1, clear it, else do + * nothing. + */ + usethread = thread; + if ((thread_list == 1) && + (thread == thread_list_start)) { + thread_list = 0; + } + if (thread_list && (threadid == thread_min)) { + if (thread == thread_list_start) { + thread_list = 0; + } else { + thread_list = 1; + } + } + /* follow through */ + case 'c': + remcomOutBuffer[0] = 'O'; + remcomOutBuffer[1] = 'K'; + remcomOutBuffer[2] = '\0'; + break; + } + break; + + /* Query thread status */ + case 'T': + ptr = &remcomInBuffer[1]; + hexToInt(&ptr, &threadid); + thread = getthread(threadid); + if (thread) { + remcomOutBuffer[0] = 'O'; + remcomOutBuffer[1] = 'K'; + remcomOutBuffer[2] = '\0'; + if (thread_min > threadid) + thread_min = threadid; + } else { + remcomOutBuffer[0] = 'E'; + remcomOutBuffer[1] = '\0'; + } + break; + + case 'Y': /* set up a hardware breakpoint */ + ptr = &remcomInBuffer[1]; + hexToInt(&ptr, &breakno); + ptr++; + hexToInt(&ptr, &breaktype); + ptr++; + hexToInt(&ptr, &length); + ptr++; + hexToInt(&ptr, &addr); + if (set_hw_break(breakno & 0x3, + breaktype & 0x3, + length & 0x3, addr) == 0) { + strcpy(remcomOutBuffer, "OK"); + } else { + strcpy(remcomOutBuffer, "ERROR"); + } + break; + + /* Remove hardware breakpoint */ + case 'y': + ptr = &remcomInBuffer[1]; + hexToInt(&ptr, &breakno); + if (remove_hw_break(breakno & 0x3) == 0) { + strcpy(remcomOutBuffer, "OK"); + } else { + strcpy(remcomOutBuffer, "ERROR"); + } + break; + + case 'r': /* reboot */ + strcpy(remcomOutBuffer, "OK"); + putpacket(remcomOutBuffer); + /*to_gdb("Rebooting\n"); */ + /* triplefault no return from here */ + { + static long no_idt[2]; + __asm__ __volatile__("lidt %0"::"m"(no_idt[0])); + BREAKPOINT; + } + + } /* switch */ + + /* reply to the request */ + putpacket(remcomOutBuffer); + } /* while(1==1) */ + /* + * reached by goto only. + */ + exit_kgdb: + /* + * Here is where we set up to trap a gdb function call. NEW_esp + * will be changed if we are trying to do this. We handle both + * adding and subtracting, thus allowing gdb to put grung on + * the stack which it removes later. + */ + if (NEW_esp != OLD_esp) { + int *ptr = END_OF_LOOKASIDE; + if (NEW_esp < OLD_esp) + ptr -= (OLD_esp - NEW_esp) / sizeof (int); + *--ptr = linux_regs->eflags; + *--ptr = linux_regs->xcs; + *--ptr = linux_regs->eip; + *--ptr = linux_regs->ecx; + *--ptr = linux_regs->ebx; + *--ptr = linux_regs->eax; + linux_regs->ecx = NEW_esp - (sizeof (int) * 6); + linux_regs->ebx = (unsigned int) END_OF_LOOKASIDE; + if (NEW_esp < OLD_esp) { + linux_regs->eip = (unsigned int) fn_call_stub; + } else { + linux_regs->eip = (unsigned int) fn_rtn_stub; + linux_regs->eax = NEW_esp; + } + linux_regs->eflags &= ~(IF_BIT | TF_BIT); + } +#ifdef CONFIG_SMP + /* + * Release gdb wait locks + * Sanity check time. Must have at least one cpu to run. Also single + * step must not be done if the current cpu is on hold. + */ + if (spinlock_count == 1) { + int ss_hold = (regs.eflags & 0x100) && kgdb_info.hold_on_sstep; + int cpu_avail = 0; + int i; + + for (i = 0; i < MAX_NO_CPUS; i++) { + if (!cpu_online(i)) + break; + if (!hold_cpu(i)) { + cpu_avail = 1; + } + } + /* + * Early in the bring up there will be NO cpus on line... + */ + if (!cpu_avail && !cpus_empty(cpu_online_map)) { + to_gdb("No cpus unblocked, see 'kgdb_info.hold_cpu'\n"); + goto once_again; + } + if (hold_cpu(smp_processor_id()) && (regs.eflags & 0x100)) { + to_gdb + ("Current cpu must be unblocked to single step\n"); + goto once_again; + } + if (!(ss_hold)) { + int i; + for (i = 0; i < MAX_NO_CPUS; i++) { + if (!hold_cpu(i)) { + spin_unlock(&waitlocks[i]); + } + } + } else { + spin_unlock(&waitlocks[smp_processor_id()]); + } + /* Release kgdb spinlock */ + KGDB_SPIN_UNLOCK(&kgdb_spinlock); + /* + * If this cpu is on hold, this is where we + * do it. Note, the NMI will pull us out of here, + * but will return as the above lock is not held. + * We will stay here till another cpu releases the lock for us. + */ + spin_unlock_wait(waitlocks + smp_processor_id()); + kgdb_local_irq_restore(flags); + return (0); + } +#if 0 +exit_just_unlock: +#endif +#endif + /* Release kgdb spinlock */ + KGDB_SPIN_UNLOCK(&kgdb_spinlock); + kgdb_local_irq_restore(flags); + return (0); +} + +/* this function is used to set up exception handlers for tracing and + * breakpoints. + * This function is not needed as the above line does all that is needed. + * We leave it for backward compatitability... + */ +void +set_debug_traps(void) +{ + /* + * linux_debug_hook is defined in traps.c. We store a pointer + * to our own exception handler into it. + + * But really folks, every hear of labeled common, an old Fortran + * concept. Lots of folks can reference it and it is define if + * anyone does. Only one can initialize it at link time. We do + * this with the hook. See the statement above. No need for any + * executable code and it is ready as soon as the kernel is + * loaded. Very desirable in kernel debugging. + + linux_debug_hook = handle_exception ; + */ + + /* In case GDB is started before us, ack any packets (presumably + "$?#xx") sitting there. + putDebugChar ('+'); + + initialized = 1; + */ +} + +/* This function will generate a breakpoint exception. It is used at the + beginning of a program to sync up with a debugger and can be used + otherwise as a quick means to stop program execution and "break" into + the debugger. */ +/* But really, just use the BREAKPOINT macro. We will handle the int stuff + */ + +#ifdef later +/* + * possibly we should not go thru the traps.c code at all? Someday. + */ +void +do_kgdb_int3(struct pt_regs *regs, long error_code) +{ + kgdb_handle_exception(3, 5, error_code, regs); + return; +} +#endif +#undef regs +#ifdef CONFIG_TRAP_BAD_SYSCALL_EXITS +asmlinkage void +bad_sys_call_exit(int stuff) +{ + struct pt_regs *regs = (struct pt_regs *) &stuff; + printk("Sys call %d return with %x preempt_count\n", + (int) regs->orig_eax, preempt_count()); +} +#endif +#ifdef CONFIG_STACK_OVERFLOW_TEST +#include <asm/kgdb.h> +asmlinkage void +stack_overflow(void) +{ +#ifdef BREAKPOINT + BREAKPOINT; +#else + printk("Kernel stack overflow, looping forever\n"); +#endif + while (1) { + } +} +#endif + +#if defined(CONFIG_SMP) || defined(CONFIG_KGDB_CONSOLE) +char gdbconbuf[BUFMAX]; + +static void +kgdb_gdb_message(const char *s, unsigned count) +{ + int i; + int wcount; + char *bufptr; + /* + * This takes care of NMI while spining out chars to gdb + */ + IF_SMP(in_kgdb_console = 1); + gdbconbuf[0] = 'O'; + bufptr = gdbconbuf + 1; + while (count > 0) { + if ((count << 1) > (BUFMAX - 2)) { + wcount = (BUFMAX - 2) >> 1; + } else { + wcount = count; + } + count -= wcount; + for (i = 0; i < wcount; i++) { + bufptr = pack_hex_byte(bufptr, s[i]); + } + *bufptr = '\0'; + s += wcount; + + putpacket(gdbconbuf); + + } + IF_SMP(in_kgdb_console = 0); +} +#endif +#ifdef CONFIG_SMP +static void +to_gdb(const char *s) +{ + int count = 0; + while (s[count] && (count++ < BUFMAX)) ; + kgdb_gdb_message(s, count); +} +#endif +#ifdef CONFIG_KGDB_CONSOLE +#include <linux/console.h> +#include <linux/init.h> +#include <linux/fs.h> +#include <asm/uaccess.h> +#include <asm/semaphore.h> + +void +kgdb_console_write(struct console *co, const char *s, unsigned count) +{ + + if (gdb_i386vector == -1) { + /* + * We have not yet talked to gdb. What to do... + * lets break, on continue we can do the write. + * But first tell him whats up. Uh, well no can do, + * as this IS the console. Oh well... + * We do need to wait or the messages will be lost. + * Other option would be to tell the above code to + * ignore this breakpoint and do an auto return, + * but that might confuse gdb. Also this happens + * early enough in boot up that we don't have the traps + * set up yet, so... + */ + breakpoint(); + } + kgdb_gdb_message(s, count); +} + +/* + * ------------------------------------------------------------ + * Serial KGDB driver + * ------------------------------------------------------------ + */ + +static struct console kgdbcons = { + name:"kgdb", + write:kgdb_console_write, +#ifdef CONFIG_KGDB_USER_CONSOLE + device:kgdb_console_device, +#endif + flags:CON_PRINTBUFFER | CON_ENABLED, + index:-1, +}; + +/* + * The trick here is that this file gets linked before printk.o + * That means we get to peer at the console info in the command + * line before it does. If we are up, we register, otherwise, + * do nothing. By returning 0, we allow printk to look also. + */ +static int kgdb_console_enabled; + +int __init +kgdb_console_init(char *str) +{ + if ((strncmp(str, "kgdb", 4) == 0) || (strncmp(str, "gdb", 3) == 0)) { + register_console(&kgdbcons); + kgdb_console_enabled = 1; + } + return 0; /* let others look at the string */ +} + +__setup("console=", kgdb_console_init); + +#ifdef CONFIG_KGDB_USER_CONSOLE +static kdev_t kgdb_console_device(struct console *c); +/* This stuff sort of works, but it knocks out telnet devices + * we are leaving it here in case we (or you) find time to figure it out + * better.. + */ + +/* + * We need a real char device as well for when the console is opened for user + * space activities. + */ + +static int +kgdb_consdev_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static ssize_t +kgdb_consdev_write(struct file *file, const char *buf, + size_t count, loff_t * ppos) +{ + int size, ret = 0; + static char kbuf[128]; + static DECLARE_MUTEX(sem); + + /* We are not reentrant... */ + if (down_interruptible(&sem)) + return -ERESTARTSYS; + + while (count > 0) { + /* need to copy the data from user space */ + size = count; + if (size > sizeof (kbuf)) + size = sizeof (kbuf); + if (copy_from_user(kbuf, buf, size)) { + ret = -EFAULT; + break;; + } + kgdb_console_write(&kgdbcons, kbuf, size); + count -= size; + ret += size; + buf += size; + } + + up(&sem); + + return ret; +} + +struct file_operations kgdb_consdev_fops = { + open:kgdb_consdev_open, + write:kgdb_consdev_write +}; +static kdev_t +kgdb_console_device(struct console *c) +{ + return MKDEV(TTYAUX_MAJOR, 1); +} + +/* + * This routine gets called from the serial stub in the i386/lib + * This is so it is done late in bring up (just before the console open). + */ +void +kgdb_console_finit(void) +{ + if (kgdb_console_enabled) { + char *cptr = cdevname(MKDEV(TTYAUX_MAJOR, 1)); + char *cp = cptr; + while (*cptr && *cptr != '(') + cptr++; + *cptr = 0; + unregister_chrdev(TTYAUX_MAJOR, cp); + register_chrdev(TTYAUX_MAJOR, "kgdb", &kgdb_consdev_fops); + } +} +#endif +#endif +#ifdef CONFIG_KGDB_TS +#include <asm/msr.h> /* time stamp code */ +#include <asm/hardirq.h> /* in_interrupt */ +#ifdef CONFIG_KGDB_TS_64 +#define DATA_POINTS 64 +#endif +#ifdef CONFIG_KGDB_TS_128 +#define DATA_POINTS 128 +#endif +#ifdef CONFIG_KGDB_TS_256 +#define DATA_POINTS 256 +#endif +#ifdef CONFIG_KGDB_TS_512 +#define DATA_POINTS 512 +#endif +#ifdef CONFIG_KGDB_TS_1024 +#define DATA_POINTS 1024 +#endif +#ifndef DATA_POINTS +#define DATA_POINTS 128 /* must be a power of two */ +#endif +#define INDEX_MASK (DATA_POINTS - 1) +#if (INDEX_MASK & DATA_POINTS) +#error "CONFIG_KGDB_TS_COUNT must be a power of 2" +#endif +struct kgdb_and_then_struct { +#ifdef CONFIG_SMP + int on_cpu; +#endif + struct task_struct *task; + long long at_time; + int from_ln; + char *in_src; + void *from; + int *with_shpf; + int data0; + int data1; +}; +struct kgdb_and_then_struct2 { +#ifdef CONFIG_SMP + int on_cpu; +#endif + struct task_struct *task; + long long at_time; + int from_ln; + char *in_src; + void *from; + int *with_shpf; + struct task_struct *t1; + struct task_struct *t2; +}; +struct kgdb_and_then_struct kgdb_data[DATA_POINTS]; + +struct kgdb_and_then_struct *kgdb_and_then = &kgdb_data[0]; +int kgdb_and_then_count; + +void +kgdb_tstamp(int line, char *source, int data0, int data1) +{ + static spinlock_t ts_spin = SPIN_LOCK_UNLOCKED; + int flags; + kgdb_local_irq_save(flags); + spin_lock(&ts_spin); + rdtscll(kgdb_and_then->at_time); +#ifdef CONFIG_SMP + kgdb_and_then->on_cpu = smp_processor_id(); +#endif + kgdb_and_then->task = current; + kgdb_and_then->from_ln = line; + kgdb_and_then->in_src = source; + kgdb_and_then->from = __builtin_return_address(0); + kgdb_and_then->with_shpf = (int *) (((flags & IF_BIT) >> 9) | + (preempt_count() << 8)); + kgdb_and_then->data0 = data0; + kgdb_and_then->data1 = data1; + kgdb_and_then = &kgdb_data[++kgdb_and_then_count & INDEX_MASK]; + spin_unlock(&ts_spin); + kgdb_local_irq_restore(flags); +#ifdef CONFIG_PREEMPT + +#endif + return; +} +#endif +typedef int gdb_debug_hook(int exceptionVector, + int signo, int err_code, struct pt_regs *linux_regs); +gdb_debug_hook *linux_debug_hook = &kgdb_handle_exception; /* histerical reasons... */ + +static int __init kgdb_opt_kgdbeth(char *str) +{ + kgdb_eth = simple_strtoul(str, NULL, 10); + return 1; +} + +static int __init kgdb_opt_kgdbeth_remoteip(char *str) +{ + kgdb_remoteip = in_aton(str); + return 1; +} + +static int __init kgdb_opt_kgdbeth_listenport(char *str) +{ + kgdb_listenport = simple_strtoul(str, NULL, 10); + kgdb_sendport = kgdb_listenport - 1; + return 1; +} + +static int __init parse_hw_addr(char *str, unsigned char *addr) +{ + int i; + char *p; + + p = str; + i = 0; + while(1) + { + unsigned int c; + + sscanf(p, "%x:", &c); + addr[i++] = c; + while((*p != 0) && (*p != ':')) { + p++; + } + if (*p == 0) { + break; + } + p++; + } + + return 1; +} + +static int __init kgdb_opt_kgdbeth_remotemac(char *str) +{ + return parse_hw_addr(str, kgdb_remotemac); +} +static int __init kgdb_opt_kgdbeth_localmac(char *str) +{ + return parse_hw_addr(str, kgdb_localmac); +} + + +__setup("gdbeth=", kgdb_opt_kgdbeth); +__setup("gdbeth_remoteip=", kgdb_opt_kgdbeth_remoteip); +__setup("gdbeth_listenport=", kgdb_opt_kgdbeth_listenport); +__setup("gdbeth_remotemac=", kgdb_opt_kgdbeth_remotemac); +__setup("gdbeth_localmac=", kgdb_opt_kgdbeth_localmac); + --- diff/arch/i386/kernel/timers/timer_pm.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/kernel/timers/timer_pm.c 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,203 @@ +/* + * (C) Dominik Brodowski <linux@brodo.de> 2003 + * + * Driver to use the Power Management Timer (PMTMR) available in some + * southbridges as primary timing source for the Linux kernel. + * + * Based on parts of linux/drivers/acpi/hardware/hwtimer.c, timer_pit.c, + * timer_hpet.c, and on Arjan van de Ven's implementation for 2.4. + * + * This file is licensed under the GPL v2. + */ + + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/init.h> +#include <asm/types.h> +#include <asm/timer.h> +#include <asm/smp.h> +#include <asm/io.h> +#include <asm/arch_hooks.h> + + +/* The I/O port the PMTMR resides at. + * The location is detected during setup_arch(), + * in arch/i386/acpi/boot.c */ +u32 pmtmr_ioport = 0; + + +/* value of the Power timer at last timer interrupt */ +static u32 offset_tick; +static u32 offset_delay; + +static unsigned long long monotonic_base; +static seqlock_t monotonic_lock = SEQLOCK_UNLOCKED; + +#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */ + +static int init_pmtmr(char* override) +{ + u32 value1, value2; + unsigned int i; + + if (override[0] && strncmp(override,"pmtmr",5)) + return -ENODEV; + + if (!pmtmr_ioport) + return -ENODEV; + + /* "verify" this timing source */ + value1 = inl(pmtmr_ioport); + value1 &= ACPI_PM_MASK; + for (i=0; i < 10000; i++) { + value2 = inl(pmtmr_ioport); + value2 &= ACPI_PM_MASK; + if (value2 == value1) + continue; + if (value2 > value1) + goto pm_good; + if ((value2 < value1) && ((value2) < 0xFFF)) + goto pm_good; + printk(KERN_INFO "PM-Timer had inconsistent results: 0x%#x, 0x%#x - aborting.\n", value1, value2); + return -EINVAL; + } + printk(KERN_INFO "PM-Timer had no reasonable result: 0x%#x - aborting.\n", value1); + return -ENODEV; + +pm_good: + init_cpu_khz(); + return 0; +} + +static inline u32 cyc2us(u32 cycles) +{ + /* The Power Management Timer ticks at 3.579545 ticks per microsecond. + * 1 / PM_TIMER_FREQUENCY == 0.27936511 =~ 286/1024 [error: 0.024%] + * + * Even with HZ = 100, delta is at maximum 35796 ticks, so it can + * easily be multiplied with 286 (=0x11E) without having to fear + * u32 overflows. + */ + cycles *= 286; + return (cycles >> 10); +} + +/* + * this gets called during each timer interrupt + * - Called while holding the writer xtime_lock + */ +static void mark_offset_pmtmr(void) +{ + u32 lost, delta, last_offset; + static int first_run = 1; + last_offset = offset_tick; + + write_seqlock(&monotonic_lock); + + offset_tick = inl(pmtmr_ioport); + offset_tick &= ACPI_PM_MASK; /* limit it to 24 bits */ + + /* calculate tick interval */ + delta = (offset_tick - last_offset) & ACPI_PM_MASK; + + /* convert to usecs */ + delta = cyc2us(delta); + + /* update the monotonic base value */ + monotonic_base += delta*NSEC_PER_USEC; + write_sequnlock(&monotonic_lock); + + /* convert to ticks */ + delta += offset_delay; + lost = delta/(USEC_PER_SEC/HZ); + offset_delay = delta%(USEC_PER_SEC/HZ); + + + /* compensate for lost ticks */ + if (lost >= 2) + jiffies += lost - 1; + + /* don't calculate delay for first run, + or if we've got less then a tick */ + if (first_run || (lost < 1)) { + first_run = 0; + offset_delay = 0; + } + + return; +} + + +static unsigned long long monotonic_clock_pmtmr(void) +{ + u32 last_offset, this_offset; + unsigned long long base, ret; + unsigned seq; + + + /* atomically read monotonic base & last_offset */ + do { + seq = read_seqbegin(&monotonic_lock); + last_offset = offset_tick; + base = monotonic_base; + } while (read_seqretry(&monotonic_lock, seq)); + + /* Read the pmtmr */ + this_offset = inl(pmtmr_ioport) & ACPI_PM_MASK; + + /* convert to nanoseconds */ + ret = (this_offset - last_offset) & ACPI_PM_MASK; + ret = base + (cyc2us(ret)*NSEC_PER_USEC); + return ret; +} + +/* + * copied from delay_pit + */ +static void delay_pmtmr(unsigned long loops) +{ + int d0; + __asm__ __volatile__( + "\tjmp 1f\n" + ".align 16\n" + "1:\tjmp 2f\n" + ".align 16\n" + "2:\tdecl %0\n\tjns 2b" + :"=&a" (d0) + :"0" (loops)); +} + + +/* + * get the offset (in microseconds) from the last call to mark_offset() + * - Called holding a reader xtime_lock + */ +static unsigned long get_offset_pmtmr(void) +{ + u32 now, offset, delta = 0; + + offset = offset_tick; + now = inl(pmtmr_ioport); + now &= ACPI_PM_MASK; + delta = (now - offset)&ACPI_PM_MASK; + + return (unsigned long) offset_delay + cyc2us(delta); +} + + +/* acpi timer_opts struct */ +struct timer_opts timer_pmtmr = { + .name = "acpi_pm_timer", + .init = init_pmtmr, + .mark_offset = mark_offset_pmtmr, + .get_offset = get_offset_pmtmr, + .monotonic_clock = monotonic_clock_pmtmr, + .delay = delay_pmtmr, +}; + + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>"); +MODULE_DESCRIPTION("Power Management Timer (PMTMR) as primary timing source for x86"); --- diff/arch/i386/lib/kgdb_serial.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/i386/lib/kgdb_serial.c 2003-11-26 10:09:04.000000000 +0000 @@ -0,0 +1,499 @@ +/* + * Serial interface GDB stub + * + * Written (hacked together) by David Grothe (dave@gcom.com) + * Modified to allow invokation early in boot see also + * kgdb.h for instructions by George Anzinger(george@mvista.com) + * Modified to handle debugging over ethernet by Robert Walsh + * <rjwalsh@durables.org> and wangdi <wangdi@clusterfs.com>, based on + * code by San Mehat. + * + */ + +#include <linux/module.h> +#include <linux/errno.h> +#include <linux/signal.h> +#include <linux/sched.h> +#include <linux/timer.h> +#include <linux/interrupt.h> +#include <linux/tty.h> +#include <linux/tty_flip.h> +#include <linux/serial.h> +#include <linux/serial_reg.h> +#include <linux/config.h> +#include <linux/major.h> +#include <linux/string.h> +#include <linux/fcntl.h> +#include <linux/ptrace.h> +#include <linux/ioport.h> +#include <linux/mm.h> +#include <linux/init.h> +#include <linux/highmem.h> +#include <asm/system.h> +#include <asm/io.h> +#include <asm/segment.h> +#include <asm/bitops.h> +#include <asm/system.h> +#include <asm/kgdb_local.h> +#ifdef CONFIG_KGDB_USER_CONSOLE +extern void kgdb_console_finit(void); +#endif +#define PRNT_off +#define TEST_EXISTANCE +#ifdef PRNT +#define dbprintk(s) printk s +#else +#define dbprintk(s) +#endif +#define TEST_INTERRUPT_off +#ifdef TEST_INTERRUPT +#define intprintk(s) printk s +#else +#define intprintk(s) +#endif + +#define IRQ_T(info) ((info->flags & ASYNC_SHARE_IRQ) ? SA_SHIRQ : SA_INTERRUPT) + +#define GDB_BUF_SIZE 512 /* power of 2, please */ + +static char gdb_buf[GDB_BUF_SIZE]; +static int gdb_buf_in_inx; +static atomic_t gdb_buf_in_cnt; +static int gdb_buf_out_inx; + +struct async_struct *gdb_async_info; +static int gdb_async_irq; + +#define outb_px(a,b) outb_p(b,a) + +static void program_uart(struct async_struct *info); +static void write_char(struct async_struct *info, int chr); +/* + * Get a byte from the hardware data buffer and return it + */ +static int +read_data_bfr(struct async_struct *info) +{ + char it = inb_p(info->port + UART_LSR); + + if (it & UART_LSR_DR) + return (inb_p(info->port + UART_RX)); + /* + * If we have a framing error assume somebody messed with + * our uart. Reprogram it and send '-' both ways... + */ + if (it & 0xc) { + program_uart(info); + write_char(info, '-'); + return ('-'); + } + return (-1); + +} /* read_data_bfr */ + +/* + * Get a char if available, return -1 if nothing available. + * Empty the receive buffer first, then look at the interface hardware. + + * Locking here is a bit of a problem. We MUST not lock out communication + * if we are trying to talk to gdb about a kgdb entry. ON the other hand + * we can loose chars in the console pass thru if we don't lock. It is also + * possible that we could hold the lock or be waiting for it when kgdb + * NEEDS to talk. Since kgdb locks down the world, it does not need locks. + * We do, of course have possible issues with interrupting a uart operation, + * but we will just depend on the uart status to help keep that straight. + + */ +static spinlock_t uart_interrupt_lock = SPIN_LOCK_UNLOCKED; +#ifdef CONFIG_SMP +extern spinlock_t kgdb_spinlock; +#endif + +static int +read_char(struct async_struct *info) +{ + int chr; + unsigned long flags; + local_irq_save(flags); +#ifdef CONFIG_SMP + if (!spin_is_locked(&kgdb_spinlock)) { + spin_lock(&uart_interrupt_lock); + } +#endif + if (atomic_read(&gdb_buf_in_cnt) != 0) { /* intr routine has q'd chars */ + chr = gdb_buf[gdb_buf_out_inx++]; + gdb_buf_out_inx &= (GDB_BUF_SIZE - 1); + atomic_dec(&gdb_buf_in_cnt); + } else { + chr = read_data_bfr(info); + } +#ifdef CONFIG_SMP + if (!spin_is_locked(&kgdb_spinlock)) { + spin_unlock(&uart_interrupt_lock); + } +#endif + local_irq_restore(flags); + return (chr); +} + +/* + * Wait until the interface can accept a char, then write it. + */ +static void +write_char(struct async_struct *info, int chr) +{ + while (!(inb_p(info->port + UART_LSR) & UART_LSR_THRE)) ; + + outb_p(chr, info->port + UART_TX); + +} /* write_char */ + +/* + * Mostly we don't need a spinlock, but since the console goes + * thru here with interrutps on, well, we need to catch those + * chars. + */ +/* + * This is the receiver interrupt routine for the GDB stub. + * It will receive a limited number of characters of input + * from the gdb host machine and save them up in a buffer. + * + * When the gdb stub routine tty_getDebugChar() is called it + * draws characters out of the buffer until it is empty and + * then reads directly from the serial port. + * + * We do not attempt to write chars from the interrupt routine + * since the stubs do all of that via tty_putDebugChar() which + * writes one byte after waiting for the interface to become + * ready. + * + * The debug stubs like to run with interrupts disabled since, + * after all, they run as a consequence of a breakpoint in + * the kernel. + * + * Perhaps someone who knows more about the tty driver than I + * care to learn can make this work for any low level serial + * driver. + */ +static irqreturn_t +gdb_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + struct async_struct *info; + unsigned long flags; + + info = gdb_async_info; + if (!info || !info->tty || irq != gdb_async_irq) + return IRQ_NONE; + + local_irq_save(flags); + spin_lock(&uart_interrupt_lock); + do { + int chr = read_data_bfr(info); + intprintk(("Debug char on int: %x hex\n", chr)); + if (chr < 0) + continue; + + if (chr == 3) { /* Ctrl-C means remote interrupt */ + BREAKPOINT; + continue; + } + + if (atomic_read(&gdb_buf_in_cnt) >= GDB_BUF_SIZE) { + /* buffer overflow tosses early char */ + read_char(info); + } + gdb_buf[gdb_buf_in_inx++] = chr; + gdb_buf_in_inx &= (GDB_BUF_SIZE - 1); + } while (inb_p(info->port + UART_IIR) & UART_IIR_RDI); + spin_unlock(&uart_interrupt_lock); + local_irq_restore(flags); + return IRQ_HANDLED; +} /* gdb_interrupt */ + +/* + * Just a NULL routine for testing. + */ +void +gdb_null(void) +{ +} /* gdb_null */ + +/* These structure are filled in with values defined in asm/kgdb_local.h + */ +static struct serial_state state = SB_STATE; +static struct async_struct local_info = SB_INFO; +static int ok_to_enable_ints = 0; +static void kgdb_enable_ints_now(void); + +extern char *kgdb_version; +/* + * Hook an IRQ for KGDB. + * + * This routine is called from tty_putDebugChar, below. + */ +static int ints_disabled = 1; +int +gdb_hook_interrupt(struct async_struct *info, int verb) +{ + struct serial_state *state = info->state; + unsigned long flags; + int port; +#ifdef TEST_EXISTANCE + int scratch, scratch2; +#endif + + /* The above fails if memory managment is not set up yet. + * Rather than fail the set up, just keep track of the fact + * and pick up the interrupt thing later. + */ + gdb_async_info = info; + port = gdb_async_info->port; + gdb_async_irq = state->irq; + if (verb) { + printk("kgdb %s : port =%x, IRQ=%d, divisor =%d\n", + kgdb_version, + port, + gdb_async_irq, gdb_async_info->state->custom_divisor); + } + local_irq_save(flags); +#ifdef TEST_EXISTANCE + /* Existance test */ + /* Should not need all this, but just in case.... */ + + scratch = inb_p(port + UART_IER); + outb_px(port + UART_IER, 0); + outb_px(0xff, 0x080); + scratch2 = inb_p(port + UART_IER); + outb_px(port + UART_IER, scratch); + if (scratch2) { + printk + ("gdb_hook_interrupt: Could not clear IER, not a UART!\n"); + local_irq_restore(flags); + return 1; /* We failed; there's nothing here */ + } + scratch2 = inb_p(port + UART_LCR); + outb_px(port + UART_LCR, 0xBF); /* set up for StarTech test */ + outb_px(port + UART_EFR, 0); /* EFR is the same as FCR */ + outb_px(port + UART_LCR, 0); + outb_px(port + UART_FCR, UART_FCR_ENABLE_FIFO); + scratch = inb_p(port + UART_IIR) >> 6; + if (scratch == 1) { + printk("gdb_hook_interrupt: Undefined UART type!" + " Not a UART! \n"); + local_irq_restore(flags); + return 1; + } else { + dbprintk(("gdb_hook_interrupt: UART type " + "is %d where 0=16450, 2=16550 3=16550A\n", scratch)); + } + scratch = inb_p(port + UART_MCR); + outb_px(port + UART_MCR, UART_MCR_LOOP | scratch); + outb_px(port + UART_MCR, UART_MCR_LOOP | 0x0A); + scratch2 = inb_p(port + UART_MSR) & 0xF0; + outb_px(port + UART_MCR, scratch); + if (scratch2 != 0x90) { + printk("gdb_hook_interrupt: " + "Loop back test failed! Not a UART!\n"); + local_irq_restore(flags); + return scratch2 + 1000; /* force 0 to fail */ + } +#endif /* test existance */ + program_uart(info); + local_irq_restore(flags); + + return (0); + +} /* gdb_hook_interrupt */ + +static void +program_uart(struct async_struct *info) +{ + int port = info->port; + + (void) inb_p(port + UART_RX); + outb_px(port + UART_IER, 0); + + (void) inb_p(port + UART_RX); /* serial driver comments say */ + (void) inb_p(port + UART_IIR); /* this clears the interrupt regs */ + (void) inb_p(port + UART_MSR); + outb_px(port + UART_LCR, UART_LCR_WLEN8 | UART_LCR_DLAB); + outb_px(port + UART_DLL, info->state->custom_divisor & 0xff); /* LS */ + outb_px(port + UART_DLM, info->state->custom_divisor >> 8); /* MS */ + outb_px(port + UART_MCR, info->MCR); + + outb_px(port + UART_FCR, UART_FCR_ENABLE_FIFO | UART_FCR_TRIGGER_1 | UART_FCR_CLEAR_XMIT | UART_FCR_CLEAR_RCVR); /* set fcr */ + outb_px(port + UART_LCR, UART_LCR_WLEN8); /* reset DLAB */ + outb_px(port + UART_FCR, UART_FCR_ENABLE_FIFO | UART_FCR_TRIGGER_1); /* set fcr */ + if (!ints_disabled) { + intprintk(("KGDB: Sending %d to port %x offset %d\n", + gdb_async_info->IER, + (int) gdb_async_info->port, UART_IER)); + outb_px(gdb_async_info->port + UART_IER, gdb_async_info->IER); + } + return; +} + +/* + * tty_getDebugChar + * + * This is a GDB stub routine. It waits for a character from the + * serial interface and then returns it. If there is no serial + * interface connection then it returns a bogus value which will + * almost certainly cause the system to hang. In the + */ +int kgdb_in_isr = 0; +int kgdb_in_lsr = 0; +extern spinlock_t kgdb_spinlock; + +/* Caller takes needed protections */ + +int +tty_getDebugChar(void) +{ + volatile int chr, dum, time, end_time; + + dbprintk(("tty_getDebugChar(port %x): ", gdb_async_info->port)); + + if (gdb_async_info == NULL) { + gdb_hook_interrupt(&local_info, 0); + } + /* + * This trick says if we wait a very long time and get + * no char, return the -1 and let the upper level deal + * with it. + */ + rdtsc(dum, time); + end_time = time + 2; + while (((chr = read_char(gdb_async_info)) == -1) && + (end_time - time) > 0) { + rdtsc(dum, time); + }; + /* + * This covers our butts if some other code messes with + * our uart, hay, it happens :o) + */ + if (chr == -1) + program_uart(gdb_async_info); + + dbprintk(("%c\n", chr > ' ' && chr < 0x7F ? chr : ' ')); + return (chr); + +} /* tty_getDebugChar */ + +static int count = 3; +static spinlock_t one_at_atime = SPIN_LOCK_UNLOCKED; + +static int __init +kgdb_enable_ints(void) +{ + if (kgdb_eth != -1) { + return 0; + } + if (gdb_async_info == NULL) { + gdb_hook_interrupt(&local_info, 1); + } + ok_to_enable_ints = 1; + kgdb_enable_ints_now(); +#ifdef CONFIG_KGDB_USER_CONSOLE + kgdb_console_finit(); +#endif + return 0; +} + +#ifdef CONFIG_SERIAL_8250 +void shutdown_for_kgdb(struct async_struct *gdb_async_info); +#endif + +#ifdef CONFIG_DISCONTIGMEM +static inline int kgdb_mem_init_done(void) +{ + return highmem_start_page != NULL; +} +#else +static inline int kgdb_mem_init_done(void) +{ + return max_mapnr != 0; +} +#endif + +static void +kgdb_enable_ints_now(void) +{ + if (!spin_trylock(&one_at_atime)) + return; + if (!ints_disabled) + goto exit; + if (kgdb_mem_init_done() && + ints_disabled) { /* don't try till mem init */ +#ifdef CONFIG_SERIAL_8250 + /* + * The ifdef here allows the system to be configured + * without the serial driver. + * Don't make it a module, however, it will steal the port + */ + shutdown_for_kgdb(gdb_async_info); +#endif + ints_disabled = request_irq(gdb_async_info->state->irq, + gdb_interrupt, + IRQ_T(gdb_async_info), + "KGDB-stub", NULL); + intprintk(("KGDB: request_irq returned %d\n", ints_disabled)); + } + if (!ints_disabled) { + intprintk(("KGDB: Sending %d to port %x offset %d\n", + gdb_async_info->IER, + (int) gdb_async_info->port, UART_IER)); + outb_px(gdb_async_info->port + UART_IER, gdb_async_info->IER); + } + exit: + spin_unlock(&one_at_atime); +} + +/* + * tty_putDebugChar + * + * This is a GDB stub routine. It waits until the interface is ready + * to transmit a char and then sends it. If there is no serial + * interface connection then it simply returns to its caller, having + * pretended to send the char. Caller takes needed protections. + */ +void +tty_putDebugChar(int chr) +{ + dbprintk(("tty_putDebugChar(port %x): chr=%02x '%c', ints_on=%d\n", + gdb_async_info->port, + chr, + chr > ' ' && chr < 0x7F ? chr : ' ', ints_disabled ? 0 : 1)); + + if (gdb_async_info == NULL) { + gdb_hook_interrupt(&local_info, 0); + } + + write_char(gdb_async_info, chr); /* this routine will wait */ + count = (chr == '#') ? 0 : count + 1; + if ((count == 2)) { /* try to enable after */ + if (ints_disabled & ok_to_enable_ints) + kgdb_enable_ints_now(); /* try to enable after */ + + /* We do this a lot because, well we really want to get these + * interrupts. The serial driver will clear these bits when it + * initializes the chip. Every thing else it does is ok, + * but this. + */ + if (!ints_disabled) { + outb_px(gdb_async_info->port + UART_IER, + gdb_async_info->IER); + } + } + +} /* tty_putDebugChar */ + +/* + * This does nothing for the serial port, since it doesn't buffer. + */ + +void tty_flushDebugChar(void) +{ +} + +module_init(kgdb_enable_ints); --- diff/arch/x86_64/ia32/ia32_aout.c 1970-01-01 01:00:00.000000000 +0100 +++ source/arch/x86_64/ia32/ia32_aout.c 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,506 @@ +/* + * a.out loader for x86-64 + * + * Copyright (C) 1991, 1992, 1996 Linus Torvalds + * Hacked together by Andi Kleen + */ + +#include <linux/module.h> + +#include <linux/time.h> +#include <linux/kernel.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/a.out.h> +#include <linux/errno.h> +#include <linux/signal.h> +#include <linux/string.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/stat.h> +#include <linux/fcntl.h> +#include <linux/ptrace.h> +#include <linux/user.h> +#include <linux/slab.h> +#include <linux/binfmts.h> +#include <linux/personality.h> +#include <linux/init.h> + +#include <asm/system.h> +#include <asm/uaccess.h> +#include <asm/pgalloc.h> +#include <asm/cacheflush.h> +#include <asm/user32.h> + +#undef WARN_OLD + +extern int ia32_setup_arg_pages(struct linux_binprm *bprm); + +static int load_aout_binary(struct linux_binprm *, struct pt_regs * regs); +static int load_aout_library(struct file*); +static int aout_core_dump(long signr, struct pt_regs * regs, struct file *file); + +/* + * fill in the user structure for a core dump.. + */ +static void dump_thread32(struct pt_regs * regs, struct user32 * dump) +{ + u32 fs,gs; + +/* changed the size calculations - should hopefully work better. lbt */ + dump->magic = CMAGIC; + dump->start_code = 0; + dump->start_stack = regs->rsp & ~(PAGE_SIZE - 1); + dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT; + dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT; + dump->u_dsize -= dump->u_tsize; + dump->u_ssize = 0; + dump->u_debugreg[0] = current->thread.debugreg0; + dump->u_debugreg[1] = current->thread.debugreg1; + dump->u_debugreg[2] = current->thread.debugreg2; + dump->u_debugreg[3] = current->thread.debugreg3; + dump->u_debugreg[4] = 0; + dump->u_debugreg[5] = 0; + dump->u_debugreg[6] = current->thread.debugreg6; + dump->u_debugreg[7] = current->thread.debugreg7; + + if (dump->start_stack < 0xc0000000) + dump->u_ssize = ((unsigned long) (0xc0000000 - dump->start_stack)) >> PAGE_SHIFT; + + dump->regs.ebx = regs->rbx; + dump->regs.ecx = regs->rcx; + dump->regs.edx = regs->rdx; + dump->regs.esi = regs->rsi; + dump->regs.edi = regs->rdi; + dump->regs.ebp = regs->rbp; + dump->regs.eax = regs->rax; + dump->regs.ds = current->thread.ds; + dump->regs.es = current->thread.es; + asm("movl %%fs,%0" : "=r" (fs)); dump->regs.fs = fs; + asm("movl %%gs,%0" : "=r" (gs)); dump->regs.gs = gs; + dump->regs.orig_eax = regs->orig_rax; + dump->regs.eip = regs->rip; + dump->regs.cs = regs->cs; + dump->regs.eflags = regs->eflags; + dump->regs.esp = regs->rsp; + dump->regs.ss = regs->ss; + +#if 1 /* FIXME */ + dump->u_fpvalid = 0; +#else + dump->u_fpvalid = dump_fpu (regs, &dump->i387); +#endif +} + +static struct linux_binfmt aout_format = { + .module = THIS_MODULE, + .load_binary = load_aout_binary, + .load_shlib = load_aout_library, + .core_dump = aout_core_dump, + .min_coredump = PAGE_SIZE +}; + +static void set_brk(unsigned long start, unsigned long end) +{ + start = PAGE_ALIGN(start); + end = PAGE_ALIGN(end); + if (end <= start) + return; + do_brk(start, end - start); +} + +/* + * These are the only things you should do on a core-file: use only these + * macros to write out all the necessary info. + */ + +static int dump_write(struct file *file, const void *addr, int nr) +{ + return file->f_op->write(file, addr, nr, &file->f_pos) == nr; +} + +#define DUMP_WRITE(addr, nr) \ + if (!dump_write(file, (void *)(addr), (nr))) \ + goto end_coredump; + +#define DUMP_SEEK(offset) \ +if (file->f_op->llseek) { \ + if (file->f_op->llseek(file,(offset),0) != (offset)) \ + goto end_coredump; \ +} else file->f_pos = (offset) + +/* + * Routine writes a core dump image in the current directory. + * Currently only a stub-function. + * + * Note that setuid/setgid files won't make a core-dump if the uid/gid + * changed due to the set[u|g]id. It's enforced by the "current->mm->dumpable" + * field, which also makes sure the core-dumps won't be recursive if the + * dumping of the process results in another error.. + */ + +static int aout_core_dump(long signr, struct pt_regs * regs, struct file *file) +{ + mm_segment_t fs; + int has_dumped = 0; + unsigned long dump_start, dump_size; + struct user32 dump; +# define START_DATA(u) (u.u_tsize << PAGE_SHIFT) +# define START_STACK(u) (u.start_stack) + + fs = get_fs(); + set_fs(KERNEL_DS); + has_dumped = 1; + current->flags |= PF_DUMPCORE; + strncpy(dump.u_comm, current->comm, sizeof(current->comm)); + dump.u_ar0 = (u32)(((unsigned long)(&dump.regs)) - ((unsigned long)(&dump))); + dump.signal = signr; + dump_thread32(regs, &dump); + +/* If the size of the dump file exceeds the rlimit, then see what would happen + if we wrote the stack, but not the data area. */ + if ((dump.u_dsize+dump.u_ssize+1) * PAGE_SIZE > + current->rlim[RLIMIT_CORE].rlim_cur) + dump.u_dsize = 0; + +/* Make sure we have enough room to write the stack and data areas. */ + if ((dump.u_ssize+1) * PAGE_SIZE > + current->rlim[RLIMIT_CORE].rlim_cur) + dump.u_ssize = 0; + +/* make sure we actually have a data and stack area to dump */ + set_fs(USER_DS); + if (verify_area(VERIFY_READ, (void *) (unsigned long)START_DATA(dump), dump.u_dsize << PAGE_SHIFT)) + dump.u_dsize = 0; + if (verify_area(VERIFY_READ, (void *) (unsigned long)START_STACK(dump), dump.u_ssize << PAGE_SHIFT)) + dump.u_ssize = 0; + + set_fs(KERNEL_DS); +/* struct user */ + DUMP_WRITE(&dump,sizeof(dump)); +/* Now dump all of the user data. Include malloced stuff as well */ + DUMP_SEEK(PAGE_SIZE); +/* now we start writing out the user space info */ + set_fs(USER_DS); +/* Dump the data area */ + if (dump.u_dsize != 0) { + dump_start = START_DATA(dump); + dump_size = dump.u_dsize << PAGE_SHIFT; + DUMP_WRITE(dump_start,dump_size); + } +/* Now prepare to dump the stack area */ + if (dump.u_ssize != 0) { + dump_start = START_STACK(dump); + dump_size = dump.u_ssize << PAGE_SHIFT; + DUMP_WRITE(dump_start,dump_size); + } +/* Finally dump the task struct. Not be used by gdb, but could be useful */ + set_fs(KERNEL_DS); + DUMP_WRITE(current,sizeof(*current)); +end_coredump: + set_fs(fs); + return has_dumped; +} + +/* + * create_aout_tables() parses the env- and arg-strings in new user + * memory and creates the pointer tables from them, and puts their + * addresses on the "stack", returning the new stack pointer value. + */ +static u32 * create_aout_tables(char * p, struct linux_binprm * bprm) +{ + u32 *argv, *envp; + u32 * sp; + int argc = bprm->argc; + int envc = bprm->envc; + + sp = (u32 *) ((-(unsigned long)sizeof(u32)) & (unsigned long) p); + sp -= envc+1; + envp = (u32 *) sp; + sp -= argc+1; + argv = (u32 *) sp; + put_user((unsigned long) envp,--sp); + put_user((unsigned long) argv,--sp); + put_user(argc,--sp); + current->mm->arg_start = (unsigned long) p; + while (argc-->0) { + char c; + put_user((u32)(unsigned long)p,argv++); + do { + get_user(c,p++); + } while (c); + } + put_user(NULL,argv); + current->mm->arg_end = current->mm->env_start = (unsigned long) p; + while (envc-->0) { + char c; + put_user((u32)(unsigned long)p,envp++); + do { + get_user(c,p++); + } while (c); + } + put_user(NULL,envp); + current->mm->env_end = (unsigned long) p; + return sp; +} + +/* + * These are the functions used to load a.out style executables and shared + * libraries. There is no binary dependent code anywhere else. + */ + +static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) +{ + struct exec ex; + unsigned long error; + unsigned long fd_offset; + unsigned long rlim; + int retval; + + ex = *((struct exec *) bprm->buf); /* exec-header */ + if ((N_MAGIC(ex) != ZMAGIC && N_MAGIC(ex) != OMAGIC && + N_MAGIC(ex) != QMAGIC && N_MAGIC(ex) != NMAGIC) || + N_TRSIZE(ex) || N_DRSIZE(ex) || + i_size_read(bprm->file->f_dentry->d_inode) < ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { + return -ENOEXEC; + } + + fd_offset = N_TXTOFF(ex); + + /* Check initial limits. This avoids letting people circumvent + * size limits imposed on them by creating programs with large + * arrays in the data or bss. + */ + rlim = current->rlim[RLIMIT_DATA].rlim_cur; + if (rlim >= RLIM_INFINITY) + rlim = ~0; + if (ex.a_data + ex.a_bss > rlim) + return -ENOMEM; + + /* Flush all traces of the currently running executable */ + retval = flush_old_exec(bprm); + if (retval) + return retval; + + regs->cs = __USER32_CS; + regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 = + regs->r13 = regs->r14 = regs->r15 = 0; + set_thread_flag(TIF_IA32); + + /* OK, This is the point of no return */ + set_personality(PER_LINUX); + + current->mm->end_code = ex.a_text + + (current->mm->start_code = N_TXTADDR(ex)); + current->mm->end_data = ex.a_data + + (current->mm->start_data = N_DATADDR(ex)); + current->mm->brk = ex.a_bss + + (current->mm->start_brk = N_BSSADDR(ex)); + current->mm->free_area_cache = TASK_UNMAPPED_BASE; + + current->mm->rss = 0; + current->mm->mmap = NULL; + compute_creds(bprm); + current->flags &= ~PF_FORKNOEXEC; + + if (N_MAGIC(ex) == OMAGIC) { + unsigned long text_addr, map_size; + loff_t pos; + + text_addr = N_TXTADDR(ex); + + pos = 32; + map_size = ex.a_text+ex.a_data; + + error = do_brk(text_addr & PAGE_MASK, map_size); + if (error != (text_addr & PAGE_MASK)) { + send_sig(SIGKILL, current, 0); + return error; + } + + error = bprm->file->f_op->read(bprm->file, (char *)text_addr, + ex.a_text+ex.a_data, &pos); + if ((signed long)error < 0) { + send_sig(SIGKILL, current, 0); + return error; + } + + flush_icache_range(text_addr, text_addr+ex.a_text+ex.a_data); + } else { +#ifdef WARN_OLD + static unsigned long error_time, error_time2; + if ((ex.a_text & 0xfff || ex.a_data & 0xfff) && + (N_MAGIC(ex) != NMAGIC) && (jiffies-error_time2) > 5*HZ) + { + printk(KERN_NOTICE "executable not page aligned\n"); + error_time2 = jiffies; + } + + if ((fd_offset & ~PAGE_MASK) != 0 && + (jiffies-error_time) > 5*HZ) + { + printk(KERN_WARNING + "fd_offset is not page aligned. Please convert program: %s\n", + bprm->file->f_dentry->d_name.name); + error_time = jiffies; + } +#endif + + if (!bprm->file->f_op->mmap||((fd_offset & ~PAGE_MASK) != 0)) { + loff_t pos = fd_offset; + do_brk(N_TXTADDR(ex), ex.a_text+ex.a_data); + bprm->file->f_op->read(bprm->file,(char *)N_TXTADDR(ex), + ex.a_text+ex.a_data, &pos); + flush_icache_range((unsigned long) N_TXTADDR(ex), + (unsigned long) N_TXTADDR(ex) + + ex.a_text+ex.a_data); + goto beyond_if; + } + + down_write(¤t->mm->mmap_sem); + error = do_mmap(bprm->file, N_TXTADDR(ex), ex.a_text, + PROT_READ | PROT_EXEC, + MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE | MAP_32BIT, + fd_offset); + up_write(¤t->mm->mmap_sem); + + if (error != N_TXTADDR(ex)) { + send_sig(SIGKILL, current, 0); + return error; + } + + down_write(¤t->mm->mmap_sem); + error = do_mmap(bprm->file, N_DATADDR(ex), ex.a_data, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE | MAP_32BIT, + fd_offset + ex.a_text); + up_write(¤t->mm->mmap_sem); + if (error != N_DATADDR(ex)) { + send_sig(SIGKILL, current, 0); + return error; + } + } +beyond_if: + set_binfmt(&aout_format); + + set_brk(current->mm->start_brk, current->mm->brk); + + retval = ia32_setup_arg_pages(bprm); + if (retval < 0) { + /* Someone check-me: is this error path enough? */ + send_sig(SIGKILL, current, 0); + return retval; + } + + current->mm->start_stack = + (unsigned long) create_aout_tables((char *) bprm->p, bprm); + /* start thread */ + asm volatile("movl %0,%%fs" :: "r" (0)); \ + asm volatile("movl %0,%%es; movl %0,%%ds": :"r" (__USER32_DS)); + load_gs_index(0); + (regs)->rip = ex.a_entry; + (regs)->rsp = current->mm->start_stack; + (regs)->eflags = 0x200; + (regs)->cs = __USER32_CS; + (regs)->ss = __USER32_DS; + set_fs(USER_DS); + if (unlikely(current->ptrace & PT_PTRACED)) { + if (current->ptrace & PT_TRACE_EXEC) + ptrace_notify ((PTRACE_EVENT_EXEC << 8) | SIGTRAP); + else + send_sig(SIGTRAP, current, 0); + } + return 0; +} + +static int load_aout_library(struct file *file) +{ + struct inode * inode; + unsigned long bss, start_addr, len; + unsigned long error; + int retval; + struct exec ex; + + inode = file->f_dentry->d_inode; + + retval = -ENOEXEC; + error = kernel_read(file, 0, (char *) &ex, sizeof(ex)); + if (error != sizeof(ex)) + goto out; + + /* We come in here for the regular a.out style of shared libraries */ + if ((N_MAGIC(ex) != ZMAGIC && N_MAGIC(ex) != QMAGIC) || N_TRSIZE(ex) || + N_DRSIZE(ex) || ((ex.a_entry & 0xfff) && N_MAGIC(ex) == ZMAGIC) || + i_size_read(inode) < ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { + goto out; + } + + if (N_FLAGS(ex)) + goto out; + + /* For QMAGIC, the starting address is 0x20 into the page. We mask + this off to get the starting address for the page */ + + start_addr = ex.a_entry & 0xfffff000; + + if ((N_TXTOFF(ex) & ~PAGE_MASK) != 0) { + loff_t pos = N_TXTOFF(ex); + +#ifdef WARN_OLD + static unsigned long error_time; + if ((jiffies-error_time) > 5*HZ) + { + printk(KERN_WARNING + "N_TXTOFF is not page aligned. Please convert library: %s\n", + file->f_dentry->d_name.name); + error_time = jiffies; + } +#endif + + do_brk(start_addr, ex.a_text + ex.a_data + ex.a_bss); + + file->f_op->read(file, (char *)start_addr, + ex.a_text + ex.a_data, &pos); + flush_icache_range((unsigned long) start_addr, + (unsigned long) start_addr + ex.a_text + ex.a_data); + + retval = 0; + goto out; + } + /* Now use mmap to map the library into memory. */ + down_write(¤t->mm->mmap_sem); + error = do_mmap(file, start_addr, ex.a_text + ex.a_data, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_32BIT, + N_TXTOFF(ex)); + up_write(¤t->mm->mmap_sem); + retval = error; + if (error != start_addr) + goto out; + + len = PAGE_ALIGN(ex.a_text + ex.a_data); + bss = ex.a_text + ex.a_data + ex.a_bss; + if (bss > len) { + error = do_brk(start_addr + len, bss - len); + retval = error; + if (error != start_addr + len) + goto out; + } + retval = 0; +out: + return retval; +} + +static int __init init_aout_binfmt(void) +{ + return register_binfmt(&aout_format); +} + +static void __exit exit_aout_binfmt(void) +{ + unregister_binfmt(&aout_format); +} + +module_init(init_aout_binfmt); +module_exit(exit_aout_binfmt); +MODULE_LICENSE("GPL"); --- diff/drivers/block/cfq-iosched.c 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/block/cfq-iosched.c 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,707 @@ +/* + * linux/drivers/block/cfq-iosched.c + * + * CFQ, or complete fairness queueing, disk scheduler. + * + * Based on ideas from a previously unfinished io + * scheduler (round robin per-process disk scheduling) and Andrea Arcangeli. + * + * Copyright (C) 2003 Jens Axboe <axboe@suse.de> + */ +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/blkdev.h> +#include <linux/elevator.h> +#include <linux/bio.h> +#include <linux/config.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/init.h> +#include <linux/compiler.h> +#include <linux/hash.h> +#include <linux/rbtree.h> +#include <linux/mempool.h> + +/* + * tunables + */ +static int cfq_quantum = 4; +static int cfq_queued = 8; + +#define CFQ_QHASH_SHIFT 6 +#define CFQ_QHASH_ENTRIES (1 << CFQ_QHASH_SHIFT) +#define list_entry_qhash(entry) list_entry((entry), struct cfq_queue, cfq_hash) + +#define CFQ_MHASH_SHIFT 8 +#define CFQ_MHASH_BLOCK(sec) ((sec) >> 3) +#define CFQ_MHASH_ENTRIES (1 << CFQ_MHASH_SHIFT) +#define CFQ_MHASH_FN(sec) (hash_long(CFQ_MHASH_BLOCK((sec)),CFQ_MHASH_SHIFT)) +#define ON_MHASH(crq) !list_empty(&(crq)->hash) +#define rq_hash_key(rq) ((rq)->sector + (rq)->nr_sectors) +#define list_entry_hash(ptr) list_entry((ptr), struct cfq_rq, hash) + +#define list_entry_cfqq(ptr) list_entry((ptr), struct cfq_queue, cfq_list) + +#define RQ_DATA(rq) ((struct cfq_rq *) (rq)->elevator_private) + +static kmem_cache_t *crq_pool; +static kmem_cache_t *cfq_pool; +static mempool_t *cfq_mpool; + +struct cfq_data { + struct list_head rr_list; + struct list_head *dispatch; + struct list_head *cfq_hash; + + struct list_head *crq_hash; + + unsigned int busy_queues; + unsigned int max_queued; + + mempool_t *crq_pool; +}; + +struct cfq_queue { + struct list_head cfq_hash; + struct list_head cfq_list; + struct rb_root sort_list; + int pid; + int queued[2]; +#if 0 + /* + * with a simple addition like this, we can do io priorities. almost. + * does need a split request free list, too. + */ + int io_prio +#endif +}; + +struct cfq_rq { + struct rb_node rb_node; + sector_t rb_key; + + struct request *request; + + struct cfq_queue *cfq_queue; + + struct list_head hash; +}; + +static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq); +static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int pid); +static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq); + +/* + * lots of deadline iosched dupes, can be abstracted later... + */ +static inline void __cfq_del_crq_hash(struct cfq_rq *crq) +{ + list_del_init(&crq->hash); +} + +static inline void cfq_del_crq_hash(struct cfq_rq *crq) +{ + if (ON_MHASH(crq)) + __cfq_del_crq_hash(crq); +} + +static void cfq_remove_merge_hints(request_queue_t *q, struct cfq_rq *crq) +{ + cfq_del_crq_hash(crq); + + if (q->last_merge == crq->request) + q->last_merge = NULL; +} + +static inline void cfq_add_crq_hash(struct cfq_data *cfqd, struct cfq_rq *crq) +{ + struct request *rq = crq->request; + + BUG_ON(ON_MHASH(crq)); + + list_add(&crq->hash, &cfqd->crq_hash[CFQ_MHASH_FN(rq_hash_key(rq))]); +} + +static struct request *cfq_find_rq_hash(struct cfq_data *cfqd, sector_t offset) +{ + struct list_head *hash_list = &cfqd->crq_hash[CFQ_MHASH_FN(offset)]; + struct list_head *entry, *next = hash_list->next; + + while ((entry = next) != hash_list) { + struct cfq_rq *crq = list_entry_hash(entry); + struct request *__rq = crq->request; + + next = entry->next; + + BUG_ON(!ON_MHASH(crq)); + + if (!rq_mergeable(__rq)) { + __cfq_del_crq_hash(crq); + continue; + } + + if (rq_hash_key(__rq) == offset) + return __rq; + } + + return NULL; +} + +/* + * rb tree support functions + */ +#define RB_NONE (2) +#define RB_EMPTY(node) ((node)->rb_node == NULL) +#define RB_CLEAR(node) ((node)->rb_color = RB_NONE) +#define RB_CLEAR_ROOT(root) ((root)->rb_node = NULL) +#define ON_RB(node) ((node)->rb_color != RB_NONE) +#define rb_entry_crq(node) rb_entry((node), struct cfq_rq, rb_node) +#define rq_rb_key(rq) (rq)->sector + +static inline void cfq_del_crq_rb(struct cfq_queue *cfqq, struct cfq_rq *crq) +{ + if (ON_RB(&crq->rb_node)) { + cfqq->queued[rq_data_dir(crq->request)]--; + rb_erase(&crq->rb_node, &cfqq->sort_list); + crq->cfq_queue = NULL; + } +} + +static struct cfq_rq * +__cfq_add_crq_rb(struct cfq_queue *cfqq, struct cfq_rq *crq) +{ + struct rb_node **p = &cfqq->sort_list.rb_node; + struct rb_node *parent = NULL; + struct cfq_rq *__crq; + + while (*p) { + parent = *p; + __crq = rb_entry_crq(parent); + + if (crq->rb_key < __crq->rb_key) + p = &(*p)->rb_left; + else if (crq->rb_key > __crq->rb_key) + p = &(*p)->rb_right; + else + return __crq; + } + + rb_link_node(&crq->rb_node, parent, p); + return 0; +} + +static void +cfq_add_crq_rb(struct cfq_data *cfqd, struct cfq_queue *cfqq,struct cfq_rq *crq) +{ + struct request *rq = crq->request; + struct cfq_rq *__alias; + + crq->rb_key = rq_rb_key(rq); + cfqq->queued[rq_data_dir(rq)]++; +retry: + __alias = __cfq_add_crq_rb(cfqq, crq); + if (!__alias) { + rb_insert_color(&crq->rb_node, &cfqq->sort_list); + crq->cfq_queue = cfqq; + return; + } + + cfq_del_crq_rb(cfqq, __alias); + cfq_dispatch_sort(cfqd->dispatch, __alias); + goto retry; +} + +static struct request * +cfq_find_rq_rb(struct cfq_data *cfqd, sector_t sector) +{ + struct cfq_queue *cfqq = cfq_find_cfq_hash(cfqd, current->tgid); + struct rb_node *n; + + if (!cfqq) + goto out; + + n = cfqq->sort_list.rb_node; + while (n) { + struct cfq_rq *crq = rb_entry_crq(n); + + if (sector < crq->rb_key) + n = n->rb_left; + else if (sector > crq->rb_key) + n = n->rb_right; + else + return crq->request; + } + +out: + return NULL; +} + +static void cfq_remove_request(request_queue_t *q, struct request *rq) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_rq *crq = RQ_DATA(rq); + + if (crq) { + struct cfq_queue *cfqq = crq->cfq_queue; + + cfq_remove_merge_hints(q, crq); + list_del_init(&rq->queuelist); + + if (cfqq) { + cfq_del_crq_rb(cfqq, crq); + + if (RB_EMPTY(&cfqq->sort_list)) + cfq_put_queue(cfqd, cfqq); + } + } +} + +static int +cfq_merge(request_queue_t *q, struct request **req, struct bio *bio) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct request *__rq; + int ret; + + ret = elv_try_last_merge(q, bio); + if (ret != ELEVATOR_NO_MERGE) { + __rq = q->last_merge; + goto out_insert; + } + + __rq = cfq_find_rq_hash(cfqd, bio->bi_sector); + if (__rq) { + BUG_ON(__rq->sector + __rq->nr_sectors != bio->bi_sector); + + if (elv_rq_merge_ok(__rq, bio)) { + ret = ELEVATOR_BACK_MERGE; + goto out; + } + } + + __rq = cfq_find_rq_rb(cfqd, bio->bi_sector + bio_sectors(bio)); + if (__rq) { + if (elv_rq_merge_ok(__rq, bio)) { + ret = ELEVATOR_FRONT_MERGE; + goto out; + } + } + + return ELEVATOR_NO_MERGE; +out: + q->last_merge = __rq; +out_insert: + *req = __rq; + return ret; +} + +static void cfq_merged_request(request_queue_t *q, struct request *req) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_rq *crq = RQ_DATA(req); + + cfq_del_crq_hash(crq); + cfq_add_crq_hash(cfqd, crq); + + if (ON_RB(&crq->rb_node) && (rq_rb_key(req) != crq->rb_key)) { + struct cfq_queue *cfqq = crq->cfq_queue; + + cfq_del_crq_rb(cfqq, crq); + cfq_add_crq_rb(cfqd, cfqq, crq); + } + + q->last_merge = req; +} + +static void +cfq_merged_requests(request_queue_t *q, struct request *req, + struct request *next) +{ + cfq_merged_request(q, req); + cfq_remove_request(q, next); +} + +static void cfq_dispatch_sort(struct list_head *head, struct cfq_rq *crq) +{ + struct list_head *entry = head; + struct request *__rq; + + if (!list_empty(head)) { + __rq = list_entry_rq(head->next); + + if (crq->request->sector < __rq->sector) { + entry = head->prev; + goto link; + } + } + + while ((entry = entry->prev) != head) { + __rq = list_entry_rq(entry); + + if (crq->request->sector <= __rq->sector) + break; + } + +link: + list_add_tail(&crq->request->queuelist, entry); +} + +static inline void +__cfq_dispatch_requests(request_queue_t *q, struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + struct cfq_rq *crq = rb_entry_crq(rb_first(&cfqq->sort_list)); + + cfq_del_crq_rb(cfqq, crq); + cfq_remove_merge_hints(q, crq); + cfq_dispatch_sort(cfqd->dispatch, crq); +} + +static int cfq_dispatch_requests(request_queue_t *q, struct cfq_data *cfqd) +{ + struct cfq_queue *cfqq; + struct list_head *entry, *tmp; + int ret, queued, good_queues; + + if (list_empty(&cfqd->rr_list)) + return 0; + + queued = ret = 0; +restart: + good_queues = 0; + list_for_each_safe(entry, tmp, &cfqd->rr_list) { + cfqq = list_entry_cfqq(cfqd->rr_list.next); + + BUG_ON(RB_EMPTY(&cfqq->sort_list)); + + __cfq_dispatch_requests(q, cfqd, cfqq); + + if (RB_EMPTY(&cfqq->sort_list)) + cfq_put_queue(cfqd, cfqq); + else + good_queues++; + + queued++; + ret = 1; + } + + if ((queued < cfq_quantum) && good_queues) + goto restart; + + return ret; +} + +static struct request *cfq_next_request(request_queue_t *q) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct request *rq; + + if (!list_empty(cfqd->dispatch)) { + struct cfq_rq *crq; +dispatch: + rq = list_entry_rq(cfqd->dispatch->next); + + BUG_ON(q->last_merge == rq); + crq = RQ_DATA(rq); + if (crq) + BUG_ON(ON_MHASH(crq)); + + return rq; + } + + if (cfq_dispatch_requests(q, cfqd)) + goto dispatch; + + return NULL; +} + +static inline struct cfq_queue * +__cfq_find_cfq_hash(struct cfq_data *cfqd, int pid, const int hashval) +{ + struct list_head *hash_list = &cfqd->cfq_hash[hashval]; + struct list_head *entry; + + list_for_each(entry, hash_list) { + struct cfq_queue *__cfqq = list_entry_qhash(entry); + + if (__cfqq->pid == pid) + return __cfqq; + } + + return NULL; +} + +static struct cfq_queue *cfq_find_cfq_hash(struct cfq_data *cfqd, int pid) +{ + const int hashval = hash_long(current->tgid, CFQ_QHASH_SHIFT); + + return __cfq_find_cfq_hash(cfqd, pid, hashval); +} + +static void cfq_put_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) +{ + cfqd->busy_queues--; + list_del(&cfqq->cfq_list); + list_del(&cfqq->cfq_hash); + mempool_free(cfqq, cfq_mpool); +} + +static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, int pid) +{ + const int hashval = hash_long(current->tgid, CFQ_QHASH_SHIFT); + struct cfq_queue *cfqq = __cfq_find_cfq_hash(cfqd, pid, hashval); + + if (!cfqq) { + cfqq = mempool_alloc(cfq_mpool, GFP_NOIO); + + INIT_LIST_HEAD(&cfqq->cfq_hash); + INIT_LIST_HEAD(&cfqq->cfq_list); + RB_CLEAR_ROOT(&cfqq->sort_list); + + cfqq->pid = pid; + cfqq->queued[0] = cfqq->queued[1] = 0; + list_add(&cfqq->cfq_hash, &cfqd->cfq_hash[hashval]); + } + + return cfqq; +} + +static void cfq_enqueue(struct cfq_data *cfqd, struct cfq_rq *crq) +{ + struct cfq_queue *cfqq; + + cfqq = cfq_get_queue(cfqd, current->tgid); + + cfq_add_crq_rb(cfqd, cfqq, crq); + + if (list_empty(&cfqq->cfq_list)) { + list_add(&cfqq->cfq_list, &cfqd->rr_list); + cfqd->busy_queues++; + } +} + +static void +cfq_insert_request(request_queue_t *q, struct request *rq, int where) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_rq *crq = RQ_DATA(rq); + + switch (where) { + case ELEVATOR_INSERT_BACK: + while (cfq_dispatch_requests(q, cfqd)) + ; + list_add_tail(&rq->queuelist, cfqd->dispatch); + break; + case ELEVATOR_INSERT_FRONT: + list_add(&rq->queuelist, cfqd->dispatch); + break; + case ELEVATOR_INSERT_SORT: + BUG_ON(!blk_fs_request(rq)); + cfq_enqueue(cfqd, crq); + break; + default: + printk("%s: bad insert point %d\n", __FUNCTION__,where); + return; + } + + if (rq_mergeable(rq)) { + cfq_add_crq_hash(cfqd, crq); + + if (!q->last_merge) + q->last_merge = rq; + } +} + +static int cfq_queue_empty(request_queue_t *q) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + + if (list_empty(cfqd->dispatch) && list_empty(&cfqd->rr_list)) + return 1; + + return 0; +} + +static struct request * +cfq_former_request(request_queue_t *q, struct request *rq) +{ + struct cfq_rq *crq = RQ_DATA(rq); + struct rb_node *rbprev = rb_prev(&crq->rb_node); + + if (rbprev) + return rb_entry_crq(rbprev)->request; + + return NULL; +} + +static struct request * +cfq_latter_request(request_queue_t *q, struct request *rq) +{ + struct cfq_rq *crq = RQ_DATA(rq); + struct rb_node *rbnext = rb_next(&crq->rb_node); + + if (rbnext) + return rb_entry_crq(rbnext)->request; + + return NULL; +} + +static int cfq_may_queue(request_queue_t *q, int rw) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_queue *cfqq; + int ret = 1; + + if (!cfqd->busy_queues) + goto out; + + cfqq = cfq_find_cfq_hash(cfqd, current->tgid); + if (cfqq) { + int limit = (q->nr_requests - cfq_queued) / cfqd->busy_queues; + + if (limit < 3) + limit = 3; + else if (limit > cfqd->max_queued) + limit = cfqd->max_queued; + + if (cfqq->queued[rw] > limit) + ret = 0; + } +out: + return ret; +} + +static void cfq_put_request(request_queue_t *q, struct request *rq) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_rq *crq = RQ_DATA(rq); + + if (crq) { + BUG_ON(q->last_merge == rq); + BUG_ON(ON_MHASH(crq)); + + mempool_free(crq, cfqd->crq_pool); + rq->elevator_private = NULL; + } +} + +static int cfq_set_request(request_queue_t *q, struct request *rq, int gfp_mask) +{ + struct cfq_data *cfqd = q->elevator.elevator_data; + struct cfq_rq *crq = mempool_alloc(cfqd->crq_pool, gfp_mask); + + if (crq) { + RB_CLEAR(&crq->rb_node); + crq->request = rq; + crq->cfq_queue = NULL; + INIT_LIST_HEAD(&crq->hash); + rq->elevator_private = crq; + return 0; + } + + return 1; +} + +static void cfq_exit(request_queue_t *q, elevator_t *e) +{ + struct cfq_data *cfqd = e->elevator_data; + + e->elevator_data = NULL; + mempool_destroy(cfqd->crq_pool); + kfree(cfqd->crq_hash); + kfree(cfqd->cfq_hash); + kfree(cfqd); +} + +static int cfq_init(request_queue_t *q, elevator_t *e) +{ + struct cfq_data *cfqd; + int i; + + cfqd = kmalloc(sizeof(*cfqd), GFP_KERNEL); + if (!cfqd) + return -ENOMEM; + + memset(cfqd, 0, sizeof(*cfqd)); + INIT_LIST_HEAD(&cfqd->rr_list); + + cfqd->crq_hash = kmalloc(sizeof(struct list_head) * CFQ_MHASH_ENTRIES, GFP_KERNEL); + if (!cfqd->crq_hash) + goto out_crqhash; + + cfqd->cfq_hash = kmalloc(sizeof(struct list_head) * CFQ_QHASH_ENTRIES, GFP_KERNEL); + if (!cfqd->cfq_hash) + goto out_cfqhash; + + cfqd->crq_pool = mempool_create(BLKDEV_MIN_RQ, mempool_alloc_slab, mempool_free_slab, crq_pool); + if (!cfqd->crq_pool) + goto out_crqpool; + + for (i = 0; i < CFQ_MHASH_ENTRIES; i++) + INIT_LIST_HEAD(&cfqd->crq_hash[i]); + for (i = 0; i < CFQ_QHASH_ENTRIES; i++) + INIT_LIST_HEAD(&cfqd->cfq_hash[i]); + + cfqd->dispatch = &q->queue_head; + e->elevator_data = cfqd; + + /* + * just set it to some high value, we want anyone to be able to queue + * some requests. fairness is handled differently + */ + cfqd->max_queued = q->nr_requests; + q->nr_requests = 8192; + + return 0; +out_crqpool: + kfree(cfqd->cfq_hash); +out_cfqhash: + kfree(cfqd->crq_hash); +out_crqhash: + kfree(cfqd); + return -ENOMEM; +} + +static int __init cfq_slab_setup(void) +{ + crq_pool = kmem_cache_create("crq_pool", sizeof(struct cfq_rq), 0, 0, + NULL, NULL); + + if (!crq_pool) + panic("cfq_iosched: can't init crq pool\n"); + + cfq_pool = kmem_cache_create("cfq_pool", sizeof(struct cfq_queue), 0, 0, + NULL, NULL); + + if (!cfq_pool) + panic("cfq_iosched: can't init cfq pool\n"); + + cfq_mpool = mempool_create(64, mempool_alloc_slab, mempool_free_slab, cfq_pool); + + if (!cfq_mpool) + panic("cfq_iosched: can't init cfq mpool\n"); + + return 0; +} + +subsys_initcall(cfq_slab_setup); + +elevator_t iosched_cfq = { + .elevator_name = "cfq", + .elevator_merge_fn = cfq_merge, + .elevator_merged_fn = cfq_merged_request, + .elevator_merge_req_fn = cfq_merged_requests, + .elevator_next_req_fn = cfq_next_request, + .elevator_add_req_fn = cfq_insert_request, + .elevator_remove_req_fn = cfq_remove_request, + .elevator_queue_empty_fn = cfq_queue_empty, + .elevator_former_req_fn = cfq_former_request, + .elevator_latter_req_fn = cfq_latter_request, + .elevator_set_req_fn = cfq_set_request, + .elevator_put_req_fn = cfq_put_request, + .elevator_may_queue_fn = cfq_may_queue, + .elevator_init_fn = cfq_init, + .elevator_exit_fn = cfq_exit, +}; + +EXPORT_SYMBOL(iosched_cfq); --- diff/drivers/ide/pci/sgiioc4.c 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/ide/pci/sgiioc4.c 2003-11-26 10:09:05.000000000 +0000 @@ -0,0 +1,833 @@ +/* + * Copyright (c) 2003 Silicon Graphics, Inc. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. + * + * Contact information: Silicon Graphics, Inc., 1600 Amphitheatre Pkwy, + * Mountain View, CA 94043, or: + * + * http://www.sgi.com + * + * For further information regarding this notice, see: + * + * http://oss.sgi.com/projects/GenInfo/NoticeExplan + */ + +#include <linux/config.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/pci.h> +#include <linux/delay.h> +#include <linux/hdreg.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/timer.h> +#include <linux/mm.h> +#include <linux/ioport.h> +#include <linux/blkdev.h> +#include <asm/io.h> + +#include <linux/ide.h> + +/* IOC4 Specific Definitions */ +#define IOC4_CMD_OFFSET 0x100 +#define IOC4_CTRL_OFFSET 0x120 +#define IOC4_DMA_OFFSET 0x140 +#define IOC4_INTR_OFFSET 0x0 + +#define IOC4_TIMING 0x00 +#define IOC4_DMA_PTR_L 0x01 +#define IOC4_DMA_PTR_H 0x02 +#define IOC4_DMA_ADDR_L 0x03 +#define IOC4_DMA_ADDR_H 0x04 +#define IOC4_BC_DEV 0x05 +#define IOC4_BC_MEM 0x06 +#define IOC4_DMA_CTRL 0x07 +#define IOC4_DMA_END_ADDR 0x08 + +/* Bits in the IOC4 Control/Status Register */ +#define IOC4_S_DMA_START 0x01 +#define IOC4_S_DMA_STOP 0x02 +#define IOC4_S_DMA_DIR 0x04 +#define IOC4_S_DMA_ACTIVE 0x08 +#define IOC4_S_DMA_ERROR 0x10 +#define IOC4_ATA_MEMERR 0x02 + +/* Read/Write Directions */ +#define IOC4_DMA_WRITE 0x04 +#define IOC4_DMA_READ 0x00 + +/* Interrupt Register Offsets */ +#define IOC4_INTR_REG 0x03 +#define IOC4_INTR_SET 0x05 +#define IOC4_INTR_CLEAR 0x07 + +#define IOC4_IDE_CACHELINE_SIZE 128 +#define IOC4_CMD_CTL_BLK_SIZE 0x20 +#define IOC4_SUPPORTED_FIRMWARE_REV 46 + +typedef struct { + u32 timing_reg0; + u32 timing_reg1; + u32 low_mem_ptr; + u32 high_mem_ptr; + u32 low_mem_addr; + u32 high_mem_addr; + u32 dev_byte_count; + u32 mem_byte_count; + u32 status; +} ioc4_dma_regs_t; + +/* Each Physical Region Descriptor Entry size is 16 bytes (2 * 64 bits) */ +/* IOC4 has only 1 IDE channel */ +#define IOC4_PRD_BYTES 16 +#define IOC4_PRD_ENTRIES (PAGE_SIZE /(4*IOC4_PRD_BYTES)) + + +static void +sgiioc4_init_hwif_ports(hw_regs_t * hw, unsigned long data_port, + unsigned long ctrl_port, unsigned long irq_port) +{ + unsigned long reg = data_port; + int i; + + /* Registers are word (32 bit) aligned */ + for (i = IDE_DATA_OFFSET; i <= IDE_STATUS_OFFSET; i++) + hw->io_ports[i] = reg + i * 4; + + if (ctrl_port) + hw->io_ports[IDE_CONTROL_OFFSET] = ctrl_port; + + if (irq_port) + hw->io_ports[IDE_IRQ_OFFSET] = irq_port; +} + +static void +sgiioc4_maskproc(ide_drive_t * drive, int mask) +{ + ide_hwif_t *hwif = HWIF(drive); + hwif->OUTB(mask ? (drive->ctl | 2) : (drive->ctl & ~2), + IDE_CONTROL_REG); +} + + +static int +sgiioc4_checkirq(ide_hwif_t * hwif) +{ + u8 intr_reg = + hwif->INL(hwif->io_ports[IDE_IRQ_OFFSET] + IOC4_INTR_REG * 4); + + if (intr_reg & 0x03) + return 1; + + return 0; +} + + +static int +sgiioc4_clearirq(ide_drive_t * drive) +{ + u32 intr_reg; + ide_hwif_t *hwif = HWIF(drive); + unsigned long other_ir = + hwif->io_ports[IDE_IRQ_OFFSET] + (IOC4_INTR_REG << 2); + + /* Code to check for PCI error conditions */ + intr_reg = hwif->INL(other_ir); + if (intr_reg & 0x03) { /* Valid IOC4-IDE interrupt */ + /* + * Using hwif->INB to read the IDE_STATUS_REG has a side effect + * of clearing the interrupt. The first read should clear it + * if it is set. The second read should return a "clear" status + * if it got cleared. If not, then spin for a bit trying to + * clear it. + */ + u8 stat = hwif->INB(IDE_STATUS_REG); + int count = 0; + stat = hwif->INB(IDE_STATUS_REG); + while ((stat & 0x80) && (count++ < 100)) { + udelay(1); + stat = hwif->INB(IDE_STATUS_REG); + } + + if (intr_reg & 0x02) { + /* Error when transferring DMA data on PCI bus */ + u32 pci_err_addr_low, pci_err_addr_high, + pci_stat_cmd_reg; + + pci_err_addr_low = + hwif->INL(hwif->io_ports[IDE_IRQ_OFFSET]); + pci_err_addr_high = + hwif->INL(hwif->io_ports[IDE_IRQ_OFFSET] + 4); + pci_read_config_dword(hwif->pci_dev, PCI_COMMAND, + &pci_stat_cmd_reg); + printk(KERN_ERR + "%s(%s) : PCI Bus Error when doing DMA:" + " status-cmd reg is 0x%x\n", + __FUNCTION__, drive->name, pci_stat_cmd_reg); + printk(KERN_ERR + "%s(%s) : PCI Error Address is 0x%x%x\n", + __FUNCTION__, drive->name, + pci_err_addr_high, pci_err_addr_low); + /* Clear the PCI Error indicator */ + pci_write_config_dword(hwif->pci_dev, PCI_COMMAND, + 0x00000146); + } + + /* Clear the Interrupt, Error bits on the IOC4 */ + hwif->OUTL(0x03, other_ir); + + intr_reg = hwif->INL(other_ir); + } + + return intr_reg & 3; +} + +static int +sgiioc4_ide_dma_begin(ide_drive_t * drive) +{ + ide_hwif_t *hwif = HWIF(drive); + unsigned int reg = hwif->INL(hwif->dma_base + IOC4_DMA_CTRL * 4); + unsigned int temp_reg = reg | IOC4_S_DMA_START; + + hwif->OUTL(temp_reg, hwif->dma_base + IOC4_DMA_CTRL * 4); + + return 0; +} + +static u32 +sgiioc4_ide_dma_stop(ide_hwif_t *hwif, u64 dma_base) +{ + u32 ioc4_dma; + int count; + + count = 0; + ioc4_dma = hwif->INL(dma_base + IOC4_DMA_CTRL * 4); + while ((ioc4_dma & IOC4_S_DMA_STOP) && (count++ < 200)) { + udelay(1); + ioc4_dma = hwif->INL(dma_base + IOC4_DMA_CTRL * 4); + } + return ioc4_dma; +} + +/* Stops the IOC4 DMA Engine */ +static int +sgiioc4_ide_dma_end(ide_drive_t * drive) +{ + u32 ioc4_dma, bc_dev, bc_mem, num, valid = 0, cnt = 0; + ide_hwif_t *hwif = HWIF(drive); + u64 dma_base = hwif->dma_base; + int dma_stat = 0; + unsigned long *ending_dma = (unsigned long *) hwif->dma_base2; + + hwif->OUTL(IOC4_S_DMA_STOP, dma_base + IOC4_DMA_CTRL * 4); + + ioc4_dma = sgiioc4_ide_dma_stop(hwif, dma_base); + + if (ioc4_dma & IOC4_S_DMA_STOP) { + printk(KERN_ERR + "%s(%s): IOC4 DMA STOP bit is still 1 :" + "ioc4_dma_reg 0x%x\n", + __FUNCTION__, drive->name, ioc4_dma); + dma_stat = 1; + } + + /* + * The IOC4 will DMA 1's to the ending dma area to indicate that + * previous data DMA is complete. This is necessary because of relaxed + * ordering between register reads and DMA writes on the Altix. + */ + while ((cnt++ < 200) && (!valid)) { + for (num = 0; num < 16; num++) { + if (ending_dma[num]) { + valid = 1; + break; + } + } + udelay(1); + } + if (!valid) { + printk(KERN_ERR "%s(%s) : DMA incomplete\n", __FUNCTION__, + drive->name); + dma_stat = 1; + } + + bc_dev = hwif->INL(dma_base + IOC4_BC_DEV * 4); + bc_mem = hwif->INL(dma_base + IOC4_BC_MEM * 4); + + if ((bc_dev & 0x01FF) || (bc_mem & 0x1FF)) { + if (bc_dev > bc_mem + 8) { + printk(KERN_ERR + "%s(%s): WARNING!! byte_count_dev %d " + "!= byte_count_mem %d\n", + __FUNCTION__, drive->name, bc_dev, bc_mem); + } + } + + drive->waiting_for_dma = 0; + ide_destroy_dmatable(drive); + + return dma_stat; +} + +static int +sgiioc4_ide_dma_check(ide_drive_t * drive) +{ + if (ide_config_drive_speed(drive, XFER_MW_DMA_2) != 0) { + printk(KERN_INFO + "Couldnot set %s in Multimode-2 DMA mode | " + "Drive %s using PIO instead\n", + drive->name, drive->name); + drive->using_dma = 0; + } else + drive->using_dma = 1; + + return 0; +} + +static int +sgiioc4_ide_dma_on(ide_drive_t * drive) +{ + drive->using_dma = 1; + + return HWIF(drive)->ide_dma_host_on(drive); +} + +static int +sgiioc4_ide_dma_off(ide_drive_t * drive) +{ + printk(KERN_INFO "%s: DMA disabled\n", drive->name); + + return HWIF(drive)->ide_dma_off_quietly(drive); +} + +static int +sgiioc4_ide_dma_off_quietly(ide_drive_t * drive) +{ + drive->using_dma = 0; + + return HWIF(drive)->ide_dma_host_off(drive); +} + +/* returns 1 if dma irq issued, 0 otherwise */ +static int +sgiioc4_ide_dma_test_irq(ide_drive_t * drive) +{ + return sgiioc4_checkirq(HWIF(drive)); +} + +static int +sgiioc4_ide_dma_host_on(ide_drive_t * drive) +{ + if (drive->using_dma) + return 0; + + return 1; +} + +static int +sgiioc4_ide_dma_host_off(ide_drive_t * drive) +{ + sgiioc4_clearirq(drive); + + return 0; +} + +static int +sgiioc4_ide_dma_verbose(ide_drive_t * drive) +{ + if (drive->using_dma == 1) + printk(", UDMA(16)"); + else + printk(", PIO"); + + return 1; +} + +static int +sgiioc4_ide_dma_lostirq(ide_drive_t * drive) +{ + HWIF(drive)->resetproc(drive); + + return __ide_dma_lostirq(drive); +} + +static void +sgiioc4_resetproc(ide_drive_t * drive) +{ + sgiioc4_ide_dma_end(drive); + sgiioc4_clearirq(drive); +} + +static u8 +sgiioc4_INB(unsigned long port) +{ + u8 reg = (u8) inb(port); + + if ((port & 0xFFF) == 0x11C) { /* Status register of IOC4 */ + if (reg & 0x51) { /* Not busy...check for interrupt */ + unsigned long other_ir = port - 0x110; + unsigned int intr_reg = (u32) inl(other_ir); + + /* Clear the Interrupt, Error bits on the IOC4 */ + if (intr_reg & 0x03) { + outl(0x03, other_ir); + intr_reg = (u32) inl(other_ir); + } + } + } + + return reg; +} + +/* Creates a dma map for the scatter-gather list entries */ +static void __init +ide_dma_sgiioc4(ide_hwif_t * hwif, unsigned long dma_base) +{ + int num_ports = sizeof (ioc4_dma_regs_t); + + printk(KERN_INFO "%s: BM-DMA at 0x%04lx-0x%04lx\n", hwif->name, + dma_base, dma_base + num_ports - 1); + + if (!request_region(dma_base, num_ports, hwif->name)) { + printk(KERN_ERR + "%s(%s) -- ERROR, Addresses 0x%p to 0x%p " + "ALREADY in use\n", + __FUNCTION__, hwif->name, (void *) dma_base, + (void *) dma_base + num_ports - 1); + goto dma_alloc_failure; + } + + hwif->dma_base = dma_base; + hwif->dmatable_cpu = pci_alloc_consistent(hwif->pci_dev, + IOC4_PRD_ENTRIES * IOC4_PRD_BYTES, + &hwif->dmatable_dma); + + if (!hwif->dmatable_cpu) + goto dma_alloc_failure; + + hwif->sg_table = + kmalloc(sizeof (struct scatterlist) * IOC4_PRD_ENTRIES, GFP_KERNEL); + + if (!hwif->sg_table) + goto dma_sgalloc_failure; + + hwif->dma_base2 = (unsigned long) + pci_alloc_consistent(hwif->pci_dev, + IOC4_IDE_CACHELINE_SIZE, + (dma_addr_t *) &(hwif->dma_status)); + + if (!hwif->dma_base2) + goto dma_base2alloc_failure; + + return; + +dma_base2alloc_failure: + kfree(hwif->sg_table); + +dma_sgalloc_failure: + pci_free_consistent(hwif->pci_dev, + IOC4_PRD_ENTRIES * IOC4_PRD_BYTES, + hwif->dmatable_cpu, hwif->dmatable_dma); + printk(KERN_INFO + "%s() -- Error! Unable to allocate DMA Maps for drive %s\n", + __FUNCTION__, hwif->name); + printk(KERN_INFO + "Changing from DMA to PIO mode for Drive %s\n", hwif->name); + +dma_alloc_failure: + /* Disable DMA because we couldnot allocate any DMA maps */ + hwif->autodma = 0; + hwif->atapi_dma = 0; +} + +/* Initializes the IOC4 DMA Engine */ +static void +sgiioc4_configure_for_dma(int dma_direction, ide_drive_t * drive) +{ + u32 ioc4_dma; + ide_hwif_t *hwif = HWIF(drive); + u64 dma_base = hwif->dma_base; + u32 dma_addr, ending_dma_addr; + + ioc4_dma = hwif->INL(dma_base + IOC4_DMA_CTRL * 4); + + if (ioc4_dma & IOC4_S_DMA_ACTIVE) { + printk(KERN_WARNING + "%s(%s):Warning!! DMA from previous transfer was still active\n", + __FUNCTION__, drive->name); + hwif->OUTL(IOC4_S_DMA_STOP, dma_base + IOC4_DMA_CTRL * 4); + ioc4_dma = sgiioc4_ide_dma_stop(hwif, dma_base); + + if (ioc4_dma & IOC4_S_DMA_STOP) + printk(KERN_ERR + "%s(%s) : IOC4 Dma STOP bit is still 1\n", + __FUNCTION__, drive->name); + } + + ioc4_dma = hwif->INL(dma_base + IOC4_DMA_CTRL * 4); + if (ioc4_dma & IOC4_S_DMA_ERROR) { + printk(KERN_WARNING + "%s(%s) : Warning!! - DMA Error during Previous" + " transfer | status 0x%x\n", + __FUNCTION__, drive->name, ioc4_dma); + hwif->OUTL(IOC4_S_DMA_STOP, dma_base + IOC4_DMA_CTRL * 4); + ioc4_dma = sgiioc4_ide_dma_stop(hwif, dma_base); + + if (ioc4_dma & IOC4_S_DMA_STOP) + printk(KERN_ERR + "%s(%s) : IOC4 DMA STOP bit is still 1\n", + __FUNCTION__, drive->name); + } + + /* Address of the Scatter Gather List */ + dma_addr = cpu_to_le32(hwif->dmatable_dma); + hwif->OUTL(dma_addr, dma_base + IOC4_DMA_PTR_L * 4); + + /* Address of the Ending DMA */ + memset((unsigned int *) hwif->dma_base2, 0, IOC4_IDE_CACHELINE_SIZE); + ending_dma_addr = cpu_to_le32(hwif->dma_status); + hwif->OUTL(ending_dma_addr, dma_base + IOC4_DMA_END_ADDR * 4); + + hwif->OUTL(dma_direction, dma_base + IOC4_DMA_CTRL * 4); + drive->waiting_for_dma = 1; +} + +/* IOC4 Scatter Gather list Format */ +/* 128 Bit entries to support 64 bit addresses in the future */ +/* The Scatter Gather list Entry should be in the BIG-ENDIAN Format */ +/* --------------------------------------------------------------------- */ +/* | Upper 32 bits - Zero | Lower 32 bits- address | */ +/* --------------------------------------------------------------------- */ +/* | Upper 32 bits - Zero |EOL| 15 unused | 16 Bit Length| */ +/* --------------------------------------------------------------------- */ +/* Creates the scatter gather list, DMA Table */ +static unsigned int +sgiioc4_build_dma_table(ide_drive_t * drive, struct request *rq, int ddir) +{ + ide_hwif_t *hwif = HWIF(drive); + unsigned int *table = hwif->dmatable_cpu; + unsigned int count = 0, i = 1; + struct scatterlist *sg; + + if (HWGROUP(drive)->rq->flags & REQ_DRIVE_TASKFILE) + hwif->sg_nents = i = ide_raw_build_sglist(drive, rq); + else + hwif->sg_nents = i = ide_build_sglist(drive, rq); + + if (!i) + return 0; /* sglist of length Zero */ + + sg = hwif->sg_table; + while (i && sg_dma_len(sg)) { + dma_addr_t cur_addr; + int cur_len; + cur_addr = sg_dma_address(sg); + cur_len = sg_dma_len(sg); + + while (cur_len) { + if (count++ >= IOC4_PRD_ENTRIES) { + printk(KERN_WARNING + "%s: DMA table too small\n", + drive->name); + goto use_pio_instead; + } else { + u32 xcount, bcount = + 0x10000 - (cur_addr & 0xffff); + + if (bcount > cur_len) + bcount = cur_len; + + /* put the addr, length in + * the IOC4 dma-table format */ + *table = 0x0; + table++; + *table = cpu_to_be32(cur_addr); + table++; + *table = 0x0; + table++; + + xcount = bcount & 0xffff; + *table = cpu_to_be32(xcount); + table++; + + cur_addr += bcount; + cur_len -= bcount; + } + } + + sg++; + i--; + } + + if (count) { + table--; + *table |= cpu_to_be32(0x80000000); + return count; + } + +use_pio_instead: + pci_unmap_sg(hwif->pci_dev, hwif->sg_table, hwif->sg_nents, + hwif->sg_dma_direction); + hwif->sg_dma_active = 0; + + return 0; /* revert to PIO for this request */ +} + +static int +sgiioc4_ide_dma_read(ide_drive_t * drive) +{ + struct request *rq = HWGROUP(drive)->rq; + unsigned int count = 0; + + if (!(count = sgiioc4_build_dma_table(drive, rq, PCI_DMA_FROMDEVICE))) { + /* try PIO instead of DMA */ + return 1; + } + /* Writes FROM the IOC4 TO Main Memory */ + sgiioc4_configure_for_dma(IOC4_DMA_WRITE, drive); + + return 0; +} + +static int +sgiioc4_ide_dma_write(ide_drive_t * drive) +{ + struct request *rq = HWGROUP(drive)->rq; + unsigned int count = 0; + + if (!(count = sgiioc4_build_dma_table(drive, rq, PCI_DMA_TODEVICE))) { + /* try PIO instead of DMA */ + return 1; + } + + sgiioc4_configure_for_dma(IOC4_DMA_READ, drive); + /* Writes TO the IOC4 FROM Main Memory */ + + return 0; +} + +static void __init +ide_init_sgiioc4(ide_hwif_t * hwif) +{ + hwif->mmio = 2; + hwif->autodma = 1; + hwif->atapi_dma = 1; + hwif->ultra_mask = 0x0; /* Disable Ultra DMA */ + hwif->mwdma_mask = 0x2; /* Multimode-2 DMA */ + hwif->swdma_mask = 0x2; + hwif->identify = NULL; + hwif->tuneproc = NULL; /* Sets timing for PIO mode */ + hwif->speedproc = NULL; /* Sets timing for DMA &/or PIO modes */ + hwif->selectproc = NULL;/* Use the default routine to select drive */ + hwif->reset_poll = NULL;/* No HBA specific reset_poll needed */ + hwif->pre_reset = NULL; /* No HBA specific pre_set needed */ + hwif->resetproc = &sgiioc4_resetproc;/* Reset DMA engine, + clear interrupts */ + hwif->intrproc = NULL; /* Enable or Disable interrupt from drive */ + hwif->maskproc = &sgiioc4_maskproc; /* Mask on/off NIEN register */ + hwif->quirkproc = NULL; + hwif->busproc = NULL; + + hwif->ide_dma_read = &sgiioc4_ide_dma_read; + hwif->ide_dma_write = &sgiioc4_ide_dma_write; + hwif->ide_dma_begin = &sgiioc4_ide_dma_begin; + hwif->ide_dma_end = &sgiioc4_ide_dma_end; + hwif->ide_dma_check = &sgiioc4_ide_dma_check; + hwif->ide_dma_on = &sgiioc4_ide_dma_on; + hwif->ide_dma_off = &sgiioc4_ide_dma_off; + hwif->ide_dma_off_quietly = &sgiioc4_ide_dma_off_quietly; + hwif->ide_dma_test_irq = &sgiioc4_ide_dma_test_irq; + hwif->ide_dma_host_on = &sgiioc4_ide_dma_host_on; + hwif->ide_dma_host_off = &sgiioc4_ide_dma_host_off; + hwif->ide_dma_bad_drive = &__ide_dma_bad_drive; + hwif->ide_dma_good_drive = &__ide_dma_good_drive; + hwif->ide_dma_count = &__ide_dma_count; + hwif->ide_dma_verbose = &sgiioc4_ide_dma_verbose; + hwif->ide_dma_retune = &__ide_dma_retune; + hwif->ide_dma_lostirq = &sgiioc4_ide_dma_lostirq; + hwif->ide_dma_timeout = &__ide_dma_timeout; + hwif->INB = &sgiioc4_INB; +} + +static int __init +sgiioc4_ide_setup_pci_device(struct pci_dev *dev, ide_pci_device_t * d) +{ + unsigned long base, ctl, dma_base, irqport; + ide_hwif_t *hwif; + int h; + + for (h = 0; h < MAX_HWIFS; ++h) { + hwif = &ide_hwifs[h]; + /* Find an empty HWIF */ + if (hwif->chipset == ide_unknown) + break; + } + + /* Get the CmdBlk and CtrlBlk Base Registers */ + base = pci_resource_start(dev, 0) + IOC4_CMD_OFFSET; + ctl = pci_resource_start(dev, 0) + IOC4_CTRL_OFFSET; + irqport = pci_resource_start(dev, 0) + IOC4_INTR_OFFSET; + dma_base = pci_resource_start(dev, 0) + IOC4_DMA_OFFSET; + + if (!request_region(base, IOC4_CMD_CTL_BLK_SIZE, hwif->name)) { + printk(KERN_ERR + "%s : %s -- ERROR, Port Addresses " + "0x%p to 0x%p ALREADY in use\n", + __FUNCTION__, hwif->name, (void *) base, + (void *) base + IOC4_CMD_CTL_BLK_SIZE); + return 1; + } + + if (hwif->io_ports[IDE_DATA_OFFSET] != base) { + /* Initialize the IO registers */ + sgiioc4_init_hwif_ports(&hwif->hw, base, ctl, irqport); + memcpy(hwif->io_ports, hwif->hw.io_ports, + sizeof (hwif->io_ports)); + hwif->noprobe = !hwif->io_ports[IDE_DATA_OFFSET]; + } + + hwif->irq = dev->irq; + hwif->chipset = ide_pci; + hwif->pci_dev = dev; + hwif->channel = 0; /* Single Channel chip */ + hwif->cds = (struct ide_pci_device_s *) d; + hwif->gendev.parent = &dev->dev;/* setup proper ancestral information */ + + /* Initializing chipset IRQ Registers */ + hwif->OUTL(0x03, irqport + IOC4_INTR_SET * 4); + + ide_init_sgiioc4(hwif); + + if (dma_base) + ide_dma_sgiioc4(hwif, dma_base); + else + printk(KERN_INFO "%s: %s Bus-Master DMA disabled\n", + hwif->name, d->name); + + probe_hwif_init(hwif); + return 0; +} + +/* This ensures that we can build this for generic kernels without + * having all the SN2 code sync'd and merged. + */ +typedef enum pciio_endian_e { + PCIDMA_ENDIAN_BIG, + PCIDMA_ENDIAN_LITTLE +} pciio_endian_t; +pciio_endian_t __attribute__ ((weak)) snia_pciio_endian_set(struct pci_dev + *pci_dev, pciio_endian_t device_end, + pciio_endian_t desired_end); + +static unsigned int __init +pci_init_sgiioc4(struct pci_dev *dev, ide_pci_device_t * d) +{ + unsigned int class_rev; + + if (pci_enable_device(dev)) { + printk(KERN_ERR + "Failed to enable device %s at slot %s\n", + d->name, dev->slot_name); + return 1; + } + pci_set_master(dev); + + pci_read_config_dword(dev, PCI_CLASS_REVISION, &class_rev); + class_rev &= 0xff; + printk(KERN_INFO "%s: IDE controller at PCI slot %s, revision %d\n", + d->name, dev->slot_name, class_rev); + if (class_rev < IOC4_SUPPORTED_FIRMWARE_REV) { + printk(KERN_ERR "Skipping %s IDE controller in slot %s: " + "firmware is obsolete - please upgrade to revision" + "46 or higher\n", d->name, dev->slot_name); + return 1; + } + + /* Enable Byte Swapping in the PIC... */ + if (snia_pciio_endian_set) { + snia_pciio_endian_set(dev, PCIDMA_ENDIAN_LITTLE, + PCIDMA_ENDIAN_BIG); + } else { + printk(KERN_ERR + "Failed to set endianness for device %s at slot %s\n", + d->name, dev->slot_name); + return 1; + } + + return sgiioc4_ide_setup_pci_device(dev, d); +} + +static ide_pci_device_t sgiioc4_chipsets[] __devinitdata = { + { + /* Channel 0 */ + .vendor = PCI_VENDOR_ID_SGI, + .device = PCI_DEVICE_ID_SGI_IOC4, + .name = "SGIIOC4", + .init_hwif = ide_init_sgiioc4, + .init_dma = ide_dma_sgiioc4, + .channels = 1, + .autodma = AUTODMA, + /* SGI IOC4 doesn't have enablebits. */ + .bootable = ON_BOARD, + } +}; + +static int __devinit +sgiioc4_init_one(struct pci_dev *dev, const struct pci_device_id *id) +{ + ide_pci_device_t *d = &sgiioc4_chipsets[id->driver_data]; + if (dev->device != d->device) { + printk(KERN_ERR "Error in %s(dev 0x%p | id 0x%p )\n", + __FUNCTION__, (void *) dev, (void *) id); + BUG(); + } + + if (pci_init_sgiioc4(dev, d)) + return 0; + + MOD_INC_USE_COUNT; + + return 0; +} + +static struct pci_device_id sgiioc4_pci_tbl[] = { + {PCI_VENDOR_ID_SGI, PCI_DEVICE_ID_SGI_IOC4, PCI_ANY_ID, + PCI_ANY_ID, 0x0b4000, 0xFFFFFF, 0}, + {0} +}; + +static struct pci_driver driver = { + .name = "SGI-IOC4 IDE", + .id_table = sgiioc4_pci_tbl, + .probe = sgiioc4_init_one, +}; + +static int +sgiioc4_ide_init(void) +{ + return ide_pci_register_driver(&driver); +} + +static void +sgiioc4_ide_exit(void) +{ + ide_pci_unregister_driver(&driver); +} + +module_init(sgiioc4_ide_init); +module_exit(sgiioc4_ide_exit); + +MODULE_AUTHOR("Aniket Malatpure - Silicon Graphics Inc. (SGI)"); +MODULE_DESCRIPTION("PCI driver module for SGI IOC4 Base-IO Card"); +MODULE_LICENSE("GPL"); --- diff/drivers/net/forcedeth.c 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/net/forcedeth.c 2003-11-26 10:09:06.000000000 +0000 @@ -0,0 +1,1420 @@ +/* + * forcedeth: Ethernet driver for NVIDIA nForce media access controllers. + * + * Note: This driver is a cleanroom reimplementation based on reverse + * engineered documentation written by Carl-Daniel Hailfinger + * and Andrew de Quincey. It's neither supported nor endorsed + * by NVIDIA Corp. Use at your own risk. + * + * NVIDIA, nForce and other NVIDIA marks are trademarks or registered + * trademarks of NVIDIA Corporation in the United States and other + * countries. + * + * Copyright (C) 2003 Manfred Spraul + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Changelog: + * 0.01: 05 Oct 2003: First release that compiles without warnings. + * 0.02: 05 Oct 2003: Fix bug for drain_tx: do not try to free NULL skbs. + * Check all PCI BARs for the register window. + * udelay added to mii_rw. + * 0.03: 06 Oct 2003: Initialize dev->irq. + * 0.04: 07 Oct 2003: Initialize np->lock, reduce handled irqs, add printks. + * 0.05: 09 Oct 2003: printk removed again, irq status print tx_timeout. + * 0.06: 10 Oct 2003: MAC Address read updated, pff flag generation updated, + * irq mask updated + * 0.07: 14 Oct 2003: Further irq mask updates. + * 0.08: 20 Oct 2003: rx_desc.Length initialization added, alloc_rx refill + * added into irq handler, NULL check for drain_ring. + * 0.09: 20 Oct 2003: Basic link speed irq implementation. Only handle the + * requested interrupt sources. + * 0.10: 20 Oct 2003: First cleanup for release. + * 0.11: 21 Oct 2003: hexdump for tx added, rx buffer sizes increased. + * MAC Address init fix, set_multicast cleanup. + * 0.12: 23 Oct 2003: Cleanups for release. + * 0.13: 25 Oct 2003: Limit for concurrent tx packets increased to 10. + * Set link speed correctly. start rx before starting + * tx (start_rx sets the link speed). + * 0.14: 25 Oct 2003: Nic dependant irq mask. + * 0.15: 08 Nov 2003: fix smp deadlock with set_multicast_list during + * open. + * 0.16: 15 Nov 2003: include file cleanup for ppc64, rx buffer size + * increased to 1628 bytes. + * 0.17: 16 Nov 2003: undo rx buffer size increase. Substract 1 from + * the tx length. + * 0.18: 17 Nov 2003: fix oops due to late initialization of dev_stats + * + * Known bugs: + * The irq handling is wrong - no tx done interrupts are generated. + * This means recovery from netif_stop_queue only happens in the hw timer + * interrupt (1/2 second), or if an rx packet arrives by chance. + */ +#define FORCEDETH_VERSION "0.18" + +#include <linux/module.h> +#include <linux/types.h> +#include <linux/pci.h> +#include <linux/netdevice.h> +#include <linux/etherdevice.h> +#include <linux/delay.h> +#include <linux/spinlock.h> +#include <linux/ethtool.h> +#include <linux/timer.h> +#include <linux/skbuff.h> +#include <linux/mii.h> +#include <linux/random.h> +#include <linux/init.h> + +#include <asm/io.h> +#include <asm/uaccess.h> +#include <asm/system.h> + +#if 0 +#define dprintk printk +#else +#define dprintk(x...) do { } while (0) +#endif + + +/* + * Hardware access: + */ + +#define DEV_NEED_LASTPACKET1 0x0001 +#define DEV_IRQMASK_1 0x0002 +#define DEV_IRQMASK_2 0x0004 + +enum { + NvRegIrqStatus = 0x000, +#define NVREG_IRQSTAT_MIIEVENT 0x040 +#define NVREG_IRQSTAT_MASK 0x1ff + NvRegIrqMask = 0x004, +#define NVREG_IRQ_UNKNOWN 0x0005 +#define NVREG_IRQ_RX 0x0002 +#define NVREG_IRQ_TX2 0x0010 +#define NVREG_IRQ_TIMER 0x0020 +#define NVREG_IRQ_LINK 0x0040 +#define NVREG_IRQ_TX1 0x0100 +#define NVREG_IRQMASK_WANTED_1 0x005f +#define NVREG_IRQMASK_WANTED_2 0x0147 + + NvRegUnknownSetupReg6 = 0x008, +#define NVREG_UNKSETUP6_VAL 3 + + NvRegPollingInterval = 0x00c, + NvRegMisc1 = 0x080, +#define NVREG_MISC1_HD 0x02 +#define NVREG_MISC1_FORCE 0x3b0f3c + + NvRegTransmitterControl = 0x084, +#define NVREG_XMITCTL_START 0x01 + NvRegTransmitterStatus = 0x088, +#define NVREG_XMITSTAT_BUSY 0x01 + + NvRegPacketFilterFlags = 0x8c, +#define NVREG_PFF_ALWAYS 0x7F0008 +#define NVREG_PFF_PROMISC 0x80 +#define NVREG_PFF_MYADDR 0x20 + + NvRegOffloadConfig = 0x90, +#define NVREG_OFFLOAD_HOMEPHY 0x601 +#define NVREG_OFFLOAD_NORMAL 0x5ee + NvRegReceiverControl = 0x094, +#define NVREG_RCVCTL_START 0x01 + NvRegReceiverStatus = 0x98, +#define NVREG_RCVSTAT_BUSY 0x01 + + NvRegRandomSeed = 0x9c, +#define NVREG_RNDSEED_MASK 0x00ff +#define NVREG_RNDSEED_FORCE 0x7f00 + + NvRegUnknownSetupReg1 = 0xA0, +#define NVREG_UNKSETUP1_VAL 0x16070f + NvRegUnknownSetupReg2 = 0xA4, +#define NVREG_UNKSETUP2_VAL 0x16 + NvRegMacAddrA = 0xA8, + NvRegMacAddrB = 0xAC, + NvRegMulticastAddrA = 0xB0, +#define NVREG_MCASTADDRA_FORCE 0x01 + NvRegMulticastAddrB = 0xB4, + NvRegMulticastMaskA = 0xB8, + NvRegMulticastMaskB = 0xBC, + + NvRegTxRingPhysAddr = 0x100, + NvRegRxRingPhysAddr = 0x104, + NvRegRingSizes = 0x108, +#define NVREG_RINGSZ_TXSHIFT 0 +#define NVREG_RINGSZ_RXSHIFT 16 + NvRegUnknownTransmitterReg = 0x10c, + NvRegLinkSpeed = 0x110, +#define NVREG_LINKSPEED_FORCE 0x10000 +#define NVREG_LINKSPEED_10 10 +#define NVREG_LINKSPEED_100 100 +#define NVREG_LINKSPEED_1000 1000 + NvRegUnknownSetupReg5 = 0x130, +#define NVREG_UNKSETUP5_BIT31 (1<<31) + NvRegUnknownSetupReg3 = 0x134, +#define NVREG_UNKSETUP3_VAL1 0x200010 + NvRegTxRxControl = 0x144, +#define NVREG_TXRXCTL_KICK 0x0001 +#define NVREG_TXRXCTL_BIT1 0x0002 +#define NVREG_TXRXCTL_BIT2 0x0004 +#define NVREG_TXRXCTL_IDLE 0x0008 +#define NVREG_TXRXCTL_RESET 0x0010 + NvRegMIIStatus = 0x180, +#define NVREG_MIISTAT_ERROR 0x0001 +#define NVREG_MIISTAT_LINKCHANGE 0x0008 +#define NVREG_MIISTAT_MASK 0x000f +#define NVREG_MIISTAT_MASK2 0x000f + NvRegUnknownSetupReg4 = 0x184, +#define NVREG_UNKSETUP4_VAL 8 + + NvRegAdapterControl = 0x188, +#define NVREG_ADAPTCTL_START 0x02 +#define NVREG_ADAPTCTL_LINKUP 0x04 +#define NVREG_ADAPTCTL_PHYVALID 0x4000 +#define NVREG_ADAPTCTL_RUNNING 0x100000 +#define NVREG_ADAPTCTL_PHYSHIFT 24 + NvRegMIISpeed = 0x18c, +#define NVREG_MIISPEED_BIT8 (1<<8) +#define NVREG_MIIDELAY 5 + NvRegMIIControl = 0x190, +#define NVREG_MIICTL_INUSE 0x10000 +#define NVREG_MIICTL_WRITE 0x08000 +#define NVREG_MIICTL_ADDRSHIFT 5 + NvRegMIIData = 0x194, + NvRegWakeUpFlags = 0x200, +#define NVREG_WAKEUPFLAGS_VAL 0x7770 +#define NVREG_WAKEUPFLAGS_BUSYSHIFT 24 +#define NVREG_WAKEUPFLAGS_ENABLESHIFT 16 +#define NVREG_WAKEUPFLAGS_D3SHIFT 12 +#define NVREG_WAKEUPFLAGS_D2SHIFT 8 +#define NVREG_WAKEUPFLAGS_D1SHIFT 4 +#define NVREG_WAKEUPFLAGS_D0SHIFT 0 +#define NVREG_WAKEUPFLAGS_ACCEPT_MAGPAT 0x01 +#define NVREG_WAKEUPFLAGS_ACCEPT_WAKEUPPAT 0x02 +#define NVREG_WAKEUPFLAGS_ACCEPT_LINKCHANGE 0x04 + + NvRegPatternCRC = 0x204, + NvRegPatternMask = 0x208, + NvRegPowerCap = 0x268, +#define NVREG_POWERCAP_D3SUPP (1<<30) +#define NVREG_POWERCAP_D2SUPP (1<<26) +#define NVREG_POWERCAP_D1SUPP (1<<25) + NvRegPowerState = 0x26c, +#define NVREG_POWERSTATE_POWEREDUP 0x8000 +#define NVREG_POWERSTATE_VALID 0x0100 +#define NVREG_POWERSTATE_MASK 0x0003 +#define NVREG_POWERSTATE_D0 0x0000 +#define NVREG_POWERSTATE_D1 0x0001 +#define NVREG_POWERSTATE_D2 0x0002 +#define NVREG_POWERSTATE_D3 0x0003 +}; + +struct ring_desc { + u32 PacketBuffer; + u16 Length; + u16 Flags; +}; + +#define NV_TX_LASTPACKET (1<<0) +#define NV_TX_RETRYERROR (1<<3) +#define NV_TX_LASTPACKET1 (1<<8) +#define NV_TX_DEFERRED (1<<10) +#define NV_TX_CARRIERLOST (1<<11) +#define NV_TX_LATECOLLISION (1<<12) +#define NV_TX_UNDERFLOW (1<<13) +#define NV_TX_ERROR (1<<14) +#define NV_TX_VALID (1<<15) + +#define NV_RX_DESCRIPTORVALID (1<<0) +#define NV_RX_MISSEDFRAME (1<<1) +#define NV_RX_SUBSTRACT1 (1<<3) +#define NV_RX_ERROR1 (1<<7) +#define NV_RX_ERROR2 (1<<8) +#define NV_RX_ERROR3 (1<<9) +#define NV_RX_ERROR4 (1<<10) +#define NV_RX_CRCERR (1<<11) +#define NV_RX_OVERFLOW (1<<12) +#define NV_RX_FRAMINGERR (1<<13) +#define NV_RX_ERROR (1<<14) +#define NV_RX_AVAIL (1<<15) + +/* Miscelaneous hardware related defines: */ +#define NV_PCI_REGSZ 0x270 + +/* various timeout delays: all in usec */ +#define NV_TXRX_RESET_DELAY 4 +#define NV_TXSTOP_DELAY1 10 +#define NV_TXSTOP_DELAY1MAX 500000 +#define NV_TXSTOP_DELAY2 100 +#define NV_RXSTOP_DELAY1 10 +#define NV_RXSTOP_DELAY1MAX 500000 +#define NV_RXSTOP_DELAY2 100 +#define NV_SETUP5_DELAY 5 +#define NV_SETUP5_DELAYMAX 50000 +#define NV_POWERUP_DELAY 5 +#define NV_POWERUP_DELAYMAX 5000 +#define NV_MIIBUSY_DELAY 50 +#define NV_MIIPHY_DELAY 10 +#define NV_MIIPHY_DELAYMAX 10000 + +#define NV_WAKEUPPATTERNS 5 +#define NV_WAKEUPMASKENTRIES 4 + +/* General driver defaults */ +#define NV_WATCHDOG_TIMEO (2*HZ) +#define DEFAULT_MTU 1500 /* also maximum supported, at least for now */ + +#define RX_RING 128 +#define TX_RING 16 +/* limited to 1 packet until we understand NV_TX_LASTPACKET */ +#define TX_LIMIT_STOP 10 +#define TX_LIMIT_START 5 + +/* rx/tx mac addr + type + vlan + align + slack*/ +#define RX_NIC_BUFSIZE (DEFAULT_MTU + 64) +/* even more slack */ +#define RX_ALLOC_BUFSIZE (DEFAULT_MTU + 128) + +#define OOM_REFILL (1+HZ/20) +/* + * SMP locking: + * All hardware access under dev->priv->lock, except the performance + * critical parts: + * - rx is (pseudo-) lockless: it relies on the single-threading provided + * by the arch code for interrupts. + * - tx setup is lockless: it relies on dev->xmit_lock. Actual submission + * needs dev->priv->lock :-( + * - set_multicast_list: preparation lockless, relies on dev->xmit_lock. + */ + +/* in dev: base, irq */ +struct fe_priv { + spinlock_t lock; + + /* General data: + * Locking: spin_lock(&np->lock); */ + struct net_device_stats stats; + int in_shutdown; + u32 linkspeed; + int duplex; + int phyaddr; + + /* General data: RO fields */ + dma_addr_t ring_addr; + struct pci_dev *pci_dev; + u32 orig_mac[2]; + u32 irqmask; + + /* rx specific fields. + * Locking: Within irq hander or disable_irq+spin_lock(&np->lock); + */ + struct ring_desc *rx_ring; + unsigned int cur_rx, refill_rx; + struct sk_buff *rx_skbuff[RX_RING]; + dma_addr_t rx_dma[RX_RING]; + unsigned int rx_buf_sz; + struct timer_list oom_kick; + + /* + * tx specific fields. + */ + struct ring_desc *tx_ring; + unsigned int next_tx, nic_tx; + struct sk_buff *tx_skbuff[TX_RING]; + dma_addr_t tx_dma[TX_RING]; + u16 tx_flags; +}; + +static inline struct fe_priv *get_nvpriv(struct net_device *dev) +{ + return (struct fe_priv *) dev->priv; +} + +static inline u8 *get_hwbase(struct net_device *dev) +{ + return (u8 *) dev->base_addr; +} + +static inline void pci_push(u8 * base) +{ + /* force out pending posted writes */ + readl(base); +} + +static int reg_delay(struct net_device *dev, int offset, u32 mask, u32 target, + int delay, int delaymax, const char *msg) +{ + u8 *base = get_hwbase(dev); + + pci_push(base); + do { + udelay(delay); + delaymax -= delay; + if (delaymax < 0) { + if (msg) + printk(msg); + return 1; + } + } while ((readl(base + offset) & mask) != target); + return 0; +} + +#define MII_READ (-1) +/* mii_rw: read/write a register on the PHY. + * + * Caller must guarantee serialization + */ +static int mii_rw(struct net_device *dev, int addr, int miireg, int value) +{ + u8 *base = get_hwbase(dev); + int was_running; + u32 reg; + int retval; + + writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus); + was_running = 0; + reg = readl(base + NvRegAdapterControl); + if (reg & NVREG_ADAPTCTL_RUNNING) { + was_running = 1; + writel(reg & ~NVREG_ADAPTCTL_RUNNING, base + NvRegAdapterControl); + } + reg = readl(base + NvRegMIIControl); + if (reg & NVREG_MIICTL_INUSE) { + writel(NVREG_MIICTL_INUSE, base + NvRegMIIControl); + udelay(NV_MIIBUSY_DELAY); + } + + reg = NVREG_MIICTL_INUSE | (addr << NVREG_MIICTL_ADDRSHIFT) | miireg; + if (value != MII_READ) { + writel(value, base + NvRegMIIData); + reg |= NVREG_MIICTL_WRITE; + } + writel(reg, base + NvRegMIIControl); + + if (reg_delay(dev, NvRegMIIControl, NVREG_MIICTL_INUSE, 0, + NV_MIIPHY_DELAY, NV_MIIPHY_DELAYMAX, NULL)) { + dprintk(KERN_DEBUG "%s: mii_rw of reg %d at PHY %d timed out.\n", + dev->name, miireg, addr); + retval = -1; + } else if (value != MII_READ) { + /* it was a write operation - fewer failures are detectable */ + dprintk(KERN_DEBUG "%s: mii_rw wrote 0x%x to reg %d at PHY %d\n", + dev->name, value, miireg, addr); + retval = 0; + } else if (readl(base + NvRegMIIStatus) & NVREG_MIISTAT_ERROR) { + dprintk(KERN_DEBUG "%s: mii_rw of reg %d at PHY %d failed.\n", + dev->name, miireg, addr); + retval = -1; + } else { + /* FIXME: why is that required? */ + udelay(50); + retval = readl(base + NvRegMIIData); + dprintk(KERN_DEBUG "%s: mii_rw read from reg %d at PHY %d: 0x%x.\n", + dev->name, miireg, addr, retval); + } + if (was_running) { + reg = readl(base + NvRegAdapterControl); + writel(reg | NVREG_ADAPTCTL_RUNNING, base + NvRegAdapterControl); + } + return retval; +} + +static void start_rx(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: start_rx\n", dev->name); + /* Already running? Stop it. */ + if (readl(base + NvRegReceiverControl) & NVREG_RCVCTL_START) { + writel(NVREG_RCVCTL_START, base + NvRegReceiverControl); + pci_push(base); + } + writel(np->linkspeed, base + NvRegLinkSpeed); + pci_push(base); + writel(NVREG_RCVCTL_START, base + NvRegReceiverControl); + pci_push(base); +} + +static void stop_rx(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: stop_rx\n", dev->name); + writel(0, base + NvRegReceiverControl); + reg_delay(dev, NvRegReceiverStatus, NVREG_RCVSTAT_BUSY, 0, + NV_RXSTOP_DELAY1, NV_RXSTOP_DELAY1MAX, + KERN_INFO "stop_rx: ReceiverStatus remained busy"); + + udelay(NV_RXSTOP_DELAY2); + writel(0, base + NvRegLinkSpeed); +} + +static void start_tx(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: start_tx\n", dev->name); + writel(NVREG_XMITCTL_START, base + NvRegTransmitterControl); + pci_push(base); +} + +static void stop_tx(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: stop_tx\n", dev->name); + writel(0, base + NvRegTransmitterControl); + reg_delay(dev, NvRegTransmitterStatus, NVREG_XMITSTAT_BUSY, 0, + NV_TXSTOP_DELAY1, NV_TXSTOP_DELAY1MAX, + KERN_INFO "stop_tx: TransmitterStatus remained busy"); + + udelay(NV_TXSTOP_DELAY2); + writel(0, base + NvRegUnknownTransmitterReg); +} + +static void txrx_reset(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: txrx_reset\n", dev->name); + writel(NVREG_TXRXCTL_BIT2 | NVREG_TXRXCTL_RESET, base + NvRegTxRxControl); + pci_push(base); + udelay(NV_TXRX_RESET_DELAY); + writel(NVREG_TXRXCTL_BIT2, base + NvRegTxRxControl); + pci_push(base); +} + +/* + * get_stats: dev->get_stats function + * Get latest stats value from the nic. + * Called with read_lock(&dev_base_lock) held for read - + * only synchronized against unregister_netdevice. + */ +static struct net_device_stats *get_stats(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + + /* It seems that the nic always generates interrupts and doesn't + * accumulate errors internally. Thus the current values in np->stats + * are already up to date. + */ + return &np->stats; +} + + +/* + * nic_ioctl: dev->do_ioctl function + * Called with rtnl_lock held. + */ +static int nic_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) +{ + return -EOPNOTSUPP; +} + +/* + * alloc_rx: fill rx ring entries. + * Return 1 if the allocations for the skbs failed and the + * rx engine is without Available descriptors + */ +static int alloc_rx(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + unsigned int refill_rx = np->refill_rx; + + while (np->cur_rx != refill_rx) { + int nr = refill_rx % RX_RING; + struct sk_buff *skb; + + if (np->rx_skbuff[nr] == NULL) { + + skb = dev_alloc_skb(RX_ALLOC_BUFSIZE); + if (!skb) + break; + + skb->dev = dev; + np->rx_skbuff[nr] = skb; + } else { + skb = np->rx_skbuff[nr]; + } + np->rx_dma[nr] = pci_map_single(np->pci_dev, skb->data, skb->len, + PCI_DMA_FROMDEVICE); + np->rx_ring[nr].PacketBuffer = cpu_to_le32(np->rx_dma[nr]); + np->rx_ring[nr].Length = cpu_to_le16(RX_NIC_BUFSIZE); + wmb(); + np->rx_ring[nr].Flags = cpu_to_le16(NV_RX_AVAIL); + dprintk(KERN_DEBUG "%s: alloc_rx: Packet %d marked as Available\n", + dev->name, refill_rx); + refill_rx++; + } + if (np->refill_rx != refill_rx) { + /* FIXME: made progress. Kick hardware */ + } + np->refill_rx = refill_rx; + if (np->cur_rx - refill_rx == RX_RING) + return 1; + return 0; +} + +static void do_rx_refill(unsigned long data) +{ + struct net_device *dev = (struct net_device *) data; + struct fe_priv *np = get_nvpriv(dev); + + disable_irq(dev->irq); + if (alloc_rx(dev)) { + spin_lock(&np->lock); + if (!np->in_shutdown) + mod_timer(&np->oom_kick, jiffies + OOM_REFILL); + spin_unlock(&np->lock); + } + enable_irq(dev->irq); +} + +static int init_ring(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int i; + + np->next_tx = np->nic_tx = 0; + for (i = 0; i < TX_RING; i++) { + np->tx_ring[i].Flags = 0; + } + + np->cur_rx = RX_RING; + np->refill_rx = 0; + for (i = 0; i < RX_RING; i++) { + np->rx_ring[i].Flags = 0; + } + init_timer(&np->oom_kick); + np->oom_kick.data = (unsigned long) dev; + np->oom_kick.function = &do_rx_refill; /* timer handler */ + + return alloc_rx(dev); +} + +static void drain_tx(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int i; + for (i = 0; i < TX_RING; i++) { + np->tx_ring[i].Flags = 0; + if (np->tx_skbuff[i]) { + pci_unmap_single(np->pci_dev, np->tx_dma[i], + np->tx_skbuff[i]->len, + PCI_DMA_TODEVICE); + dev_kfree_skb(np->tx_skbuff[i]); + np->tx_skbuff[i] = NULL; + np->stats.tx_dropped++; + } + } +} +static void drain_ring(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int i; + + drain_tx(dev); + + for (i = 0; i < RX_RING; i++) { + np->rx_ring[i].Flags = 0; + wmb(); + if (np->rx_skbuff[i]) { + pci_unmap_single(np->pci_dev, np->rx_dma[i], + np->rx_skbuff[i]->len, + PCI_DMA_FROMDEVICE); + dev_kfree_skb(np->rx_skbuff[i]); + np->rx_skbuff[i] = NULL; + } + } +} + +/* + * start_xmit: dev->hard_start_xmit function + * Called with dev->xmit_lock held. + */ +static int start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int nr = np->next_tx % TX_RING; + + np->tx_skbuff[nr] = skb; + np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data,skb->len, + PCI_DMA_TODEVICE); + + np->tx_ring[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + np->tx_ring[nr].Length = cpu_to_le16(skb->len-1); + + spin_lock_irq(&np->lock); + wmb(); + np->tx_ring[nr].Flags = np->tx_flags; + dprintk(KERN_DEBUG "%s: start_xmit: packet packet %d queued for transmission.\n", + dev->name, np->next_tx); + { + int j; + for (j=0; j<64; j++) { + if ((j%16) == 0) + dprintk("\n%03x:", j); + dprintk(" %02x", ((unsigned char*)skb->data)[j]); + } + dprintk("\n"); + } + + np->next_tx++; + + dev->trans_start = jiffies; + if (np->next_tx - np->nic_tx >= TX_LIMIT_STOP) + netif_stop_queue(dev); + spin_unlock_irq(&np->lock); + writel(NVREG_TXRXCTL_KICK, get_hwbase(dev) + NvRegTxRxControl); + return 0; +} + +/* + * tx_done: check for completed packets, release the skbs. + * + * Caller must own np->lock. + */ +static void tx_done(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + + while (np->nic_tx < np->next_tx) { + struct ring_desc *prd; + int i = np->nic_tx % TX_RING; + + prd = &np->tx_ring[i]; + + dprintk(KERN_DEBUG "%s: tx_done: looking at packet %d, Flags 0x%x.\n", + dev->name, np->nic_tx, prd->Flags); + if (prd->Flags & cpu_to_le16(NV_TX_VALID)) + break; + if (prd->Flags & cpu_to_le16(NV_TX_RETRYERROR|NV_TX_CARRIERLOST|NV_TX_LATECOLLISION| + NV_TX_UNDERFLOW|NV_TX_ERROR)) { + if (prd->Flags & cpu_to_le16(NV_TX_UNDERFLOW)) + np->stats.tx_fifo_errors++; + if (prd->Flags & cpu_to_le16(NV_TX_CARRIERLOST)) + np->stats.tx_carrier_errors++; + np->stats.tx_errors++; + } else { + np->stats.tx_packets++; + np->stats.tx_bytes += np->tx_skbuff[i]->len; + } + pci_unmap_single(np->pci_dev, np->tx_dma[i], + np->tx_skbuff[i]->len, + PCI_DMA_TODEVICE); + dev_kfree_skb_irq(np->tx_skbuff[i]); + np->tx_skbuff[i] = NULL; + np->nic_tx++; + } + if (np->next_tx - np->nic_tx < TX_LIMIT_START) + netif_wake_queue(dev); +} +/* + * tx_timeout: dev->tx_timeout function + * Called with dev->xmit_lock held. + */ +static void tx_timeout(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + + dprintk(KERN_DEBUG "%s: Got tx_timeout.\n", dev->name); + dprintk(KERN_DEBUG "%s: irq: %08x\n", dev->name, + readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK); + + spin_lock_irq(&np->lock); + + /* 1) stop tx engine */ + stop_tx(dev); + + /* 2) check that the packets were not sent already: */ + tx_done(dev); + + /* 3) if there are dead entries: clear everything */ + if (np->next_tx != np->nic_tx) { + drain_tx(dev); + np->next_tx = np->nic_tx = 0; + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + netif_wake_queue(dev); + } + + /* 4) restart tx engine */ + start_tx(dev); + spin_unlock_irq(&np->lock); +} + +static void rx_process(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + + for (;;) { + struct ring_desc *prd; + struct sk_buff *skb; + int len; + int i; + if (np->cur_rx - np->refill_rx >= RX_RING) + break; /* ring empty - do not continue */ + + i = np->cur_rx % RX_RING; + prd = &np->rx_ring[i]; + dprintk(KERN_DEBUG "%s: rx_process: looking at packet %d, Flags 0x%x.\n", + dev->name, np->cur_rx, prd->Flags); + + if (prd->Flags & cpu_to_le16(NV_RX_AVAIL)) + break; /* still owned by hardware, */ + + /* the packet is for us - immediately tear down the pci mapping, and + * prefetch the first cacheline of the packet. + */ + pci_unmap_single(np->pci_dev, np->rx_dma[i], + np->rx_skbuff[i]->len, + PCI_DMA_FROMDEVICE); + prefetch(np->rx_skbuff[i]->data); + + { + int j; + dprintk(KERN_DEBUG "Dumping packet (flags 0x%x).",prd->Flags); + for (j=0; j<64; j++) { + if ((j%16) == 0) + dprintk("\n%03x:", j); + dprintk(" %02x", ((unsigned char*)np->rx_skbuff[i]->data)[j]); + } + dprintk("\n"); + } + /* look at what we actually got: */ + if (!(prd->Flags & cpu_to_le16(NV_RX_DESCRIPTORVALID))) + goto next_pkt; + + + len = le16_to_cpu(prd->Length); + + if (prd->Flags & cpu_to_le16(NV_RX_MISSEDFRAME)) { + np->stats.rx_missed_errors++; + np->stats.rx_errors++; + goto next_pkt; + } + if (prd->Flags & cpu_to_le16(NV_RX_ERROR1|NV_RX_ERROR2|NV_RX_ERROR3|NV_RX_ERROR4)) { + np->stats.rx_errors++; + goto next_pkt; + } + if (prd->Flags & cpu_to_le16(NV_RX_CRCERR)) { + np->stats.rx_crc_errors++; + np->stats.rx_errors++; + goto next_pkt; + } + if (prd->Flags & cpu_to_le16(NV_RX_OVERFLOW)) { + np->stats.rx_over_errors++; + np->stats.rx_errors++; + goto next_pkt; + } + if (prd->Flags & cpu_to_le16(NV_RX_ERROR)) { + /* framing errors are soft errors, the rest is fatal. */ + if (prd->Flags & cpu_to_le16(NV_RX_FRAMINGERR)) { + if (prd->Flags & cpu_to_le16(NV_RX_SUBSTRACT1)) { + len--; + } + } else { + np->stats.rx_errors++; + goto next_pkt; + } + } + /* got a valid packet - forward it to the network core */ + skb = np->rx_skbuff[i]; + np->rx_skbuff[i] = NULL; + + skb_put(skb, len); + skb->protocol = eth_type_trans(skb, dev); + dprintk(KERN_DEBUG "%s: rx_process: packet %d with %d bytes, proto %d accepted.\n", + dev->name, np->cur_rx, len, skb->protocol); + netif_rx(skb); + dev->last_rx = jiffies; + np->stats.rx_packets++; + np->stats.rx_bytes += len; +next_pkt: + np->cur_rx++; + } +} + +/* + * change_mtu: dev->change_mtu function + * Called with dev_base_lock held for read. + */ +static int change_mtu(struct net_device *dev, int new_mtu) +{ + if (new_mtu > DEFAULT_MTU) + return -EINVAL; + dev->mtu = new_mtu; + return 0; +} + +/* + * change_mtu: dev->change_mtu function + * Called with dev->xmit_lock held. + */ +static void set_multicast(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + u32 addr[2]; + u32 mask[2]; + u32 pff; + + memset(addr, 0, sizeof(addr)); + memset(mask, 0, sizeof(mask)); + + if (dev->flags & IFF_PROMISC) { + printk(KERN_NOTICE "%s: Promiscuous mode enabled.\n", dev->name); + pff = NVREG_PFF_PROMISC; + } else { + pff = NVREG_PFF_MYADDR; + + if (dev->flags & IFF_ALLMULTI || dev->mc_list) { + u32 alwaysOff[2]; + u32 alwaysOn[2]; + + alwaysOn[0] = alwaysOn[1] = alwaysOff[0] = alwaysOff[1] = 0xffffffff; + if (dev->flags & IFF_ALLMULTI) { + alwaysOn[0] = alwaysOn[1] = alwaysOff[0] = alwaysOff[1] = 0; + } else { + struct dev_mc_list *walk; + + walk = dev->mc_list; + while (walk != NULL) { + u32 a, b; + a = le32_to_cpu(*(u32 *) walk->dmi_addr); + b = le16_to_cpu(*(u16 *) (&walk->dmi_addr[4])); + alwaysOn[0] &= a; + alwaysOff[0] &= ~a; + alwaysOn[1] &= b; + alwaysOff[1] &= ~b; + walk = walk->next; + } + } + addr[0] = alwaysOn[0]; + addr[1] = alwaysOn[1]; + mask[0] = alwaysOn[0] | alwaysOff[0]; + mask[1] = alwaysOn[1] | alwaysOff[1]; + } + } + addr[0] |= NVREG_MCASTADDRA_FORCE; + pff |= NVREG_PFF_ALWAYS; + spin_lock_irq(&np->lock); + stop_rx(dev); + writel(addr[0], base + NvRegMulticastAddrA); + writel(addr[1], base + NvRegMulticastAddrB); + writel(mask[0], base + NvRegMulticastMaskA); + writel(mask[1], base + NvRegMulticastMaskB); + writel(pff, base + NvRegPacketFilterFlags); + start_rx(dev); + spin_unlock_irq(&np->lock); +} + +static int update_linkspeed(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int adv, lpa, newls, newdup; + + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + lpa = mii_rw(dev, np->phyaddr, MII_LPA, MII_READ); + dprintk(KERN_DEBUG "%s: update_linkspeed: PHY advertises 0x%04x, lpa 0x%04x.\n", + dev->name, adv, lpa); + + /* FIXME: handle parallel detection properly, handle gigabit ethernet */ + lpa = lpa & adv; + if (lpa & LPA_100FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 1; + } else if (lpa & LPA_100HALF) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 0; + } else if (lpa & LPA_10FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 1; + } else if (lpa & LPA_10HALF) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 0; + } else { + dprintk(KERN_DEBUG "%s: bad ability %04x - falling back to 10HD.\n", dev->name, lpa); + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 0; + } + if (np->duplex != newdup || np->linkspeed != newls) { + np->duplex = newdup; + np->linkspeed = newls; + return 1; + } + return 0; +} + +static void link_irq(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + u32 miistat; + int miival; + + miistat = readl(base + NvRegMIIStatus); + writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus); + printk(KERN_DEBUG "%s: link change notification, status 0x%x.\n", dev->name, miistat); + + miival = mii_rw(dev, np->phyaddr, MII_BMSR, MII_READ); + if (miival & BMSR_ANEGCOMPLETE) { + update_linkspeed(dev); + + if (netif_carrier_ok(dev)) { + stop_rx(dev); + } else { + netif_carrier_on(dev); + printk(KERN_INFO "%s: link up.\n", dev->name); + } + writel(NVREG_MISC1_FORCE | ( np->duplex ? 0 : NVREG_MISC1_HD), + base + NvRegMisc1); + start_rx(dev); + } else { + if (netif_carrier_ok(dev)) { + netif_carrier_off(dev); + printk(KERN_INFO "%s: link down.\n", dev->name); + stop_rx(dev); + } + writel(np->linkspeed, base + NvRegLinkSpeed); + pci_push(base); + } +} + +static irqreturn_t nic_irq(int foo, void *data, struct pt_regs *regs) +{ + struct net_device *dev = (struct net_device *) data; + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + u32 events; + + dprintk(KERN_DEBUG "%s: nic_irq\n", dev->name); + + for (;;) { + events = readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK; + writel(NVREG_IRQSTAT_MASK, base + NvRegIrqStatus); + pci_push(base); + dprintk(KERN_DEBUG "%s: irq: %08x\n", dev->name, events); + if (!(events & np->irqmask)) + break; + + /* FIXME: only call the required processing functions */ + if (events & (NVREG_IRQ_TX1|NVREG_IRQ_TX2)) { + spin_lock(&np->lock); + tx_done(dev); + spin_unlock(&np->lock); + } + + if (events & NVREG_IRQ_RX) { + rx_process(dev); + if (alloc_rx(dev)) { + spin_lock(&np->lock); + if (!np->in_shutdown) + mod_timer(&np->oom_kick, jiffies + OOM_REFILL); + spin_unlock(&np->lock); + } + } + + if (events & NVREG_IRQ_LINK) { + spin_lock(&np->lock); + link_irq(dev); + spin_unlock(&np->lock); + } + if (events & (NVREG_IRQ_UNKNOWN)) { + printk("%s: received irq with unknown source 0x%x.\n", dev->name, events); + } + /* FIXME: general errors, link change interrupts */ + } + dprintk(KERN_DEBUG "%s: nic_irq completed\n", dev->name); + + return IRQ_HANDLED; +} + +static int open(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + int ret, oom, i; + + dprintk(KERN_DEBUG "forcedeth: open\n"); + + /* 1) erase previous misconfiguration */ + /* 4.1-1: stop adapter: ignored, 4.3 seems to be overkill */ + writel(NVREG_MCASTADDRA_FORCE, base + NvRegMulticastAddrA); + writel(0, base + NvRegMulticastAddrB); + writel(0, base + NvRegMulticastMaskA); + writel(0, base + NvRegMulticastMaskB); + writel(0, base + NvRegPacketFilterFlags); + writel(0, base + NvRegAdapterControl); + writel(0, base + NvRegLinkSpeed); + writel(0, base + NvRegUnknownTransmitterReg); + txrx_reset(dev); + writel(0, base + NvRegUnknownSetupReg6); + + /* 2) initialize descriptor rings */ + np->in_shutdown = 0; + oom = init_ring(dev); + + /* 3) set mac address */ + { + u32 mac[2]; + + mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + + (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); + mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); + + writel(mac[0], base + NvRegMacAddrA); + writel(mac[1], base + NvRegMacAddrB); + } + + /* 4) continue setup */ + np->linkspeed = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + np->duplex = 0; + writel(NVREG_UNKSETUP3_VAL1, base + NvRegUnknownSetupReg3); + writel(0, base + NvRegTxRxControl); + pci_push(base); + writel(NVREG_TXRXCTL_BIT1, base + NvRegTxRxControl); + reg_delay(dev, NvRegUnknownSetupReg5, NVREG_UNKSETUP5_BIT31, NVREG_UNKSETUP5_BIT31, + NV_SETUP5_DELAY, NV_SETUP5_DELAYMAX, + KERN_INFO "open: SetupReg5, Bit 31 remained off\n"); + writel(0, base + NvRegUnknownSetupReg4); + + /* 5) Find a suitable PHY */ + writel(NVREG_MIISPEED_BIT8|NVREG_MIIDELAY, base + NvRegMIISpeed); + for (i = 1; i < 32; i++) { + int id1, id2; + + id1 = mii_rw(dev, i, MII_PHYSID1, MII_READ); + if (id1 < 0) + continue; + id2 = mii_rw(dev, i, MII_PHYSID2, MII_READ); + if (id2 < 0) + continue; + dprintk(KERN_DEBUG "%s: open: Found PHY %04x:%04x at address %d.\n", + dev->name, id1, id2, i); + np->phyaddr = i; + + update_linkspeed(dev); + + break; + } + if (i == 32) { + printk(KERN_INFO "%s: open: failing due to lack of suitable PHY.\n", + dev->name); + ret = -EINVAL; + goto out_drain; + } + + /* 6) continue setup */ + writel(NVREG_MISC1_FORCE | ( np->duplex ? 0 : NVREG_MISC1_HD), + base + NvRegMisc1); + writel(readl(base + NvRegTransmitterStatus), base + NvRegTransmitterStatus); + writel(NVREG_PFF_ALWAYS, base + NvRegPacketFilterFlags); + writel(NVREG_OFFLOAD_NORMAL, base + NvRegOffloadConfig); + + writel(readl(base + NvRegReceiverStatus), base + NvRegReceiverStatus); + get_random_bytes(&i, sizeof(i)); + writel(NVREG_RNDSEED_FORCE | (i&NVREG_RNDSEED_MASK), base + NvRegRandomSeed); + writel(NVREG_UNKSETUP1_VAL, base + NvRegUnknownSetupReg1); + writel(NVREG_UNKSETUP2_VAL, base + NvRegUnknownSetupReg2); + writel(NVREG_UNKSETUP6_VAL, base + NvRegUnknownSetupReg6); + writel((np->phyaddr << NVREG_ADAPTCTL_PHYSHIFT)|NVREG_ADAPTCTL_PHYVALID, + base + NvRegAdapterControl); + writel(NVREG_UNKSETUP4_VAL, base + NvRegUnknownSetupReg4); + writel(NVREG_WAKEUPFLAGS_VAL, base + NvRegWakeUpFlags); + + /* 7) start packet processing */ + writel((u32) np->ring_addr, base + NvRegRxRingPhysAddr); + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), + base + NvRegRingSizes); + + i = readl(base + NvRegPowerState); + if ( (i & NVREG_POWERSTATE_POWEREDUP) == 0) { + writel(NVREG_POWERSTATE_POWEREDUP|i, base + NvRegPowerState); + } + pci_push(base); + udelay(10); + writel(readl(base + NvRegPowerState) | NVREG_POWERSTATE_VALID, base + NvRegPowerState); + writel(NVREG_ADAPTCTL_RUNNING, base + NvRegAdapterControl); + + + writel(0, base + NvRegIrqMask); + pci_push(base); + writel(NVREG_IRQSTAT_MASK, base + NvRegIrqStatus); + pci_push(base); + writel(NVREG_MIISTAT_MASK2, base + NvRegMIIStatus); + writel(NVREG_IRQSTAT_MASK, base + NvRegIrqStatus); + pci_push(base); + + ret = request_irq(dev->irq, &nic_irq, SA_SHIRQ, dev->name, dev); + if (ret) + goto out_drain; + + writel(np->irqmask, base + NvRegIrqMask); + + spin_lock_irq(&np->lock); + writel(NVREG_MCASTADDRA_FORCE, base + NvRegMulticastAddrA); + writel(0, base + NvRegMulticastAddrB); + writel(0, base + NvRegMulticastMaskA); + writel(0, base + NvRegMulticastMaskB); + writel(NVREG_PFF_ALWAYS|NVREG_PFF_MYADDR, base + NvRegPacketFilterFlags); + start_rx(dev); + start_tx(dev); + netif_start_queue(dev); + if (oom) + mod_timer(&np->oom_kick, jiffies + OOM_REFILL); + if (!(mii_rw(dev, np->phyaddr, MII_BMSR, MII_READ) & BMSR_ANEGCOMPLETE)) { + printk("%s: no link during initialization.\n", dev->name); + netif_carrier_off(dev); + } + + spin_unlock_irq(&np->lock); + + return 0; +out_drain: + drain_ring(dev); + return ret; +} + +static int close(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + + spin_lock_irq(&np->lock); + np->in_shutdown = 1; + spin_unlock_irq(&np->lock); + synchronize_irq(dev->irq); + + del_timer_sync(&np->oom_kick); + + netif_stop_queue(dev); + spin_lock_irq(&np->lock); + stop_tx(dev); + stop_rx(dev); + spin_unlock_irq(&np->lock); + + free_irq(dev->irq, dev); + + drain_ring(dev); + + /* FIXME: power down nic */ + + return 0; +} + +static int __devinit probe_nic(struct pci_dev *pci_dev, const struct pci_device_id *id) +{ + struct net_device *dev; + struct fe_priv *np; + unsigned long addr; + u8 *base; + int err, i; + + dev = alloc_etherdev(sizeof(struct fe_priv)); + np = get_nvpriv(dev); + err = -ENOMEM; + if (!dev) + goto out; + + np->pci_dev = pci_dev; + spin_lock_init(&np->lock); + SET_MODULE_OWNER(dev); + SET_NETDEV_DEV(dev, &pci_dev->dev); + + err = pci_enable_device(pci_dev); + if (err) { + printk(KERN_INFO "forcedeth: pci_enable_dev failed: %d\n", err); + goto out_free; + } + + pci_set_master(pci_dev); + + err = pci_request_regions(pci_dev, dev->name); + if (err < 0) + goto out_disable; + + err = -EINVAL; + addr = 0; + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + dprintk(KERN_DEBUG "forcedeth: resource %d start %p len %ld flags 0x%08lx.\n", + i, (void*)pci_resource_start(pci_dev, i), + pci_resource_len(pci_dev, i), + pci_resource_flags(pci_dev, i)); + if (pci_resource_flags(pci_dev, i) & IORESOURCE_MEM && + pci_resource_len(pci_dev, i) >= NV_PCI_REGSZ) { + addr = pci_resource_start(pci_dev, i); + break; + } + } + if (i == DEVICE_COUNT_RESOURCE) { + printk(KERN_INFO "forcedeth: Couldn't find register window.\n"); + goto out_relreg; + } + + err = -ENOMEM; + dev->base_addr = (unsigned long) ioremap(addr, NV_PCI_REGSZ); + if (!dev->base_addr) + goto out_disable; + dev->irq = pci_dev->irq; + np->rx_ring = pci_alloc_consistent(pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), + &np->ring_addr); + if (!np->rx_ring) + goto out_unmap; + np->tx_ring = &np->rx_ring[RX_RING]; + + dev->open = open; + dev->stop = close; + dev->hard_start_xmit = start_xmit; + dev->get_stats = get_stats; + dev->change_mtu = change_mtu; + dev->set_multicast_list = set_multicast; + dev->do_ioctl = nic_ioctl; + dev->tx_timeout = tx_timeout; + dev->watchdog_timeo = NV_WATCHDOG_TIMEO; + + pci_set_drvdata(pci_dev, dev); + + err = register_netdev(dev); + if (err) { + printk(KERN_INFO "forcedeth: unable to register netdev: %d\n", err); + goto out_freering; + } + + printk(KERN_INFO "%s: forcedeth.c: subsystem: %05x:%04x\n", + dev->name, pci_dev->subsystem_vendor, pci_dev->subsystem_device); + + + /* read the mac address */ + base = get_hwbase(dev); + np->orig_mac[0] = readl(base + NvRegMacAddrA); + np->orig_mac[1] = readl(base + NvRegMacAddrB); + + dev->dev_addr[0] = (np->orig_mac[1] >> 8) & 0xff; + dev->dev_addr[1] = (np->orig_mac[1] >> 0) & 0xff; + dev->dev_addr[2] = (np->orig_mac[0] >> 24) & 0xff; + dev->dev_addr[3] = (np->orig_mac[0] >> 16) & 0xff; + dev->dev_addr[4] = (np->orig_mac[0] >> 8) & 0xff; + dev->dev_addr[5] = (np->orig_mac[0] >> 0) & 0xff; + + dprintk(KERN_DEBUG "%s: MAC Address %02x:%02x:%02x:%02x:%02x:%02x\n", dev->name, + dev->dev_addr[0], dev->dev_addr[1], dev->dev_addr[2], + dev->dev_addr[3], dev->dev_addr[4], dev->dev_addr[5]); + + np->tx_flags = cpu_to_le16(NV_TX_LASTPACKET|NV_TX_LASTPACKET1|NV_TX_VALID); + if (id->driver_data & DEV_NEED_LASTPACKET1) + np->tx_flags |= cpu_to_le16(NV_TX_LASTPACKET1); + if (id->driver_data & DEV_IRQMASK_1) + np->irqmask = NVREG_IRQMASK_WANTED_1; + if (id->driver_data & DEV_IRQMASK_2) + np->irqmask = NVREG_IRQMASK_WANTED_2; + + return 0; + +out_freering: + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), + np->rx_ring, np->ring_addr); +out_unmap: + iounmap(get_hwbase(dev)); +out_relreg: + pci_release_regions(pci_dev); +out_disable: + pci_disable_device(pci_dev); +out_free: + kfree(dev); + pci_set_drvdata(pci_dev, NULL); +out: + return err; +} + +static void __devexit remove_nic(struct pci_dev *pci_dev) +{ + struct net_device *dev = pci_get_drvdata(pci_dev); + struct fe_priv *np = get_nvpriv(dev); + u8 *base = get_hwbase(dev); + + unregister_netdev(dev); + + /* special op: write back the misordered MAC address - otherwise + * the next probe_nic would see a wrong address. + */ + writel(np->orig_mac[0], base + NvRegMacAddrA); + writel(np->orig_mac[1], base + NvRegMacAddrB); + + /* free all structures */ + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), np->rx_ring, np->ring_addr); + iounmap(get_hwbase(dev)); + pci_release_regions(pci_dev); + pci_disable_device(pci_dev); + kfree(dev); + pci_set_drvdata(pci_dev, NULL); +} + +static struct pci_device_id pci_tbl[] = { + { /* nForce Ethernet Controller */ + .vendor = PCI_VENDOR_ID_NVIDIA, + .device = 0x1C3, + .subvendor = PCI_ANY_ID, + .subdevice = PCI_ANY_ID, + .driver_data = DEV_IRQMASK_1, + }, + { /* nForce2 Ethernet Controller */ + .vendor = PCI_VENDOR_ID_NVIDIA, + .device = 0x0066, + .subvendor = PCI_ANY_ID, + .subdevice = PCI_ANY_ID, + .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2, + }, + { /* nForce3 Ethernet Controller */ + .vendor = PCI_VENDOR_ID_NVIDIA, + .device = 0x00D6, + .subvendor = PCI_ANY_ID, + .subdevice = PCI_ANY_ID, + .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2, + }, + {0,}, +}; + +static struct pci_driver driver = { + .name = "forcedeth", + .id_table = pci_tbl, + .probe = probe_nic, + .remove = __devexit_p(remove_nic), +}; + + +static int __init init_nic(void) +{ + printk(KERN_INFO "forcedeth.c: Reverse Engineered nForce ethernet driver. Version %s.\n", FORCEDETH_VERSION); + return pci_module_init(&driver); +} + +static void __exit exit_nic(void) +{ + pci_unregister_driver(&driver); +} + +MODULE_AUTHOR("Manfred Spraul <manfred@colorfullife.com>"); +MODULE_DESCRIPTION("Reverse Engineered nForce ethernet driver"); +MODULE_LICENSE("GPL"); + +MODULE_DEVICE_TABLE(pci, pci_tbl); + +module_init(init_nic); +module_exit(exit_nic); --- diff/drivers/net/kgdb_eth.c 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/net/kgdb_eth.c 2003-11-26 10:09:06.000000000 +0000 @@ -0,0 +1,517 @@ +/* + * Network interface GDB stub + * + * Written by San Mehat (nettwerk@biodome.org) + * Based upon 'gdbserial' by David Grothe (dave@gcom.com) + * and Scott Foehner (sfoehner@engr.sgi.com) + * + * Twiddled for 2.6 by Robert Walsh <rjwalsh@durables.org> + * and wangdi <wangdi@clusterfs.com>. + */ + +#include <linux/module.h> +#include <linux/errno.h> +#include <linux/signal.h> +#include <linux/sched.h> +#include <linux/timer.h> +#include <linux/interrupt.h> +#include <linux/config.h> +#include <linux/major.h> +#include <linux/string.h> +#include <linux/fcntl.h> +#include <linux/termios.h> +#include <asm/kgdb.h> +#include <linux/if_ether.h> +#include <linux/netdevice.h> +#include <linux/etherdevice.h> +#include <linux/skbuff.h> +#include <linux/delay.h> +#include <net/tcp.h> +#include <net/udp.h> + +#include <asm/system.h> +#include <asm/io.h> +#include <asm/segment.h> +#include <asm/bitops.h> +#include <asm/system.h> +#include <asm/irq.h> +#include <asm/atomic.h> + +#define GDB_BUF_SIZE 512 /* power of 2, please */ + +static char kgdb_buf[GDB_BUF_SIZE] ; +static int kgdb_buf_in_inx ; +static atomic_t kgdb_buf_in_cnt ; +static int kgdb_buf_out_inx ; + +extern void set_debug_traps(void) ; /* GDB routine */ +extern void breakpoint(void); + +unsigned int kgdb_remoteip = 0; +unsigned short kgdb_listenport = 6443; +unsigned short kgdb_sendport= 6442; +int kgdb_eth = -1; /* Default tty mode */ +unsigned char kgdb_remotemac[6] = {0xff,0xff,0xff,0xff,0xff,0xff}; +unsigned char kgdb_localmac[6] = {0xff,0xff,0xff,0xff,0xff,0xff}; +volatile int kgdb_eth_is_initializing = 0; +int kgdb_eth_need_breakpoint[NR_CPUS]; + +struct net_device *kgdb_netdevice = NULL; + +/* + * Get a char if available, return -1 if nothing available. + * Empty the receive buffer first, then look at the interface hardware. + */ +static int +read_char(void) +{ + /* intr routine has queued chars */ + if (atomic_read(&kgdb_buf_in_cnt) != 0) + { + int chr; + + chr = kgdb_buf[kgdb_buf_out_inx++] ; + kgdb_buf_out_inx &= (GDB_BUF_SIZE - 1) ; + atomic_dec(&kgdb_buf_in_cnt) ; + return chr; + } + + return -1; /* no data */ +} + +/* + * Wait until the interface can accept a char, then write it. + */ +static void +write_buffer(char *buf, int len) +{ + int total_len, eth_len, ip_len, udp_len; + struct in_device *in_dev; + struct sk_buff *skb; + struct udphdr *udph; + struct iphdr *iph; + struct ethhdr *eth; + + if (!(in_dev = (struct in_device *) kgdb_netdevice->ip_ptr)) { + panic("No in_device available for interface!\n"); + } + + if (!(in_dev->ifa_list)) { + panic("No interface address set for interface!\n"); + } + + udp_len = len + sizeof(struct udphdr); + ip_len = eth_len = udp_len + sizeof(struct iphdr); + total_len = eth_len + ETH_HLEN; + + if (!(skb = alloc_skb(total_len, GFP_ATOMIC))) { + return; + } + + atomic_set(&skb->users, 1); + skb_reserve(skb, total_len - len); + + memcpy(skb->data, (unsigned char *) buf, len); + skb->len += len; + + udph = (struct udphdr *) skb_push(skb, sizeof(*udph)); + udph->source = htons(kgdb_listenport); + udph->dest = htons(kgdb_sendport); + udph->len = htons(udp_len); + udph->check = 0; + + iph = (struct iphdr *)skb_push(skb, sizeof(*iph)); + iph->version = 4; + iph->ihl = 5; + iph->tos = 0; + iph->tot_len = htons(ip_len); + iph->id = 0; + iph->frag_off = 0; + iph->ttl = 64; + iph->protocol = IPPROTO_UDP; + iph->check = 0; + iph->saddr = in_dev->ifa_list->ifa_address; + iph->daddr = kgdb_remoteip; + iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); + + eth = (struct ethhdr *) skb_push(skb, ETH_HLEN); + eth->h_proto = htons(ETH_P_IP); + memcpy(eth->h_source, kgdb_localmac, kgdb_netdevice->addr_len); + memcpy(eth->h_dest, kgdb_remotemac, kgdb_netdevice->addr_len); + +repeat: + spin_lock(&kgdb_netdevice->xmit_lock); + kgdb_netdevice->xmit_lock_owner = smp_processor_id(); + + if (netif_queue_stopped(kgdb_netdevice)) { + kgdb_netdevice->xmit_lock_owner = -1; + spin_unlock(&kgdb_netdevice->xmit_lock); + + kgdb_netdevice->poll_controller(kgdb_netdevice); + goto repeat; + } + + kgdb_netdevice->hard_start_xmit(skb, kgdb_netdevice); + kgdb_netdevice->xmit_lock_owner = -1; + spin_unlock(&kgdb_netdevice->xmit_lock); +} + +/* + * In the interrupt state the target machine will not respond to any + * arp requests, so handle them here. + */ + +static struct sk_buff *send_skb = NULL; + +void +kgdb_eth_reply_arp(void) +{ + if (send_skb) { + spin_lock(&kgdb_netdevice->xmit_lock); + kgdb_netdevice->xmit_lock_owner = smp_processor_id(); + kgdb_netdevice->hard_start_xmit(send_skb, kgdb_netdevice); + kgdb_netdevice->xmit_lock_owner = -1; + spin_unlock(&kgdb_netdevice->xmit_lock); + send_skb = NULL; + } +} + +static int +make_arp_request(struct sk_buff *skb) +{ + struct arphdr *arp; + unsigned char *arp_ptr; + int type = ARPOP_REPLY; + int ptype = ETH_P_ARP; + u32 sip, tip; + unsigned char *sha, *tha; + struct in_device *in_dev = (struct in_device *) kgdb_netdevice->ip_ptr; + + /* No arp on this interface */ + + if (kgdb_netdevice->flags & IFF_NOARP) { + return 0; + } + + if (!pskb_may_pull(skb, (sizeof(struct arphdr) + + (2 * kgdb_netdevice->addr_len) + + (2 * sizeof(u32))))) { + return 0; + } + + skb->h.raw = skb->nh.raw = skb->data; + arp = skb->nh.arph; + + if ((arp->ar_hrd != htons(ARPHRD_ETHER) && + arp->ar_hrd != htons(ARPHRD_IEEE802)) || + arp->ar_pro != htons(ETH_P_IP)) { + return 0; + } + + /* Understand only these message types */ + + if (arp->ar_op != htons(ARPOP_REQUEST)) { + return 0; + } + + /* Extract fields */ + + arp_ptr= (unsigned char *)(arp+1); + sha = arp_ptr; + arp_ptr += kgdb_netdevice->addr_len; + memcpy(&sip, arp_ptr, 4); + arp_ptr += 4; + tha = arp_ptr; + arp_ptr += kgdb_netdevice->addr_len; + memcpy(&tip, arp_ptr, 4); + + if (tip != in_dev->ifa_list->ifa_address) { + return 0; + } + + if (kgdb_remoteip != sip) { + return 0; + } + + /* + * Check for bad requests for 127.x.x.x and requests for multicast + * addresses. If this is one such, delete it. + */ + + if (LOOPBACK(tip) || MULTICAST(tip)) { + return 0; + } + + /* reply to the ARP request */ + + send_skb = alloc_skb(sizeof(struct arphdr) + 2 * (kgdb_netdevice->addr_len + 4) + LL_RESERVED_SPACE(kgdb_netdevice), GFP_ATOMIC); + + if (send_skb == NULL) { + return 0; + } + + skb_reserve(send_skb, LL_RESERVED_SPACE(kgdb_netdevice)); + send_skb->nh.raw = send_skb->data; + arp = (struct arphdr *) skb_put(send_skb, sizeof(struct arphdr) + 2 * (kgdb_netdevice->addr_len + 4)); + send_skb->dev = kgdb_netdevice; + send_skb->protocol = htons(ETH_P_ARP); + + /* Fill the device header for the ARP frame */ + + if (kgdb_netdevice->hard_header && + kgdb_netdevice->hard_header(send_skb, kgdb_netdevice, ptype, + kgdb_remotemac, kgdb_localmac, + send_skb->len) < 0) { + kfree_skb(send_skb); + return 0; + } + + /* + * Fill out the arp protocol part. + * + * we only support ethernet device type, + * which (according to RFC 1390) should always equal 1 (Ethernet). + */ + + arp->ar_hrd = htons(kgdb_netdevice->type); + arp->ar_pro = htons(ETH_P_IP); + + arp->ar_hln = kgdb_netdevice->addr_len; + arp->ar_pln = 4; + arp->ar_op = htons(type); + + arp_ptr=(unsigned char *)(arp + 1); + + memcpy(arp_ptr, kgdb_netdevice->dev_addr, kgdb_netdevice->addr_len); + arp_ptr += kgdb_netdevice->addr_len; + memcpy(arp_ptr, &tip, 4); + arp_ptr += 4; + memcpy(arp_ptr, kgdb_localmac, kgdb_netdevice->addr_len); + arp_ptr += kgdb_netdevice->addr_len; + memcpy(arp_ptr, &sip, 4); + return 0; +} + + +/* + * Accept an skbuff from net_device layer and add the payload onto + * kgdb buffer + * + * When the kgdb stub routine getDebugChar() is called it draws characters + * out of the buffer until it is empty and then reads directly from the + * serial port. + * + * We do not attempt to write chars from the interrupt routine since + * the stubs do all of that via putDebugChar() which writes one byte + * after waiting for the interface to become ready. + * + * The debug stubs like to run with interrupts disabled since, after all, + * they run as a consequence of a breakpoint in the kernel. + * + * NOTE: Return value of 1 means it was for us and is an indication to + * the calling driver to destroy the sk_buff and not send it up the stack. + */ +int +kgdb_net_interrupt(struct sk_buff *skb) +{ + unsigned char chr; + struct iphdr *iph = (struct iphdr*)skb->data; + struct udphdr *udph= (struct udphdr*)(skb->data+(iph->ihl<<2)); + unsigned char *data = (unsigned char *) udph + sizeof(struct udphdr); + int len; + int i; + + if ((kgdb_eth != -1) && (!kgdb_netdevice) && + (iph->protocol == IPPROTO_UDP) && + (be16_to_cpu(udph->dest) == kgdb_listenport)) { + kgdb_sendport = be16_to_cpu(udph->source); + + while (kgdb_eth_is_initializing) + ; + if (!kgdb_netdevice) + kgdb_eth_hook(); + if (!kgdb_netdevice) { + /* Lets not even try again. */ + kgdb_eth = -1; + return 0; + } + } + if (!kgdb_netdevice) { + return 0; + } + if (skb->protocol == __constant_htons(ETH_P_ARP) && !send_skb) { + make_arp_request(skb); + return 0; + } + if (iph->protocol != IPPROTO_UDP) { + return 0; + } + + if (be16_to_cpu(udph->dest) != kgdb_listenport) { + return 0; + } + + len = (be16_to_cpu(iph->tot_len) - + (sizeof(struct udphdr) + sizeof(struct iphdr))); + + for (i = 0; i < len; i++) { + chr = data[i]; + if (chr == 3) { + kgdb_eth_need_breakpoint[smp_processor_id()] = 1; + continue; + } + if (atomic_read(&kgdb_buf_in_cnt) >= GDB_BUF_SIZE) { + /* buffer overflow, clear it */ + kgdb_buf_in_inx = 0; + atomic_set(&kgdb_buf_in_cnt, 0); + kgdb_buf_out_inx = 0; + break; + } + kgdb_buf[kgdb_buf_in_inx++] = chr; + kgdb_buf_in_inx &= (GDB_BUF_SIZE - 1); + atomic_inc(&kgdb_buf_in_cnt); + } + + if (!kgdb_netdevice->kgdb_is_trapped) { + /* + * If a new gdb instance is trying to attach, we need to + * break here. + */ + if (!strncmp(data, "$Hc-1#09", 8)) + kgdb_eth_need_breakpoint[smp_processor_id()] = 1; + } + return 1; +} +EXPORT_SYMBOL(kgdb_net_interrupt); + +int +kgdb_eth_hook(void) +{ + char kgdb_netdev[16]; + extern void kgdb_respond_ok(void); + + if (kgdb_remotemac[0] == 0xff) { + panic("ERROR! 'gdbeth_remotemac' option not set!\n"); + } + if (kgdb_localmac[0] == 0xff) { + panic("ERROR! 'gdbeth_localmac' option not set!\n"); + } + if (kgdb_remoteip == 0) { + panic("ERROR! 'gdbeth_remoteip' option not set!\n"); + } + + sprintf(kgdb_netdev,"eth%d",kgdb_eth); + +#ifdef CONFIG_SMP + if (num_online_cpus() > CONFIG_NO_KGDB_CPUS) { + printk("kgdb: too manu cpus. Cannot enable debugger with more than %d cpus\n", CONFIG_NO_KGDB_CPUS); + return -1; + } +#endif + for (kgdb_netdevice = dev_base; + kgdb_netdevice != NULL; + kgdb_netdevice = kgdb_netdevice->next) { + if (strncmp(kgdb_netdevice->name, kgdb_netdev, IFNAMSIZ) == 0) { + break; + } + } + if (!kgdb_netdevice) { + printk("KGDB NET : Unable to find interface %s\n",kgdb_netdev); + return -ENODEV; + } + + /* + * Call GDB routine to setup the exception vectors for the debugger + */ + set_debug_traps(); + + /* + * Call the breakpoint() routine in GDB to start the debugging + * session. + */ + kgdb_eth_is_initializing = 1; + kgdb_eth_need_breakpoint[smp_processor_id()] = 1; + return 0; +} + +/* + * getDebugChar + * + * This is a GDB stub routine. It waits for a character from the + * serial interface and then returns it. If there is no serial + * interface connection then it returns a bogus value which will + * almost certainly cause the system to hang. + */ +int +eth_getDebugChar(void) +{ + volatile int chr; + + while ((chr = read_char()) < 0) { + if (send_skb) { + kgdb_eth_reply_arp(); + } + if (kgdb_netdevice->poll_controller) { + kgdb_netdevice->poll_controller(kgdb_netdevice); + } else { + printk("KGDB NET: Error - Device %s is not supported!\n", kgdb_netdevice->name); + panic("Please add support for kgdb net to this driver"); + } + } + return chr; +} + +#define ETH_QUEUE_SIZE 256 +static char eth_queue[ETH_QUEUE_SIZE]; +static int outgoing_queue; + +void +eth_flushDebugChar(void) +{ + if(outgoing_queue) { + write_buffer(eth_queue, outgoing_queue); + + outgoing_queue = 0; + } +} + +static void +put_char_on_queue(int chr) +{ + eth_queue[outgoing_queue++] = chr; + if(outgoing_queue == ETH_QUEUE_SIZE) + { + eth_flushDebugChar(); + } +} + +/* + * eth_putDebugChar + * + * This is a GDB stub routine. It waits until the interface is ready + * to transmit a char and then sends it. + */ +void +eth_putDebugChar(int chr) +{ + put_char_on_queue(chr); /* this routine will wait */ +} + +void +kgdb_eth_set_trapmode(int mode) +{ + if (!kgdb_netdevice) { + return; + } + kgdb_netdevice->kgdb_is_trapped = mode; +} + +int +kgdb_eth_is_trapped() +{ + if (!kgdb_netdevice) { + return 0; + } + return kgdb_netdevice->kgdb_is_trapped; +} +EXPORT_SYMBOL(kgdb_eth_is_trapped); --- diff/drivers/pci/msi.c 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/pci/msi.c 2003-11-26 10:09:06.000000000 +0000 @@ -0,0 +1,1068 @@ +/* + * linux/drivers/pci/msi.c + */ + +#include <linux/mm.h> +#include <linux/irq.h> +#include <linux/interrupt.h> +#include <linux/init.h> +#include <linux/config.h> +#include <linux/ioport.h> +#include <linux/smp_lock.h> +#include <linux/pci.h> +#include <linux/proc_fs.h> + +#include <asm/errno.h> +#include <asm/io.h> +#include <asm/smp.h> +#include <asm/desc.h> +#include <asm/io_apic.h> +#include <mach_apic.h> + +#include <linux/pci_msi.h> + +_DEFINE_DBG_BUFFER + +static spinlock_t msi_lock = SPIN_LOCK_UNLOCKED; +static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL }; +static kmem_cache_t* msi_cachep; + +static int pci_msi_enable = 1; +static int nr_alloc_vectors = 0; +static int nr_released_vectors = 0; +static int nr_reserved_vectors = NR_HP_RESERVED_VECTORS; +static int nr_msix_devices = 0; + +#ifndef CONFIG_X86_IO_APIC +int vector_irq[NR_IRQS] = { [0 ... NR_IRQS -1] = -1}; +u8 irq_vector[NR_IRQS] = { FIRST_DEVICE_VECTOR , 0 }; +#endif + +static void msi_cache_ctor(void *p, kmem_cache_t *cache, unsigned long flags) +{ + memset(p, 0, NR_IRQS * sizeof(struct msi_desc)); +} + +static int msi_cache_init(void) +{ + msi_cachep = kmem_cache_create("msi_cache", + NR_IRQS * sizeof(struct msi_desc), + 0, SLAB_HWCACHE_ALIGN, msi_cache_ctor, NULL); + if (!msi_cachep) + return -ENOMEM; + + return 0; +} + +static void msi_set_mask_bit(unsigned int vector, int flag) +{ + struct msi_desc *entry; + + entry = (struct msi_desc *)msi_desc[vector]; + if (!entry || !entry->dev || !entry->mask_base) + return; + switch (entry->msi_attrib.type) { + case PCI_CAP_ID_MSI: + { + int pos; + unsigned int mask_bits; + + pos = entry->mask_base; + entry->dev->bus->ops->read(entry->dev->bus, entry->dev->devfn, + pos, 4, &mask_bits); + mask_bits &= ~(1); + mask_bits |= flag; + entry->dev->bus->ops->write(entry->dev->bus, entry->dev->devfn, + pos, 4, mask_bits); + break; + } + case PCI_CAP_ID_MSIX: + { + int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE + + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET; + writel(flag, entry->mask_base + offset); + break; + } + default: + break; + } +} + +#ifdef CONFIG_SMP +static void set_msi_affinity(unsigned int vector, unsigned long cpu_mask) +{ + struct msi_desc *entry; + struct msg_address address; + unsigned int dest_id; + + entry = (struct msi_desc *)msi_desc[vector]; + if (!entry || !entry->dev) + return; + + switch (entry->msi_attrib.type) { + case PCI_CAP_ID_MSI: + { + int pos; + + if (!(pos = pci_find_capability(entry->dev, PCI_CAP_ID_MSI))) + return; + + entry->dev->bus->ops->read(entry->dev->bus, entry->dev->devfn, + msi_lower_address_reg(pos), 4, + &address.lo_address.value); + dest_id = (address.lo_address.u.dest_id & + MSI_ADDRESS_HEADER_MASK) | + (cpu_mask_to_apicid(cpu_mask) << MSI_TARGET_CPU_SHIFT); + address.lo_address.u.dest_id = dest_id; + entry->msi_attrib.current_cpu = cpu_mask_to_apicid(cpu_mask); + entry->dev->bus->ops->write(entry->dev->bus, entry->dev->devfn, + msi_lower_address_reg(pos), 4, + address.lo_address.value); + break; + } + case PCI_CAP_ID_MSIX: + { + int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE + + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET; + + address.lo_address.value = readl(entry->mask_base + offset); + dest_id = (address.lo_address.u.dest_id & + MSI_ADDRESS_HEADER_MASK) | + (cpu_mask_to_apicid(cpu_mask) << MSI_TARGET_CPU_SHIFT); + address.lo_address.u.dest_id = dest_id; + entry->msi_attrib.current_cpu = cpu_mask_to_apicid(cpu_mask); + writel(address.lo_address.value, entry->mask_base + offset); + break; + } + default: + break; + } +} + +static inline void move_msi(int vector) +{ + if (unlikely(pending_irq_balance_cpumask[vector])) { + set_msi_affinity(vector, pending_irq_balance_cpumask[vector]); + pending_irq_balance_cpumask[vector] = 0; + } +} +#endif + +static void mask_MSI_irq(unsigned int vector) +{ + msi_set_mask_bit(vector, 1); +} + +static void unmask_MSI_irq(unsigned int vector) +{ + msi_set_mask_bit(vector, 0); +} + +static unsigned int startup_msi_irq_wo_maskbit(unsigned int vector) +{ + return 0; /* never anything pending */ +} + +static void pci_disable_msi(unsigned int vector); +static void shutdown_msi_irq(unsigned int vector) +{ + pci_disable_msi(vector); +} + +#define shutdown_msi_irq_wo_maskbit shutdown_msi_irq +static void enable_msi_irq_wo_maskbit(unsigned int vector) {} +static void disable_msi_irq_wo_maskbit(unsigned int vector) {} +static void ack_msi_irq_wo_maskbit(unsigned int vector) {} +static void end_msi_irq_wo_maskbit(unsigned int vector) +{ + move_msi(vector); + ack_APIC_irq(); +} + +static unsigned int startup_msi_irq_w_maskbit(unsigned int vector) +{ + unmask_MSI_irq(vector); + return 0; /* never anything pending */ +} + +#define shutdown_msi_irq_w_maskbit shutdown_msi_irq +#define enable_msi_irq_w_maskbit unmask_MSI_irq +#define disable_msi_irq_w_maskbit mask_MSI_irq +#define ack_msi_irq_w_maskbit mask_MSI_irq + +static void end_msi_irq_w_maskbit(unsigned int vector) +{ + move_msi(vector); + unmask_MSI_irq(vector); + ack_APIC_irq(); +} + +/* + * Interrupt Type for MSI-X PCI/PCI-X/PCI-Express Devices, + * which implement the MSI-X Capability Structure. + */ +static struct hw_interrupt_type msix_irq_type = { + .typename = "PCI MSI-X", + .startup = startup_msi_irq_w_maskbit, + .shutdown = shutdown_msi_irq_w_maskbit, + .enable = enable_msi_irq_w_maskbit, + .disable = disable_msi_irq_w_maskbit, + .ack = ack_msi_irq_w_maskbit, + .end = end_msi_irq_w_maskbit, + .set_affinity = set_msi_irq_affinity +}; + +/* + * Interrupt Type for MSI PCI/PCI-X/PCI-Express Devices, + * which implement the MSI Capability Structure with + * Mask-and-Pending Bits. + */ +static struct hw_interrupt_type msi_irq_w_maskbit_type = { + .typename = "PCI MSI", + .startup = startup_msi_irq_w_maskbit, + .shutdown = shutdown_msi_irq_w_maskbit, + .enable = enable_msi_irq_w_maskbit, + .disable = disable_msi_irq_w_maskbit, + .ack = ack_msi_irq_w_maskbit, + .end = end_msi_irq_w_maskbit, + .set_affinity = set_msi_irq_affinity +}; + +/* + * Interrupt Type for MSI PCI/PCI-X/PCI-Express Devices, + * which implement the MSI Capability Structure without + * Mask-and-Pending Bits. + */ +static struct hw_interrupt_type msi_irq_wo_maskbit_type = { + .typename = "PCI MSI", + .startup = startup_msi_irq_wo_maskbit, + .shutdown = shutdown_msi_irq_wo_maskbit, + .enable = enable_msi_irq_wo_maskbit, + .disable = disable_msi_irq_wo_maskbit, + .ack = ack_msi_irq_wo_maskbit, + .end = end_msi_irq_wo_maskbit, + .set_affinity = set_msi_irq_affinity +}; + +static void msi_data_init(struct msg_data *msi_data, + unsigned int vector) +{ + memset(msi_data, 0, sizeof(struct msg_data)); + msi_data->vector = (u8)vector; + msi_data->delivery_mode = MSI_DELIVERY_MODE; + msi_data->level = MSI_LEVEL_MODE; + msi_data->trigger = MSI_TRIGGER_MODE; +} + +static void msi_address_init(struct msg_address *msi_address) +{ + unsigned int dest_id; + + memset(msi_address, 0, sizeof(struct msg_address)); + msi_address->hi_address = (u32)0; + dest_id = (MSI_ADDRESS_HEADER << MSI_ADDRESS_HEADER_SHIFT) | + (MSI_TARGET_CPU << MSI_TARGET_CPU_SHIFT); + msi_address->lo_address.u.dest_mode = MSI_LOGICAL_MODE; + msi_address->lo_address.u.redirection_hint = MSI_REDIRECTION_HINT_MODE; + msi_address->lo_address.u.dest_id = dest_id; +} + +static int pci_vector_resources(void) +{ + static int res = -EINVAL; + int nr_free_vectors; + + if (res == -EINVAL) { + int i, repeat; + for (i = NR_REPEATS; i > 0; i--) { + if ((FIRST_DEVICE_VECTOR + i * 8) > FIRST_SYSTEM_VECTOR) + continue; + break; + } + i++; + repeat = (FIRST_SYSTEM_VECTOR - FIRST_DEVICE_VECTOR)/i; + res = i * repeat - NR_RESERVED_VECTORS + 1; + } + + nr_free_vectors = res + nr_released_vectors - nr_alloc_vectors; + + return nr_free_vectors; +} + +int assign_irq_vector(int irq) +{ + static int current_vector = FIRST_DEVICE_VECTOR, offset = 0; + + if (irq != MSI_AUTO && IO_APIC_VECTOR(irq) > 0) + return IO_APIC_VECTOR(irq); +next: + current_vector += 8; + if (current_vector == SYSCALL_VECTOR) + goto next; + + if (current_vector > FIRST_SYSTEM_VECTOR) { + offset++; + current_vector = FIRST_DEVICE_VECTOR + offset; + } + + if (current_vector == FIRST_SYSTEM_VECTOR) + return -ENOSPC; + + vector_irq[current_vector] = irq; + if (irq != MSI_AUTO) + IO_APIC_VECTOR(irq) = current_vector; + + nr_alloc_vectors++; + + return current_vector; +} + +static int assign_msi_vector(void) +{ + static int new_vector_avail = 1; + int vector; + unsigned long flags; + + /* + * msi_lock is provided to ensure that successful allocation of MSI + * vector is assigned unique among drivers. + */ + spin_lock_irqsave(&msi_lock, flags); + if (!(pci_vector_resources() > 0)) { + spin_unlock_irqrestore(&msi_lock, flags); + return -EBUSY; + } + + if (!new_vector_avail) { + /* + * vector_irq[] = -1 indicates that this specific vector is: + * - assigned for MSI (since MSI have no associated IRQ) or + * - assigned for legacy if less than 16, or + * - having no corresponding 1:1 vector-to-IOxAPIC IRQ mapping + * vector_irq[] = 0 indicates that this vector, previously + * assigned for MSI, is freed by hotplug removed operations. + * This vector will be reused for any subsequent hotplug added + * operations. + * vector_irq[] > 0 indicates that this vector is assigned for + * IOxAPIC IRQs. This vector and its value provides a 1-to-1 + * vector-to-IOxAPIC IRQ mapping. + */ + for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) { + if (vector_irq[vector] != 0) + continue; + vector_irq[vector] = -1; + nr_released_vectors--; + spin_unlock_irqrestore(&msi_lock, flags); + return vector; + } + spin_unlock_irqrestore(&msi_lock, flags); + return -EBUSY; + } + + vector = assign_irq_vector(MSI_AUTO); + if (vector == (FIRST_SYSTEM_VECTOR - 8)) + new_vector_avail = 0; + + spin_unlock_irqrestore(&msi_lock, flags); + return vector; +} + +static int get_new_vector(void) +{ + int vector; + + if ((vector = assign_msi_vector()) > 0) + set_intr_gate(vector, interrupt[vector]); + + return vector; +} + +static int msi_init(void) +{ + static int status = -ENOMEM; + + if (!status) + return status; + + if ((status = msi_cache_init()) < 0) { + pci_msi_enable = 0; + printk(KERN_INFO "WARNING: MSI INIT FAILURE\n"); + return status; + } + printk(KERN_INFO "MSI INIT SUCCESS\n"); + + return status; +} + +static int get_msi_vector(struct pci_dev *dev) +{ + return get_new_vector(); +} + +static struct msi_desc* alloc_msi_entry(void) +{ + struct msi_desc *entry; + + entry = (struct msi_desc*) kmem_cache_alloc(msi_cachep, SLAB_KERNEL); + if (!entry) + return NULL; + + memset(entry, 0, sizeof(struct msi_desc)); + entry->link.tail = entry->link.head = 0; /* single message */ + entry->dev = NULL; + + return entry; +} + +static void attach_msi_entry(struct msi_desc *entry, int vector) +{ + unsigned long flags; + + spin_lock_irqsave(&msi_lock, flags); + msi_desc[vector] = entry; + spin_unlock_irqrestore(&msi_lock, flags); +} + +static void irq_handler_init(int cap_id, int pos, int mask) +{ + spin_lock(&irq_desc[pos].lock); + if (cap_id == PCI_CAP_ID_MSIX) + irq_desc[pos].handler = &msix_irq_type; + else { + if (!mask) + irq_desc[pos].handler = &msi_irq_wo_maskbit_type; + else + irq_desc[pos].handler = &msi_irq_w_maskbit_type; + } + spin_unlock(&irq_desc[pos].lock); +} + +static void enable_msi_mode(struct pci_dev *dev, int pos, int type) +{ + u32 control; + + dev->bus->ops->read(dev->bus, dev->devfn, + msi_control_reg(pos), 2, &control); + if (type == PCI_CAP_ID_MSI) { + /* Set enabled bits to single MSI & enable MSI_enable bit */ + msi_enable(control, 1); + dev->bus->ops->write(dev->bus, dev->devfn, + msi_control_reg(pos), 2, control); + } else { + msix_enable(control); + dev->bus->ops->write(dev->bus, dev->devfn, + msi_control_reg(pos), 2, control); + } + if (pci_find_capability(dev, PCI_CAP_ID_EXP)) { + /* PCI Express Endpoint device detected */ + u32 cmd; + dev->bus->ops->read(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd); + cmd |= PCI_COMMAND_INTX_DISABLE; + dev->bus->ops->write(dev->bus, dev->devfn, PCI_COMMAND, 2, cmd); + } +} + +static void disable_msi_mode(struct pci_dev *dev, int pos, int type) +{ + u32 control; + + dev->bus->ops->read(dev->bus, dev->devfn, + msi_control_reg(pos), 2, &control); + if (type == PCI_CAP_ID_MSI) { + /* Set enabled bits to single MSI & enable MSI_enable bit */ + msi_disable(control); + dev->bus->ops->write(dev->bus, dev->devfn, + msi_control_reg(pos), 2, control); + } else { + msix_disable(control); + dev->bus->ops->write(dev->bus, dev->devfn, + msi_control_reg(pos), 2, control); + } + if (pci_find_capability(dev, PCI_CAP_ID_EXP)) { + /* PCI Express Endpoint device detected */ + u32 cmd; + dev->bus->ops->read(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd); + cmd &= ~PCI_COMMAND_INTX_DISABLE; + dev->bus->ops->write(dev->bus, dev->devfn, PCI_COMMAND, 2, cmd); + } +} + +static int msi_lookup_vector(struct pci_dev *dev) +{ + int vector; + unsigned long flags; + + spin_lock_irqsave(&msi_lock, flags); + for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) { + if (!msi_desc[vector] || msi_desc[vector]->dev != dev || + msi_desc[vector]->msi_attrib.entry_nr || + msi_desc[vector]->msi_attrib.default_vector != dev->irq) + continue; /* not entry 0, skip */ + spin_unlock_irqrestore(&msi_lock, flags); + /* This pre-assigned entry-0 MSI vector for this device + already exits. Override dev->irq with this vector */ + dev->irq = vector; + return 0; + } + spin_unlock_irqrestore(&msi_lock, flags); + + return -EACCES; +} + +void pci_scan_msi_device(struct pci_dev *dev) +{ + if (!dev) + return; + + if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0) { + nr_reserved_vectors++; + nr_msix_devices++; + } else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0) + nr_reserved_vectors++; +} + +/** + * msi_capability_init - configure device's MSI capability structure + * @dev: pointer to the pci_dev data structure of MSI device function + * + * Setup the MSI capability structure of device funtion with a single + * MSI vector, regardless of device function is capable of handling + * multiple messages. A return of zero indicates the successful setup + * of an entry zero with the new MSI vector or non-zero for otherwise. + **/ +static int msi_capability_init(struct pci_dev *dev) +{ + struct msi_desc *entry; + struct msg_address address; + struct msg_data data; + int pos, vector; + u32 control; + + pos = pci_find_capability(dev, PCI_CAP_ID_MSI); + if (!pos) + return -EINVAL; + + dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), + 2, &control); + if (control & PCI_MSI_FLAGS_ENABLE) + return 0; + + if (!msi_lookup_vector(dev)) { + /* Lookup Sucess */ + enable_msi_mode(dev, pos, PCI_CAP_ID_MSI); + return 0; + } + /* MSI Entry Initialization */ + if (!(entry = alloc_msi_entry())) + return -ENOMEM; + + if ((vector = get_msi_vector(dev)) < 0) { + kmem_cache_free(msi_cachep, entry); + return -EBUSY; + } + entry->msi_attrib.type = PCI_CAP_ID_MSI; + entry->msi_attrib.entry_nr = 0; + entry->msi_attrib.maskbit = is_mask_bit_support(control); + entry->msi_attrib.default_vector = dev->irq; + dev->irq = vector; /* save default pre-assigned ioapic vector */ + entry->dev = dev; + if (is_mask_bit_support(control)) { + entry->mask_base = msi_mask_bits_reg(pos, + is_64bit_address(control)); + } + /* Replace with MSI handler */ + irq_handler_init(PCI_CAP_ID_MSI, vector, entry->msi_attrib.maskbit); + /* Configure MSI capability structure */ + msi_address_init(&address); + msi_data_init(&data, vector); + entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >> + MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK); + dev->bus->ops->write(dev->bus, dev->devfn, msi_lower_address_reg(pos), + 4, address.lo_address.value); + if (is_64bit_address(control)) { + dev->bus->ops->write(dev->bus, dev->devfn, + msi_upper_address_reg(pos), 4, address.hi_address); + dev->bus->ops->write(dev->bus, dev->devfn, + msi_data_reg(pos, 1), 2, *((u32*)&data)); + } else + dev->bus->ops->write(dev->bus, dev->devfn, + msi_data_reg(pos, 0), 2, *((u32*)&data)); + if (entry->msi_attrib.maskbit) { + unsigned int maskbits, temp; + /* All MSIs are unmasked by default, Mask them all */ + dev->bus->ops->read(dev->bus, dev->devfn, + msi_mask_bits_reg(pos, is_64bit_address(control)), 4, + &maskbits); + temp = (1 << multi_msi_capable(control)); + temp = ((temp - 1) & ~temp); + maskbits |= temp; + dev->bus->ops->write(dev->bus, dev->devfn, + msi_mask_bits_reg(pos, is_64bit_address(control)), 4, + maskbits); + } + attach_msi_entry(entry, vector); + /* Set MSI enabled bits */ + enable_msi_mode(dev, pos, PCI_CAP_ID_MSI); + + return 0; +} + +/** + * msix_capability_init - configure device's MSI-X capability + * @dev: pointer to the pci_dev data structure of MSI-X device function + * + * Setup the MSI-X capability structure of device funtion with a + * single MSI-X vector. A return of zero indicates the successful setup + * of an entry zero with the new MSI-X vector or non-zero for otherwise. + * To request for additional MSI-X vectors, the device drivers are + * required to utilize the following supported APIs: + * 1) msi_alloc_vectors(...) for requesting one or more MSI-X vectors + * 2) msi_free_vectors(...) for releasing one or more MSI-X vectors + * back to PCI subsystem before calling free_irq(...) + **/ +static int msix_capability_init(struct pci_dev *dev) +{ + struct msi_desc *entry; + struct msg_address address; + struct msg_data data; + int vector = 0, pos, dev_msi_cap; + u32 phys_addr, table_offset; + u32 control; + u8 bir; + void *base; + + pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + if (!pos) + return -EINVAL; + + /* Request & Map MSI-X table region */ + dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, + &control); + if (control & PCI_MSIX_FLAGS_ENABLE) + return 0; + + if (!msi_lookup_vector(dev)) { + /* Lookup Sucess */ + enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX); + return 0; + } + + dev_msi_cap = multi_msix_capable(control); + dev->bus->ops->read(dev->bus, dev->devfn, + msix_table_offset_reg(pos), 4, &table_offset); + bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK); + phys_addr = pci_resource_start (dev, bir); + phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK); + if (!request_mem_region(phys_addr, + dev_msi_cap * PCI_MSIX_ENTRY_SIZE, + "MSI-X iomap Failure")) + return -ENOMEM; + base = ioremap_nocache(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE); + if (base == NULL) + goto free_region; + /* MSI Entry Initialization */ + entry = alloc_msi_entry(); + if (!entry) + goto free_iomap; + if ((vector = get_msi_vector(dev)) < 0) + goto free_entry; + + entry->msi_attrib.type = PCI_CAP_ID_MSIX; + entry->msi_attrib.entry_nr = 0; + entry->msi_attrib.maskbit = 1; + entry->msi_attrib.default_vector = dev->irq; + dev->irq = vector; /* save default pre-assigned ioapic vector */ + entry->dev = dev; + entry->mask_base = (unsigned long)base; + /* Replace with MSI handler */ + irq_handler_init(PCI_CAP_ID_MSIX, vector, 1); + /* Configure MSI-X capability structure */ + msi_address_init(&address); + msi_data_init(&data, vector); + entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >> + MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK); + writel(address.lo_address.value, base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); + writel(address.hi_address, base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); + writel(*(u32*)&data, base + PCI_MSIX_ENTRY_DATA_OFFSET); + /* Initialize all entries from 1 up to 0 */ + for (pos = 1; pos < dev_msi_cap; pos++) { + writel(0, base + pos * PCI_MSIX_ENTRY_SIZE + + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); + writel(0, base + pos * PCI_MSIX_ENTRY_SIZE + + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); + writel(0, base + pos * PCI_MSIX_ENTRY_SIZE + + PCI_MSIX_ENTRY_DATA_OFFSET); + } + attach_msi_entry(entry, vector); + /* Set MSI enabled bits */ + enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX); + + return 0; + +free_entry: + kmem_cache_free(msi_cachep, entry); +free_iomap: + iounmap(base); +free_region: + release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE); + + return ((vector < 0) ? -EBUSY : -ENOMEM); +} + +/** + * pci_enable_msi - configure device's MSI(X) capability structure + * @dev: pointer to the pci_dev data structure of MSI(X) device function + * + * Setup the MSI/MSI-X capability structure of device function with + * a single MSI(X) vector upon its software driver call to request for + * MSI(X) mode enabled on its hardware device function. A return of zero + * indicates the successful setup of an entry zero with the new MSI(X) + * vector or non-zero for otherwise. + **/ +int pci_enable_msi(struct pci_dev* dev) +{ + int status = -EINVAL; + + if (!pci_msi_enable || !dev) + return status; + + if (msi_init() < 0) + return -ENOMEM; + + if ((status = msix_capability_init(dev)) == -EINVAL) + status = msi_capability_init(dev); + if (!status) + nr_reserved_vectors--; + + return status; +} + +static int msi_free_vector(struct pci_dev* dev, int vector); +static void pci_disable_msi(unsigned int vector) +{ + int head, tail, type, default_vector; + struct msi_desc *entry; + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[vector]; + if (!entry || !entry->dev) { + spin_unlock_irqrestore(&msi_lock, flags); + return; + } + dev = entry->dev; + type = entry->msi_attrib.type; + head = entry->link.head; + tail = entry->link.tail; + default_vector = entry->msi_attrib.default_vector; + spin_unlock_irqrestore(&msi_lock, flags); + + disable_msi_mode(dev, pci_find_capability(dev, type), type); + /* Restore dev->irq to its default pin-assertion vector */ + dev->irq = default_vector; + if (type == PCI_CAP_ID_MSIX && head != tail) { + /* Bad driver, which do not call msi_free_vectors before exit. + We must do a cleanup here */ + while (1) { + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[vector]; + head = entry->link.head; + tail = entry->link.tail; + spin_unlock_irqrestore(&msi_lock, flags); + if (tail == head) + break; + if (msi_free_vector(dev, entry->link.tail)) + break; + } + } +} + +static int msi_alloc_vector(struct pci_dev* dev, int head) +{ + struct msi_desc *entry; + struct msg_address address; + struct msg_data data; + int i, offset, pos, dev_msi_cap, vector; + u32 low_address, control; + unsigned long base = 0L; + unsigned long flags; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[dev->irq]; + if (!entry) { + spin_unlock_irqrestore(&msi_lock, flags); + return -EINVAL; + } + base = entry->mask_base; + spin_unlock_irqrestore(&msi_lock, flags); + + pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), + 2, &control); + dev_msi_cap = multi_msix_capable(control); + for (i = 1; i < dev_msi_cap; i++) { + if (!(low_address = readl(base + i * PCI_MSIX_ENTRY_SIZE))) + break; + } + if (i >= dev_msi_cap) + return -EINVAL; + + /* MSI Entry Initialization */ + if (!(entry = alloc_msi_entry())) + return -ENOMEM; + + if ((vector = get_new_vector()) < 0) { + kmem_cache_free(msi_cachep, entry); + return vector; + } + entry->msi_attrib.type = PCI_CAP_ID_MSIX; + entry->msi_attrib.entry_nr = i; + entry->msi_attrib.maskbit = 1; + entry->dev = dev; + entry->link.head = head; + entry->mask_base = base; + irq_handler_init(PCI_CAP_ID_MSIX, vector, 1); + /* Configure MSI-X capability structure */ + msi_address_init(&address); + msi_data_init(&data, vector); + entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >> + MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK); + offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE; + writel(address.lo_address.value, base + offset + + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); + writel(address.hi_address, base + offset + + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); + writel(*(u32*)&data, base + offset + PCI_MSIX_ENTRY_DATA_OFFSET); + writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); + attach_msi_entry(entry, vector); + + return vector; +} + +static int msi_free_vector(struct pci_dev* dev, int vector) +{ + struct msi_desc *entry; + int entry_nr, type; + unsigned long base = 0L; + unsigned long flags; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[vector]; + if (!entry || entry->dev != dev) { + spin_unlock_irqrestore(&msi_lock, flags); + return -EINVAL; + } + type = entry->msi_attrib.type; + entry_nr = entry->msi_attrib.entry_nr; + base = entry->mask_base; + if (entry->link.tail != entry->link.head) { + msi_desc[entry->link.head]->link.tail = entry->link.tail; + if (entry->link.tail) + msi_desc[entry->link.tail]->link.head = entry->link.head; + } + entry->dev = NULL; + vector_irq[vector] = 0; + nr_released_vectors++; + msi_desc[vector] = NULL; + spin_unlock_irqrestore(&msi_lock, flags); + + kmem_cache_free(msi_cachep, entry); + if (type == PCI_CAP_ID_MSIX) { + int offset; + + offset = entry_nr * PCI_MSIX_ENTRY_SIZE; + writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); + writel(0, base + offset + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); + } + + return 0; +} + +/** + * msi_alloc_vectors - allocate additional MSI-X vectors + * @dev: pointer to the pci_dev data structure of MSI-X device function + * @vector: pointer to an array of new allocated MSI-X vectors + * @nvec: number of MSI-X vectors requested for allocation by device driver + * + * Allocate additional MSI-X vectors requested by device driver. A + * return of zero indicates the successful setup of MSI-X capability + * structure with new allocated MSI-X vectors or non-zero for otherwise. + **/ +int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec) +{ + struct msi_desc *entry; + int i, head, pos, vec, free_vectors, alloc_vectors; + int *vectors = (int *)vector; + u32 control; + unsigned long flags; + + if (!pci_msi_enable || !dev) + return -EINVAL; + + if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSIX))) + return -EINVAL; + + dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control); + if (nvec > multi_msix_capable(control)) + return -EINVAL; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[dev->irq]; + if (!entry || entry->dev != dev || /* legal call */ + entry->msi_attrib.type != PCI_CAP_ID_MSIX || /* must be MSI-X */ + entry->link.head != entry->link.tail) { /* already multi */ + spin_unlock_irqrestore(&msi_lock, flags); + return -EINVAL; + } + /* + * msi_lock is provided to ensure that enough vectors resources are + * available before granting. + */ + free_vectors = pci_vector_resources(); + /* Ensure that each MSI/MSI-X device has one vector reserved by + default to avoid any MSI-X driver to take all available + resources */ + free_vectors -= nr_reserved_vectors; + /* Find the average of free vectors among MSI-X devices */ + if (nr_msix_devices > 0) + free_vectors /= nr_msix_devices; + spin_unlock_irqrestore(&msi_lock, flags); + + if (nvec > free_vectors) + return -EBUSY; + + alloc_vectors = 0; + head = dev->irq; + for (i = 0; i < nvec; i++) { + if ((vec = msi_alloc_vector(dev, head)) < 0) + break; + *(vectors + i) = vec; + head = vec; + alloc_vectors++; + } + if (alloc_vectors != nvec) { + for (i = 0; i < alloc_vectors; i++) { + vec = *(vectors + i); + msi_free_vector(dev, vec); + } + spin_lock_irqsave(&msi_lock, flags); + msi_desc[dev->irq]->link.tail = msi_desc[dev->irq]->link.head; + spin_unlock_irqrestore(&msi_lock, flags); + return -EBUSY; + } + if (nr_msix_devices > 0) + nr_msix_devices--; + + return 0; +} + +/** + * msi_free_vectors - reclaim MSI-X vectors to unused state + * @dev: pointer to the pci_dev data structure of MSI-X device function + * @vector: pointer to an array of released MSI-X vectors + * @nvec: number of MSI-X vectors requested for release by device driver + * + * Reclaim MSI-X vectors released by device driver to unused state, + * which may be used later on. A return of zero indicates the + * success or non-zero for otherwise. Device driver should call this + * before calling function free_irq. + **/ +int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec) +{ + struct msi_desc *entry; + int i; + unsigned long flags; + + if (!pci_msi_enable) + return -EINVAL; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[dev->irq]; + if (!entry || entry->dev != dev || + entry->msi_attrib.type != PCI_CAP_ID_MSIX || + entry->link.head == entry->link.tail) { /* Nothing to free */ + spin_unlock_irqrestore(&msi_lock, flags); + return -EINVAL; + } + spin_unlock_irqrestore(&msi_lock, flags); + + for (i = 0; i < nvec; i++) { + if (*(vector + i) == dev->irq) + continue;/* Don't free entry 0 if mistaken by driver */ + msi_free_vector(dev, *(vector + i)); + } + + return 0; +} + +/** + * msi_remove_pci_irq_vectors - reclaim MSI(X) vectors to unused state + * @dev: pointer to the pci_dev data structure of MSI(X) device function + * + * Being called during hotplug remove, from which the device funciton + * is hot-removed. All previous assigned MSI/MSI-X vectors, if + * allocated for this device function, are reclaimed to unused state, + * which may be used later on. + **/ +void msi_remove_pci_irq_vectors(struct pci_dev* dev) +{ + struct msi_desc *entry; + int type, temp; + unsigned long flags; + + if (!pci_msi_enable || !dev) + return; + + if (!pci_find_capability(dev, PCI_CAP_ID_MSI)) { + if (!pci_find_capability(dev, PCI_CAP_ID_MSIX)) + return; + } + temp = dev->irq; + if (msi_lookup_vector(dev)) + return; + + spin_lock_irqsave(&msi_lock, flags); + entry = msi_desc[dev->irq]; + if (!entry || entry->dev != dev) { + spin_unlock_irqrestore(&msi_lock, flags); + return; + } + type = entry->msi_attrib.type; + spin_unlock_irqrestore(&msi_lock, flags); + + msi_free_vector(dev, dev->irq); + if (type == PCI_CAP_ID_MSIX) { + int i, pos, dev_msi_cap; + u32 phys_addr, table_offset; + u32 control; + u8 bir; + + pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control); + dev_msi_cap = multi_msix_capable(control); + dev->bus->ops->read(dev->bus, dev->devfn, + msix_table_offset_reg(pos), 4, &table_offset); + bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK); + phys_addr = pci_resource_start (dev, bir); + phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK); + for (i = FIRST_DEVICE_VECTOR; i < NR_IRQS; i++) { + spin_lock_irqsave(&msi_lock, flags); + if (!msi_desc[i] || msi_desc[i]->dev != dev) { + spin_unlock_irqrestore(&msi_lock, flags); + continue; + } + spin_unlock_irqrestore(&msi_lock, flags); + msi_free_vector(dev, i); + } + writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET); + iounmap((void*)entry->mask_base); + release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE); + } + dev->irq = temp; + nr_reserved_vectors++; +} + +EXPORT_SYMBOL(pci_enable_msi); +EXPORT_SYMBOL(msi_alloc_vectors); +EXPORT_SYMBOL(msi_free_vectors); --- diff/drivers/pnp/isapnp/Kconfig 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/pnp/isapnp/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -0,0 +1,11 @@ +# +# ISA Plug and Play configuration +# +config ISAPNP + bool "ISA Plug and Play support (EXPERIMENTAL)" + depends on PNP && EXPERIMENTAL + help + Say Y here if you would like support for ISA Plug and Play devices. + Some information is in <file:Documentation/isapnp.txt>. + + If unsure, say Y. --- diff/drivers/pnp/pnpbios/Kconfig 1970-01-01 01:00:00.000000000 +0100 +++ source/drivers/pnp/pnpbios/Kconfig 2003-11-26 10:09:06.000000000 +0000 @@ -0,0 +1,41 @@ +# +# Plug and Play BIOS configuration +# +config PNPBIOS + bool "Plug and Play BIOS support (EXPERIMENTAL)" + depends on PNP && EXPERIMENTAL + ---help--- + Linux uses the PNPBIOS as defined in "Plug and Play BIOS + Specification Version 1.0A May 5, 1994" to autodetect built-in + mainboard resources (e.g. parallel port resources). + + Some features (e.g. event notification, docking station information, + ISAPNP services) are not currently implemented. + + If you would like the kernel to detect and allocate resources to + your mainboard devices (on some systems they are disabled by the + BIOS) say Y here. Also the PNPBIOS can help prevent resource + conflicts between mainboard devices and other bus devices. + + Note: ACPI is expected to supersede PNPBIOS some day, currently it + co-exists nicely. If you have a non-ISA system that supports ACPI, + you probably don't need PNPBIOS support. + +config PNPBIOS_PROC_FS + bool "Plug and Play BIOS /proc interface" + depends on PNPBIOS && PROC_FS + ---help--- + If you say Y here and to "/proc file system support", you will be + able to directly access the PNPBIOS. This includes resource + allocation, ESCD, and other PNPBIOS services. Using this + interface is potentially dangerous because the PNPBIOS driver will + not be notified of any resource changes made by writting directly. + Also some buggy systems will fault when accessing certain features + in the PNPBIOS /proc interface (e.g. ESCD). + + See the latest pcmcia-cs (stand-alone package) for a nice set of + PNPBIOS /proc interface tools (lspnp and setpnp). + + Unless you are debugging or have other specific reasons, it is + recommended that you say N here. + --- diff/include/asm-alpha/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-alpha/lockmeter.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,90 @@ +/* + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.h by Jack Steiner (steiner@sgi.com) + * + * Modified by Peter Rival (frival@zk3.dec.com) + */ + +#ifndef _ALPHA_LOCKMETER_H +#define _ALPHA_LOCKMETER_H + +#include <asm/hwrpb.h> +#define CPU_CYCLE_FREQUENCY hwrpb->cycle_freq + +#define get_cycles64() get_cycles() + +#define THIS_CPU_NUMBER smp_processor_id() + +#include <linux/version.h> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,3,0) +#define local_irq_save(x) \ + __save_and_cli(x) +#define local_irq_restore(x) \ + __restore_flags(x) +#endif /* Linux version 2.2.x */ + +#define SPINLOCK_MAGIC_INIT /**/ + +/* + * Macros to cache and retrieve an index value inside of a lock + * these macros assume that there are less than 65536 simultaneous + * (read mode) holders of a rwlock. + * We also assume that the hash table has less than 32767 entries. + * the high order bit is used for write locking a rw_lock + * Note: although these defines and macros are the same as what is being used + * in include/asm-i386/lockmeter.h, they are present here to easily + * allow an alternate Alpha implementation. + */ +/* + * instrumented spinlock structure -- never used to allocate storage + * only used in macros below to overlay a spinlock_t + */ +typedef struct inst_spinlock_s { + /* remember, Alpha is little endian */ + unsigned short lock; + unsigned short index; +} inst_spinlock_t; +#define PUT_INDEX(lock_ptr,indexv) ((inst_spinlock_t *)(lock_ptr))->index = indexv +#define GET_INDEX(lock_ptr) ((inst_spinlock_t *)(lock_ptr))->index + +/* + * macros to cache and retrieve an index value in a read/write lock + * as well as the cpu where a reader busy period started + * we use the 2nd word (the debug word) for this, so require the + * debug word to be present + */ +/* + * instrumented rwlock structure -- never used to allocate storage + * only used in macros below to overlay a rwlock_t + */ +typedef struct inst_rwlock_s { + volatile int lock; + unsigned short index; + unsigned short cpu; +} inst_rwlock_t; +#define PUT_RWINDEX(rwlock_ptr,indexv) ((inst_rwlock_t *)(rwlock_ptr))->index = indexv +#define GET_RWINDEX(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) ((inst_rwlock_t *)(rwlock_ptr))->cpu = cpuv +#define GET_RW_CPU(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->cpu + +/* + * return true if rwlock is write locked + * (note that other lock attempts can cause the lock value to be negative) + */ +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) (((inst_rwlock_t *)rwlock_ptr)->lock & 1) +#define IABS(x) ((x) > 0 ? (x) : -(x)) + +#define RWLOCK_READERS(rwlock_ptr) rwlock_readers(rwlock_ptr) +extern inline int rwlock_readers(rwlock_t *rwlock_ptr) +{ + int tmp = (int) ((inst_rwlock_t *)rwlock_ptr)->lock; + /* readers subtract 2, so we have to: */ + /* - andnot off a possible writer (bit 0) */ + /* - get the absolute value */ + /* - divide by 2 (right shift by one) */ + /* to find the number of readers */ + if (tmp == 0) return(0); + else return(IABS(tmp & ~1)>>1); +} + +#endif /* _ALPHA_LOCKMETER_H */ --- diff/include/asm-i386/atomic_kmap.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-i386/atomic_kmap.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,95 @@ +/* + * atomic_kmap.h: temporary virtual kernel memory mappings + * + * Copyright (C) 2003 Ingo Molnar <mingo@redhat.com> + */ + +#ifndef _ASM_ATOMIC_KMAP_H +#define _ASM_ATOMIC_KMAP_H + +#ifdef __KERNEL__ + +#include <linux/config.h> +#include <asm/tlbflush.h> + +#ifdef CONFIG_DEBUG_HIGHMEM +#define HIGHMEM_DEBUG 1 +#else +#define HIGHMEM_DEBUG 0 +#endif + +extern pte_t *kmap_pte; +#define kmap_prot PAGE_KERNEL + +#define PKMAP_BASE (0xff000000UL) +#define NR_SHARED_PMDS ((0xffffffff-PKMAP_BASE+1)/PMD_SIZE) + +static inline unsigned long __kmap_atomic_vaddr(enum km_type type) +{ + enum fixed_addresses idx; + + idx = type + KM_TYPE_NR*smp_processor_id(); + return __fix_to_virt(FIX_KMAP_BEGIN + idx); +} + +static inline void *__kmap_atomic_noflush(struct page *page, enum km_type type) +{ + enum fixed_addresses idx; + unsigned long vaddr; + + idx = type + KM_TYPE_NR*smp_processor_id(); + vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); + /* + * NOTE: entries that rely on some secondary TLB-flush + * effect must not be global: + */ + set_pte(kmap_pte-idx, mk_pte(page, PAGE_KERNEL)); + + return (void*) vaddr; +} + +static inline void *__kmap_atomic(struct page *page, enum km_type type) +{ + enum fixed_addresses idx; + unsigned long vaddr; + + idx = type + KM_TYPE_NR*smp_processor_id(); + vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); +#if HIGHMEM_DEBUG + BUG_ON(!pte_none(*(kmap_pte-idx))); +#else + /* + * Performance optimization - do not flush if the new + * pte is the same as the old one: + */ + if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot))) + return (void *) vaddr; +#endif + set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); + __flush_tlb_one(vaddr); + + return (void*) vaddr; +} + +static inline void __kunmap_atomic(void *kvaddr, enum km_type type) +{ +#if HIGHMEM_DEBUG + unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; + enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); + + BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx)); + /* + * force other mappings to Oops if they'll try to access + * this pte without first remap it + */ + pte_clear(kmap_pte-idx); + __flush_tlb_one(vaddr); +#endif +} + +#define __kunmap_atomic_type(type) \ + __kunmap_atomic((void *)__kmap_atomic_vaddr(type), (type)) + +#endif /* __KERNEL__ */ + +#endif /* _ASM_ATOMIC_KMAP_H */ --- diff/include/asm-i386/kgdb.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-i386/kgdb.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,76 @@ +#ifndef __KGDB +#define __KGDB + +/* + * This file should not include ANY others. This makes it usable + * most anywhere without the fear of include order or inclusion. + * Make it so! + * + * This file may be included all the time. It is only active if + * CONFIG_KGDB is defined, otherwise it stubs out all the macros + * and entry points. + */ +#if defined(CONFIG_KGDB) && !defined(__ASSEMBLY__) + +extern void breakpoint(void); +#define INIT_KGDB_INTS kgdb_enable_ints() + +#ifndef BREAKPOINT +#define BREAKPOINT asm(" int $3") +#endif + +struct sk_buff; + +extern int kgdb_eth; +extern unsigned kgdb_remoteip; +extern unsigned short kgdb_listenport; +extern unsigned short kgdb_sendport; +extern unsigned char kgdb_remotemac[6]; +extern unsigned char kgdb_localmac[6]; +extern int kgdb_eth_need_breakpoint[]; + +extern int kgdb_tty_hook(void); +extern int kgdb_eth_hook(void); +extern int gdb_net_interrupt(struct sk_buff *skb); + +/* + * GDB debug stub (or any debug stub) can point the 'linux_debug_hook' + * pointer to its routine and it will be entered as the first thing + * when a trap occurs. + * + * Return values are, at present, undefined. + * + * The debug hook routine does not necessarily return to its caller. + * It has the register image and thus may choose to resume execution + * anywhere it pleases. + */ +struct pt_regs; +struct sk_buff; + +extern int kgdb_handle_exception(int trapno, + int signo, int err_code, struct pt_regs *regs); +extern int in_kgdb(struct pt_regs *regs); +extern int kgdb_net_interrupt(struct sk_buff *skb); + +#ifdef CONFIG_KGDB_TS +void kgdb_tstamp(int line, char *source, int data0, int data1); +/* + * This is the time stamp function. The macro adds the source info and + * does a cast on the data to allow most any 32-bit value. + */ + +#define kgdb_ts(data0,data1) kgdb_tstamp(__LINE__,__FILE__,(int)data0,(int)data1) +#else +#define kgdb_ts(data0,data1) +#endif +#else /* CONFIG_KGDB && ! __ASSEMBLY__ ,stubs follow... */ +#ifndef BREAKPOINT +#define BREAKPOINT +#endif +#define kgdb_ts(data0,data1) +#define in_kgdb +#define kgdb_handle_exception +#define breakpoint +#define INIT_KGDB_INTS +#endif +#endif /* __KGDB */ --- diff/include/asm-i386/kgdb_local.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-i386/kgdb_local.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,102 @@ +#ifndef __KGDB_LOCAL +#define ___KGDB_LOCAL +#include <linux/config.h> +#include <linux/types.h> +#include <linux/serial.h> +#include <linux/serialP.h> +#include <linux/spinlock.h> +#include <asm/processor.h> +#include <asm/msr.h> +#include <asm/kgdb.h> + +#define PORT 0x3f8 +#ifdef CONFIG_KGDB_PORT +#undef PORT +#define PORT CONFIG_KGDB_PORT +#endif +#define IRQ 4 +#ifdef CONFIG_KGDB_IRQ +#undef IRQ +#define IRQ CONFIG_KGDB_IRQ +#endif +#define SB_CLOCK 1843200 +#define SB_BASE (SB_CLOCK/16) +#define SB_BAUD9600 SB_BASE/9600 +#define SB_BAUD192 SB_BASE/19200 +#define SB_BAUD384 SB_BASE/38400 +#define SB_BAUD576 SB_BASE/57600 +#define SB_BAUD1152 SB_BASE/115200 +#ifdef CONFIG_KGDB_9600BAUD +#define SB_BAUD SB_BAUD9600 +#endif +#ifdef CONFIG_KGDB_19200BAUD +#define SB_BAUD SB_BAUD192 +#endif +#ifdef CONFIG_KGDB_38400BAUD +#define SB_BAUD SB_BAUD384 +#endif +#ifdef CONFIG_KGDB_57600BAUD +#define SB_BAUD SB_BAUD576 +#endif +#ifdef CONFIG_KGDB_115200BAUD +#define SB_BAUD SB_BAUD1152 +#endif +#ifndef SB_BAUD +#define SB_BAUD SB_BAUD1152 /* Start with this if not given */ +#endif + +#ifndef CONFIG_X86_TSC +#undef rdtsc +#define rdtsc(a,b) if (a++ > 10000){a = 0; b++;} +#undef rdtscll +#define rdtscll(s) s++ +#endif + +#ifdef _raw_read_unlock /* must use a name that is "define"ed, not an inline */ +#undef spin_lock +#undef spin_trylock +#undef spin_unlock +#define spin_lock _raw_spin_lock +#define spin_trylock _raw_spin_trylock +#define spin_unlock _raw_spin_unlock +#else +#endif +#undef spin_unlock_wait +#define spin_unlock_wait(x) do { cpu_relax(); barrier();} \ + while(spin_is_locked(x)) + +#define SB_IER 1 +#define SB_MCR UART_MCR_OUT2 | UART_MCR_DTR | UART_MCR_RTS + +#define FLAGS 0 +#define SB_STATE { \ + magic: SSTATE_MAGIC, \ + baud_base: SB_BASE, \ + port: PORT, \ + irq: IRQ, \ + flags: FLAGS, \ + custom_divisor:SB_BAUD} +#define SB_INFO { \ + magic: SERIAL_MAGIC, \ + port: PORT,0,FLAGS, \ + state: &state, \ + tty: (struct tty_struct *)&state, \ + IER: SB_IER, \ + MCR: SB_MCR} +extern void putDebugChar(int); +/* RTAI support needs us to really stop/start interrupts */ + +#define kgdb_sti() __asm__ __volatile__("sti": : :"memory") +#define kgdb_cli() __asm__ __volatile__("cli": : :"memory") +#define kgdb_local_save_flags(x) __asm__ __volatile__(\ + "pushfl ; popl %0":"=g" (x): /* no input */) +#define kgdb_local_irq_restore(x) __asm__ __volatile__(\ + "pushl %0 ; popfl": \ + /* no output */ :"g" (x):"memory", "cc") +#define kgdb_local_irq_save(x) kgdb_local_save_flags(x); kgdb_cli() + +#ifdef CONFIG_SERIAL +extern void shutdown_for_kgdb(struct async_struct *info); +#endif +#define INIT_KDEBUG putDebugChar("+"); +#endif /* __KGDB_LOCAL */ --- diff/include/asm-i386/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-i386/lockmeter.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,127 @@ +/* + * Copyright (C) 1999,2000 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.h by Jack Steiner (steiner@sgi.com) + * + * Modified by Ray Bryant (raybry@us.ibm.com) + * Changes Copyright (C) 2000 IBM, Inc. + * Added save of index in spinlock_t to improve efficiency + * of "hold" time reporting for spinlocks. + * Added support for hold time statistics for read and write + * locks. + * Moved machine dependent code here from include/lockmeter.h. + * + */ + +#ifndef _I386_LOCKMETER_H +#define _I386_LOCKMETER_H + +#include <asm/spinlock.h> +#include <asm/rwlock.h> + +#include <linux/version.h> + +#ifdef __KERNEL__ +extern unsigned long cpu_khz; +#define CPU_CYCLE_FREQUENCY (cpu_khz * 1000) +#else +#define CPU_CYCLE_FREQUENCY 450000000 +#endif + +#define THIS_CPU_NUMBER smp_processor_id() + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,3,0) +#define local_irq_save(x) \ + __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory") + +#define local_irq_restore(x) \ + __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory") +#endif /* Linux version 2.2.x */ + +/* + * macros to cache and retrieve an index value inside of a spin lock + * these macros assume that there are less than 65536 simultaneous + * (read mode) holders of a rwlock. Not normally a problem!! + * we also assume that the hash table has less than 65535 entries. + */ +/* + * instrumented spinlock structure -- never used to allocate storage + * only used in macros below to overlay a spinlock_t + */ +typedef struct inst_spinlock_s { + /* remember, Intel is little endian */ + unsigned short lock; + unsigned short index; +} inst_spinlock_t; +#define PUT_INDEX(lock_ptr,indexv) ((inst_spinlock_t *)(lock_ptr))->index = indexv +#define GET_INDEX(lock_ptr) ((inst_spinlock_t *)(lock_ptr))->index + +/* + * macros to cache and retrieve an index value in a read/write lock + * as well as the cpu where a reader busy period started + * we use the 2nd word (the debug word) for this, so require the + * debug word to be present + */ +/* + * instrumented rwlock structure -- never used to allocate storage + * only used in macros below to overlay a rwlock_t + */ +typedef struct inst_rwlock_s { + volatile int lock; + unsigned short index; + unsigned short cpu; +} inst_rwlock_t; +#define PUT_RWINDEX(rwlock_ptr,indexv) ((inst_rwlock_t *)(rwlock_ptr))->index = indexv +#define GET_RWINDEX(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) ((inst_rwlock_t *)(rwlock_ptr))->cpu = cpuv +#define GET_RW_CPU(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->cpu + +/* + * return the number of readers for a rwlock_t + */ +#define RWLOCK_READERS(rwlock_ptr) rwlock_readers(rwlock_ptr) + +extern inline int rwlock_readers(rwlock_t *rwlock_ptr) +{ + int tmp = (int) rwlock_ptr->lock; + /* read and write lock attempts may cause the lock value to temporarily */ + /* be negative. Until it is >= 0 we know nothing (i. e. can't tell if */ + /* is -1 because it was write locked and somebody tried to read lock it */ + /* or if it is -1 because it was read locked and somebody tried to write*/ + /* lock it. ........................................................... */ + do { + tmp = (int) rwlock_ptr->lock; + } while (tmp < 0); + if (tmp == 0) return(0); + else return(RW_LOCK_BIAS-tmp); +} + +/* + * return true if rwlock is write locked + * (note that other lock attempts can cause the lock value to be negative) + */ +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) ((rwlock_ptr)->lock <= 0) +#define IABS(x) ((x) > 0 ? (x) : -(x)) +#define RWLOCK_IS_READ_LOCKED(rwlock_ptr) ((IABS((rwlock_ptr)->lock) % RW_LOCK_BIAS) != 0) + +/* this is a lot of typing just to get gcc to emit "rdtsc" */ +static inline long long get_cycles64 (void) +{ +#ifndef CONFIG_X86_TSC + #error this code requires CONFIG_X86_TSC +#else + union longlong_u { + long long intlong; + struct intint_s { + uint32_t eax; + uint32_t edx; + } intint; + } longlong; + + rdtsc(longlong.intint.eax,longlong.intint.edx); + return longlong.intlong; +#endif +} + +#endif /* _I386_LOCKMETER_H */ --- diff/include/asm-ia64/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-ia64/lockmeter.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,72 @@ +/* + * Copyright (C) 1999,2000 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.h by Jack Steiner (steiner@sgi.com) + */ + +#ifndef _IA64_LOCKMETER_H +#define _IA64_LOCKMETER_H + +#ifdef local_cpu_data +#define CPU_CYCLE_FREQUENCY local_cpu_data->itc_freq +#else +#define CPU_CYCLE_FREQUENCY my_cpu_data.itc_freq +#endif +#define get_cycles64() get_cycles() + +#define THIS_CPU_NUMBER smp_processor_id() + +/* + * macros to cache and retrieve an index value inside of a lock + * these macros assume that there are less than 65536 simultaneous + * (read mode) holders of a rwlock. + * we also assume that the hash table has less than 32767 entries. + */ +/* + * instrumented spinlock structure -- never used to allocate storage + * only used in macros below to overlay a spinlock_t + */ +typedef struct inst_spinlock_s { + /* remember, Intel is little endian */ + volatile unsigned short lock; + volatile unsigned short index; +} inst_spinlock_t; +#define PUT_INDEX(lock_ptr,indexv) ((inst_spinlock_t *)(lock_ptr))->index = indexv +#define GET_INDEX(lock_ptr) ((inst_spinlock_t *)(lock_ptr))->index + +/* + * macros to cache and retrieve an index value in a read/write lock + * as well as the cpu where a reader busy period started + * we use the 2nd word (the debug word) for this, so require the + * debug word to be present + */ +/* + * instrumented rwlock structure -- never used to allocate storage + * only used in macros below to overlay a rwlock_t + */ +typedef struct inst_rwlock_s { + volatile int read_counter:31; + volatile int write_lock:1; + volatile unsigned short index; + volatile unsigned short cpu; +} inst_rwlock_t; +#define PUT_RWINDEX(rwlock_ptr,indexv) ((inst_rwlock_t *)(rwlock_ptr))->index = indexv +#define GET_RWINDEX(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) ((inst_rwlock_t *)(rwlock_ptr))->cpu = cpuv +#define GET_RW_CPU(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->cpu + +/* + * return the number of readers for a rwlock_t + */ +#define RWLOCK_READERS(rwlock_ptr) ((rwlock_ptr)->read_counter) + +/* + * return true if rwlock is write locked + * (note that other lock attempts can cause the lock value to be negative) + */ +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) ((rwlock_ptr)->write_lock) +#define RWLOCK_IS_READ_LOCKED(rwlock_ptr) ((rwlock_ptr)->read_counter) + +#endif /* _IA64_LOCKMETER_H */ + --- diff/include/asm-mips/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-mips/lockmeter.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,126 @@ +/* + * Copyright (C) 1999,2000 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.h by Jack Steiner (steiner@sgi.com) + * Ported to mips32 for Asita Technologies + * by D.J. Barrow ( dj.barrow@asitatechnologies.com ) + */ +#ifndef _ASM_LOCKMETER_H +#define _ASM_LOCKMETER_H + +/* do_gettimeoffset is a function pointer on mips */ +/* & it is not included by <linux/time.h> */ +#include <asm/time.h> +#include <linux/time.h> +#include <asm/div64.h> + +#define SPINLOCK_MAGIC_INIT /* */ + +#define CPU_CYCLE_FREQUENCY get_cpu_cycle_frequency() + +#define THIS_CPU_NUMBER smp_processor_id() + +static uint32_t cpu_cycle_frequency = 0; + +static uint32_t get_cpu_cycle_frequency(void) +{ + /* a total hack, slow and invasive, but ... it works */ + int sec; + uint32_t start_cycles; + struct timeval tv; + + if (cpu_cycle_frequency == 0) { /* uninitialized */ + do_gettimeofday(&tv); + sec = tv.tv_sec; /* set up to catch the tv_sec rollover */ + while (sec == tv.tv_sec) { do_gettimeofday(&tv); } + sec = tv.tv_sec; /* rolled over to a new sec value */ + start_cycles = get_cycles(); + while (sec == tv.tv_sec) { do_gettimeofday(&tv); } + cpu_cycle_frequency = get_cycles() - start_cycles; + } + + return cpu_cycle_frequency; +} + +extern struct timeval xtime; + +static uint64_t get_cycles64(void) +{ + static uint64_t last_get_cycles64 = 0; + uint64_t ret; + unsigned long sec; + unsigned long usec, usec_offset; + +again: + sec = xtime.tv_sec; + usec = xtime.tv_usec; + usec_offset = do_gettimeoffset(); + if ((xtime.tv_sec != sec) || + (xtime.tv_usec != usec)|| + (usec_offset >= 20000)) + goto again; + + ret = ((uint64_t)(usec + usec_offset) * cpu_cycle_frequency); + /* We can't do a normal 64 bit division on mips without libgcc.a */ + do_div(ret,1000000); + ret += ((uint64_t)sec * cpu_cycle_frequency); + + /* XXX why does time go backwards? do_gettimeoffset? general time adj? */ + if (ret <= last_get_cycles64) + ret = last_get_cycles64+1; + last_get_cycles64 = ret; + + return ret; +} + +/* + * macros to cache and retrieve an index value inside of a lock + * these macros assume that there are less than 65536 simultaneous + * (read mode) holders of a rwlock. + * we also assume that the hash table has less than 32767 entries. + * the high order bit is used for write locking a rw_lock + */ +#define INDEX_MASK 0x7FFF0000 +#define READERS_MASK 0x0000FFFF +#define INDEX_SHIFT 16 +#define PUT_INDEX(lockp,index) \ + lockp->lock = (((lockp->lock) & ~INDEX_MASK) | (index) << INDEX_SHIFT) +#define GET_INDEX(lockp) \ + (((lockp->lock) & INDEX_MASK) >> INDEX_SHIFT) + +/* + * macros to cache and retrieve an index value in a read/write lock + * as well as the cpu where a reader busy period started + * we use the 2nd word (the debug word) for this, so require the + * debug word to be present + */ +/* + * instrumented rwlock structure -- never used to allocate storage + * only used in macros below to overlay a rwlock_t + */ +typedef struct inst_rwlock_s { + volatile int lock; + unsigned short index; + unsigned short cpu; +} inst_rwlock_t; +#define PUT_RWINDEX(rwlock_ptr,indexv) ((inst_rwlock_t *)(rwlock_ptr))->index = indexv +#define GET_RWINDEX(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) ((inst_rwlock_t *)(rwlock_ptr))->cpu = cpuv +#define GET_RW_CPU(rwlock_ptr) ((inst_rwlock_t *)(rwlock_ptr))->cpu + +/* + * return the number of readers for a rwlock_t + */ +#define RWLOCK_READERS(rwlock_ptr) rwlock_readers(rwlock_ptr) + +extern inline int rwlock_readers(rwlock_t *rwlock_ptr) +{ + int tmp = (int) rwlock_ptr->lock; + return (tmp >= 0) ? tmp : 0; +} + +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) ((rwlock_ptr)->lock < 0) +#define RWLOCK_IS_READ_LOCKED(rwlock_ptr) ((rwlock_ptr)->lock > 0) + +#endif /* _ASM_LOCKMETER_H */ --- diff/include/asm-sparc64/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-sparc64/lockmeter.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1,45 @@ +/* + * Copyright (C) 2000 Anton Blanchard (anton@linuxcare.com) + * Copyright (C) 2003 David S. Miller (davem@redhat.com) + */ + +#ifndef _SPARC64_LOCKMETER_H +#define _SPARC64_LOCKMETER_H + +#include <linux/smp.h> +#include <asm/spinlock.h> +#include <asm/timer.h> +#include <asm/timex.h> + +/* Actually, this is not the CPU frequency by the system tick + * frequency which is good enough for lock metering. + */ +#define CPU_CYCLE_FREQUENCY (timer_tick_offset * HZ) +#define THIS_CPU_NUMBER smp_processor_id() + +#define PUT_INDEX(lock_ptr,indexv) (lock_ptr)->index = (indexv) +#define GET_INDEX(lock_ptr) (lock_ptr)->index + +#define PUT_RWINDEX(rwlock_ptr,indexv) (rwlock_ptr)->index = (indexv) +#define GET_RWINDEX(rwlock_ptr) (rwlock_ptr)->index +#define PUT_RW_CPU(rwlock_ptr,cpuv) (rwlock_ptr)->cpu = (cpuv) +#define GET_RW_CPU(rwlock_ptr) (rwlock_ptr)->cpu + +#define RWLOCK_READERS(rwlock_ptr) rwlock_readers(rwlock_ptr) + +extern inline int rwlock_readers(rwlock_t *rwlock_ptr) +{ + signed int tmp = rwlock_ptr->lock; + + if (tmp > 0) + return tmp; + else + return 0; +} + +#define RWLOCK_IS_WRITE_LOCKED(rwlock_ptr) ((signed int)((rwlock_ptr)->lock) < 0) +#define RWLOCK_IS_READ_LOCKED(rwlock_ptr) ((signed int)((rwlock_ptr)->lock) > 0) + +#define get_cycles64() get_cycles() + +#endif /* _SPARC64_LOCKMETER_H */ --- diff/include/asm-x86_64/cpu.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-x86_64/cpu.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1 @@ +#include <asm-i386/cpu.h> --- diff/include/asm-x86_64/memblk.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-x86_64/memblk.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1 @@ +#include <asm-i386/memblk.h> --- diff/include/asm-x86_64/node.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/asm-x86_64/node.h 2003-11-26 10:09:07.000000000 +0000 @@ -0,0 +1 @@ +#include <asm-i386/node.h> --- diff/include/linux/dwarf2-lang.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/linux/dwarf2-lang.h 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,132 @@ +#ifndef DWARF2_LANG +#define DWARF2_LANG +#include <linux/dwarf2.h> + +/* + * This is free software; you can redistribute it and/or modify it under + * the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2, or (at your option) any later + * version. + */ +/* + * This file defines macros that allow generation of DWARF debug records + * for asm files. This file is platform independent. Register numbers + * (which are about the only thing that is platform dependent) are to be + * supplied by a platform defined file. + */ +#define DWARF_preamble() .section .debug_frame,"",@progbits +/* + * This macro starts a debug frame section. The debug_frame describes + * where to find the registers that the enclosing function saved on + * entry. + * + * ORD is use by the label generator and should be the same as what is + * passed to CFI_postamble. + * + * pc, pc register gdb ordinal. + * + * code_align this is the factor used to define locations or regions + * where the given definitions apply. If you use labels to define these + * this should be 1. + * + * data_align this is the factor used to define register offsets. If + * you use struct offset, this should be the size of the register in + * bytes or the negative of that. This is how it is used: you will + * define a register as the reference register, say the stack pointer, + * then you will say where a register is located relative to this + * reference registers value, say 40 for register 3 (the gdb register + * number). The <40> will be multiplied by <data_align> to define the + * byte offset of the given register (3, in this example). So if your + * <40> is the byte offset and the reference register points at the + * begining, you would want 1 for the data_offset. If <40> was the 40th + * 4-byte element in that structure you would want 4. And if your + * reference register points at the end of the structure you would want + * a negative data_align value(and you would have to do other math as + * well). + */ + +#define CFI_preamble(ORD, pc, code_align, data_align) \ +.section .debug_frame,"",@progbits ; \ +frame/**/_/**/ORD: \ + .long end/**/_/**/ORD-start/**/_/**/ORD; \ +start/**/_/**/ORD: \ + .long DW_CIE_ID; \ + .byte DW_CIE_VERSION; \ + .byte 0 ; \ + .uleb128 code_align; \ + .sleb128 data_align; \ + .byte pc; + +/* + * After the above macro and prior to the CFI_postamble, you need to + * define the initial state. This starts with defining the reference + * register and, usually the pc. Here are some helper macros: + */ + +#define CFA_define_reference(reg, offset) \ + .byte DW_CFA_def_cfa; \ + .uleb128 reg; \ + .uleb128 (offset); + +#define CFA_define_offset(reg, offset) \ + .byte (DW_CFA_offset + reg); \ + .uleb128 (offset); + +#define CFI_postamble(ORD) \ + .align 4; \ +end/**/_/**/ORD: +/* + * So now your code pushs stuff on the stack, you need a new location + * and the rules for what to do. This starts a running description of + * the call frame. You need to describe what changes with respect to + * the call registers as the location of the pc moves through the code. + * The following builds an FDE (fram descriptor entry?). Like the + * above, it has a preamble and a postamble. It also is tied to the CFI + * above. + * The first entry after the preamble must be the location in the code + * that the call frame is being described for. + */ +#define FDE_preamble(ORD, fde_no, initial_address, length) \ + .long FDE_end/**/_/**/fde_no-FDE_start/**/_/**/fde_no; \ +FDE_start/**/_/**/fde_no: \ + .long frame/**/_/**/ORD; \ + .long initial_address; \ + .long length; + +#define FDE_postamble(fde_no) \ + .align 4; \ +FDE_end/**/_/**/fde_no: +/* + * That done, you can now add registers, subtract registers, move the + * reference and even change the reference. You can also define a new + * area of code the info applies to. For discontinuous bits you should + * start a new FDE. You may have as many as you like. + */ + +/* + * To advance the address by <bytes> + */ + +#define FDE_advance(bytes) \ + .byte DW_CFA_advance_loc4 \ + .long bytes + + + +/* + * With the above you can define all the register locations. But + * suppose the reference register moves... Takes the new offset NOT an + * increment. This is how esp is tracked if it is not saved. + */ + +#define CFA_define_cfa_offset(offset) \ + .byte $DW_CFA_def_cfa_offset; \ + .uleb128 (offset); +/* + * Or suppose you want to use a different reference register... + */ +#define CFA_define_cfa_register(reg) \ + .byte DW_CFA_def_cfa_register; \ + .uleb128 reg; + +#endif --- diff/include/linux/dwarf2.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/linux/dwarf2.h 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,738 @@ +/* Declarations and definitions of codes relating to the DWARF2 symbolic + debugging information format. + Copyright (C) 1992, 1993, 1995, 1996, 1997, 1999, 2000, 2001, 2002 + Free Software Foundation, Inc. + + Written by Gary Funck (gary@intrepid.com) The Ada Joint Program + Office (AJPO), Florida State Unviversity and Silicon Graphics Inc. + provided support for this effort -- June 21, 1995. + + Derived from the DWARF 1 implementation written by Ron Guilmette + (rfg@netcom.com), November 1990. + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation; either version 2, or (at your option) any later + version. + + GCC is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public + License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING. If not, write to the Free + Software Foundation, 59 Temple Place - Suite 330, Boston, MA + 02111-1307, USA. */ + +/* This file is derived from the DWARF specification (a public document) + Revision 2.0.0 (July 27, 1993) developed by the UNIX International + Programming Languages Special Interest Group (UI/PLSIG) and distributed + by UNIX International. Copies of this specification are available from + UNIX International, 20 Waterview Boulevard, Parsippany, NJ, 07054. + + This file also now contains definitions from the DWARF 3 specification. */ + +/* This file is shared between GCC and GDB, and should not contain + prototypes. */ + +#ifndef _ELF_DWARF2_H +#define _ELF_DWARF2_H + +/* Structure found in the .debug_line section. */ +#ifndef __ASSEMBLY__ +typedef struct +{ + unsigned char li_length [4]; + unsigned char li_version [2]; + unsigned char li_prologue_length [4]; + unsigned char li_min_insn_length [1]; + unsigned char li_default_is_stmt [1]; + unsigned char li_line_base [1]; + unsigned char li_line_range [1]; + unsigned char li_opcode_base [1]; +} +DWARF2_External_LineInfo; + +typedef struct +{ + unsigned long li_length; + unsigned short li_version; + unsigned int li_prologue_length; + unsigned char li_min_insn_length; + unsigned char li_default_is_stmt; + int li_line_base; + unsigned char li_line_range; + unsigned char li_opcode_base; +} +DWARF2_Internal_LineInfo; + +/* Structure found in .debug_pubnames section. */ +typedef struct +{ + unsigned char pn_length [4]; + unsigned char pn_version [2]; + unsigned char pn_offset [4]; + unsigned char pn_size [4]; +} +DWARF2_External_PubNames; + +typedef struct +{ + unsigned long pn_length; + unsigned short pn_version; + unsigned long pn_offset; + unsigned long pn_size; +} +DWARF2_Internal_PubNames; + +/* Structure found in .debug_info section. */ +typedef struct +{ + unsigned char cu_length [4]; + unsigned char cu_version [2]; + unsigned char cu_abbrev_offset [4]; + unsigned char cu_pointer_size [1]; +} +DWARF2_External_CompUnit; + +typedef struct +{ + unsigned long cu_length; + unsigned short cu_version; + unsigned long cu_abbrev_offset; + unsigned char cu_pointer_size; +} +DWARF2_Internal_CompUnit; + +typedef struct +{ + unsigned char ar_length [4]; + unsigned char ar_version [2]; + unsigned char ar_info_offset [4]; + unsigned char ar_pointer_size [1]; + unsigned char ar_segment_size [1]; +} +DWARF2_External_ARange; + +typedef struct +{ + unsigned long ar_length; + unsigned short ar_version; + unsigned long ar_info_offset; + unsigned char ar_pointer_size; + unsigned char ar_segment_size; +} +DWARF2_Internal_ARange; + +#define ENUM(name) enum name { +#define IF_NOT_ASM(a) a +#define COMMA , +#else +#define ENUM(name) +#define IF_NOT_ASM(a) +#define COMMA + +#endif + +/* Tag names and codes. */ +ENUM(dwarf_tag) + + DW_TAG_padding = 0x00 COMMA + DW_TAG_array_type = 0x01 COMMA + DW_TAG_class_type = 0x02 COMMA + DW_TAG_entry_point = 0x03 COMMA + DW_TAG_enumeration_type = 0x04 COMMA + DW_TAG_formal_parameter = 0x05 COMMA + DW_TAG_imported_declaration = 0x08 COMMA + DW_TAG_label = 0x0a COMMA + DW_TAG_lexical_block = 0x0b COMMA + DW_TAG_member = 0x0d COMMA + DW_TAG_pointer_type = 0x0f COMMA + DW_TAG_reference_type = 0x10 COMMA + DW_TAG_compile_unit = 0x11 COMMA + DW_TAG_string_type = 0x12 COMMA + DW_TAG_structure_type = 0x13 COMMA + DW_TAG_subroutine_type = 0x15 COMMA + DW_TAG_typedef = 0x16 COMMA + DW_TAG_union_type = 0x17 COMMA + DW_TAG_unspecified_parameters = 0x18 COMMA + DW_TAG_variant = 0x19 COMMA + DW_TAG_common_block = 0x1a COMMA + DW_TAG_common_inclusion = 0x1b COMMA + DW_TAG_inheritance = 0x1c COMMA + DW_TAG_inlined_subroutine = 0x1d COMMA + DW_TAG_module = 0x1e COMMA + DW_TAG_ptr_to_member_type = 0x1f COMMA + DW_TAG_set_type = 0x20 COMMA + DW_TAG_subrange_type = 0x21 COMMA + DW_TAG_with_stmt = 0x22 COMMA + DW_TAG_access_declaration = 0x23 COMMA + DW_TAG_base_type = 0x24 COMMA + DW_TAG_catch_block = 0x25 COMMA + DW_TAG_const_type = 0x26 COMMA + DW_TAG_constant = 0x27 COMMA + DW_TAG_enumerator = 0x28 COMMA + DW_TAG_file_type = 0x29 COMMA + DW_TAG_friend = 0x2a COMMA + DW_TAG_namelist = 0x2b COMMA + DW_TAG_namelist_item = 0x2c COMMA + DW_TAG_packed_type = 0x2d COMMA + DW_TAG_subprogram = 0x2e COMMA + DW_TAG_template_type_param = 0x2f COMMA + DW_TAG_template_value_param = 0x30 COMMA + DW_TAG_thrown_type = 0x31 COMMA + DW_TAG_try_block = 0x32 COMMA + DW_TAG_variant_part = 0x33 COMMA + DW_TAG_variable = 0x34 COMMA + DW_TAG_volatile_type = 0x35 COMMA + /* DWARF 3. */ + DW_TAG_dwarf_procedure = 0x36 COMMA + DW_TAG_restrict_type = 0x37 COMMA + DW_TAG_interface_type = 0x38 COMMA + DW_TAG_namespace = 0x39 COMMA + DW_TAG_imported_module = 0x3a COMMA + DW_TAG_unspecified_type = 0x3b COMMA + DW_TAG_partial_unit = 0x3c COMMA + DW_TAG_imported_unit = 0x3d COMMA + /* SGI/MIPS Extensions. */ + DW_TAG_MIPS_loop = 0x4081 COMMA + /* GNU extensions. */ + DW_TAG_format_label = 0x4101 COMMA /* For FORTRAN 77 and Fortran 90. */ + DW_TAG_function_template = 0x4102 COMMA /* For C++. */ + DW_TAG_class_template = 0x4103 COMMA /* For C++. */ + DW_TAG_GNU_BINCL = 0x4104 COMMA + DW_TAG_GNU_EINCL = 0x4105 COMMA + /* Extensions for UPC. See: http://upc.gwu.edu/~upc. */ + DW_TAG_upc_shared_type = 0x8765 COMMA + DW_TAG_upc_strict_type = 0x8766 COMMA + DW_TAG_upc_relaxed_type = 0x8767 +IF_NOT_ASM(};) + +#define DW_TAG_lo_user 0x4080 +#define DW_TAG_hi_user 0xffff + +/* Flag that tells whether entry has a child or not. */ +#define DW_children_no 0 +#define DW_children_yes 1 + +/* Form names and codes. */ +ENUM(dwarf_form) + + DW_FORM_addr = 0x01 COMMA + DW_FORM_block2 = 0x03 COMMA + DW_FORM_block4 = 0x04 COMMA + DW_FORM_data2 = 0x05 COMMA + DW_FORM_data4 = 0x06 COMMA + DW_FORM_data8 = 0x07 COMMA + DW_FORM_string = 0x08 COMMA + DW_FORM_block = 0x09 COMMA + DW_FORM_block1 = 0x0a COMMA + DW_FORM_data1 = 0x0b COMMA + DW_FORM_flag = 0x0c COMMA + DW_FORM_sdata = 0x0d COMMA + DW_FORM_strp = 0x0e COMMA + DW_FORM_udata = 0x0f COMMA + DW_FORM_ref_addr = 0x10 COMMA + DW_FORM_ref1 = 0x11 COMMA + DW_FORM_ref2 = 0x12 COMMA + DW_FORM_ref4 = 0x13 COMMA + DW_FORM_ref8 = 0x14 COMMA + DW_FORM_ref_udata = 0x15 COMMA + DW_FORM_indirect = 0x16 +IF_NOT_ASM(};) + +/* Attribute names and codes. */ + +ENUM(dwarf_attribute) + + DW_AT_sibling = 0x01 COMMA + DW_AT_location = 0x02 COMMA + DW_AT_name = 0x03 COMMA + DW_AT_ordering = 0x09 COMMA + DW_AT_subscr_data = 0x0a COMMA + DW_AT_byte_size = 0x0b COMMA + DW_AT_bit_offset = 0x0c COMMA + DW_AT_bit_size = 0x0d COMMA + DW_AT_element_list = 0x0f COMMA + DW_AT_stmt_list = 0x10 COMMA + DW_AT_low_pc = 0x11 COMMA + DW_AT_high_pc = 0x12 COMMA + DW_AT_language = 0x13 COMMA + DW_AT_member = 0x14 COMMA + DW_AT_discr = 0x15 COMMA + DW_AT_discr_value = 0x16 COMMA + DW_AT_visibility = 0x17 COMMA + DW_AT_import = 0x18 COMMA + DW_AT_string_length = 0x19 COMMA + DW_AT_common_reference = 0x1a COMMA + DW_AT_comp_dir = 0x1b COMMA + DW_AT_const_value = 0x1c COMMA + DW_AT_containing_type = 0x1d COMMA + DW_AT_default_value = 0x1e COMMA + DW_AT_inline = 0x20 COMMA + DW_AT_is_optional = 0x21 COMMA + DW_AT_lower_bound = 0x22 COMMA + DW_AT_producer = 0x25 COMMA + DW_AT_prototyped = 0x27 COMMA + DW_AT_return_addr = 0x2a COMMA + DW_AT_start_scope = 0x2c COMMA + DW_AT_stride_size = 0x2e COMMA + DW_AT_upper_bound = 0x2f COMMA + DW_AT_abstract_origin = 0x31 COMMA + DW_AT_accessibility = 0x32 COMMA + DW_AT_address_class = 0x33 COMMA + DW_AT_artificial = 0x34 COMMA + DW_AT_base_types = 0x35 COMMA + DW_AT_calling_convention = 0x36 COMMA + DW_AT_count = 0x37 COMMA + DW_AT_data_member_location = 0x38 COMMA + DW_AT_decl_column = 0x39 COMMA + DW_AT_decl_file = 0x3a COMMA + DW_AT_decl_line = 0x3b COMMA + DW_AT_declaration = 0x3c COMMA + DW_AT_discr_list = 0x3d COMMA + DW_AT_encoding = 0x3e COMMA + DW_AT_external = 0x3f COMMA + DW_AT_frame_base = 0x40 COMMA + DW_AT_friend = 0x41 COMMA + DW_AT_identifier_case = 0x42 COMMA + DW_AT_macro_info = 0x43 COMMA + DW_AT_namelist_items = 0x44 COMMA + DW_AT_priority = 0x45 COMMA + DW_AT_segment = 0x46 COMMA + DW_AT_specification = 0x47 COMMA + DW_AT_static_link = 0x48 COMMA + DW_AT_type = 0x49 COMMA + DW_AT_use_location = 0x4a COMMA + DW_AT_variable_parameter = 0x4b COMMA + DW_AT_virtuality = 0x4c COMMA + DW_AT_vtable_elem_location = 0x4d COMMA + /* DWARF 3 values. */ + DW_AT_allocated = 0x4e COMMA + DW_AT_associated = 0x4f COMMA + DW_AT_data_location = 0x50 COMMA + DW_AT_stride = 0x51 COMMA + DW_AT_entry_pc = 0x52 COMMA + DW_AT_use_UTF8 = 0x53 COMMA + DW_AT_extension = 0x54 COMMA + DW_AT_ranges = 0x55 COMMA + DW_AT_trampoline = 0x56 COMMA + DW_AT_call_column = 0x57 COMMA + DW_AT_call_file = 0x58 COMMA + DW_AT_call_line = 0x59 COMMA + /* SGI/MIPS extensions. */ + DW_AT_MIPS_fde = 0x2001 COMMA + DW_AT_MIPS_loop_begin = 0x2002 COMMA + DW_AT_MIPS_tail_loop_begin = 0x2003 COMMA + DW_AT_MIPS_epilog_begin = 0x2004 COMMA + DW_AT_MIPS_loop_unroll_factor = 0x2005 COMMA + DW_AT_MIPS_software_pipeline_depth = 0x2006 COMMA + DW_AT_MIPS_linkage_name = 0x2007 COMMA + DW_AT_MIPS_stride = 0x2008 COMMA + DW_AT_MIPS_abstract_name = 0x2009 COMMA + DW_AT_MIPS_clone_origin = 0x200a COMMA + DW_AT_MIPS_has_inlines = 0x200b COMMA + /* GNU extensions. */ + DW_AT_sf_names = 0x2101 COMMA + DW_AT_src_info = 0x2102 COMMA + DW_AT_mac_info = 0x2103 COMMA + DW_AT_src_coords = 0x2104 COMMA + DW_AT_body_begin = 0x2105 COMMA + DW_AT_body_end = 0x2106 COMMA + DW_AT_GNU_vector = 0x2107 COMMA + /* VMS extensions. */ + DW_AT_VMS_rtnbeg_pd_address = 0x2201 COMMA + /* UPC extension. */ + DW_AT_upc_threads_scaled = 0x3210 +IF_NOT_ASM(};) + +#define DW_AT_lo_user 0x2000 /* Implementation-defined range start. */ +#define DW_AT_hi_user 0x3ff0 /* Implementation-defined range end. */ + +/* Location atom names and codes. */ +ENUM(dwarf_location_atom) + + DW_OP_addr = 0x03 COMMA + DW_OP_deref = 0x06 COMMA + DW_OP_const1u = 0x08 COMMA + DW_OP_const1s = 0x09 COMMA + DW_OP_const2u = 0x0a COMMA + DW_OP_const2s = 0x0b COMMA + DW_OP_const4u = 0x0c COMMA + DW_OP_const4s = 0x0d COMMA + DW_OP_const8u = 0x0e COMMA + DW_OP_const8s = 0x0f COMMA + DW_OP_constu = 0x10 COMMA + DW_OP_consts = 0x11 COMMA + DW_OP_dup = 0x12 COMMA + DW_OP_drop = 0x13 COMMA + DW_OP_over = 0x14 COMMA + DW_OP_pick = 0x15 COMMA + DW_OP_swap = 0x16 COMMA + DW_OP_rot = 0x17 COMMA + DW_OP_xderef = 0x18 COMMA + DW_OP_abs = 0x19 COMMA + DW_OP_and = 0x1a COMMA + DW_OP_div = 0x1b COMMA + DW_OP_minus = 0x1c COMMA + DW_OP_mod = 0x1d COMMA + DW_OP_mul = 0x1e COMMA + DW_OP_neg = 0x1f COMMA + DW_OP_not = 0x20 COMMA + DW_OP_or = 0x21 COMMA + DW_OP_plus = 0x22 COMMA + DW_OP_plus_uconst = 0x23 COMMA + DW_OP_shl = 0x24 COMMA + DW_OP_shr = 0x25 COMMA + DW_OP_shra = 0x26 COMMA + DW_OP_xor = 0x27 COMMA + DW_OP_bra = 0x28 COMMA + DW_OP_eq = 0x29 COMMA + DW_OP_ge = 0x2a COMMA + DW_OP_gt = 0x2b COMMA + DW_OP_le = 0x2c COMMA + DW_OP_lt = 0x2d COMMA + DW_OP_ne = 0x2e COMMA + DW_OP_skip = 0x2f COMMA + DW_OP_lit0 = 0x30 COMMA + DW_OP_lit1 = 0x31 COMMA + DW_OP_lit2 = 0x32 COMMA + DW_OP_lit3 = 0x33 COMMA + DW_OP_lit4 = 0x34 COMMA + DW_OP_lit5 = 0x35 COMMA + DW_OP_lit6 = 0x36 COMMA + DW_OP_lit7 = 0x37 COMMA + DW_OP_lit8 = 0x38 COMMA + DW_OP_lit9 = 0x39 COMMA + DW_OP_lit10 = 0x3a COMMA + DW_OP_lit11 = 0x3b COMMA + DW_OP_lit12 = 0x3c COMMA + DW_OP_lit13 = 0x3d COMMA + DW_OP_lit14 = 0x3e COMMA + DW_OP_lit15 = 0x3f COMMA + DW_OP_lit16 = 0x40 COMMA + DW_OP_lit17 = 0x41 COMMA + DW_OP_lit18 = 0x42 COMMA + DW_OP_lit19 = 0x43 COMMA + DW_OP_lit20 = 0x44 COMMA + DW_OP_lit21 = 0x45 COMMA + DW_OP_lit22 = 0x46 COMMA + DW_OP_lit23 = 0x47 COMMA + DW_OP_lit24 = 0x48 COMMA + DW_OP_lit25 = 0x49 COMMA + DW_OP_lit26 = 0x4a COMMA + DW_OP_lit27 = 0x4b COMMA + DW_OP_lit28 = 0x4c COMMA + DW_OP_lit29 = 0x4d COMMA + DW_OP_lit30 = 0x4e COMMA + DW_OP_lit31 = 0x4f COMMA + DW_OP_reg0 = 0x50 COMMA + DW_OP_reg1 = 0x51 COMMA + DW_OP_reg2 = 0x52 COMMA + DW_OP_reg3 = 0x53 COMMA + DW_OP_reg4 = 0x54 COMMA + DW_OP_reg5 = 0x55 COMMA + DW_OP_reg6 = 0x56 COMMA + DW_OP_reg7 = 0x57 COMMA + DW_OP_reg8 = 0x58 COMMA + DW_OP_reg9 = 0x59 COMMA + DW_OP_reg10 = 0x5a COMMA + DW_OP_reg11 = 0x5b COMMA + DW_OP_reg12 = 0x5c COMMA + DW_OP_reg13 = 0x5d COMMA + DW_OP_reg14 = 0x5e COMMA + DW_OP_reg15 = 0x5f COMMA + DW_OP_reg16 = 0x60 COMMA + DW_OP_reg17 = 0x61 COMMA + DW_OP_reg18 = 0x62 COMMA + DW_OP_reg19 = 0x63 COMMA + DW_OP_reg20 = 0x64 COMMA + DW_OP_reg21 = 0x65 COMMA + DW_OP_reg22 = 0x66 COMMA + DW_OP_reg23 = 0x67 COMMA + DW_OP_reg24 = 0x68 COMMA + DW_OP_reg25 = 0x69 COMMA + DW_OP_reg26 = 0x6a COMMA + DW_OP_reg27 = 0x6b COMMA + DW_OP_reg28 = 0x6c COMMA + DW_OP_reg29 = 0x6d COMMA + DW_OP_reg30 = 0x6e COMMA + DW_OP_reg31 = 0x6f COMMA + DW_OP_breg0 = 0x70 COMMA + DW_OP_breg1 = 0x71 COMMA + DW_OP_breg2 = 0x72 COMMA + DW_OP_breg3 = 0x73 COMMA + DW_OP_breg4 = 0x74 COMMA + DW_OP_breg5 = 0x75 COMMA + DW_OP_breg6 = 0x76 COMMA + DW_OP_breg7 = 0x77 COMMA + DW_OP_breg8 = 0x78 COMMA + DW_OP_breg9 = 0x79 COMMA + DW_OP_breg10 = 0x7a COMMA + DW_OP_breg11 = 0x7b COMMA + DW_OP_breg12 = 0x7c COMMA + DW_OP_breg13 = 0x7d COMMA + DW_OP_breg14 = 0x7e COMMA + DW_OP_breg15 = 0x7f COMMA + DW_OP_breg16 = 0x80 COMMA + DW_OP_breg17 = 0x81 COMMA + DW_OP_breg18 = 0x82 COMMA + DW_OP_breg19 = 0x83 COMMA + DW_OP_breg20 = 0x84 COMMA + DW_OP_breg21 = 0x85 COMMA + DW_OP_breg22 = 0x86 COMMA + DW_OP_breg23 = 0x87 COMMA + DW_OP_breg24 = 0x88 COMMA + DW_OP_breg25 = 0x89 COMMA + DW_OP_breg26 = 0x8a COMMA + DW_OP_breg27 = 0x8b COMMA + DW_OP_breg28 = 0x8c COMMA + DW_OP_breg29 = 0x8d COMMA + DW_OP_breg30 = 0x8e COMMA + DW_OP_breg31 = 0x8f COMMA + DW_OP_regx = 0x90 COMMA + DW_OP_fbreg = 0x91 COMMA + DW_OP_bregx = 0x92 COMMA + DW_OP_piece = 0x93 COMMA + DW_OP_deref_size = 0x94 COMMA + DW_OP_xderef_size = 0x95 COMMA + DW_OP_nop = 0x96 COMMA + /* DWARF 3 extensions. */ + DW_OP_push_object_address = 0x97 COMMA + DW_OP_call2 = 0x98 COMMA + DW_OP_call4 = 0x99 COMMA + DW_OP_call_ref = 0x9a COMMA + /* GNU extensions. */ + DW_OP_GNU_push_tls_address = 0xe0 +IF_NOT_ASM(};) + +#define DW_OP_lo_user 0xe0 /* Implementation-defined range start. */ +#define DW_OP_hi_user 0xff /* Implementation-defined range end. */ + +/* Type encodings. */ +ENUM(dwarf_type) + + DW_ATE_void = 0x0 COMMA + DW_ATE_address = 0x1 COMMA + DW_ATE_boolean = 0x2 COMMA + DW_ATE_complex_float = 0x3 COMMA + DW_ATE_float = 0x4 COMMA + DW_ATE_signed = 0x5 COMMA + DW_ATE_signed_char = 0x6 COMMA + DW_ATE_unsigned = 0x7 COMMA + DW_ATE_unsigned_char = 0x8 COMMA + /* DWARF 3. */ + DW_ATE_imaginary_float = 0x9 +IF_NOT_ASM(};) + +#define DW_ATE_lo_user 0x80 +#define DW_ATE_hi_user 0xff + +/* Array ordering names and codes. */ +ENUM(dwarf_array_dim_ordering) + + DW_ORD_row_major = 0 COMMA + DW_ORD_col_major = 1 +IF_NOT_ASM(};) + +/* Access attribute. */ +ENUM(dwarf_access_attribute) + + DW_ACCESS_public = 1 COMMA + DW_ACCESS_protected = 2 COMMA + DW_ACCESS_private = 3 +IF_NOT_ASM(};) + +/* Visibility. */ +ENUM(dwarf_visibility_attribute) + + DW_VIS_local = 1 COMMA + DW_VIS_exported = 2 COMMA + DW_VIS_qualified = 3 +IF_NOT_ASM(};) + +/* Virtuality. */ +ENUM(dwarf_virtuality_attribute) + + DW_VIRTUALITY_none = 0 COMMA + DW_VIRTUALITY_virtual = 1 COMMA + DW_VIRTUALITY_pure_virtual = 2 +IF_NOT_ASM(};) + +/* Case sensitivity. */ +ENUM(dwarf_id_case) + + DW_ID_case_sensitive = 0 COMMA + DW_ID_up_case = 1 COMMA + DW_ID_down_case = 2 COMMA + DW_ID_case_insensitive = 3 +IF_NOT_ASM(};) + +/* Calling convention. */ +ENUM(dwarf_calling_convention) + + DW_CC_normal = 0x1 COMMA + DW_CC_program = 0x2 COMMA + DW_CC_nocall = 0x3 +IF_NOT_ASM(};) + +#define DW_CC_lo_user 0x40 +#define DW_CC_hi_user 0xff + +/* Inline attribute. */ +ENUM(dwarf_inline_attribute) + + DW_INL_not_inlined = 0 COMMA + DW_INL_inlined = 1 COMMA + DW_INL_declared_not_inlined = 2 COMMA + DW_INL_declared_inlined = 3 +IF_NOT_ASM(};) + +/* Discriminant lists. */ +ENUM(dwarf_discrim_list) + + DW_DSC_label = 0 COMMA + DW_DSC_range = 1 +IF_NOT_ASM(};) + +/* Line number opcodes. */ +ENUM(dwarf_line_number_ops) + + DW_LNS_extended_op = 0 COMMA + DW_LNS_copy = 1 COMMA + DW_LNS_advance_pc = 2 COMMA + DW_LNS_advance_line = 3 COMMA + DW_LNS_set_file = 4 COMMA + DW_LNS_set_column = 5 COMMA + DW_LNS_negate_stmt = 6 COMMA + DW_LNS_set_basic_block = 7 COMMA + DW_LNS_const_add_pc = 8 COMMA + DW_LNS_fixed_advance_pc = 9 COMMA + /* DWARF 3. */ + DW_LNS_set_prologue_end = 10 COMMA + DW_LNS_set_epilogue_begin = 11 COMMA + DW_LNS_set_isa = 12 +IF_NOT_ASM(};) + +/* Line number extended opcodes. */ +ENUM(dwarf_line_number_x_ops) + + DW_LNE_end_sequence = 1 COMMA + DW_LNE_set_address = 2 COMMA + DW_LNE_define_file = 3 +IF_NOT_ASM(};) + +/* Call frame information. */ +ENUM(dwarf_call_frame_info) + + DW_CFA_advance_loc = 0x40 COMMA + DW_CFA_offset = 0x80 COMMA + DW_CFA_restore = 0xc0 COMMA + DW_CFA_nop = 0x00 COMMA + DW_CFA_set_loc = 0x01 COMMA + DW_CFA_advance_loc1 = 0x02 COMMA + DW_CFA_advance_loc2 = 0x03 COMMA + DW_CFA_advance_loc4 = 0x04 COMMA + DW_CFA_offset_extended = 0x05 COMMA + DW_CFA_restore_extended = 0x06 COMMA + DW_CFA_undefined = 0x07 COMMA + DW_CFA_same_value = 0x08 COMMA + DW_CFA_register = 0x09 COMMA + DW_CFA_remember_state = 0x0a COMMA + DW_CFA_restore_state = 0x0b COMMA + DW_CFA_def_cfa = 0x0c COMMA + DW_CFA_def_cfa_register = 0x0d COMMA + DW_CFA_def_cfa_offset = 0x0e COMMA + + /* DWARF 3. */ + DW_CFA_def_cfa_expression = 0x0f COMMA + DW_CFA_expression = 0x10 COMMA + DW_CFA_offset_extended_sf = 0x11 COMMA + DW_CFA_def_cfa_sf = 0x12 COMMA + DW_CFA_def_cfa_offset_sf = 0x13 COMMA + + /* SGI/MIPS specific. */ + DW_CFA_MIPS_advance_loc8 = 0x1d COMMA + + /* GNU extensions. */ + DW_CFA_GNU_window_save = 0x2d COMMA + DW_CFA_GNU_args_size = 0x2e COMMA + DW_CFA_GNU_negative_offset_extended = 0x2f +IF_NOT_ASM(};) + +#define DW_CIE_ID 0xffffffff +#define DW_CIE_VERSION 1 + +#define DW_CFA_extended 0 +#define DW_CFA_lo_user 0x1c +#define DW_CFA_hi_user 0x3f + +#define DW_CHILDREN_no 0x00 +#define DW_CHILDREN_yes 0x01 + +#define DW_ADDR_none 0 + +/* Source language names and codes. */ +ENUM(dwarf_source_language) + + DW_LANG_C89 = 0x0001 COMMA + DW_LANG_C = 0x0002 COMMA + DW_LANG_Ada83 = 0x0003 COMMA + DW_LANG_C_plus_plus = 0x0004 COMMA + DW_LANG_Cobol74 = 0x0005 COMMA + DW_LANG_Cobol85 = 0x0006 COMMA + DW_LANG_Fortran77 = 0x0007 COMMA + DW_LANG_Fortran90 = 0x0008 COMMA + DW_LANG_Pascal83 = 0x0009 COMMA + DW_LANG_Modula2 = 0x000a COMMA + DW_LANG_Java = 0x000b COMMA + /* DWARF 3. */ + DW_LANG_C99 = 0x000c COMMA + DW_LANG_Ada95 = 0x000d COMMA + DW_LANG_Fortran95 = 0x000e COMMA + /* MIPS. */ + DW_LANG_Mips_Assembler = 0x8001 COMMA + /* UPC. */ + DW_LANG_Upc = 0x8765 +IF_NOT_ASM(};) + +#define DW_LANG_lo_user 0x8000 /* Implementation-defined range start. */ +#define DW_LANG_hi_user 0xffff /* Implementation-defined range start. */ + +/* Names and codes for macro information. */ +ENUM(dwarf_macinfo_record_type) + + DW_MACINFO_define = 1 COMMA + DW_MACINFO_undef = 2 COMMA + DW_MACINFO_start_file = 3 COMMA + DW_MACINFO_end_file = 4 COMMA + DW_MACINFO_vendor_ext = 255 +IF_NOT_ASM(};) + +/* @@@ For use with GNU frame unwind information. */ + +#define DW_EH_PE_absptr 0x00 +#define DW_EH_PE_omit 0xff + +#define DW_EH_PE_uleb128 0x01 +#define DW_EH_PE_udata2 0x02 +#define DW_EH_PE_udata4 0x03 +#define DW_EH_PE_udata8 0x04 +#define DW_EH_PE_sleb128 0x09 +#define DW_EH_PE_sdata2 0x0A +#define DW_EH_PE_sdata4 0x0B +#define DW_EH_PE_sdata8 0x0C +#define DW_EH_PE_signed 0x08 + +#define DW_EH_PE_pcrel 0x10 +#define DW_EH_PE_textrel 0x20 +#define DW_EH_PE_datarel 0x30 +#define DW_EH_PE_funcrel 0x40 +#define DW_EH_PE_aligned 0x50 + +#define DW_EH_PE_indirect 0x80 + +#endif /* _ELF_DWARF2_H */ --- diff/include/linux/lockmeter.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/linux/lockmeter.h 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,320 @@ +/* + * Copyright (C) 1999-2002 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.h by Jack Steiner (steiner@sgi.com) + * + * Modified by Ray Bryant (raybry@us.ibm.com) Feb-Apr 2000 + * Changes Copyright (C) 2000 IBM, Inc. + * Added save of index in spinlock_t to improve efficiency + * of "hold" time reporting for spinlocks + * Added support for hold time statistics for read and write + * locks. + * Moved machine dependent code to include/asm/lockmeter.h. + * + */ + +#ifndef _LINUX_LOCKMETER_H +#define _LINUX_LOCKMETER_H + + +/*--------------------------------------------------- + * architecture-independent lockmeter.h + *-------------------------------------------------*/ + +/* + * raybry -- version 2: added efficient hold time statistics + * requires lstat recompile, so flagged as new version + * raybry -- version 3: added global reader lock data + * hawkes -- version 4: removed some unnecessary fields to simplify mips64 port + */ +#define LSTAT_VERSION 5 + +int lstat_update(void*, void*, int); +int lstat_update_time(void*, void*, int, uint32_t); + +/* + * Currently, the mips64 and sparc64 kernels talk to a 32-bit lockstat, so we + * need to force compatibility in the inter-communication data structure. + */ + +#if defined(CONFIG_MIPS32_COMPAT) +#define TIME_T uint32_t +#elif defined(CONFIG_SPARC) || defined(CONFIG_SPARC64) +#define TIME_T uint64_t +#else +#define TIME_T time_t +#endif + +#if defined(__KERNEL__) || (!defined(CONFIG_MIPS32_COMPAT) && !defined(CONFIG_SPARC) && !defined(CONFIG_SPARC64)) || (_MIPS_SZLONG==32) +#define POINTER void * +#else +#define POINTER int64_t +#endif + +/* + * Values for the "action" parameter passed to lstat_update. + * ZZZ - do we want a try-success status here??? + */ +#define LSTAT_ACT_NO_WAIT 0 +#define LSTAT_ACT_SPIN 1 +#define LSTAT_ACT_REJECT 2 +#define LSTAT_ACT_WW_SPIN 3 +#define LSTAT_ACT_SLEPT 4 /* UNUSED */ + +#define LSTAT_ACT_MAX_VALUES 4 /* NOTE: Increase to 5 if use ACT_SLEPT */ + +/* + * Special values for the low 2 bits of an RA passed to + * lstat_update. + */ +/* we use these values to figure out what kind of lock data */ +/* is stored in the statistics table entry at index ....... */ +#define LSTAT_RA_SPIN 0 /* spin lock data */ +#define LSTAT_RA_READ 1 /* read lock statistics */ +#define LSTAT_RA_SEMA 2 /* RESERVED */ +#define LSTAT_RA_WRITE 3 /* write lock statistics*/ + +#define LSTAT_RA(n) \ + ((void*)( ((unsigned long)__builtin_return_address(0) & ~3) | n) ) + +/* + * Constants used for lock addresses in the lstat_directory + * to indicate special values of the lock address. + */ +#define LSTAT_MULTI_LOCK_ADDRESS NULL + +/* + * Maximum size of the lockstats tables. Increase this value + * if its not big enough. (Nothing bad happens if its not + * big enough although some locks will not be monitored.) + * We record overflows of this quantity in lstat_control.dir_overflows + * + * Note: The max value here must fit into the field set + * and obtained by the macro's PUT_INDEX() and GET_INDEX(). + * This value depends on how many bits are available in the + * lock word in the particular machine implementation we are on. + */ +#define LSTAT_MAX_STAT_INDEX 2000 + +/* + * Size and mask for the hash table into the directory. + */ +#define LSTAT_HASH_TABLE_SIZE 4096 /* must be 2**N */ +#define LSTAT_HASH_TABLE_MASK (LSTAT_HASH_TABLE_SIZE-1) + +#define DIRHASH(ra) ((unsigned long)(ra)>>2 & LSTAT_HASH_TABLE_MASK) + +/* + * This defines an entry in the lockstat directory. It contains + * information about a lock being monitored. + * A directory entry only contains the lock identification - + * counts on usage of the lock are kept elsewhere in a per-cpu + * data structure to minimize cache line pinging. + */ +typedef struct { + POINTER caller_ra; /* RA of code that set lock */ + POINTER lock_ptr; /* lock address */ + ushort next_stat_index; /* Used to link multiple locks that have the same hash table value */ +} lstat_directory_entry_t; + +/* + * A multi-dimensioned array used to contain counts for lock accesses. + * The array is 3-dimensional: + * - CPU number. Keep from thrashing cache lines between CPUs + * - Directory entry index. Identifies the lock + * - Action. Indicates what kind of contention occurred on an + * access to the lock. + * + * The index of an entry in the directory is the same as the 2nd index + * of the entry in the counts array. + */ +/* + * This table contains data for spin_locks, write locks, and read locks + * Not all data is used for all cases. In particular, the hold time + * information is not stored here for read locks since that is a global + * (e. g. cannot be separated out by return address) quantity. + * See the lstat_read_lock_counts_t structure for the global read lock + * hold time. + */ +typedef struct { + uint64_t cum_wait_ticks; /* sum of wait times */ + /* for write locks, sum of time a */ + /* writer is waiting for a reader */ + int64_t cum_hold_ticks; /* cumulative sum of holds */ + /* not used for read mode locks */ + /* must be signed. ............... */ + uint32_t max_wait_ticks; /* max waiting time */ + uint32_t max_hold_ticks; /* max holding time */ + uint64_t cum_wait_ww_ticks; /* sum times writer waits on writer*/ + uint32_t max_wait_ww_ticks; /* max wait time writer vs writer */ + /* prev 2 only used for write locks*/ + uint32_t acquire_time; /* time lock acquired this CPU */ + uint32_t count[LSTAT_ACT_MAX_VALUES]; +} lstat_lock_counts_t; + +typedef lstat_lock_counts_t lstat_cpu_counts_t[LSTAT_MAX_STAT_INDEX]; + +/* + * User request to: + * - turn statistic collection on/off, or to reset + */ +#define LSTAT_OFF 0 +#define LSTAT_ON 1 +#define LSTAT_RESET 2 +#define LSTAT_RELEASE 3 + +#define LSTAT_MAX_READ_LOCK_INDEX 1000 +typedef struct { + POINTER lock_ptr; /* address of lock for output stats */ + uint32_t read_lock_count; + int64_t cum_hold_ticks; /* sum of read lock hold times over */ + /* all callers. ....................*/ + uint32_t write_index; /* last write lock hash table index */ + uint32_t busy_periods; /* count of busy periods ended this */ + uint64_t start_busy; /* time this busy period started. ..*/ + uint64_t busy_ticks; /* sum of busy periods this lock. ..*/ + uint64_t max_busy; /* longest busy period for this lock*/ + uint32_t max_readers; /* maximum number of readers ...... */ +#ifdef USER_MODE_TESTING + rwlock_t entry_lock; /* lock for this read lock entry... */ + /* avoid having more than one rdr at*/ + /* needed for user space testing... */ + /* not needed for kernel 'cause it */ + /* is non-preemptive. ............. */ +#endif +} lstat_read_lock_counts_t; +typedef lstat_read_lock_counts_t lstat_read_lock_cpu_counts_t[LSTAT_MAX_READ_LOCK_INDEX]; + +#if defined(__KERNEL__) || defined(USER_MODE_TESTING) + +#ifndef USER_MODE_TESTING +#include <asm/lockmeter.h> +#else +#include "asm_newlockmeter.h" +#endif + +/* + * Size and mask for the hash table into the directory. + */ +#define LSTAT_HASH_TABLE_SIZE 4096 /* must be 2**N */ +#define LSTAT_HASH_TABLE_MASK (LSTAT_HASH_TABLE_SIZE-1) + +#define DIRHASH(ra) ((unsigned long)(ra)>>2 & LSTAT_HASH_TABLE_MASK) + +/* + * This version eliminates the per processor lock stack. What we do is to + * store the index of the lock hash structure in unused bits in the lock + * itself. Then on unlock we can find the statistics record without doing + * any additional hash or lock stack lookup. This works for spin_locks. + * Hold time reporting is now basically as cheap as wait time reporting + * so we ignore the difference between LSTAT_ON_HOLD and LSTAT_ON_WAIT + * as in version 1.1.* of lockmeter. + * + * For rw_locks, we store the index of a global reader stats structure in + * the lock and the writer index is stored in the latter structure. + * For read mode locks we hash at the time of the lock to find an entry + * in the directory for reader wait time and the like. + * At unlock time for read mode locks, we update just the global structure + * so we don't need to know the reader directory index value at unlock time. + * + */ + +/* + * Protocol to change lstat_control.state + * This is complicated because we don't want the cum_hold_time for + * a rw_lock to be decremented in _read_lock_ without making sure it + * is incremented in _read_lock_ and vice versa. So here is the + * way we change the state of lstat_control.state: + * I. To Turn Statistics On + * After allocating storage, set lstat_control.state non-zero. + * This works because we don't start updating statistics for in use + * locks until the reader lock count goes to zero. + * II. To Turn Statistics Off: + * (0) Disable interrupts on this CPU + * (1) Seize the lstat_control.directory_lock + * (2) Obtain the current value of lstat_control.next_free_read_lock_index + * (3) Store a zero in lstat_control.state. + * (4) Release the lstat_control.directory_lock + * (5) For each lock in the read lock list up to the saved value + * (well, -1) of the next_free_read_lock_index, do the following: + * (a) Check validity of the stored lock address + * by making sure that the word at the saved addr + * has an index that matches this entry. If not + * valid, then skip this entry. + * (b) If there is a write lock already set on this lock, + * skip to (d) below. + * (c) Set a non-metered write lock on the lock + * (d) set the cached INDEX in the lock to zero + * (e) Release the non-metered write lock. + * (6) Re-enable interrupts + * + * These rules ensure that a read lock will not have its statistics + * partially updated even though the global lock recording state has + * changed. See put_lockmeter_info() for implementation. + * + * The reason for (b) is that there may be write locks set on the + * syscall path to put_lockmeter_info() from user space. If we do + * not do this check, then we can deadlock. A similar problem would + * occur if the lock was read locked by the current CPU. At the + * moment this does not appear to happen. + */ + +/* + * Main control structure for lockstat. Used to turn statistics on/off + * and to maintain directory info. + */ +typedef struct { + int state; + spinlock_t control_lock; /* used to serialize turning statistics on/off */ + spinlock_t directory_lock; /* for serialize adding entries to directory */ + volatile int next_free_dir_index;/* next free entry in the directory */ + /* FIXME not all of these fields are used / needed .............. */ + /* the following fields represent data since */ + /* first "lstat on" or most recent "lstat reset" */ + TIME_T first_started_time; /* time when measurement first enabled */ + TIME_T started_time; /* time when measurement last started */ + TIME_T ending_time; /* time when measurement last disabled */ + uint64_t started_cycles64; /* cycles when measurement last started */ + uint64_t ending_cycles64; /* cycles when measurement last disabled */ + uint64_t enabled_cycles64; /* total cycles with measurement enabled */ + int intervals; /* number of measurement intervals recorded */ + /* i. e. number of times did lstat on;lstat off */ + lstat_directory_entry_t *dir; /* directory */ + int dir_overflow; /* count of times ran out of space in directory */ + int rwlock_overflow; /* count of times we couldn't allocate a rw block*/ + ushort *hashtab; /* hash table for quick dir scans */ + lstat_cpu_counts_t *counts[NR_CPUS]; /* Array of pointers to per-cpu stats */ + int next_free_read_lock_index; /* next rwlock reader (global) stats block */ + lstat_read_lock_cpu_counts_t *read_lock_counts[NR_CPUS]; /* per cpu read lock stats */ +} lstat_control_t; + +#endif /* defined(__KERNEL__) || defined(USER_MODE_TESTING) */ + +typedef struct { + short lstat_version; /* version of the data */ + short state; /* the current state is returned */ + int maxcpus; /* Number of cpus present */ + int next_free_dir_index; /* index of the next free directory entry */ + TIME_T first_started_time; /* when measurement enabled for first time */ + TIME_T started_time; /* time in secs since 1969 when stats last turned on */ + TIME_T ending_time; /* time in secs since 1969 when stats last turned off */ + uint32_t cycleval; /* cycles per second */ +#ifdef notyet + void *kernel_magic_addr; /* address of kernel_magic */ + void *kernel_end_addr; /* contents of kernel magic (points to "end") */ +#endif + int next_free_read_lock_index; /* index of next (global) read lock stats struct */ + uint64_t started_cycles64; /* cycles when measurement last started */ + uint64_t ending_cycles64; /* cycles when stats last turned off */ + uint64_t enabled_cycles64; /* total cycles with measurement enabled */ + int intervals; /* number of measurement intervals recorded */ + /* i.e. number of times we did lstat on;lstat off*/ + int dir_overflow; /* number of times we wanted more space in directory */ + int rwlock_overflow; /* # of times we wanted more space in read_locks_count */ + struct new_utsname uts; /* info about machine where stats are measured */ + /* -T option of lockstat allows data to be */ + /* moved to another machine. ................. */ +} lstat_user_request_t; + +#endif /* _LINUX_LOCKMETER_H */ --- diff/include/linux/pci_msi.h 1970-01-01 01:00:00.000000000 +0100 +++ source/include/linux/pci_msi.h 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,193 @@ +/* + * ../include/linux/pci_msi.h + * + */ + +#ifndef _ASM_PCI_MSI_H +#define _ASM_PCI_MSI_H + +#include <linux/pci.h> + +#define MSI_AUTO -1 +#define NR_REPEATS 23 +#define NR_RESERVED_VECTORS 3 /*FIRST_DEVICE_VECTOR,FIRST_SYSTEM_VECTOR,0x80 */ + +/* + * Assume the maximum number of hot plug slots supported by the system is about + * ten. The worstcase is that each of these slots is hot-added with a device, + * which has two MSI/MSI-X capable functions. To avoid any MSI-X driver, which + * attempts to request all available vectors, NR_HP_RESERVED_VECTORS is defined + * as below to ensure at least one message is assigned to each detected MSI/ + * MSI-X device function. + */ +#define NR_HP_RESERVED_VECTORS 20 + +extern int vector_irq[NR_IRQS]; +extern cpumask_t pending_irq_balance_cpumask[NR_IRQS]; +extern void (*interrupt[NR_IRQS])(void); + +#ifdef CONFIG_SMP +#define set_msi_irq_affinity set_msi_affinity +#else +#define set_msi_irq_affinity NULL +static inline void move_msi(int vector) {} +#endif + +#ifndef CONFIG_X86_IO_APIC +static inline int get_ioapic_vector(struct pci_dev *dev) { return -1;} +static inline void restore_ioapic_irq_handler(int irq) {} +#else +extern void restore_ioapic_irq_handler(int irq); +#endif + +/* + * MSI-X Address Register + */ +#define PCI_MSIX_FLAGS_QSIZE 0x7FF +#define PCI_MSIX_FLAGS_ENABLE (1 << 15) +#define PCI_MSIX_FLAGS_BIRMASK (7 << 0) +#define PCI_MSIX_FLAGS_BITMASK (1 << 0) + +#define PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET 0 +#define PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET 4 +#define PCI_MSIX_ENTRY_DATA_OFFSET 8 +#define PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET 12 +#define PCI_MSIX_ENTRY_SIZE 16 + +#define msi_control_reg(base) (base + PCI_MSI_FLAGS) +#define msi_lower_address_reg(base) (base + PCI_MSI_ADDRESS_LO) +#define msi_upper_address_reg(base) (base + PCI_MSI_ADDRESS_HI) +#define msi_data_reg(base, is64bit) \ + ( (is64bit == 1) ? base+PCI_MSI_DATA_64 : base+PCI_MSI_DATA_32 ) +#define msi_mask_bits_reg(base, is64bit) \ + ( (is64bit == 1) ? base+PCI_MSI_MASK_BIT : base+PCI_MSI_MASK_BIT-4) +#define msi_disable(control) control &= ~PCI_MSI_FLAGS_ENABLE +#define multi_msi_capable(control) \ + (1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1)) +#define multi_msi_enable(control, num) \ + control |= (((num >> 1) << 4) & PCI_MSI_FLAGS_QSIZE); +#define is_64bit_address(control) (control & PCI_MSI_FLAGS_64BIT) +#define is_mask_bit_support(control) (control & PCI_MSI_FLAGS_MASKBIT) +#define msi_enable(control, num) multi_msi_enable(control, num); \ + control |= PCI_MSI_FLAGS_ENABLE + +#define msix_control_reg msi_control_reg +#define msix_table_offset_reg(base) (base + 0x04) +#define msix_pba_offset_reg(base) (base + 0x08) +#define msix_enable(control) control |= PCI_MSIX_FLAGS_ENABLE +#define msix_disable(control) control &= ~PCI_MSIX_FLAGS_ENABLE +#define msix_table_size(control) ((control & PCI_MSIX_FLAGS_QSIZE)+1) +#define multi_msix_capable msix_table_size +#define msix_unmask(address) (address & ~PCI_MSIX_FLAGS_BITMASK) +#define msix_mask(address) (address | PCI_MSIX_FLAGS_BITMASK) +#define msix_is_pending(address) (address & PCI_MSIX_FLAGS_PENDMASK) + +extern char __dbg_str_buf[256]; +#define _DEFINE_DBG_BUFFER char __dbg_str_buf[256]; +#define _DBG_K_TRACE_ENTRY ((unsigned int)0x00000001) +#define _DBG_K_TRACE_EXIT ((unsigned int)0x00000002) +#define _DBG_K_INFO ((unsigned int)0x00000004) +#define _DBG_K_ERROR ((unsigned int)0x00000008) +#define _DBG_K_TRACE (_DBG_K_TRACE_ENTRY | _DBG_K_TRACE_EXIT) + +#define _DEBUG_LEVEL (_DBG_K_INFO | _DBG_K_ERROR | _DBG_K_TRACE) +#define _DBG_PRINT( dbg_flags, args... ) \ +if ( _DEBUG_LEVEL & (dbg_flags) ) \ +{ \ + int len; \ + len = sprintf(__dbg_str_buf, "%s:%d: %s ", \ + __FILE__, __LINE__, __FUNCTION__ ); \ + sprintf(__dbg_str_buf + len, args); \ + printk(KERN_INFO "%s\n", __dbg_str_buf); \ +} + +#define MSI_FUNCTION_TRACE_ENTER \ + _DBG_PRINT (_DBG_K_TRACE_ENTRY, "%s", "[Entry]"); +#define MSI_FUNCTION_TRACE_EXIT \ + _DBG_PRINT (_DBG_K_TRACE_EXIT, "%s", "[Entry]"); + +/* + * MSI Defined Data Structures + */ +#define MSI_ADDRESS_HEADER 0xfee +#define MSI_ADDRESS_HEADER_SHIFT 12 +#define MSI_ADDRESS_HEADER_MASK 0xfff000 +#define MSI_TARGET_CPU_SHIFT 4 +#define MSI_TARGET_CPU_MASK 0xff +#define MSI_DELIVERY_MODE 0 +#define MSI_LEVEL_MODE 1 /* Edge always assert */ +#define MSI_TRIGGER_MODE 0 /* MSI is edge sensitive */ +#define MSI_LOGICAL_MODE 1 +#define MSI_REDIRECTION_HINT_MODE 0 +#ifdef CONFIG_SMP +#define MSI_TARGET_CPU logical_smp_processor_id() +#else +#define MSI_TARGET_CPU TARGET_CPUS +#endif + +struct msg_data { +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u32 vector : 8; + __u32 delivery_mode : 3; /* 000b: FIXED | 001b: lowest prior */ + __u32 reserved_1 : 3; + __u32 level : 1; /* 0: deassert | 1: assert */ + __u32 trigger : 1; /* 0: edge | 1: level */ + __u32 reserved_2 : 16; +#elif defined(__BIG_ENDIAN_BITFIELD) + __u32 reserved_2 : 16; + __u32 trigger : 1; /* 0: edge | 1: level */ + __u32 level : 1; /* 0: deassert | 1: assert */ + __u32 reserved_1 : 3; + __u32 delivery_mode : 3; /* 000b: FIXED | 001b: lowest prior */ + __u32 vector : 8; +#else +#error "Bitfield endianness not defined! Check your byteorder.h" +#endif +} __attribute__ ((packed)); + +struct msg_address { + union { + struct { +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u32 reserved_1 : 2; + __u32 dest_mode : 1; /*0:physic | 1:logic */ + __u32 redirection_hint: 1; /*0: dedicated CPU + 1: lowest priority */ + __u32 reserved_2 : 4; + __u32 dest_id : 24; /* Destination ID */ +#elif defined(__BIG_ENDIAN_BITFIELD) + __u32 dest_id : 24; /* Destination ID */ + __u32 reserved_2 : 4; + __u32 redirection_hint: 1; /*0: dedicated CPU + 1: lowest priority */ + __u32 dest_mode : 1; /*0:physic | 1:logic */ + __u32 reserved_1 : 2; +#else +#error "Bitfield endianness not defined! Check your byteorder.h" +#endif + }u; + __u32 value; + }lo_address; + __u32 hi_address; +} __attribute__ ((packed)); + +struct msi_desc { + struct { + __u8 type : 5; /* {0: unused, 5h:MSI, 11h:MSI-X} */ + __u8 maskbit : 1; /* mask-pending bit supported ? */ + __u8 reserved: 2; /* reserved */ + __u8 entry_nr; /* specific enabled entry */ + __u8 default_vector; /* default pre-assigned vector */ + __u8 current_cpu; /* current destination cpu */ + }msi_attrib; + + struct { + __u16 head; + __u16 tail; + }link; + + unsigned long mask_base; + struct pci_dev *dev; +}; + +#endif /* _ASM_PCI_MSI_H */ --- diff/kernel/lockmeter.c 1970-01-01 01:00:00.000000000 +0100 +++ source/kernel/lockmeter.c 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,1178 @@ +/* + * Copyright (C) 1999,2000 Silicon Graphics, Inc. + * + * Written by John Hawkes (hawkes@sgi.com) + * Based on klstat.c by Jack Steiner (steiner@sgi.com) + * + * Modified by Ray Bryant (raybry@us.ibm.com) + * Changes Copyright (C) 2000 IBM, Inc. + * Added save of index in spinlock_t to improve efficiency + * of "hold" time reporting for spinlocks + * Added support for hold time statistics for read and write + * locks. + */ + +#include <linux/config.h> +#include <linux/types.h> +#include <linux/errno.h> +#include <linux/slab.h> +#include <linux/sched.h> +#include <linux/smp.h> +#include <linux/threads.h> +#include <linux/version.h> +#include <linux/vmalloc.h> +#include <linux/spinlock.h> +#include <linux/utsname.h> +#include <linux/module.h> +#include <asm/system.h> +#include <asm/uaccess.h> + +#include <linux/lockmeter.h> + +#define ASSERT(cond) +#define bzero(loc,size) memset(loc,0,size) + +/*<---------------------------------------------------*/ +/* lockmeter.c */ +/*>---------------------------------------------------*/ + +static lstat_control_t lstat_control __cacheline_aligned = + { LSTAT_OFF, SPIN_LOCK_UNLOCKED, SPIN_LOCK_UNLOCKED, + 19 * 0, NR_CPUS * 0, 0, NR_CPUS * 0 }; + +static ushort lstat_make_dir_entry(void *, void *); + +/* + * lstat_lookup + * + * Given a RA, locate the directory entry for the lock. + */ +static ushort +lstat_lookup(void *lock_ptr, void *caller_ra) +{ + ushort index; + lstat_directory_entry_t *dirp; + + dirp = lstat_control.dir; + + index = lstat_control.hashtab[DIRHASH(caller_ra)]; + while (dirp[index].caller_ra != caller_ra) { + if (index == 0) { + return lstat_make_dir_entry(lock_ptr, caller_ra); + } + index = dirp[index].next_stat_index; + } + + if (dirp[index].lock_ptr != NULL && dirp[index].lock_ptr != lock_ptr) { + dirp[index].lock_ptr = NULL; + } + + return index; +} + +/* + * lstat_make_dir_entry + * Called to add a new lock to the lock directory. + */ +static ushort +lstat_make_dir_entry(void *lock_ptr, void *caller_ra) +{ + lstat_directory_entry_t *dirp; + ushort index, hindex; + unsigned long flags; + + /* lock the table without recursively reentering this metering code */ + local_irq_save(flags); + _raw_spin_lock(&lstat_control.directory_lock); + + hindex = DIRHASH(caller_ra); + index = lstat_control.hashtab[hindex]; + dirp = lstat_control.dir; + while (index && dirp[index].caller_ra != caller_ra) + index = dirp[index].next_stat_index; + + if (index == 0) { + if (lstat_control.next_free_dir_index < LSTAT_MAX_STAT_INDEX) { + index = lstat_control.next_free_dir_index++; + lstat_control.dir[index].caller_ra = caller_ra; + lstat_control.dir[index].lock_ptr = lock_ptr; + lstat_control.dir[index].next_stat_index = + lstat_control.hashtab[hindex]; + lstat_control.hashtab[hindex] = index; + } else { + lstat_control.dir_overflow++; + } + } + _raw_spin_unlock(&lstat_control.directory_lock); + local_irq_restore(flags); + return index; +} + +int +lstat_update(void *lock_ptr, void *caller_ra, int action) +{ + int index; + int cpu; + + ASSERT(action < LSTAT_ACT_MAX_VALUES); + + if (lstat_control.state == LSTAT_OFF) + return 0; + + index = lstat_lookup(lock_ptr, caller_ra); + cpu = THIS_CPU_NUMBER; + (*lstat_control.counts[cpu])[index].count[action]++; + (*lstat_control.counts[cpu])[index].acquire_time = get_cycles(); + + return index; +} + +int +lstat_update_time(void *lock_ptr, void *caller_ra, int action, uint32_t ticks) +{ + ushort index; + int cpu; + + ASSERT(action < LSTAT_ACT_MAX_VALUES); + + if (lstat_control.state == LSTAT_OFF) + return 0; + + index = lstat_lookup(lock_ptr, caller_ra); + cpu = THIS_CPU_NUMBER; + (*lstat_control.counts[cpu])[index].count[action]++; + (*lstat_control.counts[cpu])[index].cum_wait_ticks += (uint64_t) ticks; + if ((*lstat_control.counts[cpu])[index].max_wait_ticks < ticks) + (*lstat_control.counts[cpu])[index].max_wait_ticks = ticks; + + (*lstat_control.counts[cpu])[index].acquire_time = get_cycles(); + + return index; +} + +void +_metered_spin_lock(spinlock_t * lock_ptr) +{ + if (lstat_control.state == LSTAT_OFF) { + _raw_spin_lock(lock_ptr); /* do the real lock */ + PUT_INDEX(lock_ptr, 0); /* clean index in case lockmetering */ + /* gets turned on before unlock */ + } else { + void *this_pc = LSTAT_RA(LSTAT_RA_SPIN); + int index; + + if (_raw_spin_trylock(lock_ptr)) { + index = lstat_update(lock_ptr, this_pc, + LSTAT_ACT_NO_WAIT); + } else { + uint32_t start_cycles = get_cycles(); + _raw_spin_lock(lock_ptr); /* do the real lock */ + index = lstat_update_time(lock_ptr, this_pc, + LSTAT_ACT_SPIN, get_cycles() - start_cycles); + } + /* save the index in the lock itself for use in spin unlock */ + PUT_INDEX(lock_ptr, index); + } +} + +int +_metered_spin_trylock(spinlock_t * lock_ptr) +{ + if (lstat_control.state == LSTAT_OFF) { + return _raw_spin_trylock(lock_ptr); + } else { + int retval; + void *this_pc = LSTAT_RA(LSTAT_RA_SPIN); + + if ((retval = _raw_spin_trylock(lock_ptr))) { + int index = lstat_update(lock_ptr, this_pc, + LSTAT_ACT_NO_WAIT); + /* + * save the index in the lock itself for use in spin + * unlock + */ + PUT_INDEX(lock_ptr, index); + } else { + lstat_update(lock_ptr, this_pc, LSTAT_ACT_REJECT); + } + + return retval; + } +} + +void +_metered_spin_unlock(spinlock_t * lock_ptr) +{ + int index = -1; + + if (lstat_control.state != LSTAT_OFF) { + index = GET_INDEX(lock_ptr); + /* + * If statistics were turned off when we set the lock, + * then the index can be zero. If that is the case, + * then collect no stats on this call. + */ + if (index > 0) { + uint32_t hold_time; + int cpu = THIS_CPU_NUMBER; + hold_time = get_cycles() - + (*lstat_control.counts[cpu])[index].acquire_time; + (*lstat_control.counts[cpu])[index].cum_hold_ticks += + (uint64_t) hold_time; + if ((*lstat_control.counts[cpu])[index].max_hold_ticks < + hold_time) + (*lstat_control.counts[cpu])[index]. + max_hold_ticks = hold_time; + } + } + + /* make sure we don't have a stale index value saved */ + PUT_INDEX(lock_ptr, 0); + _raw_spin_unlock(lock_ptr); /* do the real unlock */ +} + +/* + * allocate the next global read lock structure and store its index + * in the rwlock at "lock_ptr". + */ +uint32_t +alloc_rwlock_struct(rwlock_t * rwlock_ptr) +{ + int index; + unsigned long flags; + int cpu = THIS_CPU_NUMBER; + + /* If we've already overflowed, then do a quick exit */ + if (lstat_control.next_free_read_lock_index > + LSTAT_MAX_READ_LOCK_INDEX) { + lstat_control.rwlock_overflow++; + return 0; + } + + local_irq_save(flags); + _raw_spin_lock(&lstat_control.directory_lock); + + /* It is possible this changed while we were waiting for the directory_lock */ + if (lstat_control.state == LSTAT_OFF) { + index = 0; + goto unlock; + } + + /* It is possible someone else got here first and set the index */ + if ((index = GET_RWINDEX(rwlock_ptr)) == 0) { + /* + * we can't turn on read stats for this lock while there are + * readers (this would mess up the running hold time sum at + * unlock time) + */ + if (RWLOCK_READERS(rwlock_ptr) != 0) { + index = 0; + goto unlock; + } + + /* + * if stats are turned on after being off, we may need to + * return an old index from when the statistics were on last + * time. + */ + for (index = 1; index < lstat_control.next_free_read_lock_index; + index++) + if ((*lstat_control.read_lock_counts[cpu])[index]. + lock_ptr == rwlock_ptr) + goto put_index_and_unlock; + + /* allocate the next global read lock structure */ + if (lstat_control.next_free_read_lock_index >= + LSTAT_MAX_READ_LOCK_INDEX) { + lstat_control.rwlock_overflow++; + index = 0; + goto unlock; + } + index = lstat_control.next_free_read_lock_index++; + + /* + * initialize the global read stats data structure for each + * cpu + */ + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + (*lstat_control.read_lock_counts[cpu])[index].lock_ptr = + rwlock_ptr; + } +put_index_and_unlock: + /* store the index for the read lock structure into the lock */ + PUT_RWINDEX(rwlock_ptr, index); + } + +unlock: + _raw_spin_unlock(&lstat_control.directory_lock); + local_irq_restore(flags); + return index; +} + +void +_metered_read_lock(rwlock_t * rwlock_ptr) +{ + void *this_pc; + uint32_t start_cycles; + int index; + int cpu; + unsigned long flags; + int readers_before, readers_after; + uint64_t cycles64; + + if (lstat_control.state == LSTAT_OFF) { + _raw_read_lock(rwlock_ptr); + /* clean index in case lockmetering turns on before an unlock */ + PUT_RWINDEX(rwlock_ptr, 0); + return; + } + + this_pc = LSTAT_RA(LSTAT_RA_READ); + cpu = THIS_CPU_NUMBER; + index = GET_RWINDEX(rwlock_ptr); + + /* allocate the global stats entry for this lock, if needed */ + if (index == 0) + index = alloc_rwlock_struct(rwlock_ptr); + + readers_before = RWLOCK_READERS(rwlock_ptr); + if (_raw_read_trylock(rwlock_ptr)) { + /* + * We have decremented the lock to count a new reader, + * and have confirmed that no writer has it locked. + */ + /* update statistics if enabled */ + if (index > 0) { + local_irq_save(flags); + lstat_update((void *) rwlock_ptr, this_pc, + LSTAT_ACT_NO_WAIT); + /* preserve value of TSC so cum_hold_ticks and start_busy use same value */ + cycles64 = get_cycles64(); + (*lstat_control.read_lock_counts[cpu])[index]. + cum_hold_ticks -= cycles64; + + /* record time and cpu of start of busy period */ + /* this is not perfect (some race conditions are possible) */ + if (readers_before == 0) { + (*lstat_control.read_lock_counts[cpu])[index]. + start_busy = cycles64; + PUT_RW_CPU(rwlock_ptr, cpu); + } + readers_after = RWLOCK_READERS(rwlock_ptr); + if (readers_after > + (*lstat_control.read_lock_counts[cpu])[index]. + max_readers) + (*lstat_control.read_lock_counts[cpu])[index]. + max_readers = readers_after; + local_irq_restore(flags); + } + + return; + } + /* If we get here, then we could not quickly grab the read lock */ + + start_cycles = get_cycles(); /* start counting the wait time */ + + /* Now spin until read_lock is successful */ + _raw_read_lock(rwlock_ptr); + + lstat_update_time((void *) rwlock_ptr, this_pc, LSTAT_ACT_SPIN, + get_cycles() - start_cycles); + + /* update statistics if they are enabled for this lock */ + if (index > 0) { + local_irq_save(flags); + cycles64 = get_cycles64(); + (*lstat_control.read_lock_counts[cpu])[index].cum_hold_ticks -= + cycles64; + + /* this is not perfect (some race conditions are possible) */ + if (readers_before == 0) { + (*lstat_control.read_lock_counts[cpu])[index]. + start_busy = cycles64; + PUT_RW_CPU(rwlock_ptr, cpu); + } + readers_after = RWLOCK_READERS(rwlock_ptr); + if (readers_after > + (*lstat_control.read_lock_counts[cpu])[index].max_readers) + (*lstat_control.read_lock_counts[cpu])[index]. + max_readers = readers_after; + local_irq_restore(flags); + } +} + +void +_metered_read_unlock(rwlock_t * rwlock_ptr) +{ + int index; + int cpu; + unsigned long flags; + uint64_t busy_length; + uint64_t cycles64; + + if (lstat_control.state == LSTAT_OFF) { + _raw_read_unlock(rwlock_ptr); + return; + } + + index = GET_RWINDEX(rwlock_ptr); + cpu = THIS_CPU_NUMBER; + + if (index > 0) { + local_irq_save(flags); + /* + * preserve value of TSC so cum_hold_ticks and busy_ticks are + * consistent. + */ + cycles64 = get_cycles64(); + (*lstat_control.read_lock_counts[cpu])[index].cum_hold_ticks += + cycles64; + (*lstat_control.read_lock_counts[cpu])[index].read_lock_count++; + + /* + * once again, this is not perfect (some race conditions are + * possible) + */ + if (RWLOCK_READERS(rwlock_ptr) == 1) { + int cpu1 = GET_RW_CPU(rwlock_ptr); + uint64_t last_start_busy = + (*lstat_control.read_lock_counts[cpu1])[index]. + start_busy; + (*lstat_control.read_lock_counts[cpu])[index]. + busy_periods++; + if (cycles64 > last_start_busy) { + busy_length = cycles64 - last_start_busy; + (*lstat_control.read_lock_counts[cpu])[index]. + busy_ticks += busy_length; + if (busy_length > + (*lstat_control. + read_lock_counts[cpu])[index]. + max_busy) + (*lstat_control. + read_lock_counts[cpu])[index]. + max_busy = busy_length; + } + } + local_irq_restore(flags); + } + _raw_read_unlock(rwlock_ptr); +} + +void +_metered_write_lock(rwlock_t * rwlock_ptr) +{ + uint32_t start_cycles; + void *this_pc; + uint32_t spin_ticks = 0; /* in anticipation of a potential wait */ + int index; + int write_index = 0; + int cpu; + enum { + writer_writer_conflict, + writer_reader_conflict + } why_wait = writer_writer_conflict; + + if (lstat_control.state == LSTAT_OFF) { + _raw_write_lock(rwlock_ptr); + /* clean index in case lockmetering turns on before an unlock */ + PUT_RWINDEX(rwlock_ptr, 0); + return; + } + + this_pc = LSTAT_RA(LSTAT_RA_WRITE); + cpu = THIS_CPU_NUMBER; + index = GET_RWINDEX(rwlock_ptr); + + /* allocate the global stats entry for this lock, if needed */ + if (index == 0) { + index = alloc_rwlock_struct(rwlock_ptr); + } + + if (_raw_write_trylock(rwlock_ptr)) { + /* We acquired the lock on the first try */ + write_index = lstat_update((void *) rwlock_ptr, this_pc, + LSTAT_ACT_NO_WAIT); + /* save the write_index for use in unlock if stats enabled */ + if (index > 0) + (*lstat_control.read_lock_counts[cpu])[index]. + write_index = write_index; + return; + } + + /* If we get here, then we could not quickly grab the write lock */ + start_cycles = get_cycles(); /* start counting the wait time */ + + why_wait = RWLOCK_READERS(rwlock_ptr) ? + writer_reader_conflict : writer_writer_conflict; + + /* Now set the lock and wait for conflicts to disappear */ + _raw_write_lock(rwlock_ptr); + + spin_ticks = get_cycles() - start_cycles; + + /* update stats -- if enabled */ + if (index > 0 && spin_ticks) { + if (why_wait == writer_reader_conflict) { + /* waited due to a reader holding the lock */ + write_index = lstat_update_time((void *)rwlock_ptr, + this_pc, LSTAT_ACT_SPIN, spin_ticks); + } else { + /* + * waited due to another writer holding the lock + */ + write_index = lstat_update_time((void *)rwlock_ptr, + this_pc, LSTAT_ACT_WW_SPIN, spin_ticks); + (*lstat_control.counts[cpu])[write_index]. + cum_wait_ww_ticks += spin_ticks; + if (spin_ticks > + (*lstat_control.counts[cpu])[write_index]. + max_wait_ww_ticks) { + (*lstat_control.counts[cpu])[write_index]. + max_wait_ww_ticks = spin_ticks; + } + } + + /* save the directory index for use on write_unlock */ + (*lstat_control.read_lock_counts[cpu])[index]. + write_index = write_index; + } +} + +void +_metered_write_unlock(rwlock_t * rwlock_ptr) +{ + int index; + int cpu; + int write_index; + uint32_t hold_time; + + if (lstat_control.state == LSTAT_OFF) { + _raw_write_unlock(rwlock_ptr); + return; + } + + cpu = THIS_CPU_NUMBER; + index = GET_RWINDEX(rwlock_ptr); + + /* update statistics if stats enabled for this lock */ + if (index > 0) { + write_index = + (*lstat_control.read_lock_counts[cpu])[index].write_index; + + hold_time = get_cycles() - + (*lstat_control.counts[cpu])[write_index].acquire_time; + (*lstat_control.counts[cpu])[write_index].cum_hold_ticks += + (uint64_t) hold_time; + if ((*lstat_control.counts[cpu])[write_index].max_hold_ticks < + hold_time) + (*lstat_control.counts[cpu])[write_index]. + max_hold_ticks = hold_time; + } + _raw_write_unlock(rwlock_ptr); +} + +int +_metered_write_trylock(rwlock_t * rwlock_ptr) +{ + int retval; + void *this_pc = LSTAT_RA(LSTAT_RA_WRITE); + + if ((retval = _raw_write_trylock(rwlock_ptr))) { + lstat_update(rwlock_ptr, this_pc, LSTAT_ACT_NO_WAIT); + } else { + lstat_update(rwlock_ptr, this_pc, LSTAT_ACT_REJECT); + } + + return retval; +} + +static void +init_control_space(void) +{ + /* Set all control space pointers to null and indices to "empty" */ + int cpu; + + /* + * Access CPU_CYCLE_FREQUENCY at the outset, which in some + * architectures may trigger a runtime calculation that uses a + * spinlock. Let's do this before lockmetering is turned on. + */ + if (CPU_CYCLE_FREQUENCY == 0) + BUG(); + + lstat_control.hashtab = NULL; + lstat_control.dir = NULL; + for (cpu = 0; cpu < NR_CPUS; cpu++) { + lstat_control.counts[cpu] = NULL; + lstat_control.read_lock_counts[cpu] = NULL; + } +} + +static int +reset_lstat_data(void) +{ + int cpu, flags; + + flags = 0; + lstat_control.next_free_dir_index = 1; /* 0 is for overflows */ + lstat_control.next_free_read_lock_index = 1; + lstat_control.dir_overflow = 0; + lstat_control.rwlock_overflow = 0; + + lstat_control.started_cycles64 = 0; + lstat_control.ending_cycles64 = 0; + lstat_control.enabled_cycles64 = 0; + lstat_control.first_started_time = 0; + lstat_control.started_time = 0; + lstat_control.ending_time = 0; + lstat_control.intervals = 0; + + /* + * paranoia -- in case someone does a "lockstat reset" before + * "lockstat on" + */ + if (lstat_control.hashtab) { + bzero(lstat_control.hashtab, + LSTAT_HASH_TABLE_SIZE * sizeof (short)); + bzero(lstat_control.dir, LSTAT_MAX_STAT_INDEX * + sizeof (lstat_directory_entry_t)); + + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + bzero(lstat_control.counts[cpu], + sizeof (lstat_cpu_counts_t)); + bzero(lstat_control.read_lock_counts[cpu], + sizeof (lstat_read_lock_cpu_counts_t)); + } + } +#ifdef NOTDEF + _raw_spin_unlock(&lstat_control.directory_lock); + local_irq_restore(flags); +#endif + return 1; +} + +static void +release_control_space(void) +{ + /* + * Called when either (1) allocation of kmem + * or (2) when user writes LSTAT_RELEASE to /pro/lockmeter. + * Assume that all pointers have been initialized to zero, + * i.e., nonzero pointers are valid addresses. + */ + int cpu; + + if (lstat_control.hashtab) { + kfree(lstat_control.hashtab); + lstat_control.hashtab = NULL; + } + + if (lstat_control.dir) { + vfree(lstat_control.dir); + lstat_control.dir = NULL; + } + + for (cpu = 0; cpu < NR_CPUS; cpu++) { + if (lstat_control.counts[cpu]) { + vfree(lstat_control.counts[cpu]); + lstat_control.counts[cpu] = NULL; + } + if (lstat_control.read_lock_counts[cpu]) { + kfree(lstat_control.read_lock_counts[cpu]); + lstat_control.read_lock_counts[cpu] = NULL; + } + } +} + +int +get_lockmeter_info_size(void) +{ + return sizeof (lstat_user_request_t) + + num_online_cpus() * sizeof (lstat_cpu_counts_t) + + num_online_cpus() * sizeof (lstat_read_lock_cpu_counts_t) + + (LSTAT_MAX_STAT_INDEX * sizeof (lstat_directory_entry_t)); +} + +ssize_t +get_lockmeter_info(char *buffer, size_t max_len, loff_t * last_index) +{ + lstat_user_request_t req; + struct timeval tv; + ssize_t next_ret_bcount; + ssize_t actual_ret_bcount = 0; + int cpu; + + *last_index = 0; /* a one-shot read */ + + req.lstat_version = LSTAT_VERSION; + req.state = lstat_control.state; + req.maxcpus = num_online_cpus(); + req.cycleval = CPU_CYCLE_FREQUENCY; +#ifdef notyet + req.kernel_magic_addr = (void *) &_etext; + req.kernel_end_addr = (void *) &_etext; +#endif + req.uts = system_utsname; + req.intervals = lstat_control.intervals; + + req.first_started_time = lstat_control.first_started_time; + req.started_time = lstat_control.started_time; + req.started_cycles64 = lstat_control.started_cycles64; + + req.next_free_dir_index = lstat_control.next_free_dir_index; + req.next_free_read_lock_index = lstat_control.next_free_read_lock_index; + req.dir_overflow = lstat_control.dir_overflow; + req.rwlock_overflow = lstat_control.rwlock_overflow; + + if (lstat_control.state == LSTAT_OFF) { + if (req.intervals == 0) { + /* mesasurement is off and no valid data present */ + next_ret_bcount = sizeof (lstat_user_request_t); + req.enabled_cycles64 = 0; + + if ((actual_ret_bcount + next_ret_bcount) > max_len) + return actual_ret_bcount; + + copy_to_user(buffer, (void *) &req, next_ret_bcount); + actual_ret_bcount += next_ret_bcount; + return actual_ret_bcount; + } else { + /* + * measurement is off but valid data present + * fetch time info from lstat_control + */ + req.ending_time = lstat_control.ending_time; + req.ending_cycles64 = lstat_control.ending_cycles64; + req.enabled_cycles64 = lstat_control.enabled_cycles64; + } + } else { + /* + * this must be a read while data active--use current time, + * etc + */ + do_gettimeofday(&tv); + req.ending_time = tv.tv_sec; + req.ending_cycles64 = get_cycles64(); + req.enabled_cycles64 = req.ending_cycles64 - + req.started_cycles64 + lstat_control.enabled_cycles64; + } + + next_ret_bcount = sizeof (lstat_user_request_t); + if ((actual_ret_bcount + next_ret_bcount) > max_len) + return actual_ret_bcount; + + copy_to_user(buffer, (void *) &req, next_ret_bcount); + actual_ret_bcount += next_ret_bcount; + + if (!lstat_control.counts[0]) /* not initialized? */ + return actual_ret_bcount; + + next_ret_bcount = sizeof (lstat_cpu_counts_t); + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + if ((actual_ret_bcount + next_ret_bcount) > max_len) + return actual_ret_bcount; /* leave early */ + copy_to_user(buffer + actual_ret_bcount, + lstat_control.counts[cpu], next_ret_bcount); + actual_ret_bcount += next_ret_bcount; + } + + next_ret_bcount = LSTAT_MAX_STAT_INDEX * + sizeof (lstat_directory_entry_t); + if (((actual_ret_bcount + next_ret_bcount) > max_len) + || !lstat_control.dir) + return actual_ret_bcount; /* leave early */ + + copy_to_user(buffer + actual_ret_bcount, lstat_control.dir, + next_ret_bcount); + actual_ret_bcount += next_ret_bcount; + + next_ret_bcount = sizeof (lstat_read_lock_cpu_counts_t); + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + if (actual_ret_bcount + next_ret_bcount > max_len) + return actual_ret_bcount; + copy_to_user(buffer + actual_ret_bcount, + lstat_control.read_lock_counts[cpu], + next_ret_bcount); + actual_ret_bcount += next_ret_bcount; + } + + return actual_ret_bcount; +} + +/* + * Writing to the /proc lockmeter node enables or disables metering. + * based upon the first byte of the "written" data. + * The following values are defined: + * LSTAT_ON: 1st call: allocates storage, intializes and turns on measurement + * subsequent calls just turn on measurement + * LSTAT_OFF: turns off measurement + * LSTAT_RESET: resets statistics + * LSTAT_RELEASE: releases statistics storage + * + * This allows one to accumulate statistics over several lockstat runs: + * + * lockstat on + * lockstat off + * ...repeat above as desired... + * lockstat get + * ...now start a new set of measurements... + * lockstat reset + * lockstat on + * ... + * + */ +ssize_t +put_lockmeter_info(const char *buffer, size_t len) +{ + int error = 0; + int dirsize, countsize, read_lock_countsize, hashsize; + int cpu; + char put_char; + int i, read_lock_blocks; + unsigned long flags; + rwlock_t *lock_ptr; + struct timeval tv; + + if (len <= 0) + return -EINVAL; + + _raw_spin_lock(&lstat_control.control_lock); + + get_user(put_char, buffer); + switch (put_char) { + + case LSTAT_OFF: + if (lstat_control.state != LSTAT_OFF) { + /* + * To avoid seeing read lock hold times in an + * inconsisent state, we have to follow this protocol + * to turn off statistics + */ + local_irq_save(flags); + /* + * getting this lock will stop any read lock block + * allocations + */ + _raw_spin_lock(&lstat_control.directory_lock); + /* + * keep any more read lock blocks from being + * allocated + */ + lstat_control.state = LSTAT_OFF; + /* record how may read lock blocks there are */ + read_lock_blocks = + lstat_control.next_free_read_lock_index; + _raw_spin_unlock(&lstat_control.directory_lock); + /* now go through the list of read locks */ + cpu = THIS_CPU_NUMBER; + for (i = 1; i < read_lock_blocks; i++) { + lock_ptr = + (*lstat_control.read_lock_counts[cpu])[i]. + lock_ptr; + /* is this saved lock address still valid? */ + if (GET_RWINDEX(lock_ptr) == i) { + /* + * lock address appears to still be + * valid because we only hold one lock + * at a time, this can't cause a + * deadlock unless this is a lock held + * as part of the current system call + * path. At the moment there + * are no READ mode locks held to get + * here from user space, so we solve + * this by skipping locks held in + * write mode. + */ + if (RWLOCK_IS_WRITE_LOCKED(lock_ptr)) { + PUT_RWINDEX(lock_ptr, 0); + continue; + } + /* + * now we know there are no read + * holders of this lock! stop + * statistics collection for this + * lock + */ + _raw_write_lock(lock_ptr); + PUT_RWINDEX(lock_ptr, 0); + _raw_write_unlock(lock_ptr); + } + /* + * it may still be possible for the hold time + * sum to be negative e.g. if a lock is + * reallocated while "busy" we will have to fix + * this up in the data reduction program. + */ + } + local_irq_restore(flags); + lstat_control.intervals++; + lstat_control.ending_cycles64 = get_cycles64(); + lstat_control.enabled_cycles64 += + lstat_control.ending_cycles64 - + lstat_control.started_cycles64; + do_gettimeofday(&tv); + lstat_control.ending_time = tv.tv_sec; + /* + * don't deallocate the structures -- we may do a + * lockstat on to add to the data that is already + * there. Use LSTAT_RELEASE to release storage + */ + } else { + error = -EBUSY; /* already OFF */ + } + break; + + case LSTAT_ON: + if (lstat_control.state == LSTAT_OFF) { +#ifdef DEBUG_LOCKMETER + printk("put_lockmeter_info(cpu=%d): LSTAT_ON\n", + THIS_CPU_NUMBER); +#endif + lstat_control.next_free_dir_index = 1; /* 0 is for overflows */ + + dirsize = LSTAT_MAX_STAT_INDEX * + sizeof (lstat_directory_entry_t); + hashsize = + (1 + LSTAT_HASH_TABLE_SIZE) * sizeof (ushort); + countsize = sizeof (lstat_cpu_counts_t); + read_lock_countsize = + sizeof (lstat_read_lock_cpu_counts_t); +#ifdef DEBUG_LOCKMETER + printk(" dirsize:%d", dirsize); + printk(" hashsize:%d", hashsize); + printk(" countsize:%d", countsize); + printk(" read_lock_countsize:%d\n", + read_lock_countsize); +#endif +#ifdef DEBUG_LOCKMETER + { + int secs; + unsigned long cycles; + uint64_t cycles64; + + do_gettimeofday(&tv); + secs = tv.tv_sec; + do { + do_gettimeofday(&tv); + } while (secs == tv.tv_sec); + cycles = get_cycles(); + cycles64 = get_cycles64(); + secs = tv.tv_sec; + do { + do_gettimeofday(&tv); + } while (secs == tv.tv_sec); + cycles = get_cycles() - cycles; + cycles64 = get_cycles64() - cycles; + printk("lockmeter: cycleFrequency:%d " + "cycles:%d cycles64:%d\n", + CPU_CYCLE_FREQUENCY, cycles, cycles64); + } +#endif + + /* + * if this is the first call, allocate storage and + * initialize + */ + if (!lstat_control.hashtab) { + + spin_lock_init(&lstat_control.directory_lock); + + /* guarantee all pointers at zero */ + init_control_space(); + + lstat_control.hashtab = + kmalloc(hashsize, GFP_KERNEL); + if (!lstat_control.hashtab) { + error = -ENOSPC; +#ifdef DEBUG_LOCKMETER + printk("!!error kmalloc of hashtab\n"); +#endif + } + lstat_control.dir = vmalloc(dirsize); + if (!lstat_control.dir) { + error = -ENOSPC; +#ifdef DEBUG_LOCKMETER + printk("!!error kmalloc of dir\n"); +#endif + } + + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + lstat_control.counts[cpu] = + vmalloc(countsize); + if (!lstat_control.counts[cpu]) { + error = -ENOSPC; +#ifdef DEBUG_LOCKMETER + printk("!!error vmalloc of " + "counts[%d]\n", cpu); +#endif + } + lstat_control.read_lock_counts[cpu] = + (lstat_read_lock_cpu_counts_t *) + kmalloc(read_lock_countsize, + GFP_KERNEL); + if (!lstat_control. + read_lock_counts[cpu]) { + error = -ENOSPC; +#ifdef DEBUG_LOCKMETER + printk("!!error kmalloc of " + "read_lock_counts[%d]\n", + cpu); +#endif + } + } + } + + if (error) { + /* + * One or more kmalloc failures -- free + * everything + */ + release_control_space(); + } else { + + if (!reset_lstat_data()) { + error = -EINVAL; + break; + }; + + /* + * record starting and ending times and the + * like + */ + if (lstat_control.intervals == 0) { + do_gettimeofday(&tv); + lstat_control.first_started_time = + tv.tv_sec; + } + lstat_control.started_cycles64 = get_cycles64(); + do_gettimeofday(&tv); + lstat_control.started_time = tv.tv_sec; + + lstat_control.state = LSTAT_ON; + } + } else { + error = -EBUSY; /* already ON */ + } + break; + + case LSTAT_RESET: + if (lstat_control.state == LSTAT_OFF) { + if (!reset_lstat_data()) + error = -EINVAL; + } else { + error = -EBUSY; /* still on; can't reset */ + } + break; + + case LSTAT_RELEASE: + if (lstat_control.state == LSTAT_OFF) { + release_control_space(); + lstat_control.intervals = 0; + lstat_control.enabled_cycles64 = 0; + } else { + error = -EBUSY; + } + break; + + default: + error = -EINVAL; + } /* switch */ + + _raw_spin_unlock(&lstat_control.control_lock); + return error ? error : len; +} + +#ifdef USER_MODE_TESTING +/* following used for user mode testing */ +void +lockmeter_init() +{ + int dirsize, hashsize, countsize, read_lock_countsize, cpu; + + printf("lstat_control is at %x size=%d\n", &lstat_control, + sizeof (lstat_control)); + printf("sizeof(spinlock_t)=%d\n", sizeof (spinlock_t)); + lstat_control.state = LSTAT_ON; + + lstat_control.directory_lock = SPIN_LOCK_UNLOCKED; + lstat_control.next_free_dir_index = 1; /* 0 is for overflows */ + lstat_control.next_free_read_lock_index = 1; + + dirsize = LSTAT_MAX_STAT_INDEX * sizeof (lstat_directory_entry_t); + hashsize = (1 + LSTAT_HASH_TABLE_SIZE) * sizeof (ushort); + countsize = sizeof (lstat_cpu_counts_t); + read_lock_countsize = sizeof (lstat_read_lock_cpu_counts_t); + + lstat_control.hashtab = (ushort *) malloc(hashsize); + + if (lstat_control.hashtab == 0) { + printf("malloc failure for at line %d in lockmeter.c\n", + __LINE__); + exit(0); + } + + lstat_control.dir = (lstat_directory_entry_t *) malloc(dirsize); + + if (lstat_control.dir == 0) { + printf("malloc failure for at line %d in lockmeter.c\n", cpu, + __LINE__); + exit(0); + } + + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + int j, k; + j = (int) (lstat_control.counts[cpu] = + (lstat_cpu_counts_t *) malloc(countsize)); + k = (int) (lstat_control.read_lock_counts[cpu] = + (lstat_read_lock_cpu_counts_t *) + malloc(read_lock_countsize)); + if (j * k == 0) { + printf("malloc failure for cpu=%d at line %d in " + "lockmeter.c\n", cpu, __LINE__); + exit(0); + } + } + + memset(lstat_control.hashtab, 0, hashsize); + memset(lstat_control.dir, 0, dirsize); + + for (cpu = 0; cpu < num_online_cpus(); cpu++) { + memset(lstat_control.counts[cpu], 0, countsize); + memset(lstat_control.read_lock_counts[cpu], 0, + read_lock_countsize); + } +} + +asm(" \ +.align 4 \ +.globl __write_lock_failed \ +__write_lock_failed: \ + " LOCK "addl $" RW_LOCK_BIAS_STR ",(%eax) \ +1: cmpl $" RW_LOCK_BIAS_STR ",(%eax) \ + jne 1b \ +\ + " LOCK "subl $" RW_LOCK_BIAS_STR ",(%eax) \ + jnz __write_lock_failed \ + ret \ +\ +\ +.align 4 \ +.globl __read_lock_failed \ +__read_lock_failed: \ + lock ; incl (%eax) \ +1: cmpl $1,(%eax) \ + js 1b \ +\ + lock ; decl (%eax) \ + js __read_lock_failed \ + ret \ +"); +#endif + +EXPORT_SYMBOL(_metered_spin_lock); +EXPORT_SYMBOL(_metered_spin_unlock); +EXPORT_SYMBOL(_metered_spin_trylock); +EXPORT_SYMBOL(_metered_read_lock); +EXPORT_SYMBOL(_metered_read_unlock); +EXPORT_SYMBOL(_metered_write_lock); +EXPORT_SYMBOL(_metered_write_unlock); --- diff/lib/int_sqrt.c 1970-01-01 01:00:00.000000000 +0100 +++ source/lib/int_sqrt.c 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,32 @@ + +#include <linux/kernel.h> +#include <linux/module.h> + +/** + * int_sqrt - rough approximation to sqrt + * @x: integer of which to calculate the sqrt + * + * A very rough approximation to the sqrt() function. + */ +unsigned long int_sqrt(unsigned long x) +{ + unsigned long op, res, one; + + op = x; + res = 0; + + one = 1 << 30; + while (one > op) + one >>= 2; + + while (one != 0) { + if (op >= res + one) { + op = op - (res + one); + res = res + 2 * one; + } + res /= 2; + one /= 4; + } + return res; +} +EXPORT_SYMBOL(int_sqrt); --- diff/mm/usercopy.c 1970-01-01 01:00:00.000000000 +0100 +++ source/mm/usercopy.c 2003-11-26 10:09:08.000000000 +0000 @@ -0,0 +1,277 @@ +/* + * linux/mm/usercopy.c + * + * (C) Copyright 2003 Ingo Molnar + * + * Generic implementation of all the user-VM access functions, without + * relying on being able to access the VM directly. + */ + +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/errno.h> +#include <linux/mm.h> +#include <linux/highmem.h> +#include <linux/pagemap.h> +#include <linux/smp_lock.h> +#include <linux/ptrace.h> +#include <linux/interrupt.h> + +#include <asm/pgtable.h> +#include <asm/uaccess.h> +#include <asm/atomic_kmap.h> + +/* + * Get kernel address of the user page and pin it. + */ +static inline struct page *pin_page(unsigned long addr, int write) +{ + struct mm_struct *mm = current->mm ? : &init_mm; + struct page *page = NULL; + int ret; + + spin_lock(&mm->page_table_lock); + /* + * Do a quick atomic lookup first - this is the fastpath. + */ + page = follow_page(mm, addr, write); + if (likely(page != NULL)) { + if (!PageReserved(page)) + get_page(page); + spin_unlock(&mm->page_table_lock); + return page; + } + + /* + * No luck - bad address or need to fault in the page: + */ + spin_unlock(&mm->page_table_lock); + + /* + * In the context of filemap_copy_from_user(), we are not allowed + * to sleep. We must fail this usercopy attempt and allow + * filemap_copy_from_user() to recover: drop its atomic kmap and use + * a sleeping kmap instead. + */ + if (in_atomic()) + return NULL; + + down_read(&mm->mmap_sem); + ret = get_user_pages(current, mm, addr, 1, write, 0, &page, NULL); + up_read(&mm->mmap_sem); + if (ret <= 0) + return NULL; + return page; +} + +static inline void unpin_page(struct page *page) +{ + put_page(page); +} + +/* + * Access another process' address space. + * Source/target buffer must be kernel space, + * Do not walk the page table directly, use get_user_pages + */ +static int rw_vm(unsigned long addr, void *buf, int len, int write) +{ + if (!len) + return 0; + + /* ignore errors, just check how much was sucessfully transfered */ + while (len) { + struct page *page = NULL; + int bytes, offset; + void *maddr; + + page = pin_page(addr, write); + if (!page) + break; + + bytes = len; + offset = addr & (PAGE_SIZE-1); + if (bytes > PAGE_SIZE-offset) + bytes = PAGE_SIZE-offset; + + maddr = kmap_atomic(page, KM_USER_COPY); + +#define HANDLE_TYPE(type) \ + case sizeof(type): *(type *)(maddr+offset) = *(type *)(buf); break; + + if (write) { + switch (bytes) { + HANDLE_TYPE(char); + HANDLE_TYPE(int); + HANDLE_TYPE(long long); + default: + memcpy(maddr + offset, buf, bytes); + } + } else { +#undef HANDLE_TYPE +#define HANDLE_TYPE(type) \ + case sizeof(type): *(type *)(buf) = *(type *)(maddr+offset); break; + switch (bytes) { + HANDLE_TYPE(char); + HANDLE_TYPE(int); + HANDLE_TYPE(long long); + default: + memcpy(buf, maddr + offset, bytes); + } +#undef HANDLE_TYPE + } + kunmap_atomic(maddr, KM_USER_COPY); + unpin_page(page); + len -= bytes; + buf += bytes; + addr += bytes; + } + + return len; +} + +static int str_vm(unsigned long addr, void *buf0, int len, int copy) +{ + struct mm_struct *mm = current->mm ? : &init_mm; + struct page *page; + void *buf = buf0; + + if (!len) + return len; + + down_read(&mm->mmap_sem); + /* ignore errors, just check how much was sucessfully transfered */ + while (len) { + int bytes, ret, offset, left, copied; + char *maddr; + + ret = get_user_pages(current, mm, addr, 1, copy == 2, 0, &page, NULL); + if (ret <= 0) { + up_read(&mm->mmap_sem); + return -EFAULT; + } + + bytes = len; + offset = addr & (PAGE_SIZE-1); + if (bytes > PAGE_SIZE-offset) + bytes = PAGE_SIZE-offset; + + maddr = kmap_atomic(page, KM_USER_COPY); + if (copy == 2) { + memset(maddr + offset, 0, bytes); + copied = bytes; + left = 0; + } else if (copy == 1) { + left = strncpy_count(buf, maddr + offset, bytes); + copied = bytes - left; + } else { + copied = strnlen(maddr + offset, bytes); + left = bytes - copied; + } + BUG_ON(bytes < 0 || copied < 0); + kunmap_atomic(maddr, KM_USER_COPY); + page_cache_release(page); + len -= copied; + buf += copied; + addr += copied; + if (left) + break; + } + up_read(&mm->mmap_sem); + + return len; +} + +/* + * Copies memory from userspace (ptr) into kernelspace (val). + * + * returns # of bytes not copied. + */ +int get_user_size(unsigned int size, void *val, const void *ptr) +{ + int ret; + + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) + ret = __direct_copy_from_user(val, ptr, size); + else + ret = rw_vm((unsigned long)ptr, val, size, 0); + if (ret) + /* + * Zero the rest: + */ + memset(val + size - ret, 0, ret); + return ret; +} + +/* + * Copies memory from kernelspace (val) into userspace (ptr). + * + * returns # of bytes not copied. + */ +int put_user_size(unsigned int size, const void *val, void *ptr) +{ + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) + return __direct_copy_to_user(ptr, val, size); + else + return rw_vm((unsigned long)ptr, (void *)val, size, 1); +} + +int copy_str_fromuser_size(unsigned int size, void *val, const void *ptr) +{ + int copied, left; + + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { + left = strncpy_count(val, ptr, size); + copied = size - left; + BUG_ON(copied < 0); + + return copied; + } + left = str_vm((unsigned long)ptr, val, size, 1); + if (left < 0) + return left; + copied = size - left; + BUG_ON(copied < 0); + + return copied; +} + +int strlen_fromuser_size(unsigned int size, const void *ptr) +{ + int copied, left; + + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { + copied = strnlen(ptr, size) + 1; + BUG_ON(copied < 0); + + return copied; + } + left = str_vm((unsigned long)ptr, NULL, size, 0); + if (left < 0) + return 0; + copied = size - left + 1; + BUG_ON(copied < 0); + + return copied; +} + +int zero_user_size(unsigned int size, void *ptr) +{ + int left; + + if (unlikely(segment_eq(get_fs(), KERNEL_DS))) { + memset(ptr, 0, size); + return 0; + } + left = str_vm((unsigned long)ptr, NULL, size, 2); + if (left < 0) + return size; + return left; +} + +EXPORT_SYMBOL(get_user_size); +EXPORT_SYMBOL(put_user_size); +EXPORT_SYMBOL(zero_user_size); +EXPORT_SYMBOL(copy_str_fromuser_size); +EXPORT_SYMBOL(strlen_fromuser_size); +